Before joining SocialCops, I had a very difficult time accessing data. I usually got data from government MIS systems, in PDF formats and other formats that are either inaccessible or can’t be processed. Most of the times, since manual copy-paste was tremendously cumbersome, I had to deliberately drop some variables or indicators from my research. But now I can deal with even the worst data sets.
Two months ago, I joined SocialCops’ R&A Team, which handles every stage of a data set’s journey, from coming across a data set in its natural habitat — unstructured, raw, and outlandish — to making it talk in a language everyone understands. I got several opportunities to work closely on all aspects of data cleaning and processing as a Research Associate in the R&A team. From the DISHA project with the Ministry of Rural Development to the United Nations’ Sustainable Development Goals project, my role as a Research Associate in the R&A team gave me an insight into a data set’s roller coaster ride.
So what does it take to make sense of a data set and turn data into attractive dashboards at SocialCops? Let’s find out about our data intelligence platform, which makes the day-to-day life of a Research Associate at SocialCops so much easier!
In a nutshell, here’s the journey of a data set at SocialCops:
Fetch data from an external source
Getting data from Collect is a breeze, since the tool is a part of our platform. To extract data from external services (for instance, a government MIS), we use an internal ingestion tool to extract the giant data set.
What does this data usually look like? Collect data can easily be extracted in an Excel/CSV format. If we’re lucky (which is rare), our ingestion tool gets data in the same format.
When we fetch data, we make sure to identify the following:
- Data source: our data comes from either Collect (our mobile data collection app) or secondary data sources (like the Census, sample surveys, sector-specific reports, M&E reports, external source dumps, and web services).
- File format: PDF, Excel, CSV, web portal, etc.
- Structure of the data files: for example, a single file for a single geography, time or variable type
- Data structure: long or wide
How we process online data
Step 1: Parsing PDF reports
Data from the web often doesn’t come in one nice Excel/CSV file. Instead, it’s often a series of PDF reports.
Why? A government website usually allows visitors to generate and download reports from the data housed on the site. To get that data, our web scraper visits that site multiple times (if necessary) to generate every PDF report that we need. Because it’s hard to analyze data in PDF files, our in-house PDF data extraction tool then automatically finds and recognizes the data tables within the PDF and transforms them into a CSV file.
Step 2: Storing the data
We store the collated data in our central data repository, which saves and organizes all of SocialCops’ data. It includes data with many structures, data types, levels of granularity and sectors (including demographics, schemes, health and nutrition, agriculture, economy, and water and sanitation). This data repository is one of the largest and most comprehensive in India — yet, at the same time, very convenient.
Step 3: Data exploration
After storing our data, we use our in-house static data insights engine, which reduces our data cleaning time. Before we analyze and process data, this tool gives us a broad sense of a data set’s number and type of variables, geography, time period, etc.
Step 4: Data cleaning
After basic data exploration, we start with data processing.
The first step of data processing is data cleaning — a process to transform the extracted data files into a structured format while examining them and correcting inconsistent, inaccurate, incomplete, or unreasonable data values. After cleaning the data and converting it into a standardized format, we load it into SocialCops’ clean data repository.
We do data cleaning with R, STATA, Python or our tool Transform. These platforms allow us to play with the data and run it through a series of checks to ensure that it’s complete and error-free. They also let us manipulate the data without changing the original data set.
Since there are multiple ways to jump from raw data to final cleaned data, data cleaning can become very complex. To solve this, we follow a set framework. Broadly, this framework includes four steps:
- Data structure analysis: The number of units and variables in final output should match the source file.
- Consistency analysis: Zero/missing values or NAs should make sense, and aggregate figures or ratios can be replicated.
- Validations: A 5-10% sample of the data points in final data set are cross-checked against the source file.
- Triangulation: The final data set is compared to external data sets to ensure that similar information is consistent across both data sources.
Step 5: Data enrichment
The second step of data processing is enriching a data set via our favorite tool, our entity recognition engine.
Why is this important? Indian data sets are not at all geographically standardized. Different data sets use different names for the same place. Standard names are often written in different way, and even the official names change over time. For example, there’s only a 15% match between the Census and DISE (District Information System for Education) data sets, even though they cover the same set of villages in India.
At SocialCops, we are trying to unify the way Indian geographies are represented. Our entity recognition engine automatically fixes these geographic issues so we can match and merge data sets. It accomplishes this by comparing the geographic names in external data sets to our standard internal codes. These internal codes have been developed from over a year of work on Indian data sets, and they’re self-learning — they get even better over time as our engine processes more data!
Be it Mumbai or Bombay, Pompurna or Pomburna, Andaman and Nicobar Island or A & N Island, our engine can fix any geographic mismatch. It also understands the difference between different places with the same name, like Aurangabad, Bihar, and Aurangabad, Maharashtra. This saves us a lot of time when trying to find and fix naming errors so we can join data from multiple data sets.
Step 6: Data cataloging
The third step in the data processing stage is cataloguing the data — adding descriptive labels or attaching metadata to our data set to make it useful. We use an internal data cataloguing and storage tool, which is really a boon for us because we can access any data set via keywords, client/partner name, year, source, government ministry and department, published date… I could go on!
This is because our data’s metadata includes both static descriptors and dynamic descriptors. Static descriptors are tags like
country, date published, sector, source — e.g. India, 2014, Agriculture, Ministry. Dynamic descriptors (like
% completeness) are qualitative attributes that will likely change over time if we get updated versions of that data set. We have an in-built tool that provides the dynamic descriptors.
Cataloguing the data with these descriptors helps us to be flexible and organized, especially considering the vast amount of data that we have. They also allow us to search for data sets easily. For example, imagine you are trying to answer a question about India’s education system in 2013, but you’re not sure exactly which data set in particular will help you answer your question (since we have data sets from multiple sources about education). Searching for “India, Education, 2013” will show a list of data sets with all those descriptive labels.
Step 7: Data visualization
With data processing done, we can finally move on to data visualization. We access our transformed data from our visualization platform, where we can visualize the data easily and intuitively on interactive dashboards.
Our dashboards give us very interesting insights in seconds. For example, we can track a KPI and its disaggregation across geographies. So, “the number of women in India receiving maternity benefits” can be viewed with disaggregations like age or region across multiple geographies like states, districts, blocks and villages.
We spend a lot of time researching to determine the most relevant indicators, best types of visualization, and flow for each dashboard. Getting these details right helps us empower decision makers with any level of data or tech knowledge to make better data-driven decisions.
Step 8: Setting up a data pipeline
In most of our deployments, we use real-time data (coming through updated MIS systems or Collect, for example). An internal tool lets us set up data pipelines to automate the entire process, so we don’t have to extract and analyze data and update our dashboards repeatedly. With pipelines, our dashboards automatically reflect data as it updates.
How we process data from Collect
Data collected on Collect (our mobile data collection tool) is automatically stored as a CSV file on our internal server. Once that data arrives on our server, there are 2 possible next steps:
- The implementation (per the partnership agreement) ends there. The Administrator is able to view their data and download CSV file(s).
- We pull the CSV file from the Collect server to our visualization platform. Then the data can be displayed on a dashboard.
How we process data sets from multiple sources
You now know how external and collected data are handled separately to become Viz dashboards. When we want to bring both sources of data into one dashboard, we again rely on our data pipeline tool.
- The external data is procured via a web service (or PDF or CSV file), then it’s stored in our internal data catalogue.
- Data is collected from the field using Collect, then it’s automatically stored on the Collect server.
- We set up a pipeline to bring data from our Collect server and our data storage system to our visualization platform.
- Our team takes over to turn this data into a beautiful, functional dashboard!
Interested in joining SocialCops? Check out our Careers page for much more information!
Image credit: Photo by Christina Boemio