I did not have a traditional college experience. While studying at Grinnell College, I spent a year living in Pune, Delhi and Mumbai, where I completed two internships, learned Hindi, traveled the country, and studied Economics at St. Stephen’s College. This year inspired me to leave my home in Kentucky and move back to India as soon as I graduated. As a William J. Clinton Fellow with the America India Foundation, I received several fellowship placement options and ultimately joined SocialCops as a Data Analyst. Though I’ve only been at SocialCops for two months, its fast-paced work environment and massive set of open data have made it possible to dive right into a wide variety of exciting projects.
As a Data Analyst at SocialCops, every day I am motivated by three long-term goals: filling data gaps, building transparency in the way governments and organizations operate and, most importantly, making our data stack accessible to stakeholders so they can use data to make better decisions.
Filling the Data Gaps
On a daily basis, working towards our goal of powering better decisions involves extensive data collection, data cleaning, and product development of data wrangling tools.
The Data Team collects and processes approximately 5,000 secondary data files per week spanning a variety of sectors, including health, agriculture, economics, demographics, and education.
We collect, clean and present data from these sectors in user-friendly formats that will soon be available through our open data platform.
Making secondary data available in a user-friendly format is a necessary step towards data transparency, accessibility and utilization.
Data sets available in PDF or image formats require several intermediate steps of data processing before data can be analyzed. For example, one data set on District Domestic Product from 1999-2009 published by the Planning Commission is stored in 197 separate PDF files.
Other data sets are stored online and are difficult to retrieve all at once. For example, to get Gram Panchayat-level data from NREGA’s public data portal, our engineering team scraped around 40,000 separate files. These then had to be cleaned and appended through automated processes.
The bottom line – any stakeholder wishing to use these data sets would first need to figure out how to convert them to a final tabular format before performing an analysis. In the end, they may not use the data because it is inaccessible and, as a result, insights that could be afforded by the data would be lost.
In addition to formatting challenges, secondary data is difficult to obtain at the district level.
For example, major data sets such as National Sample Surveys are performed at the district level. However, their findings are reported by the operating agencies (in PDF format) at the state or regional level. Unit level data must be purchased, extracted, and analyzed by someone with a data background before district-level analyses can be performed, provided the sample size is adequate for a district-level analysis.
State-level data is inadequate, especially for stakeholders looking to make decisions about resource allocation and development planning. By making district-level data available, we empower leaders to incorporate data into their decisions.
Despite the buzz about “big data” and its benefits for governance and development, few holistic solutions exist for data cleaning.
We utilize a variety of data wrangling tools including R, Excel, STATA, Python, Tabula, and Smallpdf. No one tool on its own provides a single solution to our data cleaning challenges. We often cycle through several software programs in the process of cleaning a data set.
For example, we may download a batch of PDF files using the DownThemAll! Mozilla Firefox add-on, convert them to CSV files using Tabula, read the CSV files into R to automate the cleaning process, export these files from R to Excel, then add relevant metadata before saving the final file.
Developing a Better Product
Unlike at other organizations, Data Analysts at SocialCops work with engineers on developing products to solve our data processing challenges in real time.
Our Engineering Team is developing an in-house data cleaning software that is designed to make data wrangling possible for analysts with all levels of expertise. In addition to cleaning data on already-available software, we also do data wrangling on our own platform. Working on our team’s platform lets us provide inputs that lead to almost instant solutions for our data cleaning challenges.
By working with the engineering team, we are coming closer to developing a holistic solution for data wrangling challenges, which will increase decision-making power through analysis in the future.
Building Intelligent Indices
Once data sets are cleaned and available at the district level, our data team creates indices that rank districts according to performance across sectors. We collaborate with sector experts to derive insights from our data stack and construct relevant indicators that form the basis of our indices.
For example, this month we worked with an agriculture expert to design an index that measures performance of districts across a holistic range of indicators related to productivity, assets, gender equality, and health in the agriculture sector.
We are developing a methodology for constructing indices that can both be duplicated and executed within our data wrangling software to make index creation part of our decision-making platform.
Sharing Our Learnings
While our data stack is not yet available through our open data platform (look for a launch soon!), we work closely with the Growth, Design and Viz teams to make our data insights publicly available through infographics, maps, and other data visualizations such as web dashboards.
For example, our agriculture data stack is available on a dashboard that enables the user to visualize district comparisons on a map using indicators from a variety of sectors including Crop Productivity, Agricultural Inputs, Nutrition, and Women’s Empowerment. We also file approximately 50 Right to Information applications (RTIs) per month and publish the results in an effort to fill gaps where data is incomplete or unavailable.
Growing My Learnings
As a Data Analyst at SocialCops, I am part of an organization that actively encourages learning and knowledge acquisition. My colleague from the Data Team leads bi-weekly sessions on R Programming, I attend weekly “Teach on Thursday” sessions where SocialCops team members teach concepts and ideas that help me grow professionally, and I am encouraged to take the time to learn new programs that will improve and scale my data analysis skills in the future.
As our data stack grows and our methodology for creating indices evolves, we also develop as professionals. We are encouraged to learn as we go and share knowledge across teams to reach our collective goal of powering India’s most important decisions.