When we started SocialCops, our aim was to drive decision making through data in places where it really mattered. Now that shouldn’t be a hard problem right? For those of us in the digital world, data has been about identifying patterns, beautiful visualizations, and predictive algorithms.

Solving real world problems is a lot harder – something that we learned the hard way. The problems for one are a lot more complex.

1. No data. Outdated data.

The Indian Census happens once in 10 years. Enough said. Public data is collected sporadically and after long gaps, such as during the Census which collects data from households once in 10 years. Researchers and policymakers struggle to make decisions in the absence of relevant data, leading to high loss in investments and poor decision making.

According to Census data, in the decades 1981–1991 and 1991–2001, the Nagaland’s population grew by 56.09% and 64.53% respectively. But the 2011 Census reveals that, between 2001 and 2011, Nagaland’s population shrunk by 0.47%. This means that Nagaland’s population has declined in absolute terms, despite in the absence of war, famine, natural calamities, political disturbance, or dramatic changes in the socio-economic correlates of fertility. This is unprecedented in the history of independent India. Does this mean that Nagaland’s population was overestimated in the earlier censuses? Or was there something wrong with the 2011 Census?

In spite of glaring inaccuracies like this in the data and the state government rejecting the figures from the 2001 census, we are forced to continue to use Nagaland’s flawed population series for sampling and analysis for the lack of a better alternative!

It became very clear to us that solving a developing-world problem needed a developing-world solution. To solve the decision-making problem in India, it is incredibly important to solve the data-collection problem.

Some of the most successful Indian startups became successful because they took a global problem and solved it in an Indian way! Flipkart flipped the traditional e-commerce model by offering cash on delivery and Zomato flipped the restaurant search model by spending time and money on collecting restaurant data.

At SocialCops, we hope to flip the “big data” model by not just building a great analytics engine, but also building our own tech to collect, check, and verify data in real time. 

2. Bad data.

At SocialCops, we recently visited an organization that does massive amounts of work with the government in revolutionizing healthcare. The economics grad, who reads “Information is Beautiful” day in and day out, looked at our household-level visualization tool and said, “This is great. But we don’t believe the underlying data as of today! So what is the point of analytics and visualizations on it?”

It is pretty easy to understand why most smart, educated people would raise their eyebrows at the validity of public data when one understands the difficulties involved in data collection.

For instance, DISE is an annual data collection survey from schools in India. This is how the survey is undertaken:

STEP 1: The central authority mails a paper form to the different schools in India.

STEP 2: The headmaster takes on the responsibility of filling the paper form with basic details, like the number of toilets, ramps etc.

STEP 3: The headmaster mails the completed form back to the central authority.

An analysis of raw DISE data shows that a number of parameters are missing and unfilled. It doesn’t take a rocket scientist to figure out why.

Now you’d think this was an easy problem to solve — build a mobile-based data collection tool. But making data collection tools work in remote parts of the world is not easy.  Building for the real world involves optimizing for low memory, lack of internet, sporadic electricity, and regional language support.

Building our Collect tool was an intensive 6-month process involving multiple pilots, field visits, and information from the field.

3. Where is the data?

Now let’s just forget points 1 and 2 for a while. Let’s assume that data quality has no issues. Now, as someone looking for data online, you’ll quickly realize how inaccessible public data is. Public data today is released in the form of obscure PDF documents and HTML tables, making it impossible to use this data for decision-making purposes.

disaster, data, startup, tech, visualization
Satellite images mapping flood-inundated areas and their proximity to roads and district boundaries in part of Uttarakhand. Available from ISRO available in the public domain.
data, engineering
Electoral voter roll in Hindi showcasing voter demographics broken down by constituency in India. Available as PDF files.

The wealth of information that unused public data can give companies, organizations, and nonprofits never ceases to amaze me. Using public data, we can break down India to the household level in the remotest of our villages. Mashing this data with purchasing power can help companies understand how to plan rural expansion. Image processing of the ISRO images might help understand how to mitigate floods.

Using public data, we can break down India to the household level in the remotest of our villages.

In October 2014, we started building a dashboard to understand healthcare interventions along with a policy partner. We created a dashboard for trends in anemia, hygiene and sanitation practices, food and nutrient intake, and improved access to healthcare facilities from different open data sources.

data, india, map, dashboard
A national and state level dashboard to measure efficacy of healthcare interventions and drive national level policy in several states. Uses 6 different public data sources.

The results astonished us. It became easy to see that Andhra Pradesh’s indicators had improved greatly over the years. This made sense since AP had introduced a mid-day meal scheme for mothers that was possibly leading to improvement.

Just the scale of being able to affect these large-scale decisions through better data made us hold our breath. We had answered our question. Yes, public data presented a trove of information that could be used for better decision making to solve massive problems – but it was never going to be used in this current format. Thus began our ambitious project of making public data accessible. We are building the developing world’s first search engine for public data — identifying data sources, cleaning data, tagging data and making it searchable by data point.

Making public data in India searchable is a hard engineering problem to solve, much more so due to the inconsistencies in data. The data is in different formats, unstructured, and impossible to parse through one Python script. Yes, local language PDF files are the reason our engineers think PDF is evil.

Over time, we are hoping to build a system that learns from different kinds of file types, recognize data types, automatically extracts data, and even deals with the difficulties in cleaning, tagging and managing data. An intelligent system that could identify outliers and inconsistencies could reduce time taken in data management cycles by over 60%.

This is difficult. Yet the possibility of driving millions of decisions that could affect billions of people makes all the hard work worth it!

4. What does all this data mean?

Data means nothing until it tells a story. We’ve realized that all these troves of data might not lead to better decision making unless it tells a story.

For instance, a district education officer might know that a school has 500 boys and 300 girls from the annual education survey. From another survey, he might know that the school has 5 toilets. But does he know that there are 500 boys and 300 girls, 4 boy toilets and 1 girl toilet, which means the school needs 2 more girl toilets? And this means he needs to allocate INR 75,000 to fixing toilets in the school? No. And this story is exactly what visualizations can tell decision makers.

This is a screenshot of a household-level visualization of a village. The top left image shows how Google Earth sees the village. The top right is a household-level visualization that we built by visualizing the houses. The bottom image shows households shaded by income level, healthcare, and education.

Shows interesting patterns of how houses are clustered by income. This could relate to social factors, linking diarrhea outbreak to drinking water source or finding patterns for disease outbreak.

The last step in driving decisions from data involves being able to find stories in data. At SocialCops, we believe strongly in the power of images and visualizations for driving insights. Through our visualization engine, we hope to help organizations and decision makers make sense of hundreds of pages of data through a simple visualization. Dashboards, household level views, maps – we are building an internal engine to understand data and identify the best way to visualize it.

Today is the age of full-stack startups. Taxi companies existed for a long time, and the biggest taxi company could possibly have country-wide operations. Then Uber came along and added a layer of technology to aggregate taxi drivers – with that, they revolutionized the taxi industry. At SocialCops, we like to think that we are building a full-stack data startup, all the way from finding the data and cleaning the data to making sense of the data.