At SocialCops, data is at the core of all our decisions, from small decisions like changing the header image of our website to big decisions like empowering policy makers with the correct information. We never take a step without consulting our friend, data.
This means that the Data Team at SocialCops is like Jarvis from Iron Man — the central system where information comes in from all teams, is structured and cleaned, and gets dispatched to the right people at the right time. The Partnerships Team comes to us for the data that clients need to make better decisions. The Engineering Team comes to us for feedback on their state-of-the-art data tools, which help us process gigabytes and petabytes of data with ease.
SocialCops’ partners even come to us for feedback on their indicators and models. For example, the Data Team recently met engineers from the Ministry of Drinking Water and Sanitation to identify the best way to manage India’s water. We presented our models to these engineers, brainstormed the best solution for them, and even checked our models by speaking with people enrolled in the Ministry’s schemes.
With so many people coming to the Data Team for guidance on big decisions, the pressure is on. Our work impacts millions of households, so we have to be sure of every step we take.
Our Data Pipeline
Here’s a look into our data pipeline — the steps we use to use data and make decisions in the SocialCops Data Team.
1. Setting the objective
Sometimes, data scientists are lucky to receive clear-cut problem statements – e.g. increasing the number of downloads on the Google Play Store or figuring out the right marketing channels for promoting a new product. Not at SocialCops. Typically our problems are far more open-ended — e.g. improve an $8 million agricultural investment or spur rapid village development to create a model constituency.
Albert Einstein reportedly said, “If I had an hour to solve a problem, I’d spend 55 minutes thinking about the problem.” At SocialCops, we spend this time working with our partners – understanding the roots of their problems, their current decision making flows and brainstorming together about potential solutions.
At times, we’ve even been assigned even bigger or more vague problems during this brainstorming. For example, while discussing policies and schemes with OSDs (Officers on Special Duty) at the Government of India, we were asked to effectively implement a brand new workflow. When that happens, buckle up everyone, it has just started…
2. Working with the data
The data we work with is far from cleaned and structured data exported from a Mongo or SQL database. We usually have to get in touch with multiple people throughout the government or an organization, connect various dots to find and get access to the data sets we need, and finally we either receive a web service link (which is never 100% accurate) or gigabytes of Excel or PDF files. State departments and sometimes even ministries maintain their own, separate databases, which don’t integrate that well with each other.
But let’s focus on the solutions, not the problems with the data we work with. Since we’ve been dealing with this difficult data for a couple of years, we have developed a data transformation engine called Transform. It runs on top of each data set we use, checks for issues, and transforms the data into the right format and structure.
Yes, data cleaning can often takes 80% of data scientists’ time, but we want to reduce this to a mere 20% with Transform. Though many issues have been nailed down, new issues are added every day, and our objective is to solve them not once but forever.
3. Optimizing our solutions for massive scale
What makes solving problems with the government and social sector organizations so challenging is the scale. In a country as diverse as India, everything — people, geography, water, sand, and more — changes within a hundred kilometer radius, so generalized solutions cannot give fruitful results.
Recently, we worked on a major problem — where to open 10,000 new LPG centers to distribute LPG (liquefied petroleum gas) cylinders to women below the poverty line. This initiative will reach 50 million women across India. Creating a solution that worked across such diverse geographies and would work for 50 million people was a big challenge. We ended up incorporating amazing amounts of data to account for India’s sheer variability.
4. Validating our solutions
In the LPG problem above, figuring out a solution was hard. But the next step — making sure every government official understood each and every variable that went into our equation — was perhaps even harder.
Our work is never just about running a random forest or a neural net to get to a solution. We always have to think about how to validate and implement the solution, which is very rare in data science. So just measuring precision and response or calculating an AUC (area under the curve), though helpful, are not enough. Our models will always be validated and scrutinized by real people who want to understand every choice we made. This is where we have the chance to convince many people why they should believe in our results.
Growing the Data Team
That was just a glimpse of what we do on a daily basis. Thats not all — we also create a lot of dashboards to drive big decisions for our partners, work to make our data experience as engaging and helpful as possible, do tons of experiments with our tech stack, explore great opportunities for integrating natural language processing and machine learning in our work, and more.
But most importantly, data science is a team sport and shouldn’t be played alone. We’ve learned about how to hire for the Data Team as we’ve grown. Recently, our team started processing close to 100 applications a week for a Data Analyst role. Though this number was encouraging, converting these applications to hires was not as successful. Candidates applied from every field, be it economics, physics, computer science, or even public sector experts. It was tough to judge everyone in the same bucket. For example, some candidates were skilled economists with deep econometrics knowledge, or engineers who were amazing at automating data processes, or machine learning enthusiasts who were amazing at predictive analytics, but they were new to new to the challenges of public data. Meanwhile, other candidates weren’t as skilled at handling big data processes, but they were excellent social researchers with deep sector knowledge who knew the ins and outs of government policies, schemes and data sets.
To build a team, you can’t hire people with all of these skills. You need people with diverse talents who balance one another. So, after looking at the numbers, we decided to diversify the kind of team profiles we were hiring for. Now we have 5 new positions for applicants to consider before joining the team. These roles are targeted towards both experienced professionals in data or analytics and graduates fresh out of college.
We are always looking for some passionate individuals to join our team — folks who can play the team sport to the best of their abilities, who are excited about changing lives as we go.
Take a look at the roles in our data team. If you think you’re a good fit for multiple roles, apply to the best option, and we’ll be sure to tweak and define your role as you make your way through our interviews!