There are over 1.4 million schools in India. With different kinds of schooling systems and a multitude of “school” types – government owned schools, central government schools, state government schools, low cost private schools, private schools – it was almost impossible to track school performance, utilization of funds and other important parameters. And, as they say, “What can’t be measured cannot be improved.”
DISE and Its Origins
The need for a centralized school statistics database resulted in the creation of the District Information System for Education (DISE) in 1995. Beginning with a few districts, DISE now covers almost all the eligible schools across most of the districts in India with NUEPA (National University of Educational Planning and Administration, New Delhi) being the agency responsible for data collection and dissemination.
For the academic year 2013-14, it covered 14.5 lakh schools across 662 districts except Ariyalur and Tirupur districts in Tamil Nadu. This is a rich database that provides basic information about schools, facilities in schools, enrollments, teachers, and other general information such as work days, number of inspections, etc. Crucially, it also provides information that are related to Right to Education Act 2009 (RTE). The disseminated raw data contains over 240 fields. Despite our misgivings regarding data quality, NUEPA needs to be complemented for creating such a huge database.
Having worked with DISE raw data for the last year, we have some concerns regarding the data quality. Anomalies such as a school with a single teacher for 12,000 students and schools with negative number of classrooms raised red flags for us.
Data Quality Issues on Consistency and Validation
Although DISE is believed to be following basic data check rules such as data validation, data consistency check, and 5% sample check, the final data disseminated to the public still has many data quality issues.
Take the case of establishment year for each school. It takes values as low as -1 to as high as 5005 – these are simply impossible values for the year of establishment. And let’s consider classrooms. There are 23,149 schools with zero classrooms. Theoretically there may exist schools without any classrooms (for example, due to calamities) but some of the schools do so well in other indicators that it’s impossible for the school to have no classrooms.
There are over 4,500 schools that have five or more facilities or favorable ratios but have no classrooms, compounding our concern about data quality. A school named “PRIVT.HS.SAVITRI VIDYA PEETH” at Pipariya in Hoshangabad district in Madhya Pradesh reportedly has negative classrooms.
On the same note, financial aids received by some schools were found to be negative. In around 9,000 private unaided schools, number of students enrolled in Class 1 from weaker sections of the society exceeded the number of students who applied from those sections. These type of data quality issues really hampers the credibility of DISE data.
Data Quality Issues with Multiple Choice Questions
In multiple choice questions (MCQ), the questionnaire used to collect data has specific codes for specific answers. However, final results exhibit codes which are not even on the code list.
For example, in “Medium of Instruction”, the questionnaire uses codes from 1 to 29 for different languages and 99 for languages that are not on the list. Surprisingly, the raw data includes codes such as 30 to 39 for around 250 schools.
These kind of data errors are not expected, given that the questionnaires are filled by teachers. These might be data entry errors arising due to manual filling of questionnaires. However, in this era of technology these problems can be solved very easily by using data collection technologies.
Data Quality Issues with Conditionality
Another data quality issue pertains to conditional questions. To give a context, some of the questions can only be answered if a previous question was answered in a certain way. For example, a question like “Are you pregnant?” can only be asked if the responder is female. However, in DISE data, some of the variables do not pass this criteria.
For example, Pupil Cumulative Records (PCR) can be maintained by a school only if Continuous and Comprehensive Evaluation (CCE) has been implemented by the school. However, DISE data shows that more than 50,000 schools have PCR maintained even if they did not implement CCE.
While the work done for DISE is prodigious given the scale, the quality concerns necessitate technological interventions to make the education data spotless and reliable. Currently, most of the anomalies can be attributed to data entry errors… or so we hope. At the very least, this makes the case for migrating DISE data collection to technology even stronger.
Great analysis. Thanks!
Great post, thanks for sharing. Data science is also impacting another area: education. Data-driven decision making is making a serious impact on the future of student achievement, especially with SLDS/P-20 initiatives. As you mention, data science can be applied to many domains of knowledge; in education, the ability to identify problem areas and link important data sets in order to make better decisions for the future of our students and teachers will be important.