Data collection is a minefield of errors. It doesn’t matter whether you’re a researcher with one survey in the field, an NGO with 10 data collection drives per month, or a market research agency with tens of thousands of surveyors — any survey is full of opportunities for errors to slip into your data.
Surveyors may be poorly trained, or well-trained ones may misunderstand a question’s responses or units. They may mistype a respondent’s answer or enter fake data to save time. Respondents may lie to save face, or they may answer randomly if they don’t understand questions. And anyone involved may get confused or distracted, even during short surveys.
In short, errors are inevitable. So how can you deal with them?
Extensively training your data collection team and conducting a thorough pilot are a great start, and they are essential steps of any good survey. But they won’t completely end data errors. To reduce errors even further, turn to data validations. Digital data collection tools have diverse question types and technological tricks to help you easily build validations into your survey and collect spotless data.
Keep reading for an overview of all the different data validations you can add to any survey. This guide is pretty long, so pour yourself some coffee, settle in, and use the table of contents above to skip to your favorite section.
What are data validations?
The most basic data validation is different question types. These are a simple way to ensure that data for each question can only be submitted in one standardized format.
For example, a basic question asking for someone’s age could receive the answers “17”, “seventeen”, and “seventine”. All represent the same value, but a computer won’t know that. A numerical question restricts this to “17”.
Even basic question types can also be helpful for validating data. For example, an email question will automatically check if the data entered is a valid email. A phone number question can check whether the phone number has the right number of digits, based on its country code.
Data validations also go way beyond different question types. You can place limits within questions, specifying what data can be entered. You can create questions just to keep an eye on surveyors. You can control which questions appear based on someone’s previous answers. You can even flag and re-collect data while your survey is still in progress.
Each type of data validation has pros and cons. Use the right data validations at the right time, and you’ll be amazed how much data quality increases and data cleaning time decreases.
Make your most important questions mandatory
Anyone who has filled out an online form is familiar with this data validation — mandatory questions. Users are required to fill mandatory questions before they can submit a survey, but they can choose whether to answer non-mandatory questions.
Mandatory questions are helpful because they guarantee that every respondent submits the most critical data. Trying to analyze data without a UID (like an Aadhar number or employee ID) or draw conclusions without key impact questions, for example, is just a waste of time.
Mandatory questions are simple, but they can be incredibly frustrating if not set up correctly. Imagine a survey that asks Americans what browser they use to access the internet. Mark this question mandatory, and the 11% of Americans who don’t use the internet will be stuck. Some will enter fake data (Netscape it is!) to complete the survey, which skews the final data analysis. Other people will abandon the survey, which means losing the rest of their data.
Bulletproof your multiple choice questions
One of the most common question types, multiple choice questions (MCQs) seem simple enough. With a set list of options, it should be a breeze to keep respondents from submitting bad data, right? If only.
Even the most basic MCQ needs data validations. Without them, someone can select options that don’t make sense, like a 20-year old who selects both 13-18 and 19-24 as their age bracket.
Here are a few key data validations for MCQs that any data collector should know.
Set a maximum and minimum number of choices
Most MCQs need a limit on the number of choices a respondent can select. Limits help to reduce contradictory or illogical data, since surveyors won’t be able to submit data until they’ve selected the correct number of choices.
The minimum number of choices could be zero. (However, it’s usually better to include an “NA” or “None of the above” option to cover this possibility, rather than letting the question go unfilled). The maximum shouldn’t be greater than the number of options.
Sometimes the maximum and minimum may be the same. Asking someone to report their total yearly income? Restrict respondents to one income bracket. Being in zero or two income brackets isn’t possible, as long as the brackets are MECE (mutually exclusive and collectively exhaustive).
Pro tip: If the question itself mentions limits, it’s easy to think you don’t need the limits built into the MCQ options. After all, people will just follow the question, right? Unfortunately, no.
For example, our job applications don’t support data validations on multiple choice questions. One of our applications asks candidates for their 3 strongest and 3 weakest skills. We’ve gotten all sorts of unexpected responses, like candidates who select no options or 5 options!
Use dynamic limits
Want to get fancy with minimums and maximums? Make them dynamic, which means the limit is based on the answer to previous questions.
For example, imagine you’re asking which mobile phone brands someone has purchased. To get more accurate data, you can set the maximum number of options to the number of phones they’ve purchased before. Someone who has only purchased 3 mobile phones can’t have purchased from 6 mobile phone brands.
Add logic to “None of the above” and “All of the above”
“None of the above” and “All of the above” are helpful for ensuring that an MCQ is MECE (mutually exclusive and collectively exhaustive). But it’s important to make sure that these options behave correctly, or they can lead to confusing data.
Imagine that you’re surveying homeowners, and one question asks which household appliances they own. These special options should definitely be included. “None of the above” is helpful for new homeowners with few assets, and “All of the above” is helpful for wealthy homeowners with fully stocked homes.
If these options behave like normal — where users can select any combination of options they want — it can lead to errors. What if a user selects “Fridge”, “All of the above” and “None of the above”? How will analysts interpret that data? They’ll most likely end up throwing it out.
We logically know how these options should behave. If someone selects “All of the above”, they shouldn’t be able to select anything else. It’s all already covered. The same is true for “None of the above”. Though this seems intuitive, it’s important to make sure your survey includes this logic. (P.S. Good data collection tools, like our app Collect, will have this logic built into special options for MCQs.)
People can only hold a small amount of information at any time. This makes us terrible at choosing from lists, since we have trouble remembering all the items in the list.
People are likely to pick one of the first options in a list (which is called “primacy bias”). Since people generally rush to finish surveys as quickly as possible, so they pay less attention as lists of options go on. For example, it was shown in New York City’s Democratic primary elections and North Dakota’s general elections that candidates received more votes when they were listed first on the ballot.
Alternatively, people are also likely to pick one of the last options they hear (called “recency bias”), since it’s in their short-term memory.
A great way to mitigate these biases is randomizing the order of MCQ options. This means that the order of your options will randomly change for each respondent. (“All of the above” or “None of the above” will always come as the last option though.)
Randomized order is great for most unordered options, like a list of assets that a household has or a list of books. It helps to ensure that surveyors aren’t just going through the motions when they select answers. However, randomization shouldn’t be used if the options have an inherent order, such as age or income brackets.
Set maximums and minimums for numerical questions
Numerical questions — which only take numbers as answers — help you control what numbers can be submitted. There are two types of numerical data validations.
Value-based validations let you set minimum and maximum values. For example, if you are surveying schoolchildren, you might select 5 as the minimum age and 18 as the maximum age that can be entered.
Soft value-based validations (also called “soft limits”) are similar but more flexible. Soft limits let you set minimum and maximum values. If data collectors try to submit data outside of this range, they’ll get a warning. However, they can still choose to submit the data anyway.
Soft values are useful when there is an expected range of values, but you expect some outliers. For instance, some children start school early or finish late, so you might make their age question a soft limit. Then surveyors can submit data for children who fall outside of the expected age range, after checking to make sure that this age is correct.
Limit the number of digits or characters
Imagine that you’re asking for a phone number, but your survey tool doesn’t have a specific phone number question. How can you ensure that you are getting a valid phone number? If you know your country’s phone number has 10 digits, you can set a minimum value of 1,000,000,000 and maximum of 9,999,999,999.
However, that’s a pretty clumsy workaround. There’s a much easier solution — set a limit on the number of digits.
Digit-based validations help you improve data quality by limiting the number of digits in a numerical question. Similarly, character-based validations limit the number of characters in a text question. These data validations are useful when you know exactly how many digits or characters something should have — like an ID code or phone number.
Add data validation questions
Adding data validations to existing questions is helpful. However, that doesn’t stop surveyors from sitting at home and filling your forms themselves, or from breaking survey protocols when they are in the field. There are many types of questions that you can add to catch these rogue surveyors and ensure that your data is coming from respondents accurately.
Location questions allow surveyors to geotag the location of each survey, even if they don’t have internet. This helps you ensure that surveyors are filling forms from the correct locations.
Besides being a source of valuable data, photo questions are a great way to guard against bad data. Asking surveyors to take a picture of a person, location, or any other data point proves that a data collector actually engaged with what they were supposed to. It’s also useful for verifying that responses are unique, so there won’t be any duplicates in your data set.
Geotagged photo questions kill two birds with one stone. The photo shows that a surveyor engaged with what they were supposed to, while the geotagged location shows exactly where that photo was taken.
Like photo questions, there are two types of audio features that are great for verifying data. Audio questions, which record a snippet of audio, can be used to verify a particular piece of data and hear it in the respondents’ own words.
You can also use audio audits, which randomly record background audio while surveys are in progress. These verify that surveyors are asking questions accurately and recording what respondents actually say.
Signature questions, which allow respondents to submit their signature, can be useful for verifying that a user has actually submitted a form. They also can be used to confirm or acknowledge any agreements or allow a user to give informed consent for a survey.
Customize your survey for different users
Different users have different needs, so why give them the same survey? Long surveys with irrelevant questions open the door for bad data.
Skip logic (also called conditionality) is a simple way to create surveys that flow seamlessly for respondents and surveyors alike. It allows you to show or hide questions based on the answers to previous questions. For example, with skip logic, only women will be asked if they’re pregnant. Men can skip straight to the next question, without having to waste time marking “NA”.
Skip logic can improve data quality because it reduces the length of a survey, ensures that people don’t answer a question that’s not meant for them, and lets you make questions more relevant for the people who see them. It also lets a survey have more mandatory questions (since each question will only be shown to the right people), which can lead to less missing data.
Flag and correct bad data on the fly
While checking your data, you might come across iffy responses, like a school with zero rooms. If you discover this during data cleaning, you’ll be stuck asking what the data means and whether you should keep it. But what if you could flag iffy responses before the data is finalized?
Flagging (also called “resurveying”) is a feature built for that situation. Flagging lets you flag questionable data as it’s being collected. Then your surveyors can recheck and, if necessary, re-collect that data point or survey immediately.
For example, you can send surveyors back to the zero-room school to see exactly what’s going on. Was the zero a mistyped “10”? Or is the school actually a teacher and students under a tree? In either case, the surveyor can check and re-collect accurate data immediately, leading to fewer questions during data cleaning and analysis.
Are data validations right for you?
The short answer — absolutely. Data validations are a simple, affordable way to increase data quality. All you need is a digital data collection tool (which you should be using anyway!) and a bit of time spent building a better survey.
Though there is one catch — data validations need to be added thoughtfully. Create mandatory questions that people won’t be able to answer, add incorrect character limits, or ask for the wrong number of MCQ options, and your respondents will be stuck and confused.
Not sure if you’ve added the right data validations? Pilot them. Run them by your team, sector experts, and field staff. Test them out with a small group of people. Listen to your surveyors and learn what questions are confusing them. Watch the pilot data and see what’s going wrong. The more you can test your data validations before you roll them out, the more likely you’ll have them right when you start collecting data at scale.
Piloting and flagging can seem like a lot of work up front, but they’re worth the effort. The time you spend on piloting and resurveying will be saved several times over during data cleaning and analysis.
Want to know more about collecting high quality data? We’ve compiled our learnings on how to build a stellar survey in our first ebook. This 30-page guide contains everything you need to know, including full chapters dedicated to how to choose your survey questions, choose the right question types, and write great questions. Download it now.