Imagine that you have a room filled with dozens of sleeping cats, and you want to know how many cats there are. It would also be good to know some basic insights about your new cat colony — for example, what colors the cats are and whether any of them have extra long tails.
This doesn’t seem too difficult, right? Just go around the room and check out each cat.
Now imagine that the room is also filled with dozens of birds and flying squirrels, and all the cats are hyped up on catnip. It’s hard enough to stick your head in the room without getting smacked by a flying animal; counting the cats is now out of the question, let alone checking out their tails.
A dirty data set is like a crazy animal-filled room. It’s possible to wrangle your cats — or data points — but it won’t be fun and you’ll still be pretty uncertain by the end.
A dirty data set is like a crazy animal-filled room. It’s possible to wrangle your cats — or say, your data points — and get insights from them, but it won’t be fun and you’ll still be pretty uncertain by the end. Instead, it’s better to first bring order to the room — calm the animals down, remove unnecessary ones, and organize the remaining cats — and only then get the insights you need.
Though cats and data points are quite different, it’s just about as hard to wrangle hyper cats as it is to clean some data sets — but it’s just as necessary. In fact, the more you care about your insights, the more you should care about data cleaning, which is the process of finding and dealing with problematic data points within a data set. After all, you don’t want to make important decisions based on incorrect insights simply because your data had errors.
The more you care about your insights, the more you should care about data cleaning.
This ebook is designed to help anyone ensure that their data set is complete and correct. The ebook includes an introduction on the importance of data cleaning (don’t worry, we won’t subject you to more cat analogies), plus 7 chapters about basic data cleaning techniques.
This ebook is designed to help anyone ensure that their data set is complete and correct.
In addition, we’ve included a chapter on one of our earliest case studies, walking through how we cleaned data from a paper-based survey, plus exercises throughout the ebook to help you practice each new skill or technique in Excel.
Know someone who would find this ebook useful? Share it with them!
Any issues or feedback? Drop us a note at [email protected].