Cleansing Data
I’ll assume that you survived the previous stage of the ETL process and you’re now the proud owner of a nice-looking table. But the smile will vanish from your face if you now find a record of a 123-year-old new mom living in a city called Cincinatti, TX.
The second stage of ETL, transformation, deals with data manipulation, but the first transformation, data cleansing, is so important and specific that it deserves to be promoted to its own step. Data cleansing suggests, of course, that the data is dirty. Data is dirty because it contains typos or inconsistencies or fails in some way to meet a standard.
All this “dirt” must be cleansed before any serious analysis can take place, and again a pivot table can be very handy for this purpose. If you count every category in a field, you’d soon find only one reference to Cincinatti, TX, while there are many references to Cincinnati, OH. So, you’ll probably need to change that record because the city name is misspelled and associated with the wrong state. And what about the 123-year old new mom? Check the age range. She’s probably only 23. Please note the word “probably”; just because a value seems strange, that doesn’t mean it’s not real. Be sure to cross-check against a lookup table and against other fields for logical inconsistencies, and don’t forget to have a log that includes all your edits.
