Transforming Data
One of the benefits of making data cleansing an autonomous step is that now transformation can focus on adapting the dataset to the goals of the analysis. If you’re using a spreadsheet, you’re now moving from the cell level to the column level where you add, remove, or change variables. Here are a few examples of possible data transformations:
Encoding: If a column includes answers to an open question (where there are no predefined answers), you must add one or more columns to categorize those answers. For example, if you asked people to name three of their preferred movie actors, you’d have to parse the answer and code every one of the names.
Aggregation: The level of detail may be excessive for the purposes of analysis, and we’ll need to aggregate the data at a higher level. Our 23-year-old new mom can belong to a larger category (for example, ages 20–24), or data at the daily level can hide a pattern that can only be spotted at the week level.
Derived data: If we’re studying obesity and have weight and height data, we can calculate Body Mass Index (BMI) and add it as a new variable.
Removal: Changes in project scope may make some of the observations irrelevant, or some variables may only be needed to calculate derived data (like BMI above). Keep in the dataset only the data you need.
Standardization: If we need to link our new table to other tables in our system, some standardization may be needed, including changes in table structure and in labeling (for example, M/F instead of Male/Female).
