Publishers of technology books, eBooks, and videos for creative people

Home > Articles

This chapter is from the book

Transforming Data

One of the benefits of making data cleansing an autonomous step is that now transformation can focus on adapting the dataset to the goals of the analysis. If you’re using a spreadsheet, you’re now moving from the cell level to the column level where you add, remove, or change variables. Here are a few examples of possible data transformations:

  • Encoding: If a column includes answers to an open question (where there are no predefined answers), you must add one or more columns to categorize those answers. For example, if you asked people to name three of their preferred movie actors, you’d have to parse the answer and code every one of the names.

  • Aggregation: The level of detail may be excessive for the purposes of analysis, and we’ll need to aggregate the data at a higher level. Our 23-year-old new mom can belong to a larger category (for example, ages 20–24), or data at the daily level can hide a pattern that can only be spotted at the week level.

    • icon1.jpg Read the blog post
  • Derived data: If we’re studying obesity and have weight and height data, we can calculate Body Mass Index (BMI) and add it as a new variable.

  • Removal: Changes in project scope may make some of the observations irrelevant, or some variables may only be needed to calculate derived data (like BMI above). Keep in the dataset only the data you need.

  • Standardization: If we need to link our new table to other tables in our system, some standardization may be needed, including changes in table structure and in labeling (for example, M/F instead of Male/Female).

Peachpit Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from Peachpit and its family of brands. I can unsubscribe at any time.