Publishers of technology books, eBooks, and videos for creative people

Home > Articles

This chapter is from the book

What Does “Well-Structured Data” Mean, Anyway?

The acronym GIGO (garbage in, garbage out) summarizes the issues we deal with every day: Results and insights depend on data quality. We can handle data critically (being aware of the “garbage” and factoring it in to the evaluation of results) or uncritically (“if the data has been subject to extensive processing by the computer, it can’t be wrong”).

Data integrity becomes essential when the volume of data increases and we need to update, filter, and aggregate it, and use data as a basis for derivative calculations. A clean, consistent, and well-structured table means lower update and maintenance costs and more flexibility to multiply the perspectives from which we can analyze the data.

This may not be good news for the user accustomed to the loose spreadsheet environment, where storage, presentation, intermediate calculations, and parameters often share the same sheet. Let’s start untangling this mess with a concrete example.

The first step toward improving data structures is understanding that storing data and presenting data are two very different things. You should never use storage and presentation features together in a single worksheet. Share your source table if requested, of course, but otherwise bury it deep down in a data-only sheet. If you have a well-structured table, you’ll never have to touch it again, except when using a client like a pivot table or when adding a variable. In Excel, tables are for storing data, and pivot tables are for analyzing and presenting data.

A Helping Hand: Pivot Tables

Ah, pivot tables! Pivot tables are great at many levels. They can even serve as a litmus test for checking how well a table is structured. If every single cross-tabulation is done easily and you don’t have to change the pivot table following an update, you can be reasonably sure that you have a well-structured table.

Figure 4.4 shows a sample of one of the output formats for the Consumer Expenditure Survey. Assuming we know the meaning of the Series ID, this is the typical manner of presenting the data, with time periods in columns and entities in rows.

Figure 4.4

Figure 4.4 Sample output from the Consumer Expenditure Survey (Bureau of Labor Statistics).

  • icon1.jpg Go to the web page

Think of the table as a cross tabulation (Series ID × Year) that must be uncrossed so that we can use it. Unlike other output formats from the Bureau of Labor Statistics, you can get all the data you need in a single table, and it’s very easy to reverse it to the right format, resulting in the table you see in Figure 4.5.

Figure 4.5

Figure 4.5 Un-pivoting the data table.

Series ID contains multiple variables, so we must parse it and look for the descriptive text for each code. Figure 4.6 shows how the final table will look.

Figure 4.6

Figure 4.6 A few rows of the final data table.

Creating dynamic charts in Excel requires knowledge of advanced formulas, but often we only need them because the data table is not properly structured. Figure 4.7 shows a simple dynamic chart (not a pivot chart) that you can create without a single formula. It displays the proportion of food expenditure away from home, over the years, for the selected income quintile. Select a different quintile and the chart will update.

  • icon.jpg Download the original chart
Figure 4.7

Figure 4.7 A dynamic chart using a pivot table.

From Figure 4.6 we can see that a well-structured table is essentially a list of observations and their characteristics (category and item, income quintile, and time) and the associated measure (expenditure). In a pivot table, measures are usually placed in the Values area, while characteristics go into the Rows, Columns, or Filters areas.

In a well-structured table that can be easily used as a pivot table source, the content of each column must be understood as a group (years, quintiles), and the values in each measure should be comparable (expenditure in dollars in a column and expenditure units in a second column).

Reality can get more complicated, and so will the structure. Suppose you get expenditure by gender. Ideally, you’d add a new column (“Gender”) with two values (Male, Female). But if they are averages instead of totals, you can’t aggregate them, and, in this case, you have to add them as measures.

Peachpit Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from Peachpit and its family of brands. I can unsubscribe at any time.