Excel Data Preparation
Jacques Bertin defines his semiology of graphics as a “visual transcription of a data table.” In a perfect world, this table materializes in front of us when we need it, ready to use. In everyday reality, however, things involve more sweat and less magic. People coined the expression “data janitor” for a reason.
In a data visualization project, data extraction costs and data preparation are often overlooked, either by management that doesn’t understand the level of detail required or by data analysts making overly optimistic assumptions. This translates into many hours of data cleansing that most people don’t see. If not taken into account, these labor-intensive tasks can consume several times the resources available for a project, whether it’s a simple chart for an upcoming meeting or an organization-wide project.
Brilliant visualizations cannot redeem bad data, either in content or in structure. Many spreadsheet users are not familiar with well-structured data, and that’s one more reason to discuss data preparation.
We can summarize all preparation work on the data table, regarding both structure and content, by the acronym ETL, for Extract, Transform, and Load. ETL is just as applicable to your Excel files as it is to large, formal systems.
This chapter is not strictly about data visualization. If the tables you need actually materialize in front of you, ready to use, if you know how to structure the tables to take advantage of pivot tables, and if you organize sheets in your workbook by content type, it’s probably safe to skip this chapter. In a more sophisticated organization, most of the issues discussed here are not relevant, and most of the data comes from internal systems. However, many people still struggle with these basic issues, so if you’re in this category, read on.
Problems with the Data
Let’s split data problems into two broad categories: 1) structure without content, and 2) content without structure. The first category affects our data in particular; the second is common in data we get from other sources.
Structure without Content
Even if you’ve never seen a table for which multiple users can enter data (such as a table for telemarketing operators), you can imagine how much garbage data is collected: incomplete ZIP codes, multiple abbreviations for the same entity, misspellings, logical inconsistencies...you name it.
It’s challenging to define good data validation rules without forcing exclusions: What happens when a few ZIP codes are missing from a lookup table? Suppose, though, that you can maintain a table with a minimum number of errors. Figure 4.1 represents an example of such a table. To make things more interesting, try linking this table to a second table containing other personal data (Figure 4.2). First, you’ll have to split the field Name into Name and Surname, to be able to join both tables. Now, is John Doe in the first table the same person referred to as John F. Doe in the second table? The solution in these cases is to have common fields in both tables that are not subject to different interpretations (social security or driver’s license numbers are good candidates). If there are no safe common fields, you’ll need to allocate additional resources to determine whether it’s the same person. Multiply this process by thousands of records and you have a problem on your hands that, if not anticipated, would generate serious time and resource management issues.
Figure 4.1 A table with names and addresses.
Figure 4.2 A table with socio-demographic characteristics. To get a better feel for structure without content, imagine that there are many more rows (records) and many entry errors in them.
A few other special cases also belong to the category of structure without content. One of the most common is a break in a time series, whereby you still get the same measure (an unemployment rate, for example), but changes in methodologies, concepts, technologies, or regional administrative boundaries make comparisons meaningless. Or, at least, comparisons must be carried out with extra care—the same care you should use when comparing countries that use different ways of measuring the same reality. For example, infant mortality rate depends on how a country defines “live birth.” Because the definition is not the same across countries, this may affect country rankings in international comparisons.1
Content without Structure
Suppose you’re a data provider, perhaps at the U.S. Census Bureau or at a small public relations company. The moment you release the data, you cease controlling it. You don’t know how people will read and reuse the data. They may want to cross-check it if they suspect that the data is not telling the whole story. Or they will misunderstand the concepts. Whatever they do, first they must have access to the data in a format they can use.
Providers often make it hard to use the data beyond the format in which they released it; they’re often unaware of this issue or focus on the end user and forget the data professional, who probably needs a more specific format.
Data providers should then ask themselves two simple questions: How many data reuse issues are we causing by releasing the data in this format? Is this reuse friction level acceptable for our data dissemination goals?2 Typical answers are, respectively, “a lot” and “no.” The end result is that data reuse friction levels can range from none (rare), to mildly annoying, to a source of a barrage of unprintable curses.
Let me give you an unfair example. Suppose you want to know the military budget as a percentage of GDP in each country. There are several sources, but you could start with the CIA’s website publication The World Factbook. Country profiles in the Factbook contain several sections and subsections.
Figure 4.3 displays the Military section for the United Kingdom. You can manually open this section and copy the data you need for each country, or you could use a scraping tool that automates the process. If you’re unable to automate the process, you’ll have a few long and boring days ahead of you. Because the data are not displayed the way you need it, time and resource costs will increase since you’ll have to structure it first.
Figure 4.3 UK Military data in the The World Factbook from the CIA.
I said this is an unfair example because the Factbook actually allows us to jump between the country profile level and the list level. At the bottom of the page on the website, you’ll see “country comparison to the world: 28.” If you click the number 28, you’ll get a list of all countries sorted by military expenditures as a percentage of GDP. Then you can choose a country from that list and return to the profile view. This nice feature is still quite rare, unfortunately.
These two broad categories of structure without content and content without structure try to make sense of the variety of issues when using data presented in an unfriendly format. Hadley Wickham brilliantly captured the difference between well-structured and poorly structured data in an excellent article3 in which he quotes the first paragraph of Leo Tolstoy’s Anna Karenina: “Happy families are all alike; every unhappy family is unhappy in its own way.” The “happy family” dataset is structured according to some rules that make it similar to other “happy families,” while there is a virtually infinite number of ways to create an unhappy dataset.
