Publishers of technology books, eBooks, and videos for creative people

Home > Articles

This chapter is from the book

Extracting the Data

You successfully complete the first stage in the ETL process when you access a file that you can edit and manipulate. When you get a text file, you may need to open it in a text editor (such as the free Notepad++ for Windows) to solve multiple small issues with Search and Replace. Do your computer’s regional settings and the text share the same symbols for decimal places and thousands separators? (Some may use periods while others use commas.) Are there any strange characters? Can they be removed?

Extraction can be a very long and rocky journey, so let’s start with a smooth example first, again from the Bureau of Labor Statistics. I’m looking for the monthly unemployment rate, at the state level, for a period of several years. Figure 4.8 shows a sample of the output. There are several output options, including an Excel file, but for now we’ll work with a tab-delimited text file. I’m getting the data for each state, which means that I’ll have to consolidate them into a single table, removing all unwanted text.

Figure 4.8

Figure 4.8 Pasting data into Excel, from a web page and from a text editor.

Figure 4.8 explains why you should have a text editor between a web page and the spreadsheet. Scenario 1, on the left, shows the result of a direct paste from the web page, while scenario 2 shows what happens when you paste to Notepad++ first: Excel recognizes the tab character and automatically parses the text.

As with the example on expenditure, we’ll have to find what the Series ID codes mean. You may want to split the Series ID codes into multiple columns using the Text to Column function in Excel. Also, create a real date from the Year and Period columns.

When extracting data from other public sources, you may run into some limits imposed by the organization. The United Nations Population Division doesn’t allow you to select more than five variables or countries in each query (Figure 4.9). Other organizations impose limitations at the cell level. The Eurostat limits each query to 750,000 cells. Depending on how high the limit is or how detailed are the data you need, you may have to run multiple queries to get all the data and then merge the results into a single file.

Figure 4.9

Figure 4.9 Extracting data from the UN Population Division.

The PDF Plague

With more or less pain, the chance of getting a text file from official statistical offices is high. Other data providers, such as professional associations, may have other, more restrictive policies regarding data dissemination.

Many years ago, I needed to get data on the various types of electricity consumption (high voltage, low voltage, domestic, industrial, public roads, and so on) at a very detailed regional level. The data were available only in large sheets of paper, where someone had elegantly handwritten all these thousands of values. It was an admirable job, almost worthy of a Charles Dickens novel. It also had an unanticipated cost, because my organization had to purchase a copy of all those sheets and hire someone to enter the data manually.

Today, no sane organization would share its data in this format. With all the technology we have in our hands, that would be ridiculous, right? Well, not so fast. Let’s abstract for a moment from the technology and focus on the goal: getting a few thousand values into an editable table. Now tell me: What difference does it make if we have handwritten numbers on a sheet of paper or a PDF file with such a twisted formatting that the cost of extracting the data is higher than entering them by hand? Actually, there is a difference: I found those handwritten sheets only once, while I keep stumbling upon data tables in PDF files, to my despair and exasperation.

If you’re a data provider, you have a degree of control over your data when you share them in a PDF. You might persuade some people not to use the data in a way different than you intend. This is not wrong if you have a strong reason to do it, but it will anger your users, even if that’s not your plan. Again, make sure that the way you share your data is aligned with your goals. In addition to presenting your data the way you want people to see it by default, provide a link to the raw data. That way everyone is happy.

If you’re a user of internal data, you might assume that you’ll never have to extract data from PDF files. But, sooner or later, you will. And there will not be a quick fix. You may be able to open simple and well-behaved PDFs in Word 2013 or 2016, so there’s no harm if you try that first. If that doesn’t work, try copying the data from the PDF and pasting it into the text editor (such as Notepad++), and then from the text editor into Excel. Then you can try an additional application, such as the free tool Tabula, to extract the data into CSV or XLS files. None of the solutions will be entirely satisfactory, but the cost of editing the table should be lower than manual data entry.

  • icon1.jpg Go to the web page

“Can It Export to Excel?”

Internal business intelligence (BI) systems should allow you full control over the content you want to extract and how you want to extract it. Unfortunately, that’s not always the case. Let me paint a grim and somewhat exaggerated picture here.

First, you have to solve a communications problem. You, the business user, and the IT people apparently don’t speak the same language: They don’t understand why a market share above 100 percent is not possible, and you don’t understand that they must have a rule for each of your beloved exceptions. So when you get the data from IT, crosscheck it to make sure you’ve got the right data.

Second, there is a political problem. The data you want and the way you want it may not fit into the current formal corporate policies regarding access privileges, data security, or data dissemination. You can also be caught in a power struggle between IT and other areas, and they may start dragging their feet to avoid granting you access to the data.

Finally, there may be technical issues. The eternal question “Can it export to Excel?” forced BI vendors to make this option available. After so many years, I think they still hate it, judging from the output files I have to deal with. If the application can export data to CSV or Excel, there’s hardly a reason to create unfriendly table structures that force the user to take additional steps to clean the data. This means extra work for you, but if in every update the format is wrong but consistent, you might use a macro to correct it and solve the problem.

Peachpit Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from Peachpit and its family of brands. I can unsubscribe at any time.