Note: This text is published under a Creative Commons License (CC-BY). Means: You can take it, embed it elsewhere, translate it. Only thing I would ask it to name the source.
Why this tutorial?
Journalists have a lot of work each day. This is one reason that keeps most of them from digging into numbers, look for patterns and question common claims and believes.
How to find relevant data, clean it and publish it
Here is a tutorial to change that and get a publishable, embeddable chart in very short time, using Datawrapper, a tool I developed with Nicolas Kayser-Bril for ABZV, a German training institution for journalists.
Tutorial goal: Create a visualization, example in full screen here: Teacher salaries.
Step by step
The goal is to show that using data as a basis for reporting is a rich soil. The experience comes with doing it, not reading about it.
We hope that you will experience that in 10, 20 minutes you can have some interesting, usable content. There is so much to find and it is effectful, relevant for the public and satisfying for you to do this. This is why quite a few people are so enthusiastic about data-driven journalism (Links to Wikipedia).
Search, clean, publish: The steps
- Start with a (good) question
- Search for data
- Clean the data
- Visualize the data
1. Start with a good question
In this example we ask: How much do teachers make, country by country?
Education is matters and getting a bit more of context on something this vital might lead to interesting insights. Asking good questions is generally important, otherwise you can get easily lost in a sea of too much information. Be inventive, be critical, check believes.
2. Search for data
Let's assume this is not your very first search into the deep web and that you basically know how to use Google or other methods. If not, you should practise this (but that's another tutorial and for now we can simply do it like the chef in the cooking show - jump to the next step).
Looking for numbers on teachers salaries might lead you to a website called Worldsalaries.org. The site collects extensive data for all kinds of professions, from multiple official sources. Here, the site is used as our source. If you dig deeper into the web, you'll will find tons of such sites for almost all conceivable topics. Tip: There is a specialized search engine to dig for numerical data, graphs and charts - it's often better than using Google. The name is Zanran.
Back to our little project: On Worldsalaries.com there is a detailed table for teachers salaries. It's detailed and convuluted at the same time. There is A LOT OF INFORMATION. For a statistician, this is great. For all others this is hard to understand. TLDR, which is slang for "Too long, did not read". This applies to a lot of already available information on the web. So, now data journalists come to the rescue and make it easier to understand.
Wha we need to do here is copy the data. With HTML tables this is often a bit tricky.
- Go to http://www.worldsalaries.org/
- Search for "teachers" (search box is on the upper right)
- Copy the table you see now, it's enough if you manage to copy this for the first three columns ("Net Monthly income")
2. Clean the data
This is a bit like cooking again. With the data you copied just a minute ago it is a bit like coming back from the market. You got some nice, fresh produce, now we peel it and chop it up to make it digestible.
You will find that simply compying the HTML table as a whole get's ou into troube. Go and try to drop the copied content into Excel.
See, Excel does not understand this. It presumably dropped all the data into one column.
Solution though is easy:
- Open Word (yes, Word) on your PC.
- Drop the copied data table from Worldsalaries into an empty page.
- If you copied the whole HTML table, this should now show you the table in Word.
The reason why we do this is that HTML has a lot of hidden commands, which makes it difficult for Excel to understand (though, go try it). Anyway, Word "transforms" the table a bit.
Goal: Getting to a "controlled" spreadsheet
In this tutorial we do everything in Word. What we need to get now, what you always need is a "controlled spreadsheet". Controlled means that you reduce the information and throw away all the markers, text elements, etc. you don't need.
- We want a table where we have the countries in the first column and the numbers in the second column. Note that in this particular example they used two different lists of average salaries. This is kind of nice, but confusing. We simply take the first one here - yes, it cut's a bit of the depth but that can be discussed in the article. I a simple chart that level of detail is too much. After all, we can dig into how they got to these numbers in the future. Here we simply do a first examination.
- Then, the PPP that in the column. What does that mean? PPP stands for "Purchasing Power". This is a way to make salaries in different countries and currencies somehow comparable - here everthing is in Dollars. For comparison, this is great.
- Try to get rid of the PPP. From experience with this particular dataset I know it is a bit tricky. Do it manually. If this sometimes deletes too much or the number all of a sudden changes - remember that you can always undo the last steps).
- Then: Get rid of the "comma" that separates the thousands in the figures. You do that with "Search and replace", which is an often used, handy feature of both Wordpress and Excel. We simply don't need the commas here. Plus, a lot of tools would interprete that comma differently, turning a 4,000 Dollar Salary into 4.00. We don't want that.
Then: Clean up the row of countries, simply delete the lengthy "average income" or "median income". This is clutter, as we can simply write one line under the chart that these are all "average" salaries (Beats me why they wrote PPP and average income into every line, it just makes the information harder to use)
So, to be fair, every dataset is different and what you need for very basic cleaning is often Word, sometimes Excel. For advanced users there is Google Refine.
But the point here is, that if you know that you can safely delete a lot of stuff in order to get to a clean, controlled list - you have already learned an important point. After all, what we are after is to see the pattern. If there are problems or notable outliers we will write about that in our text. But NOT in the chart.
Spoiler: Should you get into trouble with the cleaning process, here is a cleaned up list of the data from the web. You can work with this list going on, but remember/practice doing it yourself in the future.
Open: Teacher Salaries (Cleaned) - on Google Docs.
4. Visualize the data
By now you have accomplished the most important steps for data journalism already: Having some data.
Now the fun starts. Because being able to visualize the data is always helpful, whether you publish or do it just to get a better view for yourself.
- Go to Datawrapper, log in (or register if this is your first visit) and then go the Google Spreadsheet or your own in word and simply copy the table (just the table, including header rows).
- Datawrapper is very easy to use, especially when you have some "controlled", clean data.
- Just note that for this particular data a horizontal bar chart might be best, but tinker around with it and you will see yourself.
- Make sure that you fill out all the fields Datawrapper offers to create a nice, correct and understandable chart.
- My result can be viewed here: How much do teachers earn? (Fullscreen)
This was simple, hopefully. But from here you know have many, many options. For example: Why do the US pay considerably more? How come the salaries for teachers are relatively low in Finland, though they get excellent results?
With this visualization as a starting point, you can do what journalists do - only better now. And should you publish the chart people will see the pattern and then will ask for more details in the text. This is it.
I hope you liked this tutorial and it was comprehensible. There will be more like these. Each project is a bit different. With each you learn - tricks, new insights, new options and future ideas.
Full screen view of the finished dataset: Teacher salaries.
How much do teachers earn? von Mirko Lorenz steht unter einer Creative Commons Namensnennung 3.0 Unported Lizenz.
Beruht auf einem Inhalt unter www.mirkolorenz.com.
Über diese Lizenz hinausgehende Erlaubnisse können Sie unter http://www.datawrapper.de erhalten.