Part 4. Digital Tools Explained
4.3 Data Scraping & Cleaning Tools
What are Data Scraping & Cleaning Tools?
In this chapter, we will learn how to use a formula in conjunction with Google Sheets to download data from Wikipedia. We will also use OpenRefine, formerly Google Refine, to clean data. These two formulas make scraping tabular data from online sources accessible for beginners. This process can be used to collect information for datasets and data stories.
Data Scraping and Cleaning
With the mass proliferation of data over the last decade, data scraping and cleaning have become all the more essential. These two tools – Google Sheets and Open Refine – facilitate the process of data scraping and cleaning.
Do you ever have Wikipedia data you need in a spreadsheet? Google sheets has a function that will pull a Wikipedia table into the worksheet.Data scraping, also known as web scraping, is the process of importing information from a website into a spreadsheet or local file saved on your computer. It’s one of the most efficient ways to get data from the web, and in some cases to channel that data to another website. The size of Wikipedia makes it both the world’s largest encyclopedia and slightly intimidating to work with. However, size is not an issue with the right tools.
No matter the type and size of data, the quality is of the upmost importance. Another tool developed by Google, OpenRefine, is yet another useful tool to Clean data. Data cleaning, also called data cleansing, is the process of ensuring that your data is correct, consistent and useable by identifying any errors or corruptions in the data, correcting or deleting them, or manually processing them as needed to prevent the error from happening again. Therefore, after scraping data, you must clean it to make sure it is ready to be manipulated
Below, read more information about the two tools. In the sub-section, there is a step-by-step example about how to use the tools together to create a dataset by scraping data from Wikipedia and preparing it for analysis and interpretation.
Google Sheets Formula (Data Scraping)
Using Google Sheets you don’t have to copy and paste. Instead, use the ImportHTML function in Google Sheets and get the data from Wikipedia. In this video, Chris Menard explains how to use the following Google spreadsheet function to scrape a table from an HTML web page into a Google spreadsheet:
=ImportHtml(“Insert URL“, “table”, Insert Table Number)
The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping.
OpenRefine (Data Cleaning)
OpenRefine, formerly called Google Refine and before that Freebase Gridworks, is an open-source tool that was built to help people clean data. It provides functions that let you investigate your data and then apply fixes to groups of data at the same time. You can also write short scripts to transform columns of data.
While at first glance it may look like using spreadsheet software like Excel there are a few key things that make OpenRefine a good data cleaning tool:
- You work with a new version of your data in OpenRefine, leaving your original untouched.
- OpenRefine lets you work with data as rows or records and columns, not just cells.
- OpenRefine keeps track of the steps you go through from import to export.
- You can “Undo” steps if you want to go back to an earlier version of the data.
- You can copy the steps and apply them to another dataset with the same structure.
- OpenRefine has features that let you cluster data that looks the same, simplifying clean up.
- You can use open refine to get data from other sources through APIs.
Getting Started with Google Sheets Formula and OpenRefine
Part One
In the first part of this tutorial, I will show you how to scrape data from a table on a Wikipedia page using the importhtml command on Google Sheets, and I will also show you how to save that file as an excel file. Wikipedia has lots of information, and while the majority of that information is simple text, there are tons of tables on Wikipedia that can be used for data analysis. While this example is Wikipedia specific, the importhtml command can be used for any table on any public website.
Step One: Scraping data from a Wikipedia Table
In this step I will show you how to save a Wikipedia table into a spreadsheet using the importhtml function on Google Sheets. I will be scraping this data on Jay Z, but this method will work on any table on Wikipedia.
- Create a blank spreadsheet using Google Sheets.
- In order to do this, make sure you are logged into your Google account and then go to drive.google.com.
- Click on the button labeled “New” in the top left corner, and then create a blank spreadsheet by clicking “Google Sheets”.
- I’m going to title this spreadsheet “Jay-Z videography as lead artist”, because that’s the data we will be scraping from Wikipedia.
- In the first cell of the spreadsheet, type =importhtml.
- This is the command we will use to tell the spreadsheet to scrape the data from Wikipedia.
- Copy and paste the URL of the Wikipedia page with the data that you want to download.
- Make sure the URL is in quotation marks. Type a comma followed by the word “table” in quotation marks.
- This tells the computer that the data we want to download is in a table.
- Lastly, type a comma followed by the number that corresponds to which table you are going to download from the wikipedia page.
- The program treats all multimedia elements on the page as tables, including pictures, videos, and maps.
- This number does not go in quotation marks. If you get the number wrong, you can always delete the data and modify your original command.
- Press enter, and wait for your the program to download you data.
- As you can see, the data from this Wikipedia page has been saved into the Google spreadsheet.
- Download the second table on this Wikipedia page, which lists Jay Z’s videography as a featured artist.
- I could save it to a separate google sheet; however, my end goal is to analyze these data sets together, so I will go ahead and import the second table into this spreadsheet as well.
- Using the same command as before, (but I will change the number one to a number two), download the data from the second table on the page.
- Delete the second header row since the two tables share the same headings.
Step Two: Saving Google Sheets as an Excel file
In this step, I will show you how to save your google sheets spreadsheet as an excel file. We have to download our spreadsheet in order to make use of Open Refine, which will help us clean our data for analysis.
- Open the spreadsheet you want to download in Google Sheets.
- Click on “file” > “download” > “Microsoft xlsx”.
- The file will download as a Excel spreadsheet automatically.
Part Two: Modifying Excel Data
In Part Two, I will be modifying the dataset for my specific research purposes. Sometimes the way that the data is presented on the Wikipedia page is not optimal for your own research goals. In this step, I will show you a few tricks to quickly move around data in an Excel spreadsheet so that you can modify the layout of data to suit your own specific research purposes.
Step Three: Adding a Column to a Spreadsheet and Mass Entering Data with Excel
- Add a column that specifies whether Jay Z was a lead or featured artist on each video in the spreadsheet.
- Start by adding a header for this new column and I will title it “main or featured artist”.
- The Wikipedia page differentiated between these two types of videos by making two separate tables, and we downloaded each table individually.
- Because of this, all of the videos in which Jay Z is the lead artist are listed first, followed by all of the videos in which he is the featured artist. This will allow us to quickly enter whether Jay Z was a lead or featured artist into this column.
- Highlight all of the cells in which I want to enter that Jay Z was the main artist.
- Type the word “main” into the function bar, and then hold down control and enter. This allows me to enter a lot of data quickly into a spreadsheet.
- To do the same thing for the videos in which Jay Z was a featured artist, go to the function bar, type in “feature”, and hit control enter.
Step Four: Using Excel’s Fill Below Command
If we take a look at the original table on the Wikipedia page, we will see that the table only lists each year once for every video. For example, these three videos were all produced in 1996, but the year 1996 is only listed once. Wikipedia uses this kind of layout a lot, because it is easy for human readers to understand. However, this layout is lost when we save the data to a spreadsheet, leaving lots of empty cells that we will want to fill in order to analyze the data. We could go through manually and fill in each individual empty cell. However, with larger data sets it can become tedious. In this step, we will copy and paste each value into all of the empty cells below it for all cells in the column.
- Highlight the cells that should have a value in them.
- Use Excel’s “Go To Special” command to select only those cells which are blank.
- You can do this quickly by holding down the control key and pressing G and clicking on special, or you can navigate to this feature by clicking on “Home” > “Find and Select” > “Go to special”.
- Click the bubble labeled “blanks” and then hit “ok”. This will select all the cells that are blank in this column.
- Click on the formula bar and type an equals sign followed by the value of the cell above the first blank cell in the column.
- This tells the computer to fill in every blank cell with the value immediately above it.
- In this example, the first blank cell is B4, so , type in B3 and then hit control and enter. This quickly filled in every cell with the value of the cell immediately above it, which is useful for quickly modifying Wikipedia tables for data analysis.
- For the director column as well, to highlight all of these, hit control G, click on special, select the blank cells, and then type =c2 and hit control enter.
Part Three: Use Open Refine to Clean Date
In part three of this tutorial, I am going to show you how to clean data using Open Refine. Open Refine is a free, open source software that runs in your web browser. It lets you manipulate data in a spreadsheet. In order to download openrefine, go to openrefine.org/download, and follow the onsite directions to download the latest version of the program to your specific operating system.
Step Five: Opening a Spreadsheet in Open Refine
- Open Open Refine.
- The program will open in your default web browser.
- To open the spreadsheet with Jay Z’s videography data, click on “Create Project” > “Choose File”, and then navigate to the specific file you want to open.
- Click on “open” and then I’ll click on “next”.
- Click on “Create Project” in the top right of the screen so that I can begin manipulating the data in the spreadsheet.
Step Six: Using the Replace Function
All of the titles in this data set have quotation marks around them, and I want to get rid of these. I can use the replace feature to do this. The replace feature will find all characters in a column and replace them with another character, and we can use this to remove the quotation marks surrounding the song titles.
- Open the drop down menu for the specific column that I want to transform.
- Click on “edit cells” >”transform”.
- In this window, we can type a code into the text box labeled “expressions” and it will perform certain actions on our spreadsheet.
- The column on the left is the current state of the spreadsheet, and the column on the right is what will happen if we run our code.
- Use “ value.replace() “ in order to replace certain characters or strings of characters in the column.
- The first variable is the character or characters that you want to replace, and the second variable is what you want to replace them with.
- Both sets of characters go in quotation marks.
- For example, this formula will replace all of the capitol I’s with lower case f’s (This function is case sensitive).
- To get rid of the quotation marks (not replace the I’s with F’s), type \ followed by a ” for the first variable.
- The backwards slash tells the program to treat the next character as a character to be replaced, rather than as a quotation mark that is part of the code.
- For the second one, leave it blank.
- The two empty quotation marks for the second input signal to the computer that we don’t want to replace the character with anything—this effectively removes it.
There is an alternative way to use the replace command in Open Refine, and it’s a bit easier.
- Navigate to the drop down menu again and click on “edit cell” > “replace”.
- This will open a window like this, and it works the same way as before; however, with this version you do not have to worry about quotation marks and backslashes.
- Type what you want to replace in the first text box, and what you want to replace it with in the second box.
- However, this method does not have the advantage of showing the effects of your find/replace command before you run it.
Step Seven: Split Parenthetical Phrases into New Column
I also want to create a separate column that lists the featured artist on each song. The original table on Wikipedia listed the featured artists in the song title in parentheses, but I want these in a separate column.
- Open the drop down menu of the song title column, which I want to split into two separate columns.
- Click on “edit column” > “split into several columns”.
- I have to specify to Open Refine which character I want to mark the division between the two columns. In this example, the featured artist is always listed after the song title in parentheses. We can use an open parentheses to mark the division between the two columns.
- In the text input box labeled “separator”, I am going to type a parentheses, and then hit OK.
- As you can see, this split my original column into two separate columns.
- Click on “rename column” under the “edit this column” option in the drop down menu.
There’s still a bit of cleaning up to do for this column. Each artist is followed by a close parentheses which I want to get rid of. As before I can use the replace feature to get rid of these. I also want to get rid of the word “featuring” that proceeds each listed artist, because that information is provided in the title of the column.
Step Eight: Removing Unwanted Columns
I also want to remove the bracketed numbers in the directors’ column. If we take a look at the original Wikipedia table, we can see that these numbers originally were hyperlinks to the sources at the bottom of the page. Google sheets preserved those numbers when we scraped the data from Wikipedia, but I have no use for them. I can use the “split into several columns” feature to get rid of these.
- Navigate to “split into several columns” under “edit column” in the drop down menu.
- This time, I’m going type an open bracket as the separator, and hit OK.
- Now that the column has been split into two, delete the second column, which contains all of the excess information I do not want.
- Click on “remove column” under “edit column” in the column’s drop down menu.
Step Nine: Removing Blank Space from Bottom of Spreadsheet
I want to get rid of all of the white space before and after any words in my spreadsheet. It can be hard to see on your own if there are any extra spaces before or after values in your sheet. Fortunately, there is an easy way to do this in Open Refine.
- In the column’s drop down menu, go to “edit cells” > “common transforms” and then click on “trim leading and trailing whitespace”.
- This will get rid of any excess spaces in your cells, which will interfere with the computer’s ability to recognize your data.
- Repeat this step for every column in the spreadsheet.
Step Ten: Exporting Data from Open Refine
I have finished cleaning my dataset with Open Refine, and now I want to save my data as an excel sheet again.
- In the top right corner of the browser window, I can click on “export” > “microsoft excel” and the newly cleaned file will automatically download onto my computer.
Part Four: Final Steps
Step Eleven: Adding Director Gender
I want to add a column that denotes the gender of each director. Because this is new information that wasn’t listed in the original Wikipedia page, I will have to add it manually. However, there are a few steps I can take in order to speed up this process.
- Create a new column for the gender of each director.
- Click on the arrow in the top left corner of the excel sheet.
- This turns on filters so that I only have to enter the gender once for each director in the spreadsheet.
- Click on “data” in the menu bar and then click on “filter”.
- By clicking on the filter icon at the top of the director column, I can isolate each director and add their gender.
- For example, I can click on the box next to Abdul Malik Abbot in order to only see those rows in which he is the director.
- Enter his gender for all of these rows by highlighting the respective cells typing in “male” into the funciton bar, and then hitting control and enter.
- Repeat this step for every director in the column
Step Twelve: Saving
Our Wikipedia data is finally cleaned up and ready to be analyzed. Now we just have to save the file. Saving an Excel sheet is just like any other.
- Go to “file” > “save” and you’re good to go.
OpenRefine – Extended Information
Once OpenRefine is launched in your browser, the left margin has options to Create Project, Open Project, or Import Project. Here we will create a new project:
- Click Create Project and select Get data from This Computer.
- Click Choose Files and select the file SAFI_messy_openrefine.csv. Click Open or double-click on the filename.
- Click Next>> under the browse button to upload the data into OpenRefine.
- OpenRefine gives you a preview – a chance to show you it understood the file. If, for example, your file was really tab-delimited, the preview might look strange, you would choose the correct separator in the box shown and click Update Preview (bottom left). If this is the wrong file, click <<Start Over (upper left). There are also options to indicate whether the dataset has column headers included and whether OpenRefine should skip a number of rows before reading the data.
- If all looks well, delete csv from the end of the name of your project and click Create Project>> (upper right). Files we export from the OpenRefine project will be given the name of the project, so if you end up creating multiple versions you will want to be sure and rename the file once you have exported it.
- Note that at step 1, you could upload data in a standard form from a web address by selecting Get data from Web Addresses (URLs). However, this won’t work for all URLs.
What Data Types Can You Manipulate in OpenRefine
Text –Any data can be classified as text. Like in a spreadsheet, text will be stored exactly as it was it was entered. Text facets will look for exact matches, taking extra spaces and case into account, and give you a list with a count of each value. Text appears in the table in black and is left-justified.
Number –Numbers can be any numeric value and can include decimal places. If you try and classify a value that is not a number as a number you will get an error. You can perform mathematical calculations on numbers. They can be turned into text in a script by using toString(). Numbers appear in green and are right-justified.
Date- Dates are formatted in yyyy-mm-ddThh:mm:ss format. If no time was given it will default to midnight. Dates are in green and are right-justified.
Boolean – Boolean values are either true or false. You will often see these if you have created a facet that separates values on whether or not they have a particular characteristic or if you perform a transform that is expressed as true or false.
Converting Data in OpenRefine
When a table is imported into OpenRefine, all columns are treated as having text values. This interprets the values as numbers for the purposes of sorting but keeps the underlying data type as is. We can, however, transform columns to other data types (e.g. number or date) using the Edit cells > Common transforms feature.
To transform cells in the years farm column to numbers, click the down arrow for that column, then Edit cells > Common transforms… > To number.
Saving a Project
By default OpenRefine is saving your project continuously. If you close OpenRefine and open it up again, you’ll see a list of your projects. You can click on any one of them to open it up again. You can also export just your cleaned data, rather than the entire project. Click Export in the top right and select the file type you want to export the data in. Tab-separated values (tsv) or Comma-separated values (csv) would be good choices.
Useful Functions of OpenRefine
Trim leading and trailing whitespace
Words with spaces at the beginning or end are particularly hard for we humans to tell from strings without, but the blank characters will make a difference to the computer. We usually want to remove these. OpenRefine provides a tool to remove blank characters from the beginning and end of any entries that have them.
To remove the whitespace, choose Edit cells > Common transforms > Trim leading and trailing whitespace.
Using undo and redo.
It’s common while exploring and cleaning a dataset to discover after you’ve made a change that you really should have done something else first. OpenRefine provides Undo and Redo operations to make this easy.
How OpenRefine records what you have done
As you conduct your data cleaning and preliminary analysis, OpenRefine saves every change you make to the dataset. These changes are saved in a format known as JSON (JavaScript Object Notation). You can export this JSON script and apply it to other data files. If you had 20 files to clean, and they all had the same type of errors (e.g. misspellings, leading white spaces), and all files had the same column names, you could save the JSON script, open a new file to clean in OpenRefine, paste in the script and run it. This gives you a quick way to clean all of your related data.
Useful Expressions in OpenRefine
Adapted from: https://kb.refinepro.com/
Remove all numbers from a string
To remove all digits value from a string the following regex should do the work (from the menu edit cell > transform):
replace(value, /\d/, ”)
This expression replace all numbers (identified by the regular expression /\d/) by blank. All regular expression (regex) in google refine should starts and ends by the character: /
Remove “ (quotation) mark
Hard time removing the ” (quote sign) from your expression. Instead of quoting your quote mark with double quote, do it with simple like this :
replace(value, ‘”‘, “”)
the expression read : replace(value , single quote double quote single quote , two double quote)
Remove or replace a specific character in a column
– https://kb.refinepro.com/2011/07/remove-specific-character-in-column.html#uds-search-results
You want to remove a space or a specific character from your column like the sign # before some number. 2 expressions does the same job:
value.split(“#”).join(“”)
This splits the value at every space (“#”) and joins them with no space (“”)
replace(value, “#”, “”)
this look for every # (“#”) sign and replace it with no space (“”)
By John Merritt & Kenton Rambsy
Parts of this chapter were adapted from “Scrape data from wikipedia and put into Google Sheets by Chris Menard.”