Project: Scraping Nuclear Reactors

In this project,11 Devised initially by Prof. Nicholas Horton, Amherst College you’re going to look at data about nuclear reactors. Let’s use Japan as an example. Often, when you are doing a quick or informal project, sources like Wikipedia are useful.

Go to the page http://en.wikipedia.org/wiki/List_of_nuclear_reactors. Find the reactor list for Japan. Figure 18.1 shows part of the list22 on October 28, 2019 as a cut-and-paste image from a web browser.

Figure 18.1: Part of the Wikipedia table describing nuclear reactors in Japan.

Unfortunately, it is not a matter of cut-and-paste to get the tables in Wikipedia into the form of a data frame in R. The tables often have a complex, non-tidy form. In addition, the tables are written using HTML tags, which can be confusing. For instance, here a bit of the HTML behind the table of reactors in Japan.

<table class="wikitable sortable">
<tr>
<th rowspan="2" style="background:#FFDEAD;">Name</th>
<th rowspan="2" style="background:#FFDEAD;">Unit No.</th>
<th colspan="2" style="background:#FFDEAD;">Reactor</th>
<th rowspan="2" style="background:#FFDEAD;">Status</th>
<th colspan="2" style="background:#FFDEAD;">Capacity in MW</th>
...
</tr>
<tr>
<td>Fukushima Daiichi</td>
<td>1</td><td>BWR</td><td>BWR-3</td><td>Inoperable</td>
<td>439</td><td>460</td><td>25 Jul, 1967</td>
<td>26 Mar, 1971</td><td>19 May 2011</td>
<tr>

Compare the human-readable version of the table with the HTML markup. You’ll see that the data is there, but there is a lot of extraneous material and the arrangement is set not by position in a spreadsheet layout but by HTML tags like <td> and <tr>. [Click to see note.]A markup indicator, analogous to * or ### or [text](line) in Markdown.

library(rvest)

page <- "http://en.wikipedia.org/wiki/List_of_nuclear_reactors"
tableList <- page %>%
  read_html() %>%
  html_nodes(css = "table") %>%
  html_table(fill = TRUE)

The code above basically takes advantage the html tags that identify tables in the page and reads all of them into the R environment at once. The result object is not a data frame; it is a list of data frames. Here are some of the operations you can apply to lists:

Description	Syntax	Example
Number of elements in the list object	`length(`list_object`)`	`length(tableList)`
Access a single element of list object	list_object`[[`element`]]`	`tableList[[20]]`

Find the table element

Start with head(tableList[[5]]) and go down the list until you find the table for Japan. The tables are listed by number in the same order that they appear on the page. As of the time of this writing,33 Wikipedia articles are works in progress. Over a period of even a few days they may have been modified substantially. tableList[[5]] is data from Balarus, so you’ll have to go a good distance down the list to get to Japan.

The table will look something like this:

Name	UnitNo.	Reactor	Reactor	Status	Capacity in MW	Capacity in MW	Construction start	Commercial operation	Closure
Name	UnitNo.	Type	Reactor	Model	Capacity in MW	Capacity in MW	Construction start	Net	Gross
Fugen	1	HWLWR	ATR	Shut down	148	165	10 May 1972	20 March 1979	29 March 2003
Fukushima Daiichi	1	BWR	BWR-3	Inoperable	439	460	25 July 1967	26 March 1971	19 May 2011
Fukushima Daiichi	2	BWR	BWR-4	Inoperable	760	784	9 June 1969	18 July 1974	19 May 2011
Fukushima Daiichi	3	BWR	BWR-4	Inoperable	760	784	28 December 1970	27 March 1976	19 May 2011
Fukushima Daiichi	4	BWR	BWR-4	Shut down/Inoperable	760	784	12 February 1973	12 October 1978	19 May 2011
… and so on for 65 rows altogether.

Your turn: In what ways is the table tidy? How is it not tidy? What’s different about it from a tidy table?

Once you’ve answered the above questions … and only then … continue reading.

Among other things, some of the variables names appear redundant and others have multiple words separated by spaces. You can rename variables them using the data verb rename(), finding appropriate names from the Wikipedia table. Another problem is that the first row is not data but a continuation of the variable names. So row number 1 should be dropped.

# you may want to do your own investigation to understand why this line is needed
names(Japan)[c(3,7)] <- c("type", "grossMW")

Japan <-
  Japan %>%
  filter(row_number() > 1) %>%
  rename(name = Name, 
         reactor = `UnitNo.`,
         model = Reactor,
         status = Status, 
         netMW = `Capacity in MW`,
         construction = `Construction start`,
         operation = `Commercial operation`, 
         closure = Closure)

This sort of variable-name cleaning is common.[Click to see note.]Notice the use of back ticks ` around variables names with special characters or spaces in them. But it’s not the only sort of reformatting that’s needed here. Look at each of the variables and decide what the data type is: character, numerical, date, etc. Now use str() to see how the variable is typed in the data frame itself.

You are going to need to mutate() the variables that are not in the right type. Some suggestions:

To convert a character string of digits into a number, use as.numeric() or as.integer().
The lubridate package functions can be used to turn character string dates into a POSIXct date object.[Click to see note.]POSIXct date object: A type of R object representing points in time and allowing plotting, mathematical operations and extraction of components (such as the year or day of the week). Identify what the format of the date is. The lubridate translation functions are mdy(), mdyhms(), dmy(), and so on.

Your turn: Your cleaned data, make a plot of net generation capacity versus date of construction. Color the points by the type of reactor (for example: BWR, PWR, etc).44 Tip: boiling water reactor (BWR), pressurized water reactor (PWR), and other abbreviations are easily decoded with an Internet search. In addition to your plot, give a sentence or two of interpretation; what patterns do you see?

Your turn: Carry out the same cleaning process for the China reactor table, and then append it with the Japan data.[Click to see note.]Hint: functions such as bind_cols() or bind_rows() from the dplyr package are helpful for appending data frames. Use mutate() to add a variable that has the name of the country.

Collating the data for all countries is a matter of repeating this process over and over (You don’t need to do this). Inevitably, there are inconsistencies. For example, the US data had been organized in a somewhat different format when compared to the Japan and China data for many years until Wikipedia editors decided to reconcile them.

Your turn: Make an informative graphic similar to Figure 18.2 that shows how long it took between start of construction and commissioning for operation of each nuclear reactor in Japan (or another country of your choice). One possibility: use reactor name vs date as the frame. For each reactor, set the glyph to be a line extending from start of construction to commissioning. You can do this with geom_segment() using name as the y coordinate and time as the x coordinate.

Figure 18.2: Time interval from start of construction to operation. Tip: use the paste() function to create the reactorID on the vertical axis.