A text file of the book is available at http://www.gutenberg.org/ebooks/2701. At that page is a link to a UTF-8 encoded text document named "pg2701.txt"
. I downloaded the file and stored it on my machine as pg2701.txt
. I can read that using readLines()
.
Moby <- readLines("pg2701.txt")
You could also read the file directly from Project Gutenberg
con <- file("http://www.gutenberg.org/ebooks/2701.txt.utf-8")
Moby <- readLines(con)
close(con)
The result, stored in the Moby
object, is a character vector of 22108 strings. Some of these are prefatory matter, some postscript.
The text itself begins after a line
start_text <- "START OF THIS PROJECT GUTENBERG"
and ends before a line
end_text <- "END OF THIS PROJECT GUTENBERG"
Using these as delimiters includes some transcriber’s notes, etc. For simplicity, I’ll take as the start the line
start_text <- "CHAPTER 1\\."
The last line is simply “orphan.” ending line
end_text <- "^orphan\\.$"
Why the funny spelling? The start_text
and end_text
are being specified as a “regex” (sometimes called regular expression) indicating that the word “orphan” is at the very beginning of the line, followed by a period and the end of the line.
Regexes are a way of describing patterns. For our purposes, we’ll use them to identify the first and last line of Melville’s work in the Project Gutenberg text. As it happens, there are two instances of “CHAPTER 1.” in Moby Dick. The second is a book within the book. We want to start with the early instance.
first_line <- min(grep(start_text, Moby))
last_line <- grep(end_text, Moby)
Moby <- Moby[first_line : last_line]
We want to break the strings up into individual words. We’ll do this “by hand” because I want to render the text as a simple set of words and punctuation. Steps:
tmp <- Moby
characters <- unlist(strsplit(tolower(Moby), split = NULL))
# Step 1
punctuation <- c(".", ",", ";", ":", "?", "!", '"',
"'", "&", "-", "(", ")", "[", "]")
for (symbol in punctuation) {
result <- paste0(" ", symbol, " ")
tmp <- gsub(symbol, result, tmp, fixed=TRUE )
}
# Step 2
Words <- unlist(strsplit(tmp, split = " "))
# Step 3
Words <- data.frame(word = tolower(Words),
stringsAsFactors = FALSE)
# Get rid of empty strings
# Words <-
# Words %>%
# filter(word != "")
What are the character frequencies in the book?
table(characters) %>%
data.frame(stringsAsFactors = FALSE) %>%
arrange(desc(Freq)) %>%
mutate(cumulative = 100*cumsum(Freq) / sum(Freq)) %>%
DT::datatable(.)
Popular <-
Words %>%
group_by(word) %>%
tally() %>%
arrange(desc(n)) %>%
mutate(relfreq = 100 * cumsum(n) / sum(n))
DT::datatable(head(Popular, 1000))
Sequences <-
Words %>%
filter(grepl("[a-zA-Z]", word)) %>%
mutate(two = lead(word, 1), three = lead(two, 1),
four = lead(three, 1))
CommonPairs <-
Sequences %>%
group_by(word, two) %>%
tally() %>%
ungroup() %>%
arrange(desc(n)) %>%
mutate(relfreq = 100 * cumsum(n) / sum(n))
DT::datatable(head(CommonPairs, 1000))
Popular triplets
Triplets <-
Sequences %>%
group_by(word, two, three) %>%
tally() %>%
ungroup() %>%
arrange(desc(n)) %>%
mutate(relfreq = 100 * cumsum(n) / sum(n))
DT::datatable(head(Triplets, 1000))