Unsupervised Genetics

Data Computing

Computing project

There are many different kinds of cancer, often given the name of the tissue in which they originate: lung cancer, ovarian cancer, prostate cancer, and so on.

In this exercise, you are going to look for possible relationships among different cancer types. The basis for this will be the data in NCI60, which records the level of gene expression indicated by each of 41,078 probes against each of 60 different cell lines collected from different people with different kinds of cancer.

Looking at so many probes is a kind of shotgun approach enabled by microarray chips. It’s assumed that only a few genes might be involved in any given type of cancer; by looking at the expression of many genes, it’s hoped to be able to identify those few.

Clustering the cancer types

The point of clustering is to identify pairs or sets of cases that are similar. In hierarchical clustering, the pairs or sets are themselves grouped together in terms of similarity. The result is a kind of tree, a dendrogram, that shows which cases and sets of cases are similar.

As an example, here is a dendrogram constructed for the different cars in mtcars. There are two major groups of cars. From the base of the tree, two branches identify two major groups of cars. Each of those groups is subdivided, with the process repeating until reaching a leaf: an individual case. The level of the horizontal line connecting two sub-branches indicates the distance between those sub-branches. For example, the Honda Civic, Toyota Corolla, and Fiat 128 are quite similar.

Glyph-ready data for constructing a dendrogram is in the usual tidy form, with the cases being the objects to cluster and the variables being the means to judge similarity between cases. The mtcars data table is already in glyph-ready form; take a look! Note that the identifiers for the object, e.g. “Datsun 710,” are not stored in a variable but in the row names of the data table.

                   mpg cyl disp  hp
Mazda RX4         21.0   6  160 110
Mazda RX4 Wag     21.0   6  160 110
Datsun 710        22.8   4  108  93
Hornet 4 Drive    21.4   6  258 110
Hornet Sportabout 18.7   8  360 175
… and so on for 32 rows.
 
Table 1. Part of the mtcars data table

The process is as follows:

  1. Find the distances from each case to every other case. The dist() function accomplishes this.

    Dists <- dist(mtcars)
  2. Apply the clustering algorithm to make the dendrogram.

    Dendrogram <- hclust(Dists)
    ddata <- dendro_data(Dendrogram)
  3. Visualize the dendrogram

    ggdendrogram(Dendrogram, rotate = TRUE) +
      geom_text(data = ddata$labels, aes(x=x, y=y, label = label), vjust = 0) 

Wrangling the genetics data

The NCI60 are not yet in a form suitable for clustering. There are several shortcomings.

        Probe cellLine expression
 A_23_P100001     BR.B      -4.58
 A_23_P100001     BR.H      -4.57
 A_23_P100001    BR.MC       1.50
 A_23_P100001    BR.MD      -4.30
 A_23_P100001     BR.T       2.16
… and so on for 1940640 rows.
 
Table 2. NCI60 gathered into a narrow form. There is a row for each probe at each of the 60 cell lines.

  1. The objects to be clustered — cell lines — are represented as variables instead of cases.
  2. There are far too many probes to be able to identify “similarity” in a meaningful way. Recall that most of the probes are irrelevant.
  3. The identifiers are not stored as the row names.

To start, put the data in a narrow form, like Table 2.

Narrow <- 
  NCI60 %>%
  gather(value=expression, key=cellLine, -Probe) %>%
  group_by(Probe, cellLine) %>%
  summarise(expression = mean(expression)) %>% 
  ungroup()

It happens that the same probe can appear multiple times on the gene chip. Combine all such repeats by grouping by Probe and cellLine and summarizing each group with expression = mean(expression). Then use ungroup() on the result.1 When you group on more than one variable, summarise() leaves the result partially grouped. This can interfere with later operations.

In the narrow form, there’s a simple way to pull out probes that might be of relevance to distinguishing among the different kinds of cancer. Presumably, for a relevant probe the expression will vary a lot among the cancer types. A very simple indicator of this is the standard deviation of each probe’s expression. You can calculate this by grouping over Probe and summarising with spread = sd(expression).

The spread of expression of probes across cell lines ordered from biggest to smallest for the leading 500 probes. For comparison, the red line shows the spread under the Null Hypothesis assumption that there is no relationship between probe and cell line. The spread of expression of probes across cell lines ordered from biggest to smallest for the leading 500 probes. For comparison, the red line shows the spread under the Null Hypothesis assumption that there is no relationship between probe and cell line.

keep <- 500
Best <-
  Narrow %>%
  group_by(Probe) %>%
  summarise(spread = sd(expression)) %>%
  arrange(desc(spread)) %>%
  mutate(i = row_number()) %>%
  head(keep)
Randomized <-
  Narrow %>%
  mutate(Probe = shuffle(Probe)) %>%
  group_by(Probe) %>%
  summarise(spread = sd(expression)) %>%
  arrange(desc(spread)) %>%
  mutate(i = row_number()) %>%
  head(keep)
Best %>% 
  ggplot(aes(x=i, y=spread)) + 
  geom_line() +
  geom_line(data=Randomized, color="red", size=1, alpha=.5)

Identify a threshold for spread that identifies potential candidates for differential expression across cell lines and filter to keep only the probes above that threshold.

Keepers <-
  Narrow %>% group_by(Probe) %>%
  filter(sd(expression) > 4.5)

Wrangle the result into a wide format with the variables being the probes and the cases being the cell lines.

Keepers <-
  Keepers %>%
  spread(key=Probe, value=expression)
scale_colour_discrete <- function(...) scale_colour_hue(...)

Cluster those cases and display your result. (You’ll have to select out the cellLine variable in the data table you hand to dist(). The dist() function doesn’t work with names like those in cellLine.)