15 Clustering

15.1 K-means

An EM algorithm approach.

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"
mod0 <- kmeans(iris %>% select(-Species), centers = 6)
To_plot <- iris
To_plot$cluster <- letters[mod0$cluster]
ggplot(To_plot, aes(x = Sepal.Length, y = Petal.Length, color = cluster, shape = Species)) +
  geom_point()

ggplot(To_plot, aes(x = Sepal.Length, y = Sepal.Width, color = cluster, shape = Species)) +
  geom_point()

load("Blood-Cell-data.rda")
ggplot(Cells1, aes(x=x1, y=x2)) +
  geom_point()

mod <- kmeans(Cells1 %>% select(x1, x2), centers=8)
Cells1$cluster <- letters[mod$cluster]
ggplot(Cells1, aes(x=x1, y=x2, color = cluster, shape = class)) +
  geom_point()

15.2 Heirarchical clustering

Linkage

  • Complete: maximum distance between points in the two clusters
  • Single: minimal distance
  • Average:
  • Centroid: center-to-center distance.

15.3 Example: Gene expression in cancer

## Loading required package: ggdendro
## Warning: Setting row names on a tibble is deprecated.