Topic 13 Programming Basics

13.1 Programming Basics I: Names, classes, and objects {progbasics1}

13.1.1 Names

Composed of letters, numbers, _ and ..
- Don’t use . — it’s a bad habit. But plenty of people do. - Can’t lead with a number. - Capitalization counts. - Unquoted (… almost always)

13.1.2 Objects

Information (bits) in a particular format.
- Different formats for different purposes. - The format is the class() or mode(). mode is more basic than class.

Assignment: Give a name to an object

13.1.3 Vectors

1-dimensional homogeneous collections of numbers, character strings, booleans/logicals, etc.

13.1.3.1 Must Know!

There are some basic types of vectors. The most common are:

  1. numeric, e.g. 3, 3.14159, 6.023e26,6.626196e-34
  2. character, e.g. "hello", `“When in the course of human events …”
  3. logical (or “booleans”). The only allowable values: TRUE and FALSE

Much of the software you’ll use in this course will work with an alternative to character strings called “factors.”

  1. factor, an encoded representation of levels of a categorical variable.

Some operations you will use when dealing with categorical variables are as.character() and (for older software) as.factor().

Vectors are “1-dimensional” collections. That is, you need only one index to refer to a specific element.

my_vector <- c("apple", "berry", "cherry")
my_vector[2]
## [1] "berry"
my_vector[c(3, 1, 2, 2)]
## [1] "cherry" "apple"  "berry"  "berry"
my_vector[4] <- "durian"

Boolean indexing

my_vector[my_vector > "b"]
## [1] "berry"  "cherry" "durian"

If you want, you can convert the boolean style to a number style with which().

Other important functions:

  • length(), to say how many elements there are in the vector. The length can be zero.
  • Arithmetic operations, e.g. sum(), mean(), max(), cumsum(), and other functions (e.g. sqrt(), log(), sin()), …
  • Logical operations, that is, operations that transform vectors into a logical vector. Examples:
    • Comparison: >, ==, >=, !=, <, <=
    • Boolean operations: !, |, &. (Note, these are single characters.)
  • Categorical operations, e.g. table()
funny <- ceiling(length(my_vector)*runif(10))
my_vector[funny]
##  [1] "durian" "apple"  "apple"  "berry"  "berry"  "berry"  "durian" "apple"  "durian" "durian"
length(my_vector)
## [1] 4
dim(my_vector)
## NULL

13.1.4 Matrices

  • 2-dimensional homogeneous collections of numbers or of character strings. These collections are called matrices
    • 2-dimensional means you need two indices to refer to a specific element.
    • dim()
    • Operations, e.g., t(), colSums(), rowSums(), %*%, …
    • Index matrices with [ , ]. You need to give two

13.1.5 Lists

1-dimensional heterogeneous collections

  • create with list()
my_list <- list(magic = 1:10, greeting = "hello", is_wednesday = FALSE)
my_list[c("greeting", "magic", "magic")]
## $greeting
## [1] "hello"
## 
## $magic
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $magic
##  [1]  1  2  3  4  5  6  7  8  9 10
my_list$magic
##  [1]  1  2  3  4  5  6  7  8  9 10
  • index with [[ ]] or [ ]. The first is for an individual item which you want in the form of that item itself. The second is to create a subset of the list which will be a list itself.
  • you can name items in lists and refer to them by name. [["name"]] To set names of list items, use named arguments to list() or use the names() function on the left-hand side of the <- operation. (The left-hand side of <- is called an “assignable.” These can be names, but they can be other things as well such as indexed arrays, names(), etc.)

  • Data frames

A list of vectors.

  • Each component in a given vector must be the same kind of thing as the other components. (Special cases: NA for missing data. Two more special things for numerical data.NaN, & Inf)
  • Important functions: nrow(), names(), $.

13.1.6 Functions

Take inputs and produce an output (and maybe a side-effect). - Make them with function(){ }, a special form. (It’s not actually a function!) - Try this

class(sin)
class(3)
class("a string")
class(function)
## Error: <text>:4:15: unexpected ')'
## 3: class("a string")
## 4: class(function)
##                  ^

13.2 Programming basics: Linear Models

Syntactic element: formulas. Formulas provide a way of using variables “symbolically.” This is useful, for instance, in depicting the desired relationship among variables. Two forms:

  • y ~ x two-sided
  • ~ x one-sided
  • NOT ALLOWED, y ~

Important functions:

  • lm(), predict() , anova(), summary().
  • For later: solve(), model.matrix().
  • From 155: coef(), fitted(), resid()

Training with lm(): Specify formula Y ~ X1 + X2

data(College, package = "ISLR")
mod <- lm(Outstate ~ Enroll + Accept + perc.alumni, data = College ) 
coef(mod)
## (Intercept)      Enroll      Accept perc.alumni 
## 6387.368606   -2.888077    1.101019  179.528731

What kind of thing is mod?

Model output with predict()

predict(mod, 
        newdata = data.frame(Enroll = 100, Accept = 1000, perc.alumni = 25))
##       1 
## 11687.8

What kind of thing is the output of predict()?

predict(mod, interval = "confidence",
        newdata = data.frame(Enroll = 100, Accept = 1000, perc.alumni = 25))
##       fit      lwr      upr
## 1 11687.8 11383.57 11992.03
predict(mod, interval = "prediction",
        newdata = data.frame(Enroll = 100, Accept = 1000, perc.alumni = 25))
##       fit      lwr      upr
## 1 11687.8 5548.956 17826.64

Why is the “confidence interval” so much narrower than the “prediction interval?”

Inference with anova() and summary(). This is all “in-sample” inference, not cross-validated.

M <- model.matrix(~ Enroll + Accept * perc.alumni, data = College)
qr.solve(M, College$Outstate)
##        (Intercept)             Enroll             Accept        perc.alumni Accept:perc.alumni 
##      7024.34843865        -3.00377409         0.78401053       147.31268325         0.02011475

For the people who have had linear algebra, why doesn’t this work?

solve(M, College$Outstate)
## Error in solve.default(M, College$Outstate): 'a' (777 x 5) must be square

Indexing on data: training and testing data sets

13.2.1 Graphics basics

  1. API for graphics: plot(), points(), lines(), polygon(), text(), …

13.3 K-nearest neighbors

K-nearest neighbors is a simple, general kind of function-building method. But some problems:

  • Interpretability: but you can always take partial derivatives.
  • When you have prediction (aka “explanatory”) variables in dollars and in miles, how do you calculate the distance between points? What are the dimensions of distance?
    • Dimensionality refers to the physical feature, e.g. time, distance, area, volume, money, charge, luminance, mass, …
    • Units are the ways in which dimensions are measured, e.g., cups, gallons, liters … all refer to volume
      • Give some examples of units for each of the dimensions.
      • Some everyday quantities are dimensionless, e.g. pure numbers. Give some examples: … (angles, percent, fractions, … but not ratios in general.)
    • Regression fixes units automatically, since the coefficients themselves have dimensionality. They will adjust automatically to changes in units, so the model is the same regardless of whether we use miles, km, parsecs, …
    • In KNN, to avoid dependence on units, need to do some standardization by dividing by something in the same units, e.g. sd.
  • Curse of dimensionality. Let’s create 1000 randomly placed points in the unit square:
rpts <- matrix(runif(2*1000), ncol=2)

What’s the distribution of distances from a single random point to the 1000 others:

our_point <- runif(2)

The distance between our point and each of the others

tmp <- matrix(our_point, ncol=2, nrow=1000, byrow=TRUE)
delta <- sqrt(rowSums((rpts - tmp)^2))
  • How far away is a typical point?
  • Write a function that takes the matrix of points and the “our point” and finds the distance from our point to each and every one of the points in the matrix.
  • How far away is a typical point in 1-dimensional space?
  • In 10-dimensional space?
  • In 100-dimensional space?

13.4 Loops/Iteration

Loops are the programming control structure that allows you to repeat the same commands many times.

A definition of insanity: Doing something over and over again and expecting a different result.

13.5 Parts of a loop

  1. Preparation — creating a place to hold the results

    This is called the “accumulator.”

  2. Identify a set to loop over.
  3. Inside the loop, modify the accumulator
  4. When the loop is done, package up the results.

13.6 Trivial examples

  • Find the sum of squares of a vector. (R sum(x^2))
  • Find the biggest element of a vector. (R max())
  • Generate \(k\) random numbers from the set 1:n (R sample())
    • with replacement: pre-allocate result, loop and select
    • without replacement: shrink the set of possibilities each time.
  • Find the \(k\)th Fibonacci number:
    • with previous and before_that as the state
    • with an array as the state: initialize to c(1, 1)
    • with a global array as the state memoization.
    • with an array created in the capturing environment.

**In practice, we would use the already parallelized R functions. See also Vectorize().

13.7 Bootstrapping

Process

  • Set up accumulator — what should we store? (all coefs)
  • but we don’t know what this should look like until we’ve tried it out
  • Loop:
  • create a new random sample
  • fit the model
  • store away the results
  • Post-process:
  • Give std-err?
  • Give covariance matrix?
  • Give array?

13.8 Leave-one-out cross-validation.

# preparation
my_data <- mosaicData::KidsFeet
error <- numeric(nrow(my_data))

# The looping set: each row in my_data
for (k in 1:nrow(my_data)) {
  # the body of the loop
  mod <- lm(width ~ length * sex, data = my_data[ -k, ])
  mod_value <- predict(mod, newdata = my_data[k, ])
  error[k] <- my_data$width[k] - mod_value
}

# packaging up the results
result <- sum(error^2)

Look at the result:

result
## [1] 6.304381
regular_model <- lm(width ~ length * sex, data = my_data)
sum(resid(regular_model)^2)
## [1] 5.329327
anova(regular_model)
## Analysis of Variance Table
## 
## Response: width
##            Df Sum Sq Mean Sq F value   Pr(>F)    
## length      1 4.0557  4.0557 26.6353 9.86e-06 ***
## sex         1 0.4790  0.4790  3.1456  0.08484 .  
## length:sex  1 0.0037  0.0037  0.0246  0.87638    
## Residuals  35 5.3293  0.1523                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In ANOVA, we use a degrees of freedom to adjust for the under-estimate of residuals.

sum(resid(regular_model)^2) / 35
## [1] 0.1522665

In leave-one-out, we can simply average the errors:

result / 38
## [1] 0.1659048

13.9 Building a package

Divide up into groups of two or three.

  1. Open a new project: choose “new directory”, “choose R package”
  2. Go to the “Build” tab, select More/Configure Build Tools and fill in the checkmark for Roxygen comments.
  3. Write functions for \(C_p\), \(AIC\), \(BIC\), and adjusted \(R^2\). The function should take a model as input.
  4. Document the functions.
  5. Compile the package.