Data Computing

Wrangling

The process of transforming, aggregating, and combining data to put it in a glyph-ready form for your particular purpose.

Other terms used

  • Data Processing
  • Data Manipulation

Data Computing book uses "Wrangling" because

  • "Wrangling" is distinctive and therefore rises about cognitive noise
  • "Wrangling" doesn't carry misleading connotations
    • "Processing" suggests routine
    • "Manipulation" suggests nefarious intent

Five Main Kinds of "Things"

"Thing" a.k.a. "object"

  1. Data tables: contain variables Our convention: first letter is upper-case: BabyNames
  2. Variables: the stuff in data tables Our convention: first letter is lower-case: year
  3. Scalars: e.g. "Treatment", 42
  4. Functions and their arguments: Specify actions and carry them out
    • positional arguments
    • named arguments
  5. Pipes: Arrange sequences of actions

Three Main Kinds of Functions for Wrangling

  1. Data verbs
  2. Transformation functions
  3. Reduction functions

Data Verbs

  • Always take a data table as an input.
  • May also take other arguments, e.g.
    • variables,
    • transformation or reduction operations
    • a second data table
  • Always return a data table as the output

Note: Our style is never to use $ or other indexing operations. If you're using $, try to figure out another way.

Transformation Functions

Always take variables and/or scalars as inputs

EXAMPLES

A transformation expression is typically:

  • function(variable)
  • function(variable1, variable2)
  • function(variable, scalar)

For instance:

  • Age > 18
  • weight / height ^ 2

Reduction Functions

Take as arguments

  • variable(s)
  • transformation expressions

The Data We're Going to Use

require(babynames)
BabyNames <- babynames  # another name for the data frame

Examples of Simple Wrangling Statements

filter(NHANES, Age >= 21)
select(NHANES, Age, SmokeNow)
mutate(NHANES, Age_group = ntiles(Age, 5))
summarise(NHANES, mean.age = mean(Age))
arrange(NHANES, Age)

Quiz: What kind of thing is each argument?

Pipes

  • Streamline the appearance of data verbs
  • Support chains of operations: a step-by-step style
NHANES %>% filter(Age >= 21)
NHANES %>% select(Age, SmokeNow)
NHANES %>% mutate(Age_group = ntiles(Age, 5))
NHANES %>% summarise(Age_group = ntiles(Age, 5))
NHANES %>% arrange(Age)

Key Data Verbs

Memorize these and what they do!

  • arrange() – put data in a specified order
  • filter() – subset some of the rows/cases
  • select() – subset some of the columns/variables
  • mutate() – add a variable (column)
  • summarise() – summarise the data with a single row

Working subset by subset with group_by

group_by(BabyNames, name, year)
BabyNames %>% group_by(name, year)
  • group_by() puts the data into groups
  • nothing will be apparent until you do subsequent operations

Bringing in data from multiple sources

Also very important, but to be covered later,

  • joins such as inner_join()

There are a few other things as well

Examples: sample_n(), rename(), …

Zebrafish

Try them out