COMP 110 - Weekly Goals

Week 1

At the end of this week, you should be able to:

Building on the Week 1 goals, at the end of Week 2, you should be able to:

Be able to relate glyphs and frames to data. More specifically, seeing a simple scientific graph, be able to sketch in the form of the data table that underlies the graph.
Use the helper functions mScatter(), and mBar() to generate graphics interactively by assigning variables to graphical attributes.
Create an Rmd document using the template.
Add R calculations to an Rmd document using the “chunk” notation.
Use the group_by() and summarise() data verbs to calculate all-cases and groupwise summaries of data.
Use the chaining syntax (%>%) to send the output of one expression as the input to another. Distinguish between assignment (<-) and chaining (%>%).
Use str(), names(), nrow(), and head() to get brief displays of data tables, and be able to interpret the output of those functions.
Be able to use and understand the use of functions such as mean(), sum(), n() as they are involved in summarise(). This includes constructing appropriate variable names in the output of summarise() and using the named variable syntax (=).
When getting NA as the output of a functions such as mean(), using the na.rm=TRUE optional argument.¹

Construct a graphic with more than one layer of glyphs.
Recognize and understand the basic operation and purpose of these data verbs: filter, select, mutate, and arrange.
Be able to chain multiple operations together to produce novel, more complex operations.
Understand the function of the join data verb: combining information from two data tables into a single output data table.
Recognize that criterion used by join for matching cases from the two tables and that a case in input table A may match no cases in B, or one case in B, or multiple cases in B.
Understand that there are different varieties of join to deal with cases that have no match in the other table.
Recognize the use of join for each of these purposes:
- translate a variable to new levels (as in re-coding a categorical variable)
- augment a data table to add new variables

Combine data from two tables using a join data verb. (See also Week 3 goals for join.)
Understand how filter and mutate can, when used in conjunction with group by, draw on group-specific properties for comparison to individual cases.
Understand the concept of “rank” and how it can be used with filter and group by to perform operations such as “find the 3 biggest in each group.”
Recognize the difference between wide and narrow formats of data. Understand that it’s possible to convert between wide and narrow and have examples in mind of when this is appropriate.
Recognize that networks consist of vertices and edges that connect pairs of vertices.
Be able to construct (using data verbs) data tables that list edges and edge-wise variables. Be able to follow instructions to use layers in ggplot() to graph a network along with the edge and vertex properties. (You don’t need to be able to graph networks from scratch, although it’s not that hard.)

Recognize aspects of graphs chosen in order to enhance readability or interpretability.
- Order of categorical levels
- Logarithmic axes
- Discrete colors
- Shades, colors, and shapes for “minimal ink”
Understand that “statistics” describe the collective properties of a set of cases.
- Be able to plot densities in one and two variables.
- Have a basic understanding of how box-and-whisker plots, or violin plots, can be used to compare densities of a quantitative variable at different levels of a categorical variable.
Recognize that a density can be a more effective presentation of the location of data, particularly when there are a very large number of cases.

Recognize different common ways of making data available over the Internet.
Be able to download or access data stored in these common ways.
Use simple regular expressions to find patterns in character strings.
Be able to identify “blunders” using graphics. Use filter to eliminate blundering cases or NA to mark them as blunders.

Understand why graphical presentations are ill-suited to display relationships that involve many variables.
Be able to create predictive models using clustering or decision trees.
Know how to evaluate the quality of two-level classification models.
Be aware of “dimension reduction” techniques such as SVD.

Please use the comment system to make suggestions, point out errors, or to discuss the topic.

I hate that you have to use na.rm=TRUE so often. In my view, this is what the default should be. Unfortunately, it’s hard to create a new default value, so we’re stuck with specifying na.rm=TRUE each time we use the functions.↩