Chapter 7 Informing statistical investigation

The paramount goal of data science is to extract information from data. It’s natural, but misleading, to focus the attention of data science on “data.” But data has value only insofar as it informs our belief and understanding of the world and the decisions we make in interacting with the world. This chapter explores some of the richness of the idea of informing, which goes well beyond the everyday notion of transmitting a fact.

The prefix “in” is the Latin for “into.” The stem, “form,” has several related meanings including: the visible shape or arrangement or expression of a thing; a frame or mold which shapes something; a document with blank spaces for the insertion of data. As a verb it can mean to make into a shape; to bring together and organize people or things into a new entity; to appear gradually; to influence. “Form” is also paired with other prefixes: conform, reform, transform.

The point here is to suggest that the process of “informing” involves multiple components: things that are brought together, changed, updated, extended. One of these things is a person’s state of belief or knowledge before interaction with the data. This previous belief or knowledge is called, in statistics, prior belief or knowledge.

Data can be used to transform our prior knowledge into something new, to reshape our prior belief to conform to observed facts. To inform is not just to derive descriptions from data, but to bring data into contrast and comparison with what was in our head before we encountered the data.

This chapter is about starting with existing belief and knowledge. Based on this we engage in a process, often gradual and multi-facetted, of deciding what facts to collect and how to interpret them in the light of what we already knew or believed. This is sometimes called the process of statistical investigation.

There is not a single, universal process of statistical investigation. Each investigation is unique and shaped by the problem at hand, the reason behind the investigation, and the resources available. In discussing statistical investigation generally, the best we can do here is to identify common components of typical investigations. I’ll put these components into a sensible order that statistical investigations often follow. Other authors might reasonably define the components differently, put them in a different order, or insert new components that those authors believe help in undertaking a statistical investigation.

An individual might work on only part of the process. Still, to make informed decisions about the paths to take on your journey, you should be familiar with all the components.

7.1 Component I: Identify the goal

To start, think about why you are about to undertake a statistical investigation. Some general kinds of goals are to:

  1. make a prediction of the as-yet-unknown value of a variable from already available knowledge and facts
  2. anticipate how a proposed intervention, whether that be as serious as the administration of a new drug or as superficial as choosing the color of a “buy” button on a website, will shape outcomes.
  3. detect anomolies, such as a purchase with a credit card that breaks the pattern already established by the legitimate owner of that card
  4. form new hypotheses as proposals to move forward with other work

Depending on your goal, one statistical technique or another might be appropriate. And, depending on your goal, the data you already have at hand may be adequate or completely inadequate. Until you know your goal, there’s no telling.

Very often data scientists are working as part of a team. The overall goal of the team or the client who sponsors the work may not be the same as your own personal goals. Professional ethics require you not to use your specialized skills to the unfair advantage of others on your team or of your client. Often, you may find that the goal expressed by your client is not feasible or is inconsistent with what you believe is a deeper, underlying goal of your client. You should try, as much as feasible, to have a broad understanding of your client’s objectives. Sometimes, you may be in the position of explaining to your client why what the client thinks they want may not be what they actually want.

7.2 Component II: Identify existing knowledge

Find out what you can about the setting, system, or situation you are working with. Is there a published research literature on the topic or on closely related topics? Is there an expert on your team who can explain or opine about the mechanisms involved? Is there a specialized vocabulary? Are there relevant demographics?

What sorts of data are available? How were they collected? Exactly what does each variable mean?

Are there analogous settings or previous work that might provide some insight into how to go about understanding the system you are working with?

If your goal involves anticipating an intervention or generating hypotheses, you will want to think broadly about what causes what in your system, what quantities or characteristics might be involved, and how those quantities or characteristics might be connected to one another. This may well be speculative on your part.

It often happens that there is one or more covariates – aspects of a system in which you have little or no direct interest but which might play a role in shaping the aspects that you do care about.

To illustrate, consider the work of political pollsters. The motivating goals for a political poll are varied: e.g., to predict what the outcome of a vote might be, to identify segments of the population that are especially worthwhile to devote attention to, to assess whether a campaign message is getting through. Pollsters routinely inquire not just about political preferences but about covariates such as age, sex, race, past voting record, neighborhood or region of residence, and so on. Pollsters have learned that stratifying data on political preferences by these covariates allows them to make more specific predictions by group. That’s helpful for developing and targeting campaign messages, but it’s also important for prediction: Overall demographic information from, say, the national census, can be used to paste the groups back together into a whole that’s more representative of the actual population. Looking back on the poor poll predictions in the 2016 US presidential election, pollsters realized that an important covariate was level of education.

Often it can be helpful to draw plausible graphical causal networks to make explicit hypotheses about the causal connections between variables, including covariates and unmeasured quantities. The next chapter introduces graphical causal networks.

7.3 Component III: Design a data collection plan

Based on your current understanding, you may discover that your goal can best be achieved by collecting additional data. Your primary data needs to include both response and explanatory variables. But there may be auxilliary data that’s helpful, as with the pollster’s use of census data.
Often, a data scientist is brought onto the team to work with already collected data. Nevertheless, an effective data scientist understands basic principles and practices of data collection so that she can anticipate problems and envision possibilities for improving the available data and interpreting it in a correct context. (See CHAPTER XXXX CHAPTER)

As much as possible, measure the quantities that appear in your graphical causal networks. Keep in mind the ones that you had no way to measure when interpreting and communicating your results later on in the process.

Appendix @ref(data_collection) describes some of the most commonly encountered approaches to collecting data. The topic is important and broad, too broad to include in this short section.

Often, the data scientist will be working with data that already exists. But even when it’s not possible to participate in the design of data collection, you should take care to be aware of what the design is so that you can choose appropriate methods and interpret results in the right context.

7.4 Component IV: Statistical modeling

Statistical modeling is a process of constructing a mathematical representation of the system under study that accords, as much as possible, with the data collected from the system. A model is an intermediary between you and the actual system.

The mathematical intermediary has the great advantage of being easy to work with. You can interrogate a model to reveal features that can help you to achieve your goal. Such features might be about the relationships between variables, about prediction of an output variable given values for the inputs, and so on.

Statistical modeling has been a mainstream way of working with data for almost a century. As might be expected, modeling methods developed before the advent of modern computing were generally framed using traditional algebraic approaches and notation. The last couple of decades have seen the development of new approaches that are only practical with today’s powerful and inexpensive computing. These approaches, often given the labels machine learning or statistical learning, often incorporate important features of the traditional techniques, but are accessible without algebraic notation.

Much of this book is about modeling techniques, including both traditional techniques (that are still useful) and more modern learning techniques.

7.5 Component V: Assessing model performance

Of course it’s important to know whether a statistical model works well. To some extent, this is because you want to decide whether the claims you deduce from interrogating your model are justified. Part of assessing model performance is oriented to statistical inference, a set of techniques developed to address questions such as the precision of the model or the strength of evidence for a claim.

A major purpose for assessing model performance is to help guide a process of model development. Statisticians and data scientists often construct a series of models, not just a single one. One part of this is exploring how best to make use of the available data. Another reason to work with a series of models is that different models can incorporate different perspectives (also called prior beliefs or knowledge) on how the actual system works. It is often important to be able to compare these perspectives.

It can be helpful to think of model building and assessment of performance as part of a cycle of statistical modeling.

7.6 Component VI: Communication

Ultimately, the work encompassed by the previous components aims at achieving appropriate goals for the project. Decisions will be based on your work. Those decisions will often be taken by others, and so you need to be able to communicate your work and results to them. But even when you yourself are the decision maker, it can be important to document your work for “future you,” for instance when you are prompted to go back and reconsider the work you did earlier.

Communicating data science results can be challenging. Often, the people you are communicating with will have little technical understanding of statistics or data science. Care needs to be taken to report results clearly and to allow the decision-maker to assess the credibility of your work. Some results, such as those related to expressing risk or uncertainty or dealing with confounding, are notoriously hard for a layperson to make effective use of. Chosing appropriate formats for communicating results helps in achieving the project goals. It also helps to master standard formats so that you can communicate effectively with other data science experts who may be called on to evaluate or extend your work.

7.7 Exercises

rrrr etude_list(Process_exercises)