Chapter 9 Collaboration and Reproducibility with Git (Draft)


Data analysis is often collaborative. Sometimes this collaboration involves a team of programmers simultaneously contributing code to the same project, even the same document. In other cases, the collaboration may be asynchronous or an analyst may decide to continue work left behind by someone else. Each of these cases is simplified if the users adopt a workflow that includes tools designed to facilitate version control (sometimes called “source code management”).

As you read this chapter, you will be introduced to one such tool–Git–and introduced to a basic implementation of a collaborative programming workflow that includes version control. Additionally, a valuable byproduct of such a workflow is the ability to preserve and access the entire history of a document evolution.

9.1 Version Control

As mentioned above, version control is a system that records changes to a file or a set of files in order to keep track and possibly revert to or modify those changes over time. More specifically, it allows you to:

  • record the entire history of a file;
  • revert to a specific version of the file;
  • collaborate on the same platform with other people;
  • make changes without modifying the main file and add them once you feel comfortable with them.

As projects become more complex or the number of contributors increases, these features become increasingly valuable. Users unfamiliar with proper tools designed to manage version control sometimes recognize a need and attempt an inefficient solution like saving many copies of the same document in order to preserve its history or the contributions of different collaborators

An inefficient attempt to implement version control by file name.

Rather than save many independent copies of each file, most version control software tools simply track incremental changes to the files under version control. Storing a complete record of just the portions that change from one version to the next is far more efficient than saving many copies of whole documents for which a majority of the content may be unchanged from one iteration to the next. Moreover, with a complete record of every incremental change, it’s just as easy to piece together the current state of a document or to rebuild a previous the state of a document at an earlier point recorded in it’s development lifecycle. For this reason, some have quipped that including version control in your workflow document editing time machine!

9.1.1 Collaboration

  • collaboration with others
  • collaboration with self on multiple computers

9.1.2 Reproducibility

  • preserve complete record of work as a matter of transparency
  • another person (or future self) can see every decision and detail of analysis from source data through final product
  • description of complete reproducibility

9.2 Git and GitHub

Several well-established software tools designed for version control including, but not limited to, Git, Subversion, Mercurial, and more. In Git, a repository, or Repo, is a group of documents subject to version control. A Repo is analogous to a folder at some directory location on your computer: everything in the folder can be tracked by Git including Rmd files, images, PDFs, R Notebooks, data sets, and more–

Furthermore, there are several web-based version control repository hosting services such as GitLab, GitHub, Bitbucket, and more that accomplish the same goals. In the next section, we introduce use of Git and GitHub as tools for version control and collaboration.

Importantly, adding version control with Git and GitHub doesn’t necessarily change much of your workflow. In fact, for many RStudio users who have properly configured RStudio and Git, the vast majority of your version control workflow can happen entirely within a convenient the RStudio interface. We’ll discuss specific details, but it’s sufficient to note the “Git” tab shown in the upper right pane of the RStudio window shown.

Git pane within RStudio window

The latter uses the Git platform and stores local files into a flexible folder called a “repository”.

Git(Hub) uses repositories to organize your work. If you like, you can store a bunch of files in a repository (or Repo) on the GitHub remote servers and delete them from your computer entirely. You can replace the files with new versions or even edit some specific types of files right from the GitHub webpage. When you are ready to get them back, you could simply locate the files through your GitHub account and retrieve them.

More commonly, users establish a link between a file directory on their (local) computer and a Repo stored on the GitHub remote. You edit files on your computer and save your progress as you normally would, except now that you have established a link with Git(Hub) you can periodically update the Repo on the GitHub remote with the latest progress.

9.2.1 Git & RStudio Configuration

As mentioned previously, with proper configuration a wide majority of your use of Git can happen entirely within the RStudio environment. APPENDIX REFERENCE

If you change computers, or switch to a new RStudio web service, you will have to repeat the configuation on that new system.

9.2.2 Basic workflow

Do the configuration described in the APPENDIX linked to in the previous section. You need to this only once.

  • create Repo
  • add, change, or remove files in Repo
  • diff
  • commit
  • pull
  • push

show diagram?

9.2.3 Checklist when starting work in RStudio

  1. Start up RStudio (duh!)
  2. Make sure RStudio is pointing to the right project.
  3. If you are collaborating with others, make sure to PULL to get their most recent changes. Even if you’re collaborating with yourself (e.g. sometimes working on a different RStudio system), do the PULL.
  4. Do your editing, debugging, etc. But very often …
  5. DIFF
  6. Commit
  7. Go back to 4 until you’re done for the session.
  8. PULL AGAIN! Your colleagues might have changed something.
  9. PUSH
  10. Take a well-deserved break until your next work session.

9.2.4 Troubleshooting issues Merge conflicts

  • why they happen–and why that’s a good thing
  • how to fix it
  • how to avoid them happening inadvertantly Working in the wrong project in RStudio

It’s a common mistake to forget to change from one RStudio Project to the next. If you forget, it may look like your changes aren’t tracked by Git. In reality, Git will still monitor changes… but it is monitoring them in the correct Repo linked to the file, so you won’t be able to make commits on those files until you change to the correct RStudio Project. Large files

GitHub is storing all sorts of things for its users and even doing so free of charge for academic users. Having said that, storage presumably costs them money and slows down performance, so GitHub is inclined to resist storing even moderately large files. You’ll be warned if you try and commit any single file that is more than 10 MB. There are sensible ways to work around this, but a common strategy is to tell Git to simply “ignore” the large file… that is, don’t include it in my snapshots and don’t archive it on the GitHub remote. Finding help & a last resort

  • many solutions found in web forums will recommend a shell command because many users only use git through shell commands
  • last resort: blow it up and start over. Compromises some of the virtues of complete traceability of project evolution, so should happen less frequently as users become more proficient with version control tools.