Tidy Tales

What’s data science?

Data science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge.

— Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund in R for Data Science

A data scientist is someone who creates understanding, insight, and knowledge from raw data with programming. Programming is an essential tool in nearly every part of a data science project because it allows you to do data science efficiently and reproducibly.

There are many different programming languages you can use to do data science, but here we cover my favourite programming language: R.

What’s R?

R is an open source programming language for wrangling, visualizing, modelling, and communicating data, and so much more. It has a strong community behind it and is widely used among researchers, statisticians, and data scientists in a variety of fields.

Where do I start?

I believe every R user should work through these two books (in order):

Hands-On Programming with R by Garrett Grolemund
R for Data Science by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund

Together these two books teach you a core set of tools that you can use to accomplish the vast majority of tasks in any data science project:

Hands-On Programming with R is a short and friendly introduction to the fundamentals of R programming written for non-programmers who want to become data scientists.
R for Data Science is a friendly introduction to the tools useful for doing data science with R.

As a companion to these two books I also recommend reading the tidyverse style guide, which provides a style guide for writing R code that is easy to read.

Helping yourself

Part of the challenge of learning a programming language is building up a vocabulary of commands to accomplish different tasks. Posit (formerly RStudio) maintains a number of cheat sheets to remind you how to do common tasks in your favourite R packages.

Finally, I recommend learning to write reports and presentations in Quarto, a tool for integrating prose, code, and results into reproducible documents. This is covered in the Communicate sections of R for Data Science, but I’d like to emphasize it here because it’s a useful, beginner-friendly skill that goes a long way.

Quarto is a successor to R Markdown that has ironed-out some of the friction points to make writing reproducible documents an even better experience than it already was. Because Quarto is relatively new, you might find R Markdown currently supports certain use cases better (like APA style manuscripts). In cases like these it’s perfectly fine to use R Markdown; it isn’t going away and is still actively maintained.

If you do use R Markdown, the Communicate sections of the first edition of R for Data Science provide a a short and friendly introduction, and the R Markdown: The Definitive Guide and R Markdown Cookbook books provide more comprehensive coverage.

What else should I learn?

Hands-On Programming with R and R for Data Science provide an excellent foundation in R programming for data science, but there are a number of topics these books don’t cover that are equally important for doing data science successfully:

Content knowledge
Interpersonal skills
Research skills
Statistical modelling

Having content knowledge about the problem you are using data science to answer allows you to ask better questions, identify data problems, and develop solutions that are meaningful to your audience. Content knowledge is something you can build over time through experience, interactions with the people affected by the problem you are trying to answer, books and courses, and so forth. You don’t always need to be an expert on the problem at hand, but learning more about it will help you avoid mistakes, build confidence in your solutions, and connect with your audience.

This also underscores the importance of interpersonal skills for practicing data scientists. Data science problems are ultimately people problems: Much of our data is about people. All of our data is communicated to people. And the solutions we develop using data science will only make an impact if our audience chooses to adopt them. You have to speak for the data, because the data doesn’t speak for itself. Developing strong leadership, teamwork, empathy, and communication skills will help you navigate the human side of data science.

Complimenting content knowledge and interpersonal skills are research skills and statistical modelling, which are embodied in the term data science. Research skills such as observation, measurement, experiment design, and survey design allow us to collect data that addresses a problem we care about; and statistical modelling helps us transform that raw data into understanding, insight, and knowledge through estimation and testing. Research and statistics, along with programming, are core skills in data science that can be challenging to learn. Partly because developing these skills takes time, practice, and humility. And partly because pedagogy on research and statistics—whether in books or courses—has a lot of gaps for people who aren’t “real” statisticians.¹

How do I learn statistics?

My answer to this question makes the following assumptions about you:

You are not a formally trained statistician
You have received some prior training in statistics
You have (re-)discovered your own ignorance of statistics
You want to address your ignorance of statistics
You want to use statistics to solve problems

If this describes you then I hope the resources that follow help you on your journey like they have for me. If this doesn’t describe you, stick around anyway, you might find something new.

Let’s start with some book recommendations:

Regression and Other Stories by Andrew Gelman, Jennifer Hill, and Aki Vehtari (Source)
Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill²
Bayes Rules! by Alicia A. Johnson, Miles Q. Ott, and Mine Dogucu (GitHub)
An introduction to statistical learning with applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Rob Tibshirani
Statistical Rethinking by Richard McElreath
Improving Your Statistical Inferences by Daniël Lakens

All of these books are well-written, engaging, and have examples in R.³ Together they cover applications of fundamental and state-of-the-art tools in statistical modelling. But most importantly—they encourage statistical thinking. They’re ideal books for self-study and I can’t recommend them enough.

All you need is the linear model

Most people have the unfortunate experience of learning statistics through arcane rituals with numerous statistical tests that appear to have no fundamental framework to tie them together. This is a failure of pedagogy rather than a feature of statistics. Fortunately, there is a fundamental framework to tie all these statistical tests together: It’s called the linear model.

Most common statistical tests are either special cases or extensions of the linear model; understanding this will improve your statistical thinking and make it easier to abandon the arcanum and statistical rituals in favour of a unified statistical framework.

For a short introduction to this concept I recommend reading Common statistical tests are linear models (or: how to teach stats) by Jonas Kristoffer Lindeløv (note: there are a few small errors that haven’t been fixed, so also see the GitHub Issues).

I’d like to also give honourable mentions to the following books, which have all the great qualities of the books above, but cover more specialized topics:

Handbook of Graphs and Networks in People Analytics: With Examples in R and Python by Keith McNulty (Source)
Doing Meta-Analysis in R: A Hands-on Guide by Mathias Harrer, Pim Cuijpers, Toshi A. Furukawa, and David D. Ebert (Source)
Text Mining with R by Julia Silge and David Robinson (Source)

These are just a few choice examples. There are a lot of statistics books written for R users, and you can almost always find a book covering whatever topic you’re interested in.

Social media

There are strong #RStats communities on most social media platforms, where you can discover new people and follow or participate in conversations about R and statistics. The #RStats community on Mastodon (formerly Twitter; RIP) has been an invaluable learning tool and helped me discover things about R and statistics I wouldn’t have on my own. Frank Harrell also created the Data Methods Discussion Forum to provide a place for longer more in-depth discussions. Finally, Stack Overflow and Cross Validated are great public Q&A platforms for R programming and statistics, respectively.

A lot of people in the R community also have their own R programming, data science, and statistics blogs, vlogs, or websites. Some of my favourite authors include:

Starting your own blog or website is also a great way to learn, and gives you a place to share your work! Quarto, the open-source scientific and technical publishing system, makes the process of creating and publishing a website simple and friendly. It’s what I use for Tidy Tales (and lots of other projects).

Finally, a lot of people in the R community who have taught courses or workshops make their materials openly available (including myself and some of the people listed above). One source I want to highlight in particular is psyTeachR from the University of Glasgow School of Psychology and Neuroscience, which covers an entire curriculum of courses for doing reproducible data science.

Where do I learn more about R programming?

You don’t need to be an expert programmer to be a successful data scientist, but learning more about programming pays off, because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.

— Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund in R for Data Science

If you want to become a better R programmer, I think these books are a good place to start:

R Packages by Hadley Wickham and Jenny Bryan
Advanced R by Hadley Wickham
ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen
Mastering Shiny by Hadley Wickham

R packages is a friendly introduction teaching you how to develop and publish your own R packages.

Advanced R is written for intermediate R users who want to improve their programming skills and understanding of the language, teaching you useful tools, techniques, and idioms that can help solve many types of problems.

ggplot2: Elegant Graphics for Data Analysis is written for R users who want to understand the details of the theory underlying the ggplot2 R package, teaching you the elements of ggplot2’s grammar and how they fit together, and giving you the power to tailor any plot specifically to your needs.

Mastering Shiny is a comprehensive introduction teaching you how to easily create rich, interactive web apps with the shiny R package.

The R Manuals

The R Manuals are manuals for the R language written by the R Development Core team. I mention them here because it’s good to know about them, but the book recommendations I’ve already made largely cover the contents of these manuals in a friendlier way. Posit also maintain nicely formatted HTML versions of the manuals.

How do I make my work reproducible?

This series is called Reproducible Data Science, so I should probably talk more about that. If you’ve learned even a portion of what’s covered above then you already have a lot of the skills needed to do reproducible data science; but you are likely missing out on some essential tools and a stable framework for making reproducible data products.

A data product is the combination of data with code that wrangles, visualizes, models, or communicates data, whose outputs will be shared with some end-user. Some common examples of data products are quarterly earnings presentations, scientific papers, and interactive dashboards. For these examples the end-user might be your coworkers, other scientists, or members of the public, respectively. You will even be the end-user of your data products sometimes. Because data products will be used by others (or yourself), it’s good practice to do quality assurance so your end-users (that includes you!) can be confident in the quality of your data product. Making your data products reproducible is one small but important step you can take to ensure they meet the expectations of your end-user.⁴

Shades of reproducibility

The basic idea behind a reproducible data product is that the steps, processes, and procedures that went into making it can be repeated exactly by yourself and others, resulting in the exact same outcome every time. Ideally, there is no expiration date for reproducibility—the reproducibility of a data product could be tested tomorrow or in ten decades and should give the exact same outcome both times. If the outcomes were different, the data product is no longer reproducible (and perhaps we should no longer trust the original results). Realistically, it might be okay if a data product stops being reproducible, so long as this change happens after the data product has outlived its purpose.

There are three core components that need to be accessible for a data product to be reproducible:

Data
Software
Documentation

The data and software should be packaged together somewhere, like an R project stored in a GitHub repository or Docker container, with documentation on how to reproduce the data product. Reproducing the data product should be convenient for the end-user, without being disruptive. Ideally the entire pipeline can be run with a single command, and it should not install packages into someone’s local library or change settings on their computer without their permission. Achieving this requires new tools and a stable framework to glue them together.

In particular, I think the following R packages are essential for reproducibility:

renv for creating reproducible environments in R projects
targets for creating reproducible workflows
testthat for testing the reproducibility of results
sessioninfo for getting system and R session information

But there are tools beyond R that are also essential for reproducibility:

Quarto for reproducible documents
Git for version control
GitHub or GitLab for hosting Git repositories
Docker for packaging data products and their dependencies into reproducible containers

Each of these R packages and tools plays a different role in making a data product reproducible. Together they create a system for making reproducible data products. Depending on how long you hope a data product will be reproducible for, you might use all these R packages and tools or you might only use some.

The best place to learn each of these R packages and tools individually is their respective documentation. To learn how to use these R packages and tools as a system for making reproducible data products, the following are a good place to start:

Reproducible Analytical Pipelines by Bruno Rodrigues
Automating Computational Reproducibility in R using renv, Docker, and GitHub Actions by Nathaniel Haines
Combining R and Python with {reticulate} and Quarto by Nicola Rennie

What did you forget to teach me about R?

See What They Forgot to Teach You About R by Jenny Bryan and Jim Hester.

See also Happy Git and GitHub for the useR by Jenny Bryan, the STAT 545 TAs, and Jim Hester.

Parting words

Data science is not 100% about writing code. There’s a human side to it.

— Hadley Wickham in Designing Data Science

I discussed this earlier, but it bears repeating: the human side of data science is really important if you want to solve problems successfully. One of the reasons R is my favourite language is because it’s been designed to make statistical thinking and computing accessible to anyone. This accessibility has had a big impact on me—I doubt I would be doing data science without R—and I think it’s why we have such a strong, diverse community of R users and programmers.

So I try to make all my work as accessible as it can be, and I recommend you do too. It makes a difference.

Footnotes

Myself included. It’s hard for me to recommend how to learn research or statistics in the same way I’ve recommended how to learn R. Hands-On Programming with R and R for Data Science are excellent, beginner-friendly, books that will get you started using the tools of the trade in the way they were intended to be used. But a lot of the excellent statistics books I’ve read are not beginner-friendly (even if they claim to be) and assume you have prior training in statistics. On the other hand, beginner-friendly books can encourage statistical rituals over statistical thinking, which you then have to unlearn in the future as your knowledge and skills develop.↩︎
Regression and Other Stories is an updated and expanded second edition of the regressions parts of Data Analysis Using Regression and Multilevel/Hierarchical Models. The authors are also working on an updated and expanded second edition of the multilevel modelling parts of Data Analysis Using Regression and Multilevel/Hierarchical Models, but it isn’t out yet.↩︎
Most of these books have also had their examples translated to use different R packages than the authors used. For example, Andrew Heiss has translated Bayes Rules! and Statistical Rethinking into the tidyverse, brms, and marginaleffects packages; Emil Hvitfeldt has translated An introduction to statistical learning into the tidymodels set of packages; and A. Solomon Kurz has translated Regression and Other Stories into the tidyverse and brms packages.↩︎
This should go without saying, but the old “garbage-in garbage-out” adage still applies to reproducible data products. If your data has problems, your code has bugs, your visualizations are misleading, your models are inappropriate, or your communications are unclear, then your data product will be reproducible but not very useful (or maybe even harmful). Quality assurance has to happen at every step, and reproducibility is the last step. It’s supposed to be the little bow on top that ties all the other great work you’ve done together.↩︎