What’s data science?
Data science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge.
— Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund in R for Data Science
A data scientist is someone who creates understanding, insight, and knowledge from raw data with programming. Programming is an essential tool in nearly every part of a data science project because it allows you to do data science efficiently and reproducibly.
There are many different programming languages you can use to do data science, but here we cover my favourite programming language: R.
What’s R?
R is an open source programming language for wrangling, visualizing, modelling, and communicating data, and so much more. It has a strong community behind it and is widely used among researchers, statisticians, and data scientists in a variety of fields.
Where do I start?
I believe every R user should work through these two books (in order):
- Hands-On Programming with R by Garrett Grolemund
- R for Data Science by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund
Together these two books teach you a core set of tools that you can use to accomplish the vast majority of tasks in any data science project:
Hands-On Programming with R is a short and friendly introduction to the fundamentals of R programming written for non-programmers who want to become data scientists.
R for Data Science is a friendly introduction to the tools useful for doing data science with R.
As a companion to these two books I also recommend reading the tidyverse style guide, which provides a style guide for writing R code that is easy to read.
Finally, I recommend learning to write reports and presentations in Quarto, a tool for integrating prose, code, and results into reproducible documents. This is covered in the Communicate sections of R for Data Science, but I’d like to emphasize it here because it’s a useful, beginner-friendly skill that goes a long way.
Quarto is a successor to R Markdown that has ironed-out some of the friction points to make writing reproducible documents an even better experience than it already was. Because Quarto is relatively new, you might find R Markdown currently supports certain use cases better (like APA style manuscripts). In cases like these it’s perfectly fine to use R Markdown; it isn’t going away and is still actively maintained.
If you do use R Markdown, the Communicate sections of the first edition of R for Data Science provide a a short and friendly introduction, and the R Markdown: The Definitive Guide and R Markdown Cookbook books provide more comprehensive coverage.
What else should I learn?
Hands-On Programming with R and R for Data Science provide an excellent foundation in R programming for data science, but there are a number of topics these books don’t cover that are equally important for doing data science successfully:
- Content knowledge
- Interpersonal skills
- Research skills
- Statistical modelling
Having content knowledge about the problem you are using data science to answer allows you to ask better questions, identify data problems, and develop solutions that are meaningful to your audience. Content knowledge is something you can build over time through experience, interactions with the people affected by the problem you are trying to answer, books and courses, and so forth. You don’t always need to be an expert on the problem at hand, but learning more about it will help you avoid mistakes, build confidence in your solutions, and connect with your audience.
This also underscores the importance of interpersonal skills for practicing data scientists. Data science problems are ultimately people problems: Much of our data is about people. All of our data is communicated to people. And the solutions we develop using data science will only make an impact if our audience chooses to adopt them. You have to speak for the data, because the data doesn’t speak for itself. Developing strong leadership, teamwork, empathy, and communication skills will help you navigate the human side of data science.
Complimenting content knowledge and interpersonal skills are research skills and statistical modelling, which are embodied in the term data science. Research skills such as observation, measurement, experiment design, and survey design allow us to collect data that addresses a problem we care about; and statistical modelling helps us transform that raw data into understanding, insight, and knowledge through estimation and testing. Research and statistics, along with programming, are core skills in data science that can be challenging to learn. Partly because developing these skills takes time, practice, and humility. And partly because pedagogy on research and statistics—whether in books or courses—has a lot of gaps for people who aren’t “real” statisticians.1
How do I learn statistics?
My answer to this question makes the following assumptions about you:
- You are not a formally trained statistician
- You have received some prior training in statistics
- You have (re-)discovered your own ignorance of statistics
- You want to address your ignorance of statistics
- You want to use statistics to solve problems
If this describes you then I hope the resources that follow help you on your journey like they have for me. If this doesn’t describe you, stick around anyway, you might find something new.
Let’s start with some book recommendations:
- Regression and Other Stories by Andrew Gelman, Jennifer Hill, and Aki Vehtari (Source)
- Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill2
- Bayes Rules! by Alicia A. Johnson, Miles Q. Ott, and Mine Dogucu (GitHub)
- An introduction to statistical learning with applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Rob Tibshirani
- Statistical Rethinking by Richard McElreath
- Improving Your Statistical Inferences by Daniël Lakens
All of these books are well-written, engaging, and have examples in R.3 Together they cover applications of fundamental and state-of-the-art tools in statistical modelling. But most importantly—they encourage statistical thinking. They’re ideal books for self-study and I can’t recommend them enough.
I’d like to also give honourable mentions to the following books, which have all the great qualities of the books above, but cover more specialized topics:
- Handbook of Graphs and Networks in People Analytics: With Examples in R and Python by Keith McNulty (Source)
- Doing Meta-Analysis in R: A Hands-on Guide by Mathias Harrer, Pim Cuijpers, Toshi A. Furukawa, and David D. Ebert (Source)
- Text Mining with R by Julia Silge and David Robinson (Source)
These are just a few choice examples. There are a lot of statistics books written for R users, and you can almost always find a book covering whatever topic you’re interested in.
A lot of people in the R community also have their own R programming, data science, and statistics blogs, vlogs, or websites. Some of my favourite authors include:
Starting your own blog or website is also a great way to learn, and gives you a place to share your work! Quarto, the open-source scientific and technical publishing system, makes the process of creating and publishing a website simple and friendly. It’s what I use for Tidy Tales (and lots of other projects).
Finally, a lot of people in the R community who have taught courses or workshops make their materials openly available (including myself and some of the people listed above). One source I want to highlight in particular is psyTeachR from the University of Glasgow School of Psychology and Neuroscience, which covers an entire curriculum of courses for doing reproducible data science.
Where do I learn more about R programming?
You don’t need to be an expert programmer to be a successful data scientist, but learning more about programming pays off, because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.
— Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund in R for Data Science
If you want to become a better R programmer, I think these books are a good place to start:
- R Packages by Hadley Wickham and Jenny Bryan
- Advanced R by Hadley Wickham
- ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen
- Mastering Shiny by Hadley Wickham
R packages is a friendly introduction teaching you how to develop and publish your own R packages.
Advanced R is written for intermediate R users who want to improve their programming skills and understanding of the language, teaching you useful tools, techniques, and idioms that can help solve many types of problems.
ggplot2: Elegant Graphics for Data Analysis is written for R users who want to understand the details of the theory underlying the ggplot2 R package, teaching you the elements of ggplot2’s grammar and how they fit together, and giving you the power to tailor any plot specifically to your needs.
Mastering Shiny is a comprehensive introduction teaching you how to easily create rich, interactive web apps with the shiny R package.
How do I make my work reproducible?
This series is called Reproducible Data Science, so I should probably talk more about that. If you’ve learned even a portion of what’s covered above then you already have a lot of the skills needed to do reproducible data science; but you are likely missing out on some essential tools and a stable framework for making reproducible data products.
A data product is the combination of data with code that wrangles, visualizes, models, or communicates data, whose outputs will be shared with some end-user. Some common examples of data products are quarterly earnings presentations, scientific papers, and interactive dashboards. For these examples the end-user might be your coworkers, other scientists, or members of the public, respectively. You will even be the end-user of your data products sometimes. Because data products will be used by others (or yourself), it’s good practice to do quality assurance so your end-users (that includes you!) can be confident in the quality of your data product. Making your data products reproducible is one small but important step you can take to ensure they meet the expectations of your end-user.4
There are three core components that need to be accessible for a data product to be reproducible:
- Data
- Software
- Documentation
The data and software should be packaged together somewhere, like an R project stored in a GitHub repository or Docker container, with documentation on how to reproduce the data product. Reproducing the data product should be convenient for the end-user, without being disruptive. Ideally the entire pipeline can be run with a single command, and it should not install packages into someone’s local library or change settings on their computer without their permission. Achieving this requires new tools and a stable framework to glue them together.
In particular, I think the following R packages are essential for reproducibility:
- renv for creating reproducible environments in R projects
- targets for creating reproducible workflows
- testthat for testing the reproducibility of results
- sessioninfo for getting system and R session information
But there are tools beyond R that are also essential for reproducibility:
- Quarto for reproducible documents
- Git for version control
- GitHub or GitLab for hosting Git repositories
- Docker for packaging data products and their dependencies into reproducible containers
Each of these R packages and tools plays a different role in making a data product reproducible. Together they create a system for making reproducible data products. Depending on how long you hope a data product will be reproducible for, you might use all these R packages and tools or you might only use some.
The best place to learn each of these R packages and tools individually is their respective documentation. To learn how to use these R packages and tools as a system for making reproducible data products, the following are a good place to start:
- Reproducible Analytical Pipelines by Bruno Rodrigues
- Automating Computational Reproducibility in R using renv, Docker, and GitHub Actions by Nathaniel Haines
- Combining R and Python with {reticulate} and Quarto by Nicola Rennie
You might also like:
- Open source is a hard requirement for reproducibility by Bruno Rodrigues
- Functional programming explains why containerization is needed for reproducibility by Bruno Rodrigues
- Code longevity of the R programming language by Bruno Rodrigues
- MRAN is getting shutdown - what else is there for reproducibility with R, or why reproducibility is on a continuum? by Bruno Rodrigues
What did you forget to teach me about R?
See What They Forgot to Teach You About R by Jenny Bryan and Jim Hester.
See also Happy Git and GitHub for the useR by Jenny Bryan, the STAT 545 TAs, and Jim Hester.
Parting words
Data science is not 100% about writing code. There’s a human side to it.
— Hadley Wickham in Designing Data Science
I discussed this earlier, but it bears repeating: the human side of data science is really important if you want to solve problems successfully. One of the reasons R is my favourite language is because it’s been designed to make statistical thinking and computing accessible to anyone. This accessibility has had a big impact on me—I doubt I would be doing data science without R—and I think it’s why we have such a strong, diverse community of R users and programmers.
So I try to make all my work as accessible as it can be, and I recommend you do too. It makes a difference.
Footnotes
Myself included. It’s hard for me to recommend how to learn research or statistics in the same way I’ve recommended how to learn R. Hands-On Programming with R and R for Data Science are excellent, beginner-friendly, books that will get you started using the tools of the trade in the way they were intended to be used. But a lot of the excellent statistics books I’ve read are not beginner-friendly (even if they claim to be) and assume you have prior training in statistics. On the other hand, beginner-friendly books can encourage statistical rituals over statistical thinking, which you then have to unlearn in the future as your knowledge and skills develop.↩︎
Regression and Other Stories is an updated and expanded second edition of the regressions parts of Data Analysis Using Regression and Multilevel/Hierarchical Models. The authors are also working on an updated and expanded second edition of the multilevel modelling parts of Data Analysis Using Regression and Multilevel/Hierarchical Models, but it isn’t out yet.↩︎
Most of these books have also had their examples translated to use different R packages than the authors used. For example, Andrew Heiss has translated Bayes Rules! and Statistical Rethinking into the
tidyverse
,brms
, andmarginaleffects
packages; Emil Hvitfeldt has translated An introduction to statistical learning into thetidymodels
set of packages; and A. Solomon Kurz has translated Regression and Other Stories into thetidyverse
andbrms
packages.↩︎This should go without saying, but the old “garbage-in garbage-out” adage still applies to reproducible data products. If your data has problems, your code has bugs, your visualizations are misleading, your models are inappropriate, or your communications are unclear, then your data product will be reproducible but not very useful (or maybe even harmful). Quality assurance has to happen at every step, and reproducibility is the last step. It’s supposed to be the little bow on top that ties all the other great work you’ve done together.↩︎