library(tidyverse)
library(stringi)
library(scales)
library(gt)
Overview
During my (ongoing) job search for a data science or developer-focused role where I get to do R programming, this question came to me: Just how many R developers are there? That’s the question that inspired this post. However, the data needed to answer this question can also be used to answer other interesting questions about R developers, such as how many packages they’ve contributed to, their roles in package development, and so forth. So that’s what we’ll be doing here.
If you just want to see the stats, you can skip to the R developer statistics section. Otherwise follow along to see how I retrieved and wrangled the data into a usable state.
Prerequisites
I’ll be using the CRAN package repository data returned by tools::CRAN_package_db()
to get package and author metadata for the current packages available on CRAN. This returns a data frame with character columns containing most metadata from the DESCRIPTION
file of a given R package.
Since this data will change over time, here’s when tools::CRAN_package_db()
was run for reference: 2023-05-03.
<- tools::CRAN_package_db()
cran_pkg_db
glimpse(cran_pkg_db)
#> Rows: 19,473
#> Columns: 67
#> $ Package <chr> "A3", "AalenJohansen", "AATtools", "ABACUS",…
#> $ Version <chr> "1.0.0", "1.0", "0.0.2", "1.0.0", "0.1", "0.…
#> $ Priority <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Depends <chr> "R (>= 2.15.0), xtable, pbapply", NA, "R (>=…
#> $ Imports <chr> NA, NA, "magrittr, dplyr, doParallel, foreac…
#> $ LinkingTo <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Rcp…
#> $ Suggests <chr> "randomForest, e1071", "knitr, rmarkdown", N…
#> $ Enhances <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ License <chr> "GPL (>= 2)", "GPL (>= 2)", "GPL-3", "GPL-3"…
#> $ License_is_FOSS <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ License_restricts_use <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ OS_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Archs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ MD5sum <chr> "027ebdd8affce8f0effaecfcd5f5ade2", "d7eb2a6…
#> $ NeedsCompilation <chr> "no", "no", "no", "no", "no", "no", "no", "n…
#> $ Additional_repositories <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Author <chr> "Scott Fortmann-Roe", "Martin Bladt [aut, cr…
#> $ `Authors@R` <chr> NA, "c(person(\"Martin\", \"Bladt\", email =…
#> $ Biarch <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ BugReports <chr> NA, NA, "https://github.com/Spiritspeak/AATt…
#> $ BuildKeepEmpty <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ BuildManual <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ BuildResaveData <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ BuildVignettes <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Built <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ ByteCompile <chr> NA, NA, "true", NA, NA, NA, NA, NA, NA, NA, …
#> $ `Classification/ACM` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ `Classification/ACM-2012` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ `Classification/JEL` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ `Classification/MSC` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ `Classification/MSC-2010` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Collate <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Collate.unix <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Collate.windows <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Contact <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Ian Morison…
#> $ Copyright <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Eli…
#> $ Date <chr> "2015-08-15", NA, NA, NA, "2021-12-12", NA, …
#> $ `Date/Publication` <chr> "2015-08-16 23:05:52", "2023-03-01 10:42:09 …
#> $ Description <chr> "Supplies tools for tabulating and analyzing…
#> $ Encoding <chr> NA, "UTF-8", "UTF-8", "UTF-8", "UTF-8", NA, …
#> $ KeepSource <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Language <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ LazyData <chr> NA, NA, "true", "true", NA, "true", NA, NA, …
#> $ LazyDataCompression <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ LazyLoad <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "yes", N…
#> $ MailingList <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Maintainer <chr> "Scott Fortmann-Roe <scottfr@berkeley.edu>",…
#> $ Note <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Packaged <chr> "2015-08-16 14:17:33 UTC; scott", "2023-02-2…
#> $ RdMacros <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ StagedInstall <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SysDataCompression <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SystemRequirements <chr> NA, NA, NA, NA, NA, NA, NA, NA, "GNU make", …
#> $ Title <chr> "Accurate, Adaptable, and Accessible Error M…
#> $ Type <chr> "Package", "Package", "Package", NA, "Packag…
#> $ URL <chr> NA, NA, NA, "https://shiny.abdn.ac.uk/Stats/…
#> $ UseLTO <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ VignetteBuilder <chr> NA, "knitr", NA, "knitr", NA, "knitr", NA, N…
#> $ ZipData <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Path <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ `X-CRAN-Comment` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Published <chr> "2015-08-16", "2023-03-01", "2022-08-12", "2…
#> $ `Reverse depends` <chr> NA, NA, NA, NA, NA, NA, "abctools, EasyABC",…
#> $ `Reverse imports` <chr> NA, NA, NA, NA, NA, NA, "ecolottery, poems",…
#> $ `Reverse linking to` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ `Reverse suggests` <chr> NA, NA, NA, NA, NA, NA, "coala", "abctools",…
#> $ `Reverse enhances` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
Wrangle
Since we only care about package and author metadata, a good first step is to remove everything else. This leaves us with a Package
field and two author fields: Author
and Authors@R
. The difference between the two author fields is that Author
is an unstructured text field that can contain any text in any format, and Authors@R
is a structured text field containing R code that defines authors’ names and roles with the person()
function.
<- cran_pkg_db |>
cran_pkg_db select(package = Package, authors = Author, authors_r = `Authors@R`) |>
as_tibble()
Here’s a comparison of the two fields, using the dplyr package as an example:
# Author
|>
cran_pkg_db filter(package == "dplyr") |>
pull(authors) |>
cat()
#> Hadley Wickham [aut, cre] (<https://orcid.org/0000-0003-4757-117X>),
#> Romain François [aut] (<https://orcid.org/0000-0002-2444-4226>),
#> Lionel Henry [aut],
#> Kirill Müller [aut] (<https://orcid.org/0000-0002-1416-3412>),
#> Davis Vaughan [aut] (<https://orcid.org/0000-0003-4777-038X>),
#> Posit Software, PBC [cph, fnd]
# Authors@R
|>
cran_pkg_db filter(package == "dplyr") |>
pull(authors_r) |>
cat()
#> c(
#> person("Hadley", "Wickham", , "hadley@posit.co", role = c("aut", "cre"),
#> comment = c(ORCID = "0000-0003-4757-117X")),
#> person("Romain", "François", role = "aut",
#> comment = c(ORCID = "0000-0002-2444-4226")),
#> person("Lionel", "Henry", role = "aut"),
#> person("Kirill", "Müller", role = "aut",
#> comment = c(ORCID = "0000-0002-1416-3412")),
#> person("Davis", "Vaughan", , "davis@posit.co", role = "aut",
#> comment = c(ORCID = "0000-0003-4777-038X")),
#> person("Posit Software, PBC", role = c("cph", "fnd"))
#> )
And a glimpse at the data:
cran_pkg_db
#> # A tibble: 19,473 × 3
#> package authors autho…¹
#> <chr> <chr> <chr>
#> 1 A3 "Scott Fortmann-Roe" <NA>
#> 2 AalenJohansen "Martin Bladt [aut, cre],\n Christian Furrer [aut]" "c(per…
#> 3 AATtools "Sercan Kahveci [aut, cre]" "perso…
#> 4 ABACUS "Mintu Nath [aut, cre]" <NA>
#> 5 abbreviate "Sigbert Klinke [aut, cre]" "\n p…
#> 6 abbyyR "Gaurav Sood [aut, cre]" "perso…
#> 7 abc "Csillery Katalin [aut],\n Lemaire Louisiane [aut],\n… "c( \n…
#> 8 abc.data "Csillery Katalin [aut],\n Lemaire Louisiane [aut],\n… "c( \n…
#> 9 ABC.RAP "Abdulmonem Alsaleh [cre, aut], Robert Weeks [aut], Ia… <NA>
#> 10 ABCanalysis "Michael Thrun, Jorn Lotsch, Alfred Ultsch" <NA>
#> # … with 19,463 more rows, and abbreviated variable name ¹authors_r
From the output above you can see that every package uses the Author
field, but not all packages use the Authors@R
field. This is unfortunate, because it means that the names and roles of authors need to be extracted from the unstructured text in the Author
field for a subset of packages, which is difficult to do and somewhat error-prone. Just for consideration, here’s how many packages don’t use the Authors@R
field.
|>
cran_pkg_db filter(is.na(authors_r)) |>
nrow()
#> [1] 6361
So roughly one-third of all packages. From the output above it’s also clear that although there are similarities in how different packages populate the Author
field, it does vary; so a simple rule like splitting the text on commas isn’t sufficient. These are fairly tame examples—some packages even use multiple sentences describing each author’s roles and affiliations, or contain other comments such as copyright disclaimers. All of these things make it more difficult to extract names and roles without errors.
Conversely, for the Authors@R
field, all that’s needed is to parse and evaluate the R code stored there as a character string; this will return a person
vector that has format()
methods to get authors’ names and roles into an analysis-ready format. This removes the possibility for me to introduce errors into the data, although it doesn’t solve things like Authors using an inconsistent name across packages (e.g., sometimes including their middle initial and sometimes not, or just generally writing their name differently).
Because there are two fields, I’ll make two helper functions to get name and role data from each field. Regardless of the field, the end goal is to tidy cran_pkg_db
into a data frame with three columns: package
, person
, and roles
, with one package/person combination per row.
Extracting roles
From the example dplyr output above, we can see that the roles column is currently a character string with the role codes, which isn’t super useful. Later on I’ll split these out into indicator columns with a TRUE
or FALSE
for whether someone had a given role. I also wanted the full names for the roles, since some of the codes aren’t very obvious.
Kurt Hornik, Duncan Murdoch and Achim Zeileis published a nice article in The R Journal explaining the roles of R package authors and where they come from. Briefly, they come from the “Relator and Role” codes and terms from MARC (MAchine-Readable Cataloging, Library of Congress, 2012) here: https://www.loc.gov/marc/relators/relaterm.html.
There are a lot of roles there; I just took the ones that were present in the data at the time I wrote this post.
<- c(
marc_roles analyst = "anl",
architecht = "arc",
artist = "art",
author = "aut",
author_in_quotations = "aqt",
author_of_intro = "aui",
bibliographic_antecedent = "ant",
collector = "col",
compiler = "com",
conceptor = "ccp",
conservator = "con",
consultant = "csl",
consultant_to_project = "csp",
contestant_appellant = "cot",
contractor = "ctr",
contributor = "ctb",
copyright_holder = "cph",
corrector = "crr",
creator = "cre",
data_contributor = "dtc",
degree_supervisor = "dgs",
editor = "edt",
funder = "fnd",
illustrator = "ill",
inventor = "inv",
lab_director = "ldr",
lead = "led",
metadata_contact = "mdc",
musician = "mus",
owner = "own",
presenter = "pre",
programmer = "prg",
project_director = "pdr",
scientific_advisor = "sad",
second_party = "spy",
sponsor = "spn",
supporting_host = "sht",
teacher = "tch",
thesis_advisor = "ths",
translator = "trl",
research_team_head = "rth",
research_team_member = "rtm",
researcher = "res",
reviewer = "rev",
witness = "wit",
woodcutter = "wdc"
)
Tidying the data
With all the explanations out of the way we can now tidy the data with our helper functions.
<- cran_pkg_db |>
cran_authors mutate(
# Letters with accents, etc. should be normalized so that names including
# them are picked up by the regex.
across(c(authors, authors_r), \(.x) stri_trans_general(.x, "latin-ascii")),
# The extraction functions aren't vectorized so they have to be mapped over.
# This creates a list column.
persons = if_else(
is.na(authors_r),
map(authors, \(.x) authors(.x)),
map(authors_r, \(.x) authors_r(.x))
)|>
) select(-c(authors, authors_r)) |>
unnest(persons) |>
# If a package only has one author then they must be the author and creator,
# so it's safe to impute this when it isn't there.
group_by(package) |>
mutate(roles = if_else(
is.na(roles) & n() == 1, "[aut, cre]", roles
|>
)) ungroup()
Then add the indicator columns for roles. Note the use of the walrus operator (:=
) here to create new columns from the full names of MARC roles on the left side of the walrus, while detecting the MARC codes with str_detect()
on the right side. I’m mapping over this because the left side can’t be a vector.
<- cran_authors |>
cran_authors_tidy # Add indicator columns for all roles.
bind_cols(
map2_dfc(
names(marc_roles), marc_roles,
function(.x, .y) {
|>
cran_authors mutate(!!.x := str_detect(roles, .y)) |>
select(!!.x)
}
)|>
) # Not everyone's role is known.
mutate(unknown = is.na(roles))
This all leaves us with a tidy (mostly error free) data frame about R developers and their roles that is ready to explore:
glimpse(cran_authors_tidy)
#> Rows: 52,719
#> Columns: 50
#> $ package <chr> "A3", "AalenJohansen", "AalenJohansen", "AATt…
#> $ person <chr> "Scott Fortmann-Roe", "Martin Bladt", "Christ…
#> $ roles <chr> "[aut, cre]", "[aut, cre]", "[aut]", "[aut, c…
#> $ analyst <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ architecht <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ artist <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ author <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
#> $ author_in_quotations <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ author_of_intro <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ bibliographic_antecedent <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ collector <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ compiler <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ conceptor <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ conservator <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ consultant <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ consultant_to_project <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ contestant_appellant <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ contractor <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ contributor <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ copyright_holder <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ corrector <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ creator <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FA…
#> $ data_contributor <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ degree_supervisor <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ editor <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ funder <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ illustrator <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ inventor <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ lab_director <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ lead <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ metadata_contact <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ musician <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ owner <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ presenter <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ programmer <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ project_director <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ scientific_advisor <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ second_party <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ sponsor <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ supporting_host <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ teacher <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ thesis_advisor <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ translator <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ research_team_head <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ research_team_member <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ researcher <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ reviewer <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ witness <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ woodcutter <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ unknown <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
R developer statistics
I’ll start with person-level stats, mainly because some of the other stats are further summaries of these statistics. Nothing fancy here, just the number of packages a person has contributed to, role counts, and nominal and percentile rankings. Both the ranking methods used here give every tie the same (smallest) value, so if two people tied for second place both their ranks would be 2, and the next person’s rank would be 4.
<- cran_authors_tidy |>
cran_author_pkg_counts group_by(person) |>
summarise(
n_packages = n(),
across(analyst:unknown, function(.x) sum(.x, na.rm = TRUE))
|>
) mutate(
# Discretizing this for visualization purposes later on
n_pkgs_fct = case_when(
== 1 ~ "One",
n_packages == 2 ~ "Two",
n_packages == 3 ~ "Three",
n_packages >= 4 ~ "Four+"
n_packages
),n_pkgs_fct = factor(n_pkgs_fct, levels = c("One", "Two", "Three", "Four+")),
rank = min_rank(desc(n_packages)),
percentile = percent_rank(n_packages) * 100,
.after = n_packages
|>
) arrange(desc(n_packages))
Here’s an interactive gt table of the person-level stats so you can find yourself, or ask silly questions like how many other authors share a name with you. If you page or search through it you can also get an idea of the data quality (e.g., try “Posit” under the person column and you’ll see that they don’t use a consistent organization name across all packages, which creates some measurement error here).
Code
|>
cran_author_pkg_counts select(-n_pkgs_fct) |>
gt() |>
tab_header(
title = "R Developer Contributions",
subtitle = "CRAN Package Authorships and Roles"
|>
) text_transform(
str_to_title(str_replace_all(.x, "_", " ")),
\(.x) locations = cells_column_labels()
|>
) fmt_number(
columns = percentile
|>
) fmt(
columns = rank,
fns = \(.x) label_ordinal()(.x)
|>
) cols_width(everything() ~ px(120)) |>
opt_interactive(use_sorting = FALSE, use_filters = TRUE)