The Pareto Principle in R package development

Exploring R package developer data on CRAN with the {tidyverse} and {gt}.

.Wrangle
.Visualize
{tidyverse}
{scales}
{gt}
Author

Michael McCarthy

Published

May 3, 2023

Overview

During my (ongoing) job search for a data science or developer-focused role where I get to do R programming, this question came to me: Just how many R developers are there? That’s the question that inspired this post. However, the data needed to answer this question can also be used to answer other interesting questions about R developers, such as how many packages they’ve contributed to, their roles in package development, and so forth. So that’s what we’ll be doing here.

If you just want to see the stats, you can skip to the R developer statistics section. Otherwise follow along to see how I retrieved and wrangled the data into a usable state.

Prerequisites

library(tidyverse)
library(stringi)
library(scales)
library(gt)

I’ll be using the CRAN package repository data returned by tools::CRAN_package_db() to get package and author metadata for the current packages available on CRAN. This returns a data frame with character columns containing most metadata from the DESCRIPTION file of a given R package.

Since this data will change over time, here’s when tools::CRAN_package_db() was run for reference: 2023-05-03.

cran_pkg_db <- tools::CRAN_package_db()

glimpse(cran_pkg_db)
#> Rows: 19,473
#> Columns: 67
#> $ Package                   <chr> "A3", "AalenJohansen", "AATtools", "ABACUS",…
#> $ Version                   <chr> "1.0.0", "1.0", "0.0.2", "1.0.0", "0.1", "0.…
#> $ Priority                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Depends                   <chr> "R (>= 2.15.0), xtable, pbapply", NA, "R (>=…
#> $ Imports                   <chr> NA, NA, "magrittr, dplyr, doParallel, foreac…
#> $ LinkingTo                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Rcp…
#> $ Suggests                  <chr> "randomForest, e1071", "knitr, rmarkdown", N…
#> $ Enhances                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ License                   <chr> "GPL (>= 2)", "GPL (>= 2)", "GPL-3", "GPL-3"…
#> $ License_is_FOSS           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ License_restricts_use     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ OS_type                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Archs                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ MD5sum                    <chr> "027ebdd8affce8f0effaecfcd5f5ade2", "d7eb2a6…
#> $ NeedsCompilation          <chr> "no", "no", "no", "no", "no", "no", "no", "n…
#> $ Additional_repositories   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Author                    <chr> "Scott Fortmann-Roe", "Martin Bladt [aut, cr…
#> $ `Authors@R`               <chr> NA, "c(person(\"Martin\", \"Bladt\", email =…
#> $ Biarch                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ BugReports                <chr> NA, NA, "https://github.com/Spiritspeak/AATt…
#> $ BuildKeepEmpty            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ BuildManual               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ BuildResaveData           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ BuildVignettes            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Built                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ ByteCompile               <chr> NA, NA, "true", NA, NA, NA, NA, NA, NA, NA, …
#> $ `Classification/ACM`      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ `Classification/ACM-2012` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ `Classification/JEL`      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ `Classification/MSC`      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ `Classification/MSC-2010` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Collate                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Collate.unix              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Collate.windows           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Contact                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Ian Morison…
#> $ Copyright                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Eli…
#> $ Date                      <chr> "2015-08-15", NA, NA, NA, "2021-12-12", NA, …
#> $ `Date/Publication`        <chr> "2015-08-16 23:05:52", "2023-03-01 10:42:09 …
#> $ Description               <chr> "Supplies tools for tabulating and analyzing…
#> $ Encoding                  <chr> NA, "UTF-8", "UTF-8", "UTF-8", "UTF-8", NA, …
#> $ KeepSource                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Language                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ LazyData                  <chr> NA, NA, "true", "true", NA, "true", NA, NA, …
#> $ LazyDataCompression       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ LazyLoad                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "yes", N…
#> $ MailingList               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Maintainer                <chr> "Scott Fortmann-Roe <scottfr@berkeley.edu>",…
#> $ Note                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Packaged                  <chr> "2015-08-16 14:17:33 UTC; scott", "2023-02-2…
#> $ RdMacros                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ StagedInstall             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SysDataCompression        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SystemRequirements        <chr> NA, NA, NA, NA, NA, NA, NA, NA, "GNU make", …
#> $ Title                     <chr> "Accurate, Adaptable, and Accessible Error M…
#> $ Type                      <chr> "Package", "Package", "Package", NA, "Packag…
#> $ URL                       <chr> NA, NA, NA, "https://shiny.abdn.ac.uk/Stats/…
#> $ UseLTO                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ VignetteBuilder           <chr> NA, "knitr", NA, "knitr", NA, "knitr", NA, N…
#> $ ZipData                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Path                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ `X-CRAN-Comment`          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Published                 <chr> "2015-08-16", "2023-03-01", "2022-08-12", "2…
#> $ `Reverse depends`         <chr> NA, NA, NA, NA, NA, NA, "abctools, EasyABC",…
#> $ `Reverse imports`         <chr> NA, NA, NA, NA, NA, NA, "ecolottery, poems",…
#> $ `Reverse linking to`      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ `Reverse suggests`        <chr> NA, NA, NA, NA, NA, NA, "coala", "abctools",…
#> $ `Reverse enhances`        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Wrangle

Since we only care about package and author metadata, a good first step is to remove everything else. This leaves us with a Package field and two author fields: Author and Authors@R. The difference between the two author fields is that Author is an unstructured text field that can contain any text in any format, and Authors@R is a structured text field containing R code that defines authors’ names and roles with the person() function.

cran_pkg_db <- cran_pkg_db |>
  select(package = Package, authors = Author, authors_r = `Authors@R`) |>
  as_tibble()

Here’s a comparison of the two fields, using the dplyr package as an example:

# Author
cran_pkg_db |>
  filter(package == "dplyr") |>
  pull(authors) |>
  cat()
#> Hadley Wickham [aut, cre] (<https://orcid.org/0000-0003-4757-117X>),
#>   Romain François [aut] (<https://orcid.org/0000-0002-2444-4226>),
#>   Lionel Henry [aut],
#>   Kirill Müller [aut] (<https://orcid.org/0000-0002-1416-3412>),
#>   Davis Vaughan [aut] (<https://orcid.org/0000-0003-4777-038X>),
#>   Posit Software, PBC [cph, fnd]
# Authors@R
cran_pkg_db |>
  filter(package == "dplyr") |>
  pull(authors_r) |>
  cat()
#> c(
#>     person("Hadley", "Wickham", , "hadley@posit.co", role = c("aut", "cre"),
#>            comment = c(ORCID = "0000-0003-4757-117X")),
#>     person("Romain", "François", role = "aut",
#>            comment = c(ORCID = "0000-0002-2444-4226")),
#>     person("Lionel", "Henry", role = "aut"),
#>     person("Kirill", "Müller", role = "aut",
#>            comment = c(ORCID = "0000-0002-1416-3412")),
#>     person("Davis", "Vaughan", , "davis@posit.co", role = "aut",
#>            comment = c(ORCID = "0000-0003-4777-038X")),
#>     person("Posit Software, PBC", role = c("cph", "fnd"))
#>   )

And a glimpse at the data:

cran_pkg_db
#> # A tibble: 19,473 × 3
#>    package       authors                                                 autho…¹
#>    <chr>         <chr>                                                   <chr>  
#>  1 A3            "Scott Fortmann-Roe"                                     <NA>  
#>  2 AalenJohansen "Martin Bladt [aut, cre],\n  Christian Furrer [aut]"    "c(per…
#>  3 AATtools      "Sercan Kahveci [aut, cre]"                             "perso…
#>  4 ABACUS        "Mintu Nath [aut, cre]"                                  <NA>  
#>  5 abbreviate    "Sigbert Klinke [aut, cre]"                             "\n  p…
#>  6 abbyyR        "Gaurav Sood [aut, cre]"                                "perso…
#>  7 abc           "Csillery Katalin [aut],\n  Lemaire Louisiane [aut],\n… "c( \n…
#>  8 abc.data      "Csillery Katalin [aut],\n  Lemaire Louisiane [aut],\n… "c( \n…
#>  9 ABC.RAP       "Abdulmonem Alsaleh [cre, aut], Robert Weeks [aut], Ia…  <NA>  
#> 10 ABCanalysis   "Michael Thrun, Jorn Lotsch, Alfred Ultsch"              <NA>  
#> # … with 19,463 more rows, and abbreviated variable name ¹​authors_r

From the output above you can see that every package uses the Author field, but not all packages use the Authors@R field. This is unfortunate, because it means that the names and roles of authors need to be extracted from the unstructured text in the Author field for a subset of packages, which is difficult to do and somewhat error-prone. Just for consideration, here’s how many packages don’t use the Authors@R field.

cran_pkg_db |>
  filter(is.na(authors_r)) |>
  nrow()
#> [1] 6361

So roughly one-third of all packages. From the output above it’s also clear that although there are similarities in how different packages populate the Author field, it does vary; so a simple rule like splitting the text on commas isn’t sufficient. These are fairly tame examples—some packages even use multiple sentences describing each author’s roles and affiliations, or contain other comments such as copyright disclaimers. All of these things make it more difficult to extract names and roles without errors.

Conversely, for the Authors@R field, all that’s needed is to parse and evaluate the R code stored there as a character string; this will return a person vector that has format() methods to get authors’ names and roles into an analysis-ready format. This removes the possibility for me to introduce errors into the data, although it doesn’t solve things like Authors using an inconsistent name across packages (e.g., sometimes including their middle initial and sometimes not, or just generally writing their name differently).

Because there are two fields, I’ll make two helper functions to get name and role data from each field. Regardless of the field, the end goal is to tidy cran_pkg_db into a data frame with three columns: package, person, and roles, with one package/person combination per row.

Extracting from Authors@R

Getting the data we want from the Authors@R field is pretty straightforward. For the packages where this is used, each one has a vector of person objects stored as a character string like:

mm_string <- "person(\"Michael\", \"McCarthy\", , role = c(\"aut\", \"cre\"))"

mm_string
#> [1] "person(\"Michael\", \"McCarthy\", , role = c(\"aut\", \"cre\"))"

Which can be parsed and evaluated as R code like:

mm_eval <- eval(parse(text = mm_string))

class(mm_eval)
#> [1] "person"

Then the format() method for the person class can be used to get names and roles into the format I want simply and accurately.

mm_person <- format(mm_eval, include = c("given", "family"))
mm_roles  <- format(mm_eval, include = c("role"))
tibble(person = mm_person, roles = mm_roles)
#> # A tibble: 1 × 2
#>   person           roles     
#>   <chr>            <chr>     
#> 1 Michael McCarthy [aut, cre]

I’ve wrapped this up into a small helper function, authors_r(), that includes some light tidying steps just to deal with a couple small discrepancies I noticed in a subset of packages.

# Get names and roles from "person" objects in the Authors@R field
authors_r <- function(x) {
  # Some light preprocessing is needed to replace the unicode symbol for line
  # breaks with the regular "\n". This is an edge case from at least one
  # package.
  code <- str_replace_all(x, "\\<U\\+000a\\>", "\n")
  persons <- eval(parse(text = code))
  person <- str_trim(format(persons, include = c("given", "family")))
  roles <- format(persons, include = c("role"))
  tibble(person = person, roles = roles)
}

Here’s an example of it with dplyr:

cran_pkg_db |>
  filter(package == "dplyr") |>
  pull(authors_r) |>
  # Normalizing names leads to more consistent results with summary statistics
  # later on, since some people use things like umlauts and accents
  # inconsistently.
  stri_trans_general("latin-ascii") |>
  authors_r()
#> # A tibble: 6 × 2
#>   person              roles     
#>   <chr>               <chr>     
#> 1 Hadley Wickham      [aut, cre]
#> 2 Romain Francois     [aut]     
#> 3 Lionel Henry        [aut]     
#> 4 Kirill Muller       [aut]     
#> 5 Davis Vaughan       [aut]     
#> 6 Posit Software, PBC [cph, fnd]

Extracting from Author

As I mentioned before, getting the data we want from the Author field is more complicated since there’s no common structure between all packages. I tried a few approaches, including:

  • ChatGPT
  • Named Entity Extraction
  • Regular expressions (regex)

ChatGPT worked excellently in the few examples I tried; however, OpenAI doesn’t provide free API access, so I had no way of using this with R without paying (which I didn’t want to do). Here’s the prompt I used (note that it would need to be expanded to deal with more edge cases):

Separate these names with commas and do not include any other information (including a response to the request); if any names are within person() they belong to one person:

Named Entity Extraction, which is a natural language processing (NLP) method that extracts entities (like peoples’ names) from text, didn’t work very well in the few examples I tried. It didn’t recognize certain names even when the only thing in a sentence was names separated by commas. This is probably my fault more than anything—I’ve never used this method before and didn’t want to spend too much time learning it just for this post, so I used a pre-trained model that probably wasn’t trained on a diverse set of names.

Fortunately, regular expressions actually worked pretty well, so this is the solution I settled on. I tried two approaches to this. First I tried to split the names (and roles) up by commas (and eventually other punctuation as I ran into edge cases). This worked alright; there were clearly errors in the data with this method, but since most packages use a simple structure in the Author field it correctly extracted names from most packages.

Second I tried to extract the names (and roles) directly with a regular expression that could match a variety of names. This is the solution I settled on. It still isn’t perfect, but the data is cleaner than with the other method. Regardless, the difference in number of observations between both methods was only in the mid hundreds—so I think any statistics based on this data, although not completely accurate, are still sufficient to get a good idea of the R developer landscape on CRAN.

# This regex was adapted from <https://stackoverflow.com/a/7654214/16844576>.
# It's designed to capture a wide range of names, including those with
# punctuation in them. It's tailored to this data, so I don't know how well
# it would generalize to other situations, but feel free to try it.
persons_roles <- r"((\'|\")*[A-Z]([A-Z]+|(\'[A-Z])?[a-z]+|\.)(?:(\s+|\-)[A-Z]([a-z]+|\.?))*(?:(\'?\s+|\-)[a-z][a-z\-]+){0,2}(\s+|\-)[A-Z](\'?[A-Za-z]+(\'[A-Za-z]+)?|\.)(?:(\s+|\-)[A-Za-z]([a-z]+|\.))*(\'|\")*(?:\s*\[(.*?)\])?)"
# Some packages put the person() code in the wrong field, but it's also
# formatted incorrectly and throws an error when evaluated, so the best we can
# do is just extract the whole thing for each person.
person_objects <- r"(person\((.*?)\))"

# Get names and roles from character strings in the Author field
authors <- function(x) {
  # The Author field is unstructured and there are idiosyncrasies between
  # different packages. The steps here attempt to fix the idiosyncrasies so
  # authors can be extracted with as few errors as possible.
  persons <- x |>
    # Line breaks should be replaced with spaces in case they occur in the
    # middle of a name.
    str_replace_all("\\n|\\<U\\+000a\\>|\\n(?=[:upper:])", " ") |>
    # Periods should always have a space after them so initials will be
    # recognized as part of a name.
    str_replace_all("\\.", "\\. ") |>
    # Commas before roles will keep them from being included in the regex.
    str_remove_all(",(?= \\[)") |>
    # Get persons and their roles.
    str_extract_all(paste0(persons_roles, "|", person_objects)) |>
    unlist() |>
    # Multiple spaces can be replaced with a single space for cleaner names.
    str_replace_all("\\s+", " ")

  tibble(person = persons) |>
    mutate(
      roles  = str_extract(person, "\\[(.*?)\\]"),
      person = str_remove(
        str_remove(person, "\\s*\\[(.*?)\\]"),
        "^('|\")|('|\")$" # Some names are wrapped in quotations
      )
    )
}

Here’s an example of it with dplyr. If you compare it to the output from authors_r() above you can see the data quality is still good enough for rock ‘n’ roll, but it isn’t perfect; Posit’s roles are no longer defined because the comma in their name cut off the regex before it captured the roles. So there are some edge cases like this that will create measurement error in the person or roles columns, but I don’t think it’s bad enough to invalidate the results.

cran_pkg_db |>
  filter(package == "dplyr") |>
  pull(authors) |>
  stri_trans_general("latin-ascii") |>
  authors()
#> # A tibble: 6 × 2
#>   person          roles     
#>   <chr>           <chr>     
#> 1 Hadley Wickham  [aut, cre]
#> 2 Romain Francois [aut]     
#> 3 Lionel Henry    [aut]     
#> 4 Kirill Muller   [aut]     
#> 5 Davis Vaughan   [aut]     
#> 6 Posit Software  <NA>

Extracting roles

From the example dplyr output above, we can see that the roles column is currently a character string with the role codes, which isn’t super useful. Later on I’ll split these out into indicator columns with a TRUE or FALSE for whether someone had a given role. I also wanted the full names for the roles, since some of the codes aren’t very obvious.

Kurt Hornik, Duncan Murdoch and Achim Zeileis published a nice article in The R Journal explaining the roles of R package authors and where they come from. Briefly, they come from the “Relator and Role” codes and terms from MARC (MAchine-Readable Cataloging, Library of Congress, 2012) here: https://www.loc.gov/marc/relators/relaterm.html.

There are a lot of roles there; I just took the ones that were present in the data at the time I wrote this post.

marc_roles <- c(
  analyst = "anl",
  architecht = "arc",
  artist = "art",
  author = "aut",
  author_in_quotations = "aqt",
  author_of_intro = "aui",
  bibliographic_antecedent = "ant",
  collector = "col",
  compiler = "com",
  conceptor = "ccp",
  conservator = "con",
  consultant = "csl",
  consultant_to_project = "csp",
  contestant_appellant = "cot",
  contractor = "ctr",
  contributor = "ctb",
  copyright_holder = "cph",
  corrector = "crr",
  creator = "cre",
  data_contributor = "dtc",
  degree_supervisor = "dgs",
  editor = "edt",
  funder = "fnd",
  illustrator = "ill",
  inventor = "inv",
  lab_director = "ldr",
  lead = "led",
  metadata_contact = "mdc",
  musician = "mus",
  owner = "own",
  presenter = "pre",
  programmer = "prg",
  project_director = "pdr",
  scientific_advisor = "sad",
  second_party = "spy",
  sponsor = "spn",
  supporting_host = "sht",
  teacher = "tch",
  thesis_advisor = "ths",
  translator = "trl",
  research_team_head = "rth",
  research_team_member = "rtm",
  researcher = "res",
  reviewer = "rev",
  witness = "wit",
  woodcutter = "wdc"
)

Tidying the data

With all the explanations out of the way we can now tidy the data with our helper functions.

cran_authors <- cran_pkg_db |>
  mutate(
    # Letters with accents, etc. should be normalized so that names including
    # them are picked up by the regex.
    across(c(authors, authors_r), \(.x) stri_trans_general(.x, "latin-ascii")),
    # The extraction functions aren't vectorized so they have to be mapped over.
    # This creates a list column.
    persons = if_else(
      is.na(authors_r),
      map(authors, \(.x) authors(.x)),
      map(authors_r, \(.x) authors_r(.x))
    )
  ) |>
  select(-c(authors, authors_r)) |>
  unnest(persons) |>
  # If a package only has one author then they must be the author and creator,
  # so it's safe to impute this when it isn't there.
  group_by(package) |>
  mutate(roles = if_else(
    is.na(roles) & n() == 1, "[aut, cre]", roles
  )) |>
  ungroup()

Then add the indicator columns for roles. Note the use of the walrus operator (:=) here to create new columns from the full names of MARC roles on the left side of the walrus, while detecting the MARC codes with str_detect() on the right side. I’m mapping over this because the left side can’t be a vector.

cran_authors_tidy <- cran_authors |>
  # Add indicator columns for all roles.
  bind_cols(
    map2_dfc(
      names(marc_roles), marc_roles,
      function(.x, .y) {
        cran_authors |>
          mutate(!!.x := str_detect(roles, .y)) |>
          select(!!.x)
      }
    )
  ) |>
  # Not everyone's role is known.
  mutate(unknown = is.na(roles))

This all leaves us with a tidy (mostly error free) data frame about R developers and their roles that is ready to explore:

glimpse(cran_authors_tidy)
#> Rows: 52,719
#> Columns: 50
#> $ package                  <chr> "A3", "AalenJohansen", "AalenJohansen", "AATt…
#> $ person                   <chr> "Scott Fortmann-Roe", "Martin Bladt", "Christ…
#> $ roles                    <chr> "[aut, cre]", "[aut, cre]", "[aut]", "[aut, c…
#> $ analyst                  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ architecht               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ artist                   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ author                   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
#> $ author_in_quotations     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ author_of_intro          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ bibliographic_antecedent <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ collector                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ compiler                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ conceptor                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ conservator              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ consultant               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ consultant_to_project    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ contestant_appellant     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ contractor               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ contributor              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ copyright_holder         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ corrector                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ creator                  <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FA…
#> $ data_contributor         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ degree_supervisor        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ editor                   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ funder                   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ illustrator              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ inventor                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ lab_director             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ lead                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ metadata_contact         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ musician                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ owner                    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ presenter                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ programmer               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ project_director         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ scientific_advisor       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ second_party             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ sponsor                  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ supporting_host          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ teacher                  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ thesis_advisor           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ translator               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ research_team_head       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ research_team_member     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ researcher               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ reviewer                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ witness                  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ woodcutter               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
#> $ unknown                  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

R developer statistics

I’ll start with person-level stats, mainly because some of the other stats are further summaries of these statistics. Nothing fancy here, just the number of packages a person has contributed to, role counts, and nominal and percentile rankings. Both the ranking methods used here give every tie the same (smallest) value, so if two people tied for second place both their ranks would be 2, and the next person’s rank would be 4.

cran_author_pkg_counts <- cran_authors_tidy |>
  group_by(person) |>
  summarise(
    n_packages = n(),
    across(analyst:unknown, function(.x) sum(.x, na.rm = TRUE))
  ) |>
  mutate(
    # Discretizing this for visualization purposes later on
    n_pkgs_fct = case_when(
      n_packages == 1 ~ "One",
      n_packages == 2 ~ "Two",
      n_packages == 3 ~ "Three",
      n_packages >= 4 ~ "Four+"
    ),
    n_pkgs_fct = factor(n_pkgs_fct, levels = c("One", "Two", "Three", "Four+")),
    rank = min_rank(desc(n_packages)),
    percentile = percent_rank(n_packages) * 100,
    .after = n_packages
  ) |>
  arrange(desc(n_packages))

Here’s an interactive gt table of the person-level stats so you can find yourself, or ask silly questions like how many other authors share a name with you. If you page or search through it you can also get an idea of the data quality (e.g., try “Posit” under the person column and you’ll see that they don’t use a consistent organization name across all packages, which creates some measurement error here).

Code
cran_author_pkg_counts |>
  select(-n_pkgs_fct) |>
  gt() |>
  tab_header(
    title = "R Developer Contributions",
    subtitle = "CRAN Package Authorships and Roles"
  ) |>
  text_transform(
    \(.x) str_to_title(str_replace_all(.x, "_", " ")),
    locations = cells_column_labels()
  ) |>
  fmt_number(
    columns = percentile
  ) |>
  fmt(
    columns = rank,
    fns = \(.x) label_ordinal()(.x)
  ) |>
  cols_width(everything() ~ px(120)) |>
  opt_interactive(use_sorting = FALSE, use_filters = TRUE)
R Developer Contributions
CRAN Package Authorships and Roles