Exploring R package contributor data on GitHub with the {tidyverse}, {gh}, and {tidygraph}.
.Wrangle
.Visualize
{tidyverse}
{gh}
{gt}
{ggrepel}
{scales}
{igraph}
{tidygraph}
{ggraph}
{graphlayouts}
Author
Michael McCarthy
Published
May 10, 2023
Overview
In my previous post I explored R package developer statistics on CRAN, focusing on developers with package authorship. However, because many of the packages published on CRAN are hosted on open source repositories (like GitHub), many R packages have additional contributors who do not appear in the list of authors.
@mccarthymg Very interesting (though not surprising)!
One caveat is that most big packages have many more contributors than declared authors, because people often only submit a handful of patches, and are consequently not added you the authors list (to be clear, I don't think that's a problem; just something to keep in mind).
Although this information isn’t available through CRAN, GitHub has a REST API that we can use to get data on the contributors of R packages hosted on GitHub. We can use this to answer new questions about R package contributions by the wider R community.
Again, if you just want to see the stats, you can skip to the R contributor statistics section. Otherwise follow along to see how I retrieved and wrangled the data into a usable state. I’d also like to give a big thanks to everyone who engaged with me on my previous post (and some of the stuff in-between); you all inspired this post and some of the approaches I chose to take below.
I’ll be using the CRAN package repository data returned by tools::CRAN_package_db() again in this post, this time to get the following DESCRIPTION fields from each package: BugReports and URL. Each of these fields can contain the URL for a package’s GitHub repository, so I’m taking both for better coverage.
Since this data will change over time, here’s when tools::CRAN_package_db() was run for reference: 2023-05-06.
I’ve also selected the Author and Authors@R fields again so I can rerun the author extraction code from the previous post. This will let us ask more interesting questions later on. I just need package and author names, so the code is slightly different from the previous post.
I’ve hidden the code since there’s no need to reexplain it—all that matters is that this returns a two column data frame called cran_authors, with columns package and person.
Code
# Helper functions ------------------------------------------------------------# Get names from "person" objects in the Authors@R fieldauthors_r <-function(x) { code <-str_replace_all(x, "\\<U\\+000a\\>", "\n") persons <-eval(parse(text = code)) person <-str_trim(format(persons, include =c("given", "family")))tibble(person = person)}# Get names from character strings in the Author fieldpersons_roles <- r"((\'|\")*[A-Z]([A-Z]+|(\'[A-Z])?[a-z]+|\.)(?:(\s+|\-)[A-Z]([a-z]+|\.?))*(?:(\'?\s+|\-)[a-z][a-z\-]+){0,2}(\s+|\-)[A-Z](\'?[A-Za-z]+(\'[A-Za-z]+)?|\.)(?:(\s+|\-)[A-Za-z]([a-z]+|\.))*(\'|\")*(?:\s*\[(.*?)\])?)"person_objects <- r"(person\((.*?)\))"authors <-function(x) { persons <- x |>str_replace_all("\\n|\\<U\\+000a\\>|\\n(?=[:upper:])", " ") |>str_replace_all("\\.", "\\. ") |>str_remove_all(",(?= \\[)") |>str_extract_all(paste0(persons_roles, "|", person_objects)) |>unlist() |>str_replace_all("\\s+", " ")tibble(person = persons) |>mutate(person =str_remove(str_remove(person, "\\s*\\[(.*?)\\]"),"^('|\")|('|\")$" ) )}# Get data used in the previous post ------------------------------------------cran_authors <- cran_pkgs_db |>select(package, authors, authors_r) |>as_tibble() |>mutate(across(c(authors, authors_r), \(.x) stri_trans_general(.x, "latin-ascii")),persons =if_else(is.na(authors_r),map(authors, \(.x) authors(.x)),map(authors_r, \(.x) authors_r(.x)) ) ) |>select(-c(authors, authors_r)) |>unnest(persons) |># Some unicode characters from authors_r were sneaking through despite the# first normalization call. Also making everything title case to improve# joining results later.mutate(person =str_to_title(stri_trans_general(person, "latin-ascii")))
Wrangle
I’ll be using the gh package to query GitHub’s REST API. However, before any queries can be done, I need to figure out which packages in cran_pkgs_db have GitHub repositories and what their URLs are.
Finding the packages with GitHub repositories is simple—it’s just a matter of filtering cran_pkgs_db down to the subset of packages with GitHub’s base URL.
Of the 19483 packages on CRAN, 9182 have a GitHub URL in the BugReports or URL field.
Next step: get the repository URLs. Again, regular expressions are our friend. This one is a simple search for the owner and repository name following the GitHub base URL in the form of owner/repo, following GitHub’s rules for the allowable characters in user and repository names. After getting that it’s just a matter of tidying the data up a bit more to make it query-ready.
owner_repo <- r"((?<=https:\/\/github.com\/)([A-Za-z0-9\-]{1,39}\/[A-Za-z0-9\-\_\.]+))"cran_gh_repos <- cran_gh_repos |>mutate(# Get the owner and repo name from a GitHub URL in the form: owner/repoacross(c(bug_reports, url), \(.x) str_extract(.x, owner_repo)),repo_url =case_when(# There shouldn't be a need to check for both columns being NA, since# at least one column should have the info we want due to the filter()# call at the start of this pipe.is.na(bug_reports) &!is.na(url) ~ url,!is.na(bug_reports) &is.na(url) ~ bug_reports, bug_reports == url ~ bug_reports ),# We just need everything before or after the / to get the owner and repo.owner =str_extract(repo_url, ".+(?=\\/)"),repo =str_extract(repo_url, "(?<=\\/).+") ) |>select(package, owner, repo) |># Despite the previous comment, some packages only include their username# or organization in the URL, which we can't make use of and which return# NA. These should be omitted. There aren't that many (< 100).na.omit()cran_gh_repos
The GitHub REST API has a rate limit of 5000 requests per hour for authenticated users, so the repositories have to be queried in chunks instead of doing them all in one go. To accomodate this, cran_gh_repos can be split into a list of data frames that can be mapped over when calling the API. I’ve chosen groups of 4000 here just to avoid accidentally hitting the limit.
n_per_group <-4000n_groups <-ceiling(nrow(cran_gh_repos)/n_per_group)query_groups <- cran_gh_repos |>mutate(# The result of the rep() call needs to be subset to the max number of# observations to avoid a recycling error with vctrs.query_group =paste0("group_", rep(1:n_groups, each = n_per_group)[1:nrow(cran_gh_repos)] ) ) |>group_by(query_group) |>nest() |>deframe()query_groups
We’re now ready to get the contributors for each package. The code here is based on the code used to thank contributors in the R for Data Science book, so if you want to understand the general idea I’d recommend running that code. Otherwise this is just an extension of that code, using list columns, rectangling, and error handling to make it work over multiple repositories.
contributors_all <- query_groups |>map_dfr( \(query_group) {# All packages will be queried within the group one-by-one. We just# need package names, usernames, and commit numbers; everything else# can be dropped. query_result <- query_group |>rowwise() |>mutate(contributors_json =list(# Some package repos don't exist anymore (404), which triggers an# error. This needs to be handled in order for the rest of the# queries to continue running.tryCatch(expr = { gh::gh("/repos/:owner/:repo/contributors",owner = owner,repo = repo,.limit =Inf ) },error = \(unused_argument) {# For consistency with `gh()`, return a list with the keys we# want but with NA values.list(list(login =NA_character_, contributions =NA_integer_)) } ) ),contributors =list(tibble(login =map_chr(contributors_json, "login"),n_commits =map_int(contributors_json, "contributions") )) ) |>select(-c(owner, repo, contributors_json)) |>unnest(contributors)# Once a group is done, rest for an hour to avoid the rate limit.Sys.sleep(3600) query_result } )contributors_all
As you can see from the output above, contributors are returned as logins rather than names; so the API will need to be queried a second time to get the names corresponding to each login. However, to avoid redundancy and reduce the total number of queries, we should only take the unique login names from the contributor data.
# We only want real people, not bots or other automated tools. Most bots have a# common naming scheme for their login, but there are exceptions. May as well# drop the NA entries here too, since it's convenient.contributors_all <- contributors_all |>filter(!( login %in%c("actions-user","cran-robot","fossabot","ImgBotApp","traviscibot",NA_character_ ) |str_detect(login, "\\[bot\\]$|\\-bot$") ) )contributors_unique <-tibble(login =unique(contributors_all$login))contributors_unique
We also need to make query groups for contributors_unique to avoid hitting the rate limit, since logins need to be queried one-by-one.
# We need to make query groups here too to avoid hitting the rate limit.n_per_group <-4000n_groups <-ceiling(nrow(contributors_unique)/n_per_group)query_groups <- contributors_unique |>mutate(query_group =paste0("group_", rep(1:n_groups, each = n_per_group)[1:nrow(contributors_unique)] ) ) |>group_by(query_group) |>nest() |>deframe()query_groups
Again, this query is just a generalization of the R for Data Science contributors code. Here we’re getting users’ real names from their login names, then tidying up.
contributors_tidy <- query_groups |>map_dfr( \(query_group) {# All logins will be queried within the group one-by-one. query_result <- query_group |>rowwise() |>mutate(users_json =list(# Some user login names don't exist anymore (404), which triggers# an error. This needs to be handled in order for the rest of the# queries to continue running.tryCatch(expr = { gh::gh("/users/:username", username = login) },error = \(unused_argument) {# For consistency with `gh()`, return a list with the keys we# want but with NA values.list(list(login = login, name =NA_character_)) } ) ),# Some users don't have a name filled out. To keep it from being# empty, use their login name instead.name =pluck(users_json, "name", .default = login) ) |># Rowwise evaluation is no longer needed and I don't want active after# this code chunk.ungroup() |># Letters with accents, etc. should be normalized so that they match# the normalized names in `cran_authors` from the previous post.mutate(name =str_trim(stri_trans_general(name, "latin-ascii"))) |>select(name, login)# Once a group is done, rest for an hour to avoid the rate limit.# Since this is our last query, there's a redundant hour of waiting# after the last query group is finished. This is easy to fix, but I'm# only running this code once *ever*, so it's even easier not to.Sys.sleep(3600) query_result } )contributors_tidy
#> # A tibble: 15,318 × 2
#> name login
#> <chr> <chr>
#> 1 Spiritspeak Spiritspeak
#> 2 Sigbert Klinke sigbertklinke
#> 3 Wenjie Wang wenjie2wang
#> 4 Dennis Prangle dennisprangle
#> 5 Kyle Bittinger kylebittinger
#> 6 Jin Zhu Mamba413
#> 7 Junhao Huang oooo26
#> 8 Jiang-Kangkang Jiang-Kangkang
#> 9 Liyuan Hu zaza0209
#> 10 bbayukari bbayukari
#> # … with 15,308 more rows
Tidying GitHub query results
With the names associated with each login in contributors_tidy, I can add those to contributors_all.
contributors_all <- contributors_all |>left_join(contributors_tidy) |>relocate(name, .before = login) |># Make all names title case to improve the success rates of joining names# with `cran_authors` in the next code chunk.mutate(name =str_to_title(name))contributors_all
Even better, since we know the authors of each package from the work I did in the previous post, we can identify which contributors have authorship in the DESCRIPTION of each package. This one is pretty simple: Just use filtering joins to get the subset of contributors with or without a matching package-name combination in cran_authors (using semi_join() and anti_join(), respectively), add an indicator column for package authorship, then join the two data frames back together.
As I discussed in the previous post, there is some degree of measurement error here—so at the person level there will be some errors, but they aren’t bad enough to prevent us from getting an idea of how things are at the population level. There’s also a small bit of new measurement error introduced here, because not every login has a name associated with it, and some people use different names on GitHub and in the DESCRIPTION of the R packages they’re authors on.
contributors_authors <- contributors_all |>semi_join(cran_authors, by =join_by(package, name == person)) |>mutate(package_author =TRUE)contributors_contributors <- contributors_all |>anti_join(cran_authors, by =join_by(package, name == person)) |>mutate(package_author =FALSE)contributors_all <- contributors_authors |>bind_rows(contributors_contributors) |>arrange(package) |># If a package has a single contributor it's reasonable to assume they have# authorship of the package. This fixes some misses in the joining method# where people use different names on GitHub and in the `DESCRIPTION`.group_by(package) |>mutate(package_author =if (n() ==1) TRUEelse package_author) |>ungroup()contributors_all
Going full circle, we can then update contributors_tidy to include some summary statistics for each contributor, including the number of packages they have contributed to that they have authorship on.
Finally, we can make a graph from contributors_all to look at collaboration networks between packages and people. This is fairly straightforward to do since contributors_all is already structured as a bipartite graph, where its nodes are divided into two disjoint and independent sets—in this case packages and people. A bipartite projection can then be applied to this graph to get (1) a network of people that have contributed to the same packages, and (2) a network of packages that have contributors in common.
Note that both these networks are undirected. For the packages network this makes sense, but the contributors network could theoretically be turned into a directed network based on the authorship status of a contributor. I’ve decided against doing that for this post as it would make the contributors network—which is already large and complex—even larger and more complex.
contributors_bipartite <- contributors_all |># The next three steps turn `contributors_all` into a bipartite graph, then# the fourth applies the bipartite projection. This returns a list of two# graphs, hence the use of `map()` in the fifth step.select(package, login) |>table() |>graph_from_incidence_matrix() |>bipartite_projection() |># The `as_tbl_graph()` function comes from the tidygraph package, which I# explain below.map(as_tbl_graph, directed =FALSE)contributors_bipartite
#> $proj1
#> # A tbl_graph: 8972 nodes and 259674 edges
#> #
#> # An undirected simple graph with 1936 components
#> #
#> # Node Data: 8,972 × 1 (active)
#> name
#> <chr>
#> 1 AATtools
#> 2 abbreviate
#> 3 abclass
#> 4 abctools
#> 5 abdiv
#> 6 abess
#> # … with 8,966 more rows
#> #
#> # Edge Data: 259,674 × 3
#> from to weight
#> <int> <int> <dbl>
#> 1 2 165 1
#> 2 2 5601 1
#> 3 2 6844 1
#> # … with 259,671 more rows
#>
#> $proj2
#> # A tbl_graph: 15318 nodes and 714831 edges
#> #
#> # An undirected simple graph with 1936 components
#> #
#> # Node Data: 15,318 × 1 (active)
#> name
#> <chr>
#> 1 001ben
#> 2 0liver0815
#> 3 0tertra
#> 4 0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
#> 5 0x00b1
#> 6 0x26res
#> # … with 15,312 more rows
#> #
#> # Edge Data: 714,831 × 3
#> from to weight
#> <int> <int> <dbl>
#> 1 1 455 1
#> 2 1 652 1
#> 3 1 860 1
#> # … with 714,828 more rows
Some of the summary data from before can also be added to the nodes of the projections. Here I’m taking advantage of the tidygraph package so this can be done using the familiar functions from the dplyr package. The workflow is exactly the same, except that the activate() function needs to be called first to declare whether subsequent functions should be applied to the nodes or edges dataframe.
First the contributors network:
contributors_graph <- contributors_bipartite$proj2 |>activate(nodes) |>rename(login = name) |>left_join(contributors_tidy) |>select(-login) |># Get some descriptive measures of the nodes for use later.mutate(degree =centrality_degree(),strength =centrality_degree(weights = weight),neighbours =local_ave_degree(weights = weight),component =group_components() )contributors_graph
Phew, that was… a lot. A lot of wrangling. To summarize, we now have four different objects that cover different aspects about R package contributions on GitHub:
contributors_tidy: A data frame with summary statistics for each contributor.
packages_tidy: A data frame with summary statistics for each package.
packages_graph: A graph of packages that have contributors in common.
contributors_graph: A graph of people that have contributed to the same packages.
Like in the previous post, it’s worth reiterating that there is some degree of measurement error here. It’s going to vary by package and person; some of it can be attributable to my code not dealing with certain edge cases well, but other errors are just inherent to doing this analysis in an programmatic way.
For example, if we look at which individuals haven’t authored an R package on CRAN, we immediately see three obvious false negatives with Jenny, Daniel, and Will, who are all prolific authors who have most certainly published on CRAN, and whose packages I use regularly. The reason they all have zero authorships is because they use different names on GitHub and in the DESCRIPTION of the R packages they’re authors on.
filter(contributors_tidy, n_authorships ==0)
#> # A tibble: 9,485 × 5
#> name login n_packages n_commits n_authorships
#> <chr> <chr> <int> <int> <int>
#> 1 Jennifer (Jenny) Bryan jennybc 87 4877 0
#> 2 Salim B salim-b 65 136 0
#> 3 Daniel strengejacke 28 16006 0
#> 4 The Gitter Badger gitter-badger 22 23 0
#> 5 Eitsupi eitsupi 19 225 0
#> 6 Max Held maxheld83 19 159 0
#> 7 Steven (Siwei) Ye stevenysw 19 65 0
#> 8 Bgoodri bgoodri 18 2417 0
#> 9 Will Landau wlandau-lilly 16 3341 0
#> 10 Jonathan jmcphers 16 917 0
#> # … with 9,475 more rows
Given that almost two-thirds of contributors have no authorships, and there’s no way of knowing what proportion of these are true positives or false negatives, I wouldn’t take any stats about authorship too seriously at the individual or population level. I’ve kept these stats in the post because I think they make for a nice demonstration of taking data from different sources to answer new questions; but they’re obviously flawed. Measurement errors like this can be dealt with, but they require manual inspection, which has a high time cost that I wasn’t prepared to pay for this post.
It’s also worth a reminder that not every R package on CRAN has a GitHub repository associated with it. Here’s a quick survey of the difference in population sizes. About half the R packages on CRAN have a GitHub repository, and there are about half as many GitHub contributors as there are CRAN authors.
It’s hard to say exactly how much overlap there is between the people in the GitHub contributors group and the CRAN authors group (given the discussion above), but I’d guess it’s fairly high. For certain, about one-third of contributors are CRAN authors:
Anyways, keeping all that in mind, here’s my exploration of R package contributions in the wider R community.
Contributor statistics
Similar to my previous post, we can look at some basic person-level stats—this time the number of R packages a person has contributed to and their total commits across packages. Funnily enough, the distribution for the number of R packages a person has contributed to is very similar to the distribution of CRAN package authorships we saw in my previous post. Thus, around 90% have contributed to three packages, and only around 10% have contributed to four or more packages.
For both these subgroups, we can also see that the majority of people in them do not have any CRAN package authorships; but that as the number of commits increases so does the proportion of people who have CRAN package authorship. Moreover, from the histogram below we see that almost 25% have only made a single commit.
ggplot(contributors_tidy, aes(x = n_commits, fill = n_authorships >0)) +geom_histogram() +scale_x_log10() +scale_y_continuous(sec.axis =sec_axis(trans = \(.x) .x /nrow(contributors_tidy),name ="Percent of sample",labels =label_percent() ) ) +labs(x ="Commits across packages",y ="People",fill ="CRAN authorship" )
The number of R packages a person has contributed to and their total commits across packages are obviously related (e.g., every new package a person contributes to is a new commit), so it makes sense to look at these together too. To avoid overplotting, I’ve used a hex bin plot instead of a scatterplot; and to give an idea of the distributions for each variable I’ve added a jittered rug to both axes. There’s a high amount of variability here, but around 76% of people have made 100 or less commits across one to three packages.
contributors_tidy |>ggplot(aes(x = n_packages, y = n_commits)) +annotate("text",x =32,y =8,hjust =0,# I calculated this value using the same methods as above, but hard coded# it here just to avoid some ugly code.label ="Around 76% of people\nare in the shaded region." ) +annotate("curve",x =64, xend =3.5,y =3, yend =1.5,curvature =-0.1,arrow =arrow(type ="closed") ) +annotate("rect",xmin =0.9, xmax =3.45,ymin =0.8, ymax =100,alpha =0.5 ) +# Note that geom_hex was borked in ggplot2 v3.4.0, so make sure to have# v3.4.1 or later if you use this in your own projects. See:# https://ggplot2.tidyverse.org/news/index.html#bug-fixes-3-4-1geom_hex(bins =30) +geom_rug(alpha =0.01, position =position_jitter(width = .1, height = .1) ) +scale_x_continuous(trans ="log2", breaks =breaks_log(n =7, base =2) ) +scale_y_continuous(trans ="log10" ) +scale_fill_viridis_c(trans ="log10" ) +labs(x ="Packages contributed to",y ="Commits across packages",fill ="Count" ) +theme(panel.grid.minor =element_blank())
The previous plot used log scales to avoid almost all the data being compressed in the bottom-left corner. However, this makes the true scale of the data less visually intuitive, so here’s a scatterplot of this data without any scaling. I’ve shaded the 76% of people region here too (this time in yellow; it’s very small in the bottom-left corner), which makes it clear that the number of package contributions and commits made by a typical contributor are orders of magnitude smaller than those of the most prolific contributors.
I’ve also labelled some of the most prolific people and added the number of CRAN package authorships each person has to the colour scale (since it wasn’t be used otherwise). For the authorships, I’ve coloured them grey if the number of authorships was zero. I also stopped the highest bin at 30 or more authorships, since that seemed like a good cut point for this data.
contributors_tidy |>mutate(n_authorships =if_else(n_authorships ==0, NA_integer_, n_authorships)) |>ggplot(aes(x = n_packages, y = n_commits, colour = n_authorships)) +geom_point(size =0.5) +geom_text_repel(aes(label = name),data =filter(contributors_tidy, n_packages >=100| n_commits >=10000),min.segment.length =0,max.overlaps =Inf ) +annotate("rect",xmin =1, xmax =3,ymin =1, ymax =100,fill ="#FDE725", colour ="#FDE725" ) +# This is a slightly hack-ish way to provide a manual fill to a binned# colour scale: https://stackoverflow.com/a/67273672/16844576.binned_scale(aesthetics ="colour",scale_name ="coloursteps",palette =function(x) pal_ramp(viridis_palettes$viridis, 6),na.value ="darkgrey",limits =c(1, max(contributors_tidy$n_authorships)),breaks =c(1, 3, 10, 20, 30),guide ="coloursteps" ) +labs(x ="Packages contributed to",y ="Commits across packages",colour ="Package\nauthorships" )
Just as we found last time with CRAN package authorships, the vast range in package contributions and commits is mostly driven by people who do R package development as part of their job, which can be seen by browsing through the top entries in the interactive gt table below.
The graph of 15318 contributors that have contributed to the same packages is disconnected, meaning that it can be broken down into a number of subgraphs (called components) that are completely isolated from one another. This is the opposite of a connected graph, where every node is reachable from every other node.
is_connected(contributors_graph)
#> [1] FALSE
A census of the components shows that the contributor collaboration network is dominated by one giant component that contains around 80% of the contributors in the network, which indicates that the vast majority of people can be reached through common package contributions. However, around 8% of the nodes in the network consist of single contributors that don’t share common package contributions with any other contributors, meaning that there’s also a decent chunk of solo package authors who haven’t published or contributed to any other CRAN packages, or had anyone contribute to their package.
Given that the network is dominated by one giant component, and the next largest set of components don’t have any connections at all, it makes sense to focus on the giant component.
# Components are ordered by size, so the largest component is the first one.contributors_graph <-filter(contributors_graph, component ==1)
Here’s a plot of the giant component. It looks like a big hairball, which makes it hard to fully interpret, but we can at least see some structure. For example, it’s clear that not all contributors are connected, that there is a wide range in the number of common package contributions that exist between people (based on the node sizes here), and that there are possibly some natural communities of people who tend to contribute to the same packages.
contributors_graph |>ggraph() +# It's important to use geom_edge_link0() instead of geom_edge_link();# otherwise plotting takes forever since the latter draws 100 points# along each edge so it they be used make edges with gradients.geom_edge_link0(aes(edge_width = weight),colour ="grey",alpha =0.5 ) +geom_node_point(aes(size = strength),fill ="white",shape =21 ) +theme_graph() +theme(legend.position ="none")
We can use descriptive statistics to augment our understanding of the graph beyond what can be learned by plotting it.
First let’s look at the distributions of degree and strength. Here the degree of each node represents the total number of neighbours a contributor has; and the strength of each node represents the total number of common packages a person and their neighbours have contributed to.
The (log) degree range is quite wide, and it makes it clear that although most contributors have relatively few neighbours, there is also a segment that have a lot of neighbours.
To augment the strength distribution above, the scatterplot below shows the strength of each node plotted against the average strength of the neighbours of each node. This plot suggests that there is a tendency for contributors with higher strengths to link with similar contributors, while contributors with lower strengths tend to link with contributors of both lower and higher strengths.
Another way to make the giant component easier to interpret is to to try to detect communities of contributors who have contributed to the same packages, then contract and simplify the graph so that each node summarizes a community and each edge summarizes community interactions. There are a lot of algorithms available to do this; here I’ve chosen the Louvain algorithm, which was created in part to help visualize the structure of large networks using this “communities as nodes” approach.
#> # A tbl_graph: 139 nodes and 269 edges
#> #
#> # An undirected simple graph with 1 component
#> #
#> # Node Data: 139 × 2 (active)
#> community n_nodes
#> <int> <int>
#> 1 1 5531
#> 2 12 256
#> 3 28 26
#> 4 6 401
#> 5 2 843
#> 6 17 117
#> # … with 133 more rows
#> #
#> # Edge Data: 269 × 3
#> from to weight
#> <int> <int> <int>
#> 1 1 2 953
#> 2 1 3 28
#> 3 1 4 9088
#> # … with 266 more rows
This is now much easier to visualize. Here we can see there is one giant community with thousands of contributors in it that has connections to every other community, several large communities with hundreds of contributors in them that have relatively high connections with the giant community, and many more communities with less than 100 contributors whose members have relatively low amounts of common package contributions with other communities.
The summary statistics for packages are similar to those for people, so I’ll follow a similar exploration workflow here. Starting with the number of contributors to the 8972 packages with GitHub repositories, it’s apparent that around 45% of open source CRAN packages are single author packages with no contributors, and that around 82% of packages have one to four contributors.
It’s also apparent that as the number of contributors increases, the proportion of packages whose only contributors are their authors decreases. This isn’t surprising at all—but I will emphasize that the exact proportions shown here aren’t accurate due to the measurement error between people’s names on GitHub and DESCRIPTION I discussed earlier. I’ve added a tolerance of one to try to account for this, so the proportions here slightly favour false positives over false negatives, but to what extent it’s hard to say.
The number of contributors an R package has and the total commits to that package are obviously related (e.g., every new contributor is a new commit), so it makes sense to look at these together too. I’ve used a hex bin plot again to avoid overplotting, with jittered rugs to give an idea of the distributions for each variable. This one’s particularly aesthetically pleasing, with two salient points of interest: First, package commits follow a log-normal distribution; and second, although there’s a high amount of variability, around 75% of packages have between one and four contributors, with 10 to 1000 commits shared between them.
packages_tidy |># Around 35 packages have 0 commits, so they shouldn't be included here.# I tried following up on these and some of them led to 404 pages, which is# weird because those should have been skipped when getting data from the# GitHub REST API. Some others seemed to be caused by people changing their# usernames. filter(n_commits !=0) |>ggplot(aes(x = n_contributors, y = n_commits)) +annotate("text",x =32,y =8,hjust =0,label ="Around 75% of people\nare in the shaded region." ) +annotate("curve",x =64, xend =4,y =3, yend =6,curvature =-0.35,arrow =arrow(type ="closed") ) +annotate("rect",xmin =0.9, xmax =4,ymin =10, ymax =1000,alpha =0.5 ) +geom_hex(bins =30) +geom_rug(alpha =0.01, position =position_jitter(width = .1, height = .1) ) +scale_x_continuous(trans ="log2", breaks =breaks_log(n =7, base =2) ) +scale_y_continuous(trans ="log10") +scale_fill_viridis_c(trans ="log2") +labs(x ="Contributors",y ="Commits",fill ="Count" ) +theme(panel.grid.minor =element_blank())
And here’s the same thing as a scatterplot on the original scale. Unsurprisingly, the packages with the most activity here are largely made by Posit, with a few community favourites as well like brms. Notably, the extreme outliers here—like duckdb and xgboost—are so far apart from everything else because they are cross-language tools that use the monorepo philosophy for project management, so their GitHub activity goes beyond the confines of the R community.