Overview of forwards package

The forwards package provides anonymized data from surveys conducted by Forwards, the R Foundation task force on women and other under-represented groups. The package currently contains a single data set, useR2016, with results from a survey of participants at the useR! 2016 conference. The questions and form of responses are described in the help file (?useR2016). This vignette provides a few examples of how to obtain equivalent results to those presented in reports on this survey (note it is not possible to reproduce the exact results in most cases, due to the aggregation necessary to protect respondent’s privacy).

This vignette uses the following packages:

library(dplyr)
library(FactoMineR)
library(forcats)
library(ggplot2)
library(knitr)
library(likert)
library(tidyr)
library(forwards)

Descriptive statistics

Q7 gives the highest education level of the respondents. We can cross-tabulate this by gender (Q2) as follows:

tab <- with(useR2016,
            prop.table(table(Q7, Q2), margin = 2))
kable(tab*100, digits = 1)
Men Non-Binary/Unknown Women
Doctorate/Professional 48.6 NaN 43.6
Masters or lower 51.4 NaN 56.4

The results are missing for the non-binary/unknown group as all demographic variables have been suppressed for these individuals. The education levels have been aggregated into “Doctorate/Professional” and “Masters or lower” - this gives two groups of roughly similar size, also the Doctorate and Professional qualification groups were observed to be separated from the lower education groups (Master, Undergraduate, and High School or lower) in the multivariate analyses (see useR! 2016 participants and R programming: a multivariate analysis and useR! 2016 participants and the R community: a multivariate analysis). Even with this heavy aggregation, we can still observe the high proportion of people with advanced qualifications and the tendency for men to have higher qualifications than women (as noted in our reports, women attendees were generally younger than men). For more discussion of the respondent demographics, see the blog post mapping useRs.

Q15 asked respondent’s opinions on several statements about R. The following code collects these responses and shows the percentage in each opinion category for each statement:

ldat <- likert(useR2016[c("Q15", "Q15_B", "Q15_C", "Q15_D")])
plot(ldat) +
    scale_x_discrete(labels = 
                       rev(c("fun", "considered cool/interesting\n by my peers",
                             "difficult", "monotonous task"))) +
    ggtitle("useR! 2016 attendees' opinions on writing R")

This plot was presented in the blog post useRs relationship with R which covers all the programming related questions in the survey.

Q24 asked respondents whether certain options would make them more likely to participate in the R community, or improve their experience. The following code gathers all the responses together and summarizes the percentage selecting each category, for men and women separately.

dat <- useR2016 %>%
    filter(Q2 %in% c("Men", "Women")) %>%
    select(Q2, Q24, Q24_B, Q24_C, Q24_D, Q24_E, Q24_F, Q24_G, Q24_H, Q24_I, 
           Q24_J, Q24_K, Q24_L) %>%
    group_by(Q2) %>%
    summarize_all(list(Yes = ~ sum(!is.na(.)),
                       No = ~ sum(is.na(.)))) %>%
    gather(Response, Count, -Q2) %>%
    separate(Response, c("Q", "Answer"), sep = "_(?=[^_]+$)") %>%
    arrange(Q2, Q, Answer) %>%
    group_by(Q2, Q) %>%
    summarize(Yes = Count[2],
              Percentage = Count[2]/sum(Count) * 100) %>%
    ungroup() %>%
    filter(Yes > 4) %>%
    mutate(Q = factor(Q, labels = 
                        c("New R user group near me",#A
                          "New R user group near me aimed at my demographic",#B
                          "Free local introductory R workshops",#C
                          "Paid local advanced R workshops",#D
                          "R workshop at conference in my domain", #E
                          "R workshop aimed at my demographic",#F
                          "Mentoring (e.g. CRAN/useR! submission, GitHub contribution)", #G
                          #"Training in non-English language",
                          #"Training that accommodates my disability",
                          "Online forum to discuss R-related issues", #J
                          "Online support group for my demographic"#, #K
                          #"Special facilities at R conferences"
                          ))) 
kable(dat, digits = 1)
Q2 Q Yes Percentage
Men New R user group near me 78 27.6
Men New R user group near me aimed at my demographic 9 3.2
Men Free local introductory R workshops 37 13.1
Men Paid local advanced R workshops 32 11.3
Men R workshop at conference in my domain 39 13.8
Men R workshop aimed at my demographic 7 2.5
Men Mentoring (e.g. CRAN/useR! submission, GitHub contribution) 46 16.3
Men Online forum to discuss R-related issues 28 9.9
Men Online support group for my demographic 5 1.8
Women New R user group near me 44 26.0
Women New R user group near me aimed at my demographic 22 13.0
Women Free local introductory R workshops 24 14.2
Women Paid local advanced R workshops 28 16.6
Women R workshop at conference in my domain 33 19.5
Women R workshop aimed at my demographic 13 7.7
Women Mentoring (e.g. CRAN/useR! submission, GitHub contribution) 40 23.7
Women Online forum to discuss R-related issues 34 20.1
Women Online support group for my demographic 13 7.7

Note that respondents could select multiple options so that the percentages to not add up to 100% for men and women. Also some options were not selected at all and do not appear in the summary. The following code visualizes these percentages:

ggplot(dat, aes(x = fct_rev(Q),  y = Percentage, fill = Q2)) + 
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  labs(x = NULL, y = "%", title = "R programming level of useR! 2016 attendees", fill = NULL) +
  scale_y_continuous(breaks = seq(0, 100, 20), limits = c(0, 100)) +
  scale_fill_hue(h = c(110,250), direction = -1, breaks = c("Women", "Men"))

Men and women are equally interested in local user groups and free workshops, but women are more interested than men in mentoring, online support groups and workshops of all types. For more on the community questions in the survey, see the blog post useRs participation in the R community.

Logistic regression analysis

Logistic regression analysis can be used to explore the relationships between contribution to the R project and other survey variables. For example, the following code creates a contributor response, that is equal to 1 if respondents have contributed to R packages on CRAN or elsewhere (Q13_D), have written their own R package (Q13_E), or have written their own R package and released it on CRAN or Bioconductor or shared it on GitHub, R-Forge or similar platforms (Q13_F), and is equal to 0 otherwise. A logistic regression is then used to model this response by gender (Q2), length of R usage (Q11), employment status (Q8) and whether the respondent feels a part of the R community (Q18):

response <- with(useR2016,
    ifelse(!is.na(Q13_D) | !is.na(Q13_E) | !is.na(Q13_F), 1, 0))
summary(glm(response ~ Q2 + Q11 + Q8 + Q18, data = useR2016))
## 
## Call:
## glm(formula = response ~ Q2 + Q11 + Q8 + Q18, data = useR2016)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.27420    0.05918   4.633 4.84e-06 ***
## Q2Women       -0.10034    0.04423  -2.269   0.0238 *  
## Q112-5 years   0.26890    0.06665   4.034 6.53e-05 ***
## Q115-10 years  0.37641    0.06501   5.790 1.40e-08 ***
## Q11> 10 years  0.43971    0.06849   6.420 3.77e-10 ***
## Q8Academic     0.21506    0.04329   4.968 9.95e-07 ***
## Q18No         -0.34397    0.05864  -5.866 9.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1809193)
## 
##     Null deviance: 102.050  on 417  degrees of freedom
## Residual deviance:  74.358  on 411  degrees of freedom
##   (37 observations deleted due to missingness)
## AIC: 480.52
## 
## Number of Fisher Scoring iterations: 2

This model suggests that women are slightly less likely to contribute, however more important factors are length of programming experience (more experience, more likely to have contributed), type of employment (academic, including retired, unemployed and student more likely to contribute) and sense of belonging to the R community (people that do not feel part of the R community are less likely to contribute). A working paper is in progress on this and related models.

Multivariate analysis

A multiple correspondence analysis may be used to explore multivariate relationships between a set of questions. The following code considers questions relating to previous programming experience (Q12), how the respondent uses R (Q13) and why they use R (Q14). The demographic variables gender (Q2), age (Q3), highest education level (Q7) and employment type (Q8) are used as supplementary variables, that is they are not used to build the dimensions of variability, but projected a posteriori to aid interpretation.

demo <- c("Q2", "Q3", "Q7", "Q8")
suppl <- c(demo, "Q12")
ruses <- c("Q11", "Q13", "Q13_B", "Q13_C", "Q13_D", "Q13_E", "Q13_F", "Q14")
don.mca <- useR2016[, c(suppl, ruses)] %>%
    mutate(Q12 = factor(ifelse(Q12 == "Yes", "prg_exp_yes", "prg_exp_no")),
           Q13 = factor(ifelse(!is.na(Q13), "use_func_yes", "use_func_no")),
           Q13_B = factor(ifelse(!is.na(Q13_B), "wrt_code_yes", "wrt_code_no")),
           Q13_C = factor(ifelse(!is.na(Q13_C), "wrt_func_yes", "wrt_func_no")),
           Q13_D = factor(ifelse(!is.na(Q13_D), "ctb_pkg_yes", "ctb_pkg_no")),
           Q13_E = factor(ifelse(!is.na(Q13_E), "wrt_pkg_yes", "wrt_pkg_no")),
           Q13_F = factor(ifelse(!is.na(Q13_F), "rel_pkg_yes", "wrt_rel_no")))
rownames(don.mca) <- seq(nrow(don.mca))
res.mca <- MCA(don.mca, graph =  FALSE, quali.sup =  seq(length(suppl)))
plot(res.mca, invisible = c("ind", "quali.sup"), cex = 0.8)

The plot above summarizes the main dimensions of variability in the responses to the programming experience questions. Two categories are close on the graph when individuals who have selected the first category also tend to select the other category. The main feature of this plot is the gradient from bottom right to top left, showing increasing experience and greater contribution, including in the respondents’ free time.

The following code then projects the demographic variables onto the same dimensions

res.dimdesc <- dimdesc(res.mca)  
# demographic variables linked to the dimension 1 or 2 
varselect <- 
    demo[which(demo%in%unique(c(rownames(res.dimdesc$'Dim 1'$quali),
                                rownames(res.dimdesc$'Dim 2'$quali))))]
# vector with the categories for such demographic variables
modeselect <- unlist(sapply(don.mca[, varselect],levels))      
# discriminant categories for the position of the individuals on dimension 1 or 2
getlabel <- function(x) sub("[^=]+=(.*)", "\\1", x)
lab1 <- getlabel(rownames(res.dimdesc$'Dim 1'$category))
lab2 <- getlabel(rownames(res.dimdesc$'Dim 2'$category))
modeselect <- modeselect[modeselect %in% unique(c(lab1, lab2))]
plot(res.mca, invisible=c("ind", "var"), cex = 0.8,
     selectMod = modeselect, autoLab = "yes",
     xlim = c(-1.5,1.5), ylim = c(-1,1))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).

This shows that the more experienced programmers tend to be men working in academia. Further multivariate analysis of the programming questions can be found in useR! 2016 participants and R programming: a multivariate analysis while a similar analysis of the community questions is reported in useR! 2016 participants and the R community: a multivariate analysis.