6  R workbench

Under construction

These notes are currently under construction and may contain errors. Feedback of any kind is very much welcome: lukas.soenning@uni-bamberg.de

This chapter shows how to apply the methods discussed in earlier chapters in R.

6.1 Descriptive statistics

6.1.1 Relative frequencies

We can obtain category-specific and cumulative proportions in R using the {dplyr} package. The first step is to obtain the category-specific counts, then the corresponding proportions, and finally aggregated proportions. We apply this sequence of steps to the heaps dataset.

d |> 
  group_by(rating) |> 
  dplyr::summarize(
    n = n()) |> 
  mutate(
    prop = prop.table(n),
    cum_prop = cumsum(prop)
  )

We can also form cumulative proportions for subgroups in the data, by adding a grouping variable to the group_by() function. Note that the grouping variable must be listed first.

d |> 
  group_by(gender, rating) |> 
  dplyr::summarize(
    n = n()) |> 
  mutate(
    prop = prop.table(n),
    cum_prop = cumsum(prop)
  )

6.1.2 Visualizing ordinal data

6.1.2.1 Categorical predictors

When drawing a bar chart as shown in Figure 2.2, we first need to decide on how to order subgroups (in this case: syntactic functions) along the x-axis. Before visualizing the data, we convert the variable synt_fun into and ordered factor manually:

d$synt_fun <- factor(
  d$synt_fun, 
  levels = c("p_adj", "verb", "c_adj", "quan"),
  ordered = TRUE)

We can now draw a grouped bar chart.

d |> 
  ggplot(aes(
    x = synt_fun, 
    fill = rating)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(
    palette = "RdGy")

To draw the line plot in Figure 2.2, we need more code. Note that the first part merely creates the category-specific proportions.

d |>                                    
  group_by(synt_fun, rating) |>  
  dplyr::summarize(                     
    n = n()) |>                         
  mutate(                               
    prop = prop.table(n),               
    cum_prop = cumsum(prop)             
  ) |>                                
  ggplot(aes(
    x = synt_fun, 
    y = prop, 
    color = rating, 
    group = rating)) +
  geom_line() +
  geom_point() +
  scale_colour_brewer(
    palette = "RdGy")

Here is the code to draw a bar chart showing cumulative proportions (see Figure 2.3).

d |> 
  ggplot(aes(
    x = synt_fun, 
    fill = rating)) +
  geom_bar(
    position = "fill") +
  scale_fill_brewer(
    palette = "RdGy")

To change the order of the categories (i.e. to have red segments at the bottom), we can reverse the y-axis using the additional function scale_y_reverse().

d |> 
  ggplot(aes(
    x = synt_fun, 
    fill = rating)) +
  geom_bar(
    position = "fill") +
  scale_fill_brewer(
    palette = "RdGy") +
  scale_y_reverse()

We may then want to change the tick mark labels manually, so that 0 marks the bottom of the graph and 1 the top.

d |> 
  ggplot(aes(
    x = synt_fun, 
    fill = rating)) +
  geom_bar(
    position = "fill") +
  scale_fill_brewer(
    palette = "RdGy") +
  scale_y_reverse(
    breaks = c(0, .5, 1), 
    labels = c("1", ".5", "1"))

Diverging bar charts require more work. Note that the first part only creates the appropriate summary statistics (like above).

# create summary
ds <- d |>                              
  group_by(synt_fun, rating) |>         
  dplyr::summarize(                      
    n = n()) |>                         
  mutate(                               
    prop = prop.table(n),               
    cum_prop = cumsum(prop)) 

# draw graph
ds |>  
  ggplot(aes(
    x = reorder(synt_fun, prop * as.numeric(rating), mean),
    y = prop, fill = rating)) +
  geom_col(data = dplyr::filter(
    ds, rating %in% c("4", "5", "6")),
    position = position_stack(reverse = TRUE)) +
  geom_col(data = dplyr::filter(
    ds, rating %in% c("1", "2", "3")),
    aes(y = -prop),
    position = position_stack(reverse = FALSE)) +
  scale_fill_brewer(palette = "RdGy", drop = FALSE)

The following code creates the line plot in Figure 2.5. We start by creating a summary table listing the relevant exceedance proportions. Then we exclude those for the highest category, as these are 0 and redundant. Then we draw the graph, where we explicitly define the limits of the y-axis.

# create summary
ds <- d |>                                    
  group_by(synt_fun, rating) |>  
  dplyr::summarize(                      
    n = n()) |>                         
  mutate(                               
    prop = prop.table(n),               
    exc_prop = 1 - cumsum(prop)) 

# exclude exceedance proportions for highest category
ds <- ds |> dplyr::filter(rating != "6")

# draw graph
ds |>  
  ggplot(aes(
    x = synt_fun, 
    y = exc_prop, 
    color = rating, 
    group = rating)) +
  geom_line() +
  geom_point(size = 1) +
  scale_colour_brewer(
    palette = "RdGy") +
  ylim(0,1)

6.1.2.2 Numeric predictors

Unfortunately, we don’t know how to use {ggplot2} to directly create a plot showing category-specific proportions across the variable Date of birth (see Figure 2.6). Our current workaround is to extract the category-specific proportions from an object created using the function cdplot() and then pass on these quantities to {ggplot}

# create smoothed cumulative proportions
tmp <- cdplot(
  rating ~ age, 
  data = d, 
  bw = 3, 
  plot = FALSE)

# change these into category-specific proportions and arrange as a data frame
ds <- data.frame(
  age = 18:80,
  rating_1 = tmp[[1]](18:80),
  rating_2 = tmp[[2]](18:80) - tmp[[1]](18:80),
  rating_3 = tmp[[3]](18:80) - tmp[[2]](18:80),
  rating_4 = tmp[[4]](18:80) - tmp[[3]](18:80),
  rating_5 = tmp[[5]](18:80) - tmp[[4]](18:80),
  rating_6 = 1 - tmp[[5]](18:80)
) |>  
  gather(rating_1:rating_6, key = "response", value = "prob")

# draw line plot
ds|> 
  ggplot(aes(
    x = age, 
    y = prob, 
    color = response)) +
  geom_line() +
  scale_colour_brewer(
    type = "div", 
    palette = "RdGy") 

The following code produces an area chart showing aggregated proportions (see Figure 2.7).

d |> 
  ggplot(aes(
    x = age, 
    after_stat(count), 
    fill = rating)) +
  geom_density(
    position = "fill", 
    adjust = 1.5, 
    colour = 1, 
    outline.type = "full") +
  scale_fill_brewer(
    palette = "RdGy") 

To create a spine plot (see Figure 2.8) using {ggplot2}, we need need the geom_mosaic() function in the {ggmosaic} package. The first step is to discretize the variable Age into 10-year bins. To produce our desired visual arrangement (the highest category “6” in grey at the bottom), we need to reverse the order of the response categories (using fct_rev()) and then alse reverse the order of the fill colors using the argument direction in the function scale_fill_brewer().

d |> mutate(
  age_bin = factor(
    cut(age,
        seq(10, 90, by = 10)))) |> 
  mutate(
    rating = fct_rev(rating)
  ) |> 
  count(rating, age_bin) %>%
    ggplot() +
    geom_mosaic(aes(
      weight = n,
      x = product(age_bin),
      fill = rating)) +
    scale_fill_brewer(
      palette = "RdGy",
      direction = -1)

The following code produces the line plot version of this area chart (see Figure 2.9).

d |> 
  ggplot(aes(
    x = age, 
    after_stat(count), 
    color = rating)) +
  geom_density(
    position = "fill",
    adjust = 1.5, 
    outline.type = "upper") +
  scale_colour_brewer(
    palette = "RdGy") +
  ylim(0, 1)

6.2 Ordered regression models

6.2.1 Model fitting

We can fit cumulative models using the function the {ordinal} package. It offers two functions for model fitting:

  • clm() for simple models
  • clmm() for mixed-effects models (more recent version)

The function clmm2() is an older version for mixed-effects models, which currently offers extended functionality but will eventually be superseded by clmm().

We start by loading the data and make sure that the outcome variable rating is represented as an ordered factor.

d_parcel  <- read_tsv(
  here("data/analysis_data",
       "malta_data_parcel.tsv"))

d_parcel$rating <- factor(d_parcel$rating)

The following code fits a model to the parcel-package data with a single predictor variable, Date of birth. We use the argument threshold to request symmetric thresholds.

m <- clm(
  rating ~ dob_centered, 
  data = d_parcel, 
  threshold = "symmetric")

We can print a regression table using summary():

summary(m)
formula: rating ~ dob_centered
data:    d_parcel

 link  threshold nobs logLik  AIC    niter max.grad cond.H 
 logit symmetric 193  -263.16 534.32 5(1)  5.40e-13 1.9e+01

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
dob_centered   -1.212      0.242  -5.008 5.49e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
          Estimate Std. Error z value
central.1 -0.55013    0.17209  -3.197
central.2  0.44757    0.17158   2.609
spacing.1  0.38966    0.06784   5.744

The function Anova() in the {car} package (Fox and Weisberg 2019) produces an analysis of deviance table:

car::Anova(m, type="III")
Analysis of Deviance Table (Type III tests)

Response: rating
             Df  Chisq Pr(>Chisq)    
dob_centered  1 25.085  5.486e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There are other functions for model comparison and evaluation. Nested models can be compared using anova(), and stepwise model selection can be done with drop1() and add1().

6.2.2 The parallel regression assumption

In R, we can run the LR test as follows. We first fit the two models using the clm() function from the package {ordinal} (Christensen 2023). To fit the non-parallel model, we use the additional argument nominal =, where we specify the predictor(s) for which we want to fit freely varying coefficients.

m_np <- clm(
  rating ~ 1, 
  nominal = ~ dob_centered, 
  data = d_parcel,
  threshold = "symmetric")

Then we compare the two models using the function anova():

anova(m, m_np)
Likelihood ratio tests of cumulative link models:
 
     formula:              nominal:      link: threshold:
m    rating ~ dob_centered ~1            logit symmetric 
m_np rating ~ 1            ~dob_centered logit symmetric 

     no.par    AIC  logLik LR.stat df Pr(>Chisq)  
m         4 534.32 -263.16                        
m_np      6 533.15 -260.58  5.1627  2    0.07567 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Non-parallel mixed models not implemented in {ordinal}.

It is currently not possible to fit non-parallel mixed-effects models using the package {ordinal} – the clmm() function does not have the nominal argument.

6.2.2.1 Model comparison using information criteria

We can use the functions AIC() and BIC() to obtain information measures for both models.

BIC(m)
[1] 547.3666
BIC(m_np)
[1] 552.7294

6.2.2.2 Comparison of coefficients across cutpoint equations

To compare coefficients across cutpoint equations, we need to create four new indicator variables that represent the binary splits underlying cumulative models (see Figure 3.1). We add these to the data frame.

d_parcel$split_1 <- ifelse(as.numeric(d_parcel$rating) <= 1, 0, 1)
d_parcel$split_2 <- ifelse(as.numeric(d_parcel$rating) <= 2, 0, 1)
d_parcel$split_3 <- ifelse(as.numeric(d_parcel$rating) <= 3, 0, 1)
d_parcel$split_4 <- ifelse(as.numeric(d_parcel$rating) <= 4, 0, 1)

Then we fit a logistic regression model to each cutpoint equation:

m_s1 <- glm(
  split_1 ~ dob_centered, 
  data = d_parcel, 
  family = binomial(link = "logit"))

m_s2 <- glm(
  split_2 ~ dob_centered, 
  data = d_parcel, 
  family = binomial(link = "logit"))

m_s3 <- glm(
  split_3 ~ dob_centered, 
  data = d_parcel, 
  family = binomial(link = "logit"))

m_s4 <- glm(
  split_4 ~ dob_centered, 
  data = d_parcel, 
  family = binomial(link = "logit"))

The coefficients can then be compared in tabular and/or graphical form. We can use the function coef() to access the coefficients of interest, and the function confint() to obtain confidence intervals for them. The following code produces a simpl table (see Table 3.3).

data.frame(
  split = 1:4,
  estimate = round(
    c(coef(m_s1)[2], 
      coef(m_s2)[2], 
      coef(m_s3)[2], 
      coef(m_s4)[2]), 
    2),
  ci_lower = round(
    c(confint(m_s1)[2,1], 
      confint(m_s2)[2,1],  
      confint(m_s3)[2,1], 
      confint(m_s4)[2,1]), 
    2),
  ci_upper = round(
    c(confint(m_s1)[2,2], 
      confint(m_s2)[2,2],
      confint(m_s3)[2,2], 
      confint(m_s4)[2,2]), 
    2))
  split estimate ci_lower ci_upper
1     1    -0.83    -1.43    -0.29
2     2    -1.32    -1.96    -0.76
3     3    -1.35    -1.92    -0.82
4     4    -1.38    -1.96    -0.84

6.2.2.3 Comparison of predicted probabilities

We use the emmeans() function to obtain average predicted response probabilities for models m and m_np.

ap_m <- emmeans(
  m, 
  ~ rating | dob_centered, 
  mode = "prob",
  at = list(
    dob_centered = seq(-1.4, 1, by = .1))
  ) |> 
  data.frame()

ap_m_np <- emmeans(
  m_np, 
  ~ rating | dob_centered, 
  mode = "prob",
  at = list(
    dob_centered = seq(-1.4, 1, by = .1))
  ) |> 
  data.frame()

Then we draw the graph using {ggplot2} (see Figure 3.6).

ap_m |> 
  ggplot(aes(
    x = dob_centered, 
    y = prob, 
    color = rating)) +
  geom_line() +
  geom_line(data = ap_m_np, linetype = "22") +
  scale_colour_brewer(palette = "RdBu")

Extract category-specific probabilities from model.

emmeans(
  m_np, 
  ~ rating | dob_centered, 
  mode = "prob",
  at = list(
    dob_centered = 0)
  )

Extract cumulative probabilities from model.

emmeans(
  m_np, 
  ~ dob_centered | cut, 
  mode = "cum.prob",
  at = list(
    dob_centered = 0)
  )

Extract exceedance probabilities from the model:

emmeans(
  m_np, 
  ~ dob_centered | cut, 
  mode = "exc.prob",
  at = list(
    dob_centered = 0)
  )

6.2.3 Model diagnostics

The presid() function in the {PResiduals} package can be used to calculate probability-scale residuals. Since presid() does not (yet) work on models fit using the package {ordinal}, we write a function that extracts probability-scale residuals from models fit using clm():

presid.clm <- function(object, ...) {
  pfun <- switch(object$link,
                 logit = plogis,
                 probit = pnorm,
                 cloglog = pGumbel,
                 cauchit = pcauchy)
  n <- object$nobs
  q <- length(object$Theta)
  lp <- as.numeric(model.matrix(object)$X[,-1] %*% coef(object)[-(1:length(object$alpha))])

  cumpr <- cbind(0, matrix(pfun(matrix(object$Theta, n, q, byrow = TRUE) - lp),, q), 1)
  y <- as.integer(object$y)
  lo <- cumpr[cbind(seq_len(n), y)]
  hi <- 1 - cumpr[cbind(seq_len(n), y+1L)]
  res <- lo - hi
  res <- naresid(object$na.action, res)
  return(res)
}

6.3 Methods of interpretation

We now look at how to apply different methods of interpretation using R. The following sections are structured identically to the ones in Chapter 5. The R code given here produces figures and tables similar to the ones in that earlier chapter. The relevant items are cross-referenced. To see the intended output, hover over the hyperlink – the image or table will then appear.

This is the model we will concentrate on:

m <- clm(
  rating ~ dob_c + gender, 
  data = d_parcel, 
  threshold = "symmetric")

6.3.1 Predictions on the probability scale (R)

6.3.1.1 Predictions for each observed unit (R)

In R, the in-sample predicted probabilities for the outcome categories can be computed with the function predictions() in the {marginaleffects} package.

predictions(m)

The following code produces the dot diagram in Figure 5.4. note how piping is used to pass on the output of the predictions() function to the ggplot() function. The function geom_dotplot() draws a dot diagram. We manually changed the argument binwidth, since the default setting (30 bins) is too coarse for the large number of dots that need to be displayed.

predictions(m)  |> 
  ggplot(aes(x = estimate, y = group)) +
  geom_dotplot(
    binwidth = .005)

To create a by-category summary table (Table 5.1), we use the function group_by() to identify the subgroups we wish to compare (here: the response categories, which are generically referred to as “group” by the predictions() function). Then we use the function summarize() in the {dplyr} package to obtain the summary statistics we need.

predictions(m) |> 
  group_by(group) |> 
  dplyr::summarize(
    Mean = mean(estimate),
    SD = sd(estimate),
    Min = min(estimate),
    Max = max(estimate)
  )

6.3.1.2 Unit-level predictions (R)

6.3.1.2.1 Predictions at specified values (R)

We can use the predictions() function to obtain predictions at specified values. We start by creating a “grid”, i.e. data frame which holds this specific profile we are interested in.

grid <- data.frame(
  dob_c = 0,
  gender = "f")

grid
  dob_c gender
1     0      f

By supplying this grid to the newdata argument in the predictions() function, we can predict the response probabilities for this profile:

predictions(m, newdata = grid)

The same can be done using the {emmeans} package:

emmeans(           # compute from
  m,               # model m
  ~ rating,        # category-specific predictions
  mode = "prob",   # on the probability scale
  at = list(       # for the following profile:
    dob_c = 0,     #  - date of birth = 0 (year 1975)
    gender = "f")) #  - gender = female

The datagrid() function in the {marginaleffects} package makes it easy to define different profiles. Note how the datagrid() function returns all combinations of the specified values:

grid <- datagrid(
  gender = c("m", "f"), 
  dob_c = c(-1.4, -0.6, 0.2, 1), 
  model = m)

grid
  gender dob_c rowid
1      m  -1.4     1
2      m  -0.6     2
3      m   0.2     3
4      m   1.0     4
5      f  -1.4     5
6      f  -0.6     6
7      f   0.2     7
8      f   1.0     8

We then pass the new grid to the predictions() function, to obtain the predicted probabilities summarized in Table 5.2.

predictions(
  m, 
  newdata = grid)

We can also use shortcut functions: range to define the endpoints of the date-of-birth distribution, and unique to include all attested levels (or values) of Gender:

grid <- datagrid(
  gender = unique, 
  dob_c = range, 
  model = m)

grid
  gender dob_c rowid
1      f  -1.4     1
2      f   1.0     2
3      m  -1.4     3
4      m   1.0     4
6.3.1.2.2 Predictions for the “average unit” (R)

When locating a central and/or representative point in the predictor space, {marginaleffects} uses the mean for all numeric variables, and the mode for all categorical variables. The package allows us to pass the "mean" shortcut to the newdata argument, which returns predictions for a (hypothetical) female speaker born in 1973.34 (Table 5.3):

predictions(
  m, 
  newdata = "mean")
6.3.1.2.3 Predictions for ideal types (R)

To form predictions for the two ideal types (Table 5.4), we specify the relevant profiles using the newdata argument of the predictions() function:

predictions(
  m, 
  newdata = datagrid(
    dob_c = -1.4,
    gender = "m"
  ))

predictions(
  m, 
  newdata = datagrid(
    dob_c = 1,
    gender = "f"
  ))
6.3.1.2.4 Extra: Plotting unit-level predictions using {marginaleffects}

Male and female speaker born in 2000:

plot_predictions(
  m,
  condition = list(
    "gender", "group", "dob_c" = 1)) + 
  scale_y_continuous(
    limits = c(0, 1), expand = c(0, 0),
    breaks = c(0, .5, 1), label = c("0", ".5", "1")) +
  scale_color_manual(values = fill_cols_label) +
  theme_ls() +
  ylab("Probability") + xlab(NULL) +
  labs(caption = "Adjusted to: Date of birth = 2000.") +
  theme(strip.text = element_blank())

Full range of conditions:

plot_predictions(
  m,
  condition = c("dob_c", "gender", "group")) + 
  scale_y_continuous(
    limits = c(0, 1), expand = c(0, 0),
    breaks = c(0, .5, 1), label = c("0", ".5", "1")) +
  scale_x_continuous(breaks = c(-1, 0, 1)) +
  facet_grid(. ~ group) +
  theme_ls() +
  ylab("Probability") +
  xlab("Date of birth (rescaled)")

6.3.2 Average predictions (R)

6.3.2.1 Unit-level predictions: as-observed vs. specified values (R)

By default, the function avg_predictions() uses the as-observed approach and calculates the average prediction over the sample. In {marginaleffects} terminology, such average predictions are said to have been obtained using an empirical grid.

avg_predictions(m)

Rather than use the observed distribution for the variable Gender, we may take control of the value(s) at which to evaluate predictions. For instance, to treat the two levels (male and female) equally, we need to supply this variable to the newdata argument. All variables that are not specified in the newdata argument will still be treated as-observed. The grid is said to be partly synthetic, i.e. partly manually specified.

avg_predictions(
  m,
  newdata = datagrid(
    gender = unique
  ))

We can also determine the value of Age at which unit-level prediction are to be formed. By also defining the values of Age we are specifying a fully synthetic grid.

avg_predictions(
  m,
  newdata = datagrid(
    dob_c = seq(-1.4, 1, by = .8),
    gender = unique
  ))
Tip

We have created two custom functions for handling quantitative predictors with datagrid(). These allow you to select specific points of the observed (empirical) distribution of numeric variables:

  • sd_mean_sd() returns three values: the mean surrounded by two values at +/- 1 standard deviation. The function mad() is used to obtain a robust version of the SD, and a 10% trimmed mean is used by default.
  • decile_midpoints() returns the midpoints of n bins, each of which contain roughly the same number of cases in the estimation sample.

6.3.2.2 Average predictions for a quick overview of predictor importance (R)

We can compute average predictions for a specific predictor using the avg_predictions() function. The argument variables specifies the predictor of interest, and the argument newdata controls the teatment of the peripheral variables, i.e. it defines the locations of the data space over which predictions are averaged. The average predictions reported in Table 5.6 for the variable Gender can be obtained as follows:

avg_predictions(
  m,
  variables = "gender",
  newdata = datagrid(
    dob_c = seq(-1.4, 1, length = 7)
  ))

For quantitative variables, we have to specify the values of interest in a list. Otherwise avg_predictions() will only return a single average prediction calculated at the in-sample mean of the variable.

avg_predictions(
  m,
  variables = list(
    "dob_c" = c(-1.4, -.2, 1)),
  newdata = datagrid(
    gender = unique
  ))

6.3.2.3 Average predictions for subgroups in the data (R)

To compute average predictions for the subgroups using the same as-observed values for all groups we need to use the argument variables in the avg_predictions() function. The “as-observed, identical” average predictions are obtained as follows:

avg_predictions(
  m,
  variables = "gender")

If we instead use the by argument, the peripheral variables are again treated as-observed, but a separate set of observed values is used for each subgroup defined by the by argument. The “as-observed, different” average predictions are obtained as follows:

avg_predictions(
  m,
  by = "gender")

For average predictions that give equal weight to cohorts, we can use the newdata argument to specify the age values at which to generate predictions.

avg_predictions(
  m,
  variables = "gender", 
  newdata = datagrid(
    dob_c = seq(-1.4, 1, by = .8)
  ))

6.3.2.4 Using local means for adjustment (R)

6.3.2.5 Extra: Plotting average predictions using {marginaleffects}

The {marginaleffects} package makes it easy to visualize average predictions.

Average predictions by Gender, treating Date of birth as-observed:

plot_predictions(
  m, 
  by = c("gender", "group"))

Average predictions by Gender, treating Date of birth as-balanced:

plot_predictions(
  m, 
  by = c("gender", "group"),
  newdata = datagrid(
    gender = unique,
    dob_c = seq(-1.4, 1, .8)))

Average predictions by Date of birth, treating Gender as-balanced:

plot_predictions(
  m, 
  by = c("dob_c", "group"),
  newdata = datagrid(
    gender = unique,
    dob_c = seq(-1.4, 1, .05)))

6.3.3 Comparisons (R)

6.3.3.1 Comparisons for each observed unit (R)

The function comparison() in the {marginaleffects} package can be used to make comparisons. The targeted variable is specified with the argument variable. By default, all other variables are treated as-observed, which means that a comparison is computed for each observation in the sample. The following function call therefore returns a large table of differences in predicted probability, which compare the predictions obtained for the male and the female version of each subject in the sample.

comparisons(
  m,
  variables = "female")

To draw a dot diagram (Figure 5.12), we run the following code:

comparisons(
  m,
  variables = "gender") |>  
  ggplot(aes(x = estimate, y = group)) +
  geom_dotplot(binwidth = .001) +
  geom_vline(xintercept = 0)

For numeric variables, the default behavior of the comparisons() function is to use the observed values of the targeted numeric variable: The variable is changed from its observed value to its observed value plus 1.

comparisons(
  m,
  variables = "dob_c") |>  
  ggplot(aes(x = estimate, y = group)) +
  geom_dotplot(binwidth = .001) +
  geom_vline(xintercept = 0)

6.3.3.2 Comparisons for numeric predictors (R)

We will consider the options shown in Figure 5.10 in turn, from top to bottom. The output is not shown here. Note that the first two strategies return a batch of predictions, one for each observation in the estimation sample:

  • Default: (observed + 1) – observed
comparisons(
  m,
  variables = "dob_c",
  newdata = datagrid(
    gender = unique))
  • (observed + 0.5) – observed
comparisons(
  m,
  variables = list(
    "dob_c" = 0.5),
  newdata = datagrid(
    gender = unique))
  • –1 -> 1
comparisons(
  m,
  variables = list(
    "dob_c" = c(-1, 1)),
  newdata = datagrid(
    gender = unique))
  • (mean + SD) – (mean – SD)
comparisons(
  m,
  variables = list(
    "dob_c" = "2sd"),
  newdata = datagrid(
    gender = unique))
  • (mean + SD/2) – (mean – SD/2)
comparisons(
  m,
  variables = list(
    "dob_c" = "sd"),
  newdata = datagrid(
    gender = unique))
  • upper quartile – lower quartile
comparisons(
  m,
  variables = list(
    "dob_c" = "iqr"),
  newdata = datagrid(
    gender = unique))
  • maximum – minimum
comparisons(
  m,
  variables = list(
    "dob_c" = "minmax"))

For the intermediate approach illustrated in Figure 5.11, we need to specify two things: (i) the span we want to consider, which is done in the same way as above; and (ii) the base values, i.e. values of the numeric variable to which the span will be added:

comparisons(
  m,
  variables = list(
    "dob_c" = 30/25),
  newdata = datagrid(
    dob_c = seq(-1.4, -.2, length = 4)
  )
)

To evaluate the comparison at custom locations, we can also use functions to supply specific values of the distribution. To apply the comparison to the restricted range, from the 5th to the 95th percentile (see Long and Freese 2014, 250–51), we could type the following:

comparisons(m,
  variables = list(
    "dob_c" = c(
      quantile(d$dob_c, .05),
      quantile(d$dob_c, .95)
    )
  ))

6.3.3.3 Comparisons at specified values (R)

comparisons(
  m,
  variables = "gender",
  newdata = datagrid(
    dob_c = 0))

6.3.3.4 Comparisons for the “average unit” (R)

We can set peripheral variables equal to their means (numeric predictors) or modes (categorical predictors) by specifying the argument newdata = "mean":

comparisons(
  m,
  variables = "gender",
  newdata = "mean")

6.3.3.5 Comparisons for ideal types (R)

6.3.4 Average comparisons (R)

The {marginaleffects} package computes average comparisons by first calculating unit-level comparisons on a specified scale and then averaging over these.

6.3.4.1 Average comparisons: Observed vs. specified values (R)

By default, the avg_comparisons() function treats peripheral variables as-observed, which means that the comparison is calculated for each observation in the data set and then averaged. For binary variables, it considers a change from one level to the other. The values we obtain are those in Table 5.11 (left-hand side).

avg_comparisons(
  m,
  variables = "gender")

We can graph these with the following code (see Figure 5.15).

avg_comparisons(
  m,
  variables = "gender") |>  
  ggplot(aes(x = estimate, y = group)) +
  geom_point() +
  geom_linerange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 0)

To produce average comparisons over specified values of the peripheral variable Date of birth, we need to specify its values using the newdata argument (see right-hand side of Table 5.11).

avg_comparisons(
  m,
  variables = "gender",
  newdata = datagrid(
    dob_c = seq(-1.4, 1, length = 7)
  ))

6.3.4.2 Average comparisons for a quick overview of predictor importance (R)

The summaries reported in Table 5.12 can be obtained as follows. To calculate average comparisons for the variable Gender, treating Date of birth as-balanced, we need to pass the nae of the tageted variable gender to the variables argument, and manually define the values of Date of birth using the newdata argument.

avg_comparisons(
  m,
  variables = "gender",
  newdata = datagrid(
    dob_c = seq(-1.4, 1, length = 7)
  ))

To calculate average comparisons for Date of birth using the intermediate approach outlined in Figure 5.11, we need to specify the step size in a list passed to the argument variables. In the newdata argument, we define the start locations of Date of birth.

avg_comparisons(
  m,
  variables = list(
    "dob_c" = 30/25),
  newdata = datagrid(
    dob_c = seq(-1.4, -.2, length = 4),
    gender = unique
  ))

Instead of treating a peripheral variable as-observed, we can determine its values manually:

avg_comparisons(
  m, 
  variables = "gender")
avg_comparisons(
  m, 
  by = "gender")
avg_comparisons(
  m, 
  variables = "female",
  newdata = datagrid(
    dob_c = quantile_midpoints
  ))

6.3.4.3 Average comparisons for subgroups in the data (R)

Average comparisons for Date of birth, for Gender subgroups

  • DOB: as-observed (+1)
  • Gender: treated identically
avg_comparisons(
  m,
  variables = "dob_c",
  by = "gender")

Average comparisons for Date of birth, for Gender subgroups

  • DOB: -1 => +1
  • Gender: treated identically
avg_comparisons(
  m,
  variables = list(
    "dob_c" = c(-1, 1)),
  by = "gender")

Average comparisons for Date of birth, for Gender subgroups

  • DOB: intermediate approach
  • Gender: treated identically
avg_comparisons(
  m,
  variables = list(
    "dob_c" = 30/25),
  newdata = datagrid(
    dob_c = seq(-1.4, -.2, length = 4),
    gender = unique),
  by = "gender") 

6.3.5 Predictions on the latent-variable scale (R)

Since model-based predictions on the latent-variable scale only make sense in the context of the thresholds, we have written two custom functions for adding thresholds and category labels to {ggplot} graphs:

  • add_thresholds()
  • add_threshold_labels()
add_thresholds <- function(
    thresholds, 
    col_lines = "black",
    vertical = FALSE){
  thresholds_c <- as.numeric(thresholds) - mean(as.numeric(thresholds))
  
  if (vertical == FALSE){
    ggplot2::geom_hline(
      yintercept = thresholds_c,
      color = col_lines
    )
  } else {
    ggplot2::geom_vline(
      xintercept = thresholds_c,
      color = col_lines
    )
  }
  
}

add_threshold_labels <- function(
    thresholds, 
    padding_labels = .1, 
    x_location = 1, 
    adj_label = 1,
    col_labels = "black",
    custom_labels = NULL,
    size_labels = 3,
    vertical = FALSE){
  
  thresholds_c <- as.numeric(thresholds) - mean(as.numeric(thresholds))
  
  n_thresholds <- length(thresholds)

  if (is.null(custom_labels)){
    category_labels <- c(
      str_split(dimnames(thresholds)[[2]][1:n_thresholds], pattern = "\\|", simplify = TRUE)[,1],
      str_split(dimnames(thresholds)[[2]][1:n_thresholds], pattern = "\\|", simplify = TRUE)[n_thresholds,2]
    )
  } else {
    category_labels <- custom_labels
  }
  
  location_labels <- c(
    min(thresholds_c) - diff(range(thresholds_c))*padding_labels,
    zoo::rollmean(thresholds_c, k = 2),
    max(thresholds_c) + diff(range(thresholds_c))*padding_labels)
  
  if (vertical == FALSE){
    ggplot2::annotate(
      "text", 
      x = x_location,
      y = location_labels,
      label = category_labels,
      adj = adj_label,
      color = col_labels,
      size = size_labels
    )
  } else {
    ggplot2::annotate(
      "text", 
      y = x_location,
      x = location_labels,
      label = category_labels,
      adj = adj_label,
      color = col_labels,
      size = size_labels
    )
  }  
}

6.3.5.1 {emmeans} package

Unfortunately, the {marginaleffects} package does not allow us to produce predictions on the latent-variable scale. We must therefore switch to the {emmeans} package (Lenth 2024), which offers similar functionality to {marginaleffects} package, but differs in a few key respects. The most important difference (apart from syntax) concerns the default treatment of peripheral variables.

As we have seen above, {marginaleffects} treats peripheral variables as-observed, unless we actively specify their values. {emmeans}, on the other hand, defaults to an as-balanced treatment of categorical variables. This means that it returns simple averages, giving equal weight to all levels of the variable. The default treatment of continuous predictors, on the other hand, is to hold them at their in-sample average. Importantly, {emmeans} does not support as-observed predictions. The key differences are summarized in Figure 6.1.

Figure 6.1

6.3.6 Predictions for each observed unit (R)

The {emmeans} package does not allow us to generate as-observed predictions. This means that we must construct them manually:

po_lv_m <- data.frame(
  estimate = as.numeric(
    t(as.matrix(coef(m)[4:5])) %*% t(model.matrix(m)$X[,-1]) - mean(m$Theta)))

Then we can plot them using a dot diagram. The custom functions (defined above) add_thresholds() and add_threshold_labels() are used to annotate the graph.

po_lv_m |> 
  ggplot(aes(x = estimate)) + 
  geom_dotplot(
    method = "histodot", 
    binwidth = .09) +
  add_thresholds(m$Theta, vertical = TRUE) +
  add_threshold_labels(m$Theta, vertical = TRUE)

6.3.7 Unit-level predictions (R)

While {emmeans} is not designed to generate unit-level predictions, we can nevertheless force it to avoid averaging by manually specifying value for all predictor variables in the model. Let’s say we wish to obtain predictions for a male speaker born in 2000. We pass the variables that are used to define the condition to the spec argument and then, critically, we specify the values for both in the at argument. To compute a prediction for a male speaker born in 2000, we run the following:

emmeans(
  m,
  spec = c("dob_c", "gender"),
  at = list(
    dob_c = 1,
    gender = "m"))

To add conditions, we specify additional values in the at argument. To obtain predictions for both a male and female speaker born in 2000:

emmeans(
  m,
  spec = c("dob_c", "gender"),
  at = list(
    dob_c = 1,
    gender = c("f", "m")))

And we can also add predictions for two additional years (1950 and 1975):

emmeans(
  m,
  spec = c("dob_c", "gender"),
  at = list(
    dob_c = c(-1, 0, 1),
    gender = c("f", "m")))

6.3.8 Average predictions (R)

We can use the emmeans() function to get average predictions for male and female speakers. If we do not specify a birth year, we get at-means predictions, based on the in-sample average.

emmeans(
  m,
  spec = "gender")

Since the distribution of this variable is not representative of the target population, we avoid sample statistics and instead ask for predictions for the year 1975:

emmeans(
  m,
  spec = "gender",
  at = list(
    dob_c = 0))

We can likewise obtain average predictions for the variable Date of birth. By default, {emmeans} treats Gender (the peripheral categorical variable) as-balanced, giving the same weight to male and female speakers. This is what we want. When requesting average predictions for numeric predictors, we also need to specify the locations we wish to consider. Otherwise we will only get a single estimate for the in-sample mean. Here, we choose 20-year steps:

emmeans(
  m,
  spec = "gender",
  at = list(
    dob_c = c(-1.4, -.6, .2, 1)))

If we wish to instead obtain weighted averages over peripheral categorical variables, we can use the weights argument. Setting it to "proportional", the levels are weighted according to their representation in the estimation sample.

emmeans(
  m,
  spec = "dob_c",
  at = list(
    dob_c = c(-1.4, -.6, .2, 1)),
  weights = "proportional")

6.3.9 Unit-level comparisons (R)

To obtain unit-level comparisons, the first step is to create unit-level predictions. For instance, to compare female and male speakers born in 2000, we use the emmeans() function to calculate unit-level predictions. We assign these to a new object ap_gender:

ap_gender <- emmeans(
  m,
  spec = "gender",
  at = list(
    dob_c = 1))

ap_gender
 gender emmean    SE  df asymp.LCL asymp.UCL
 f      -0.878 0.247 Inf     -1.36    -0.395
 m      -1.462 0.258 Inf     -1.97    -0.957

Confidence level used: 0.95 

To compare these predictions, use the function pairs():

pairs(ap_gender)
 contrast estimate   SE  df z.ratio p.value
 f - m       0.584 0.27 Inf   2.166  0.0303

The output of pairs() does not include confidence intervals. To obtain these, we have to wrap the function confint() around it:

confint(
  pairs(ap_gender))
 contrast estimate   SE  df asymp.LCL asymp.UCL
 f - m       0.584 0.27 Inf    0.0555      1.11

Confidence level used: 0.95 

By default, pairs() runs all pairwise comparisons among predictions contained in the objects created by emmeans(). We can simplify the output with the argument simple. For instance, if we wish to obtain differences between male and female speakers at different birth years, we start by forming a set of 6 predictions:

ap_gender <- emmeans(
  m,
  spec = c("dob_c", "gender"),
  at = list(
    dob_c = c(-1, 0, 1),
    gender = c("f", "m")))

ap_gender
 dob_c gender emmean    SE  df asymp.LCL asymp.UCL
    -1 f       1.490 0.363 Inf     0.778     2.201
     0 f       0.306 0.195 Inf    -0.076     0.688
     1 f      -0.878 0.247 Inf    -1.361    -0.395
    -1 m       0.906 0.380 Inf     0.161     1.650
     0 m      -0.278 0.217 Inf    -0.704     0.147
     1 m      -1.462 0.258 Inf    -1.968    -0.957

Confidence level used: 0.95 

We are not interested in all pairwise comparisons, only in the comparisons between the levels of Gender, but for each birth year. The argument simple allows us to specify the focal variable for the contrast, which gives us what we want:

pairs(
  ap_gender, 
  simple = "gender")
dob_c = -1:
 contrast estimate   SE  df z.ratio p.value
 f - m       0.584 0.27 Inf   2.166  0.0303

dob_c =  0:
 contrast estimate   SE  df z.ratio p.value
 f - m       0.584 0.27 Inf   2.166  0.0303

dob_c =  1:
 contrast estimate   SE  df z.ratio p.value
 f - m       0.584 0.27 Inf   2.166  0.0303

The following code produces an interaction plot.

emmip(
  m, 
  gender ~ dob_c,
  at = list(
    dob_c = seq(-1.4, 1, by = .1)
    ),
  CIs = TRUE, 
  col = NA) +
  add_thresholds(m$Theta, col_lines = "grey") +
  add_threshold_labels(m$Theta, x_location = 1.2, adj_label = 1, col_labels = "grey") +
  theme_classic()  +
  geom_ribbon(aes(ymin = LCL, ymax = UCL, fill = gender), alpha = 0.2, colour = NA) +
  geom_line(lty = 1)

6.3.10 Average comparisons (R)

Generating average comparisons works in a similar way. We first create an object containing average predictions and then use pairs() to obtain relevant contrasts.

ap_dob <- emmeans(
  m,
  spec = "dob_c",
  at = list(
    dob_c = c(-1.4, 1)))

ap_dob
 dob_c emmean    SE  df asymp.LCL asymp.UCL
  -1.4   1.67 0.437 Inf     0.815     2.527
   1.0  -1.17 0.213 Inf    -1.588    -0.752

Results are averaged over the levels of: gender 
Confidence level used: 0.95 
pairs(ap_dob)
 contrast             estimate   SE  df z.ratio p.value
 (dob_c-1.4) - dob_c1     2.84 0.58 Inf   4.903  <.0001

Results are averaged over the levels of: gender