|>
d group_by(rating) |>
::summarize(
dplyrn = n()) |>
mutate(
prop = prop.table(n),
cum_prop = cumsum(prop)
)
6 R workbench
These notes are currently under construction and may contain errors. Feedback of any kind is very much welcome: lukas.soenning@uni-bamberg.de
This chapter shows how to apply the methods discussed in earlier chapters in R.
6.1 Descriptive statistics
6.1.1 Relative frequencies
We can obtain category-specific and cumulative proportions in R using the {dplyr}
package. The first step is to obtain the category-specific counts, then the corresponding proportions, and finally aggregated proportions. We apply this sequence of steps to the heaps dataset.
We can also form cumulative proportions for subgroups in the data, by adding a grouping variable to the group_by()
function. Note that the grouping variable must be listed first.
|>
d group_by(gender, rating) |>
::summarize(
dplyrn = n()) |>
mutate(
prop = prop.table(n),
cum_prop = cumsum(prop)
)
6.1.2 Visualizing ordinal data
6.1.2.1 Categorical predictors
When drawing a bar chart as shown in Figure 2.2, we first need to decide on how to order subgroups (in this case: syntactic functions) along the x-axis. Before visualizing the data, we convert the variable synt_fun
into and ordered factor manually:
$synt_fun <- factor(
d$synt_fun,
dlevels = c("p_adj", "verb", "c_adj", "quan"),
ordered = TRUE)
We can now draw a grouped bar chart.
|>
d ggplot(aes(
x = synt_fun,
fill = rating)) +
geom_bar(position = "dodge") +
scale_fill_brewer(
palette = "RdGy")
To draw the line plot in Figure 2.2, we need more code. Note that the first part merely creates the category-specific proportions.
|>
d group_by(synt_fun, rating) |>
::summarize(
dplyrn = n()) |>
mutate(
prop = prop.table(n),
cum_prop = cumsum(prop)
|>
) ggplot(aes(
x = synt_fun,
y = prop,
color = rating,
group = rating)) +
geom_line() +
geom_point() +
scale_colour_brewer(
palette = "RdGy")
Here is the code to draw a bar chart showing cumulative proportions (see Figure 2.3).
|>
d ggplot(aes(
x = synt_fun,
fill = rating)) +
geom_bar(
position = "fill") +
scale_fill_brewer(
palette = "RdGy")
To change the order of the categories (i.e. to have red segments at the bottom), we can reverse the y-axis using the additional function scale_y_reverse()
.
|>
d ggplot(aes(
x = synt_fun,
fill = rating)) +
geom_bar(
position = "fill") +
scale_fill_brewer(
palette = "RdGy") +
scale_y_reverse()
We may then want to change the tick mark labels manually, so that 0 marks the bottom of the graph and 1 the top.
|>
d ggplot(aes(
x = synt_fun,
fill = rating)) +
geom_bar(
position = "fill") +
scale_fill_brewer(
palette = "RdGy") +
scale_y_reverse(
breaks = c(0, .5, 1),
labels = c("1", ".5", "1"))
Diverging bar charts require more work. Note that the first part only creates the appropriate summary statistics (like above).
# create summary
<- d |>
ds group_by(synt_fun, rating) |>
::summarize(
dplyrn = n()) |>
mutate(
prop = prop.table(n),
cum_prop = cumsum(prop))
# draw graph
|>
ds ggplot(aes(
x = reorder(synt_fun, prop * as.numeric(rating), mean),
y = prop, fill = rating)) +
geom_col(data = dplyr::filter(
%in% c("4", "5", "6")),
ds, rating position = position_stack(reverse = TRUE)) +
geom_col(data = dplyr::filter(
%in% c("1", "2", "3")),
ds, rating aes(y = -prop),
position = position_stack(reverse = FALSE)) +
scale_fill_brewer(palette = "RdGy", drop = FALSE)
The following code creates the line plot in Figure 2.5. We start by creating a summary table listing the relevant exceedance proportions. Then we exclude those for the highest category, as these are 0 and redundant. Then we draw the graph, where we explicitly define the limits of the y-axis.
# create summary
<- d |>
ds group_by(synt_fun, rating) |>
::summarize(
dplyrn = n()) |>
mutate(
prop = prop.table(n),
exc_prop = 1 - cumsum(prop))
# exclude exceedance proportions for highest category
<- ds |> dplyr::filter(rating != "6")
ds
# draw graph
|>
ds ggplot(aes(
x = synt_fun,
y = exc_prop,
color = rating,
group = rating)) +
geom_line() +
geom_point(size = 1) +
scale_colour_brewer(
palette = "RdGy") +
ylim(0,1)
6.1.2.2 Numeric predictors
Unfortunately, we don’t know how to use {ggplot2}
to directly create a plot showing category-specific proportions across the variable Date of birth (see Figure 2.6). Our current workaround is to extract the category-specific proportions from an object created using the function cdplot()
and then pass on these quantities to {ggplot}
# create smoothed cumulative proportions
<- cdplot(
tmp ~ age,
rating data = d,
bw = 3,
plot = FALSE)
# change these into category-specific proportions and arrange as a data frame
<- data.frame(
ds age = 18:80,
rating_1 = tmp[[1]](18:80),
rating_2 = tmp[[2]](18:80) - tmp[[1]](18:80),
rating_3 = tmp[[3]](18:80) - tmp[[2]](18:80),
rating_4 = tmp[[4]](18:80) - tmp[[3]](18:80),
rating_5 = tmp[[5]](18:80) - tmp[[4]](18:80),
rating_6 = 1 - tmp[[5]](18:80)
|>
) gather(rating_1:rating_6, key = "response", value = "prob")
# draw line plot
|>
dsggplot(aes(
x = age,
y = prob,
color = response)) +
geom_line() +
scale_colour_brewer(
type = "div",
palette = "RdGy")
The following code produces an area chart showing aggregated proportions (see Figure 2.7).
|>
d ggplot(aes(
x = age,
after_stat(count),
fill = rating)) +
geom_density(
position = "fill",
adjust = 1.5,
colour = 1,
outline.type = "full") +
scale_fill_brewer(
palette = "RdGy")
To create a spine plot (see Figure 2.8) using {ggplot2}
, we need need the geom_mosaic()
function in the {ggmosaic}
package. The first step is to discretize the variable Age into 10-year bins. To produce our desired visual arrangement (the highest category “6” in grey at the bottom), we need to reverse the order of the response categories (using fct_rev()
) and then alse reverse the order of the fill colors using the argument direction
in the function scale_fill_brewer()
.
|> mutate(
d age_bin = factor(
cut(age,
seq(10, 90, by = 10)))) |>
mutate(
rating = fct_rev(rating)
|>
) count(rating, age_bin) %>%
ggplot() +
geom_mosaic(aes(
weight = n,
x = product(age_bin),
fill = rating)) +
scale_fill_brewer(
palette = "RdGy",
direction = -1)
The following code produces the line plot version of this area chart (see Figure 2.9).
|>
d ggplot(aes(
x = age,
after_stat(count),
color = rating)) +
geom_density(
position = "fill",
adjust = 1.5,
outline.type = "upper") +
scale_colour_brewer(
palette = "RdGy") +
ylim(0, 1)
6.2 Ordered regression models
6.2.1 Model fitting
We can fit cumulative models using the function the {ordinal}
package. It offers two functions for model fitting:
clm()
for simple modelsclmm()
for mixed-effects models (more recent version)
The function clmm2()
is an older version for mixed-effects models, which currently offers extended functionality but will eventually be superseded by clmm()
.
We start by loading the data and make sure that the outcome variable rating
is represented as an ordered factor.
<- read_tsv(
d_parcel here("data/analysis_data",
"malta_data_parcel.tsv"))
$rating <- factor(d_parcel$rating) d_parcel
The following code fits a model to the parcel-package data with a single predictor variable, Date of birth. We use the argument threshold
to request symmetric thresholds.
<- clm(
m ~ dob_centered,
rating data = d_parcel,
threshold = "symmetric")
We can print a regression table using summary()
:
summary(m)
formula: rating ~ dob_centered
data: d_parcel
link threshold nobs logLik AIC niter max.grad cond.H
logit symmetric 193 -263.16 534.32 5(1) 5.40e-13 1.9e+01
Coefficients:
Estimate Std. Error z value Pr(>|z|)
dob_centered -1.212 0.242 -5.008 5.49e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Threshold coefficients:
Estimate Std. Error z value
central.1 -0.55013 0.17209 -3.197
central.2 0.44757 0.17158 2.609
spacing.1 0.38966 0.06784 5.744
The function Anova()
in the {car}
package (Fox and Weisberg 2019) produces an analysis of deviance table:
::Anova(m, type="III") car
Analysis of Deviance Table (Type III tests)
Response: rating
Df Chisq Pr(>Chisq)
dob_centered 1 25.085 5.486e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There are other functions for model comparison and evaluation. Nested models can be compared using anova()
, and stepwise model selection can be done with drop1()
and add1()
.
6.2.2 The parallel regression assumption
In R, we can run the LR test as follows. We first fit the two models using the clm()
function from the package {ordinal}
(Christensen 2023). To fit the non-parallel model, we use the additional argument nominal =
, where we specify the predictor(s) for which we want to fit freely varying coefficients.
<- clm(
m_np ~ 1,
rating nominal = ~ dob_centered,
data = d_parcel,
threshold = "symmetric")
Then we compare the two models using the function anova()
:
anova(m, m_np)
Likelihood ratio tests of cumulative link models:
formula: nominal: link: threshold:
m rating ~ dob_centered ~1 logit symmetric
m_np rating ~ 1 ~dob_centered logit symmetric
no.par AIC logLik LR.stat df Pr(>Chisq)
m 4 534.32 -263.16
m_np 6 533.15 -260.58 5.1627 2 0.07567 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It is currently not possible to fit non-parallel mixed-effects models using the package {ordinal} – the clmm()
function does not have the nominal
argument.
6.2.2.1 Model comparison using information criteria
We can use the functions AIC()
and BIC()
to obtain information measures for both models.
BIC(m)
[1] 547.3666
BIC(m_np)
[1] 552.7294
6.2.2.2 Comparison of coefficients across cutpoint equations
To compare coefficients across cutpoint equations, we need to create four new indicator variables that represent the binary splits underlying cumulative models (see Figure 3.1). We add these to the data frame.
$split_1 <- ifelse(as.numeric(d_parcel$rating) <= 1, 0, 1)
d_parcel$split_2 <- ifelse(as.numeric(d_parcel$rating) <= 2, 0, 1)
d_parcel$split_3 <- ifelse(as.numeric(d_parcel$rating) <= 3, 0, 1)
d_parcel$split_4 <- ifelse(as.numeric(d_parcel$rating) <= 4, 0, 1) d_parcel
Then we fit a logistic regression model to each cutpoint equation:
<- glm(
m_s1 ~ dob_centered,
split_1 data = d_parcel,
family = binomial(link = "logit"))
<- glm(
m_s2 ~ dob_centered,
split_2 data = d_parcel,
family = binomial(link = "logit"))
<- glm(
m_s3 ~ dob_centered,
split_3 data = d_parcel,
family = binomial(link = "logit"))
<- glm(
m_s4 ~ dob_centered,
split_4 data = d_parcel,
family = binomial(link = "logit"))
The coefficients can then be compared in tabular and/or graphical form. We can use the function coef()
to access the coefficients of interest, and the function confint()
to obtain confidence intervals for them. The following code produces a simpl table (see Table 3.3).
data.frame(
split = 1:4,
estimate = round(
c(coef(m_s1)[2],
coef(m_s2)[2],
coef(m_s3)[2],
coef(m_s4)[2]),
2),
ci_lower = round(
c(confint(m_s1)[2,1],
confint(m_s2)[2,1],
confint(m_s3)[2,1],
confint(m_s4)[2,1]),
2),
ci_upper = round(
c(confint(m_s1)[2,2],
confint(m_s2)[2,2],
confint(m_s3)[2,2],
confint(m_s4)[2,2]),
2))
split estimate ci_lower ci_upper
1 1 -0.83 -1.43 -0.29
2 2 -1.32 -1.96 -0.76
3 3 -1.35 -1.92 -0.82
4 4 -1.38 -1.96 -0.84
6.2.2.3 Comparison of predicted probabilities
We use the emmeans()
function to obtain average predicted response probabilities for models m
and m_np
.
<- emmeans(
ap_m
m, ~ rating | dob_centered,
mode = "prob",
at = list(
dob_centered = seq(-1.4, 1, by = .1))
|>
) data.frame()
<- emmeans(
ap_m_np
m_np, ~ rating | dob_centered,
mode = "prob",
at = list(
dob_centered = seq(-1.4, 1, by = .1))
|>
) data.frame()
Then we draw the graph using {ggplot2}
(see Figure 3.6).
|>
ap_m ggplot(aes(
x = dob_centered,
y = prob,
color = rating)) +
geom_line() +
geom_line(data = ap_m_np, linetype = "22") +
scale_colour_brewer(palette = "RdBu")
Extract category-specific probabilities from model.
emmeans(
m_np, ~ rating | dob_centered,
mode = "prob",
at = list(
dob_centered = 0)
)
Extract cumulative probabilities from model.
emmeans(
m_np, ~ dob_centered | cut,
mode = "cum.prob",
at = list(
dob_centered = 0)
)
Extract exceedance probabilities from the model:
emmeans(
m_np, ~ dob_centered | cut,
mode = "exc.prob",
at = list(
dob_centered = 0)
)
6.2.3 Model diagnostics
The presid()
function in the {PResiduals}
package can be used to calculate probability-scale residuals. Since presid()
does not (yet) work on models fit using the package {ordinal}
, we write a function that extracts probability-scale residuals from models fit using clm()
:
<- function(object, ...) {
presid.clm <- switch(object$link,
pfun logit = plogis,
probit = pnorm,
cloglog = pGumbel,
cauchit = pcauchy)
<- object$nobs
n <- length(object$Theta)
q <- as.numeric(model.matrix(object)$X[,-1] %*% coef(object)[-(1:length(object$alpha))])
lp
<- cbind(0, matrix(pfun(matrix(object$Theta, n, q, byrow = TRUE) - lp),, q), 1)
cumpr <- as.integer(object$y)
y <- cumpr[cbind(seq_len(n), y)]
lo <- 1 - cumpr[cbind(seq_len(n), y+1L)]
hi <- lo - hi
res <- naresid(object$na.action, res)
res return(res)
}
6.3 Methods of interpretation
We now look at how to apply different methods of interpretation using R. The following sections are structured identically to the ones in Chapter 5. The R code given here produces figures and tables similar to the ones in that earlier chapter. The relevant items are cross-referenced. To see the intended output, hover over the hyperlink – the image or table will then appear.
This is the model we will concentrate on:
<- clm(
m ~ dob_c + gender,
rating data = d_parcel,
threshold = "symmetric")
6.3.1 Predictions on the probability scale (R)
6.3.1.1 Predictions for each observed unit (R)
In R, the in-sample predicted probabilities for the outcome categories can be computed with the function predictions()
in the {marginaleffects}
package.
predictions(m)
The following code produces the dot diagram in Figure 5.4. note how piping is used to pass on the output of the predictions()
function to the ggplot()
function. The function geom_dotplot()
draws a dot diagram. We manually changed the argument binwidth
, since the default setting (30 bins) is too coarse for the large number of dots that need to be displayed.
predictions(m) |>
ggplot(aes(x = estimate, y = group)) +
geom_dotplot(
binwidth = .005)
To create a by-category summary table (Table 5.1), we use the function group_by()
to identify the subgroups we wish to compare (here: the response categories, which are generically referred to as “group” by the predictions()
function). Then we use the function summarize()
in the {dplyr}
package to obtain the summary statistics we need.
predictions(m) |>
group_by(group) |>
::summarize(
dplyrMean = mean(estimate),
SD = sd(estimate),
Min = min(estimate),
Max = max(estimate)
)
6.3.1.2 Unit-level predictions (R)
6.3.1.2.1 Predictions at specified values (R)
We can use the predictions()
function to obtain predictions at specified values. We start by creating a “grid”, i.e. data frame which holds this specific profile we are interested in.
<- data.frame(
grid dob_c = 0,
gender = "f")
grid
dob_c gender
1 0 f
By supplying this grid to the newdata
argument in the predictions()
function, we can predict the response probabilities for this profile:
predictions(m, newdata = grid)
The same can be done using the {emmeans}
package:
emmeans( # compute from
# model m
m, ~ rating, # category-specific predictions
mode = "prob", # on the probability scale
at = list( # for the following profile:
dob_c = 0, # - date of birth = 0 (year 1975)
gender = "f")) # - gender = female
The datagrid()
function in the {marginaleffects}
package makes it easy to define different profiles. Note how the datagrid()
function returns all combinations of the specified values:
<- datagrid(
grid gender = c("m", "f"),
dob_c = c(-1.4, -0.6, 0.2, 1),
model = m)
grid
gender dob_c rowid
1 m -1.4 1
2 m -0.6 2
3 m 0.2 3
4 m 1.0 4
5 f -1.4 5
6 f -0.6 6
7 f 0.2 7
8 f 1.0 8
We then pass the new grid
to the predictions()
function, to obtain the predicted probabilities summarized in Table 5.2.
predictions(
m, newdata = grid)
We can also use shortcut functions: range
to define the endpoints of the date-of-birth distribution, and unique
to include all attested levels (or values) of Gender:
<- datagrid(
grid gender = unique,
dob_c = range,
model = m)
grid
gender dob_c rowid
1 f -1.4 1
2 f 1.0 2
3 m -1.4 3
4 m 1.0 4
6.3.1.2.2 Predictions for the “average unit” (R)
When locating a central and/or representative point in the predictor space, {marginaleffects}
uses the mean for all numeric variables, and the mode for all categorical variables. The package allows us to pass the "mean"
shortcut to the newdata
argument, which returns predictions for a (hypothetical) female speaker born in 1973.34 (Table 5.3):
predictions(
m, newdata = "mean")
6.3.1.2.3 Predictions for ideal types (R)
To form predictions for the two ideal types (Table 5.4), we specify the relevant profiles using the newdata
argument of the predictions()
function:
predictions(
m, newdata = datagrid(
dob_c = -1.4,
gender = "m"
))
predictions(
m, newdata = datagrid(
dob_c = 1,
gender = "f"
))
6.3.1.2.4 Extra: Plotting unit-level predictions using {marginaleffects}
Male and female speaker born in 2000:
plot_predictions(
m,condition = list(
"gender", "group", "dob_c" = 1)) +
scale_y_continuous(
limits = c(0, 1), expand = c(0, 0),
breaks = c(0, .5, 1), label = c("0", ".5", "1")) +
scale_color_manual(values = fill_cols_label) +
theme_ls() +
ylab("Probability") + xlab(NULL) +
labs(caption = "Adjusted to: Date of birth = 2000.") +
theme(strip.text = element_blank())
Full range of conditions:
plot_predictions(
m,condition = c("dob_c", "gender", "group")) +
scale_y_continuous(
limits = c(0, 1), expand = c(0, 0),
breaks = c(0, .5, 1), label = c("0", ".5", "1")) +
scale_x_continuous(breaks = c(-1, 0, 1)) +
facet_grid(. ~ group) +
theme_ls() +
ylab("Probability") +
xlab("Date of birth (rescaled)")
6.3.2 Average predictions (R)
6.3.2.1 Unit-level predictions: as-observed vs. specified values (R)
By default, the function avg_predictions()
uses the as-observed approach and calculates the average prediction over the sample. In {marginaleffects}
terminology, such average predictions are said to have been obtained using an empirical grid.
avg_predictions(m)
Rather than use the observed distribution for the variable Gender, we may take control of the value(s) at which to evaluate predictions. For instance, to treat the two levels (male and female) equally, we need to supply this variable to the newdata
argument. All variables that are not specified in the newdata
argument will still be treated as-observed. The grid is said to be partly synthetic, i.e. partly manually specified.
avg_predictions(
m,newdata = datagrid(
gender = unique
))
We can also determine the value of Age at which unit-level prediction are to be formed. By also defining the values of Age we are specifying a fully synthetic grid.
avg_predictions(
m,newdata = datagrid(
dob_c = seq(-1.4, 1, by = .8),
gender = unique
))
We have created two custom functions for handling quantitative predictors with datagrid()
. These allow you to select specific points of the observed (empirical) distribution of numeric variables:
sd_mean_sd()
returns three values: the mean surrounded by two values at +/- 1 standard deviation. The functionmad()
is used to obtain a robust version of the SD, and a 10% trimmed mean is used by default.decile_midpoints()
returns the midpoints of n bins, each of which contain roughly the same number of cases in the estimation sample.
6.3.2.2 Average predictions for a quick overview of predictor importance (R)
We can compute average predictions for a specific predictor using the avg_predictions()
function. The argument variables
specifies the predictor of interest, and the argument newdata
controls the teatment of the peripheral variables, i.e. it defines the locations of the data space over which predictions are averaged. The average predictions reported in Table 5.6 for the variable Gender can be obtained as follows:
avg_predictions(
m,variables = "gender",
newdata = datagrid(
dob_c = seq(-1.4, 1, length = 7)
))
For quantitative variables, we have to specify the values of interest in a list. Otherwise avg_predictions()
will only return a single average prediction calculated at the in-sample mean of the variable.
avg_predictions(
m,variables = list(
"dob_c" = c(-1.4, -.2, 1)),
newdata = datagrid(
gender = unique
))
6.3.2.3 Average predictions for subgroups in the data (R)
To compute average predictions for the subgroups using the same as-observed values for all groups we need to use the argument variables
in the avg_predictions()
function. The “as-observed, identical” average predictions are obtained as follows:
avg_predictions(
m,variables = "gender")
If we instead use the by
argument, the peripheral variables are again treated as-observed, but a separate set of observed values is used for each subgroup defined by the by
argument. The “as-observed, different” average predictions are obtained as follows:
avg_predictions(
m,by = "gender")
For average predictions that give equal weight to cohorts, we can use the newdata
argument to specify the age values at which to generate predictions.
avg_predictions(
m,variables = "gender",
newdata = datagrid(
dob_c = seq(-1.4, 1, by = .8)
))
6.3.2.4 Using local means for adjustment (R)
6.3.2.5 Extra: Plotting average predictions using {marginaleffects}
The {marginaleffects}
package makes it easy to visualize average predictions.
Average predictions by Gender, treating Date of birth as-observed:
plot_predictions(
m, by = c("gender", "group"))
Average predictions by Gender, treating Date of birth as-balanced:
plot_predictions(
m, by = c("gender", "group"),
newdata = datagrid(
gender = unique,
dob_c = seq(-1.4, 1, .8)))
Average predictions by Date of birth, treating Gender as-balanced:
plot_predictions(
m, by = c("dob_c", "group"),
newdata = datagrid(
gender = unique,
dob_c = seq(-1.4, 1, .05)))
6.3.3 Comparisons (R)
6.3.3.1 Comparisons for each observed unit (R)
The function comparison()
in the {marginaleffects}
package can be used to make comparisons. The targeted variable is specified with the argument variable
. By default, all other variables are treated as-observed, which means that a comparison is computed for each observation in the sample. The following function call therefore returns a large table of differences in predicted probability, which compare the predictions obtained for the male and the female version of each subject in the sample.
comparisons(
m,variables = "female")
To draw a dot diagram (Figure 5.12), we run the following code:
comparisons(
m,variables = "gender") |>
ggplot(aes(x = estimate, y = group)) +
geom_dotplot(binwidth = .001) +
geom_vline(xintercept = 0)
For numeric variables, the default behavior of the comparisons()
function is to use the observed values of the targeted numeric variable: The variable is changed from its observed value to its observed value plus 1.
comparisons(
m,variables = "dob_c") |>
ggplot(aes(x = estimate, y = group)) +
geom_dotplot(binwidth = .001) +
geom_vline(xintercept = 0)
6.3.3.2 Comparisons for numeric predictors (R)
We will consider the options shown in Figure 5.10 in turn, from top to bottom. The output is not shown here. Note that the first two strategies return a batch of predictions, one for each observation in the estimation sample:
- Default: (observed + 1) – observed
comparisons(
m,variables = "dob_c",
newdata = datagrid(
gender = unique))
- (observed + 0.5) – observed
comparisons(
m,variables = list(
"dob_c" = 0.5),
newdata = datagrid(
gender = unique))
- –1 -> 1
comparisons(
m,variables = list(
"dob_c" = c(-1, 1)),
newdata = datagrid(
gender = unique))
- (mean + SD) – (mean – SD)
comparisons(
m,variables = list(
"dob_c" = "2sd"),
newdata = datagrid(
gender = unique))
- (mean + SD/2) – (mean – SD/2)
comparisons(
m,variables = list(
"dob_c" = "sd"),
newdata = datagrid(
gender = unique))
- upper quartile – lower quartile
comparisons(
m,variables = list(
"dob_c" = "iqr"),
newdata = datagrid(
gender = unique))
- maximum – minimum
comparisons(
m,variables = list(
"dob_c" = "minmax"))
For the intermediate approach illustrated in Figure 5.11, we need to specify two things: (i) the span we want to consider, which is done in the same way as above; and (ii) the base values, i.e. values of the numeric variable to which the span will be added:
comparisons(
m,variables = list(
"dob_c" = 30/25),
newdata = datagrid(
dob_c = seq(-1.4, -.2, length = 4)
) )
To evaluate the comparison at custom locations, we can also use functions to supply specific values of the distribution. To apply the comparison to the restricted range, from the 5th to the 95th percentile (see Long and Freese 2014, 250–51), we could type the following:
comparisons(m,
variables = list(
"dob_c" = c(
quantile(d$dob_c, .05),
quantile(d$dob_c, .95)
) ))
6.3.3.3 Comparisons at specified values (R)
comparisons(
m,variables = "gender",
newdata = datagrid(
dob_c = 0))
6.3.3.4 Comparisons for the “average unit” (R)
We can set peripheral variables equal to their means (numeric predictors) or modes (categorical predictors) by specifying the argument newdata = "mean"
:
comparisons(
m,variables = "gender",
newdata = "mean")
6.3.3.5 Comparisons for ideal types (R)
6.3.4 Average comparisons (R)
The {marginaleffects}
package computes average comparisons by first calculating unit-level comparisons on a specified scale and then averaging over these.
6.3.4.1 Average comparisons: Observed vs. specified values (R)
By default, the avg_comparisons()
function treats peripheral variables as-observed, which means that the comparison is calculated for each observation in the data set and then averaged. For binary variables, it considers a change from one level to the other. The values we obtain are those in Table 5.11 (left-hand side).
avg_comparisons(
m,variables = "gender")
We can graph these with the following code (see Figure 5.15).
avg_comparisons(
m,variables = "gender") |>
ggplot(aes(x = estimate, y = group)) +
geom_point() +
geom_linerange(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 0)
To produce average comparisons over specified values of the peripheral variable Date of birth, we need to specify its values using the newdata
argument (see right-hand side of Table 5.11).
avg_comparisons(
m,variables = "gender",
newdata = datagrid(
dob_c = seq(-1.4, 1, length = 7)
))
6.3.4.2 Average comparisons for a quick overview of predictor importance (R)
The summaries reported in Table 5.12 can be obtained as follows. To calculate average comparisons for the variable Gender, treating Date of birth as-balanced, we need to pass the nae of the tageted variable gender
to the variables
argument, and manually define the values of Date of birth using the newdata
argument.
avg_comparisons(
m,variables = "gender",
newdata = datagrid(
dob_c = seq(-1.4, 1, length = 7)
))
To calculate average comparisons for Date of birth using the intermediate approach outlined in Figure 5.11, we need to specify the step size in a list passed to the argument variables
. In the newdata
argument, we define the start locations of Date of birth.
avg_comparisons(
m,variables = list(
"dob_c" = 30/25),
newdata = datagrid(
dob_c = seq(-1.4, -.2, length = 4),
gender = unique
))
Instead of treating a peripheral variable as-observed, we can determine its values manually:
avg_comparisons(
m, variables = "gender")
avg_comparisons(
m, by = "gender")
avg_comparisons(
m, variables = "female",
newdata = datagrid(
dob_c = quantile_midpoints
))
6.3.4.3 Average comparisons for subgroups in the data (R)
Average comparisons for Date of birth, for Gender subgroups
- DOB: as-observed (+1)
- Gender: treated identically
avg_comparisons(
m,variables = "dob_c",
by = "gender")
Average comparisons for Date of birth, for Gender subgroups
- DOB: -1 => +1
- Gender: treated identically
avg_comparisons(
m,variables = list(
"dob_c" = c(-1, 1)),
by = "gender")
Average comparisons for Date of birth, for Gender subgroups
- DOB: intermediate approach
- Gender: treated identically
avg_comparisons(
m,variables = list(
"dob_c" = 30/25),
newdata = datagrid(
dob_c = seq(-1.4, -.2, length = 4),
gender = unique),
by = "gender")
6.3.5 Predictions on the latent-variable scale (R)
Since model-based predictions on the latent-variable scale only make sense in the context of the thresholds, we have written two custom functions for adding thresholds and category labels to {ggplot}
graphs:
add_thresholds()
add_threshold_labels()
<- function(
add_thresholds
thresholds, col_lines = "black",
vertical = FALSE){
<- as.numeric(thresholds) - mean(as.numeric(thresholds))
thresholds_c
if (vertical == FALSE){
::geom_hline(
ggplot2yintercept = thresholds_c,
color = col_lines
)else {
} ::geom_vline(
ggplot2xintercept = thresholds_c,
color = col_lines
)
}
}
<- function(
add_threshold_labels
thresholds, padding_labels = .1,
x_location = 1,
adj_label = 1,
col_labels = "black",
custom_labels = NULL,
size_labels = 3,
vertical = FALSE){
<- as.numeric(thresholds) - mean(as.numeric(thresholds))
thresholds_c
<- length(thresholds)
n_thresholds
if (is.null(custom_labels)){
<- c(
category_labels str_split(dimnames(thresholds)[[2]][1:n_thresholds], pattern = "\\|", simplify = TRUE)[,1],
str_split(dimnames(thresholds)[[2]][1:n_thresholds], pattern = "\\|", simplify = TRUE)[n_thresholds,2]
)else {
} <- custom_labels
category_labels
}
<- c(
location_labels min(thresholds_c) - diff(range(thresholds_c))*padding_labels,
::rollmean(thresholds_c, k = 2),
zoomax(thresholds_c) + diff(range(thresholds_c))*padding_labels)
if (vertical == FALSE){
::annotate(
ggplot2"text",
x = x_location,
y = location_labels,
label = category_labels,
adj = adj_label,
color = col_labels,
size = size_labels
)else {
} ::annotate(
ggplot2"text",
y = x_location,
x = location_labels,
label = category_labels,
adj = adj_label,
color = col_labels,
size = size_labels
)
} }
6.3.5.1 {emmeans}
package
Unfortunately, the {marginaleffects}
package does not allow us to produce predictions on the latent-variable scale. We must therefore switch to the {emmeans}
package (Lenth 2024), which offers similar functionality to {marginaleffects}
package, but differs in a few key respects. The most important difference (apart from syntax) concerns the default treatment of peripheral variables.
As we have seen above, {marginaleffects}
treats peripheral variables as-observed, unless we actively specify their values. {emmeans}
, on the other hand, defaults to an as-balanced treatment of categorical variables. This means that it returns simple averages, giving equal weight to all levels of the variable. The default treatment of continuous predictors, on the other hand, is to hold them at their in-sample average. Importantly, {emmeans}
does not support as-observed predictions. The key differences are summarized in Figure 6.1.
6.3.6 Predictions for each observed unit (R)
The {emmeans}
package does not allow us to generate as-observed predictions. This means that we must construct them manually:
<- data.frame(
po_lv_m estimate = as.numeric(
t(as.matrix(coef(m)[4:5])) %*% t(model.matrix(m)$X[,-1]) - mean(m$Theta)))
Then we can plot them using a dot diagram. The custom functions (defined above) add_thresholds()
and add_threshold_labels()
are used to annotate the graph.
|>
po_lv_m ggplot(aes(x = estimate)) +
geom_dotplot(
method = "histodot",
binwidth = .09) +
add_thresholds(m$Theta, vertical = TRUE) +
add_threshold_labels(m$Theta, vertical = TRUE)
6.3.7 Unit-level predictions (R)
While {emmeans}
is not designed to generate unit-level predictions, we can nevertheless force it to avoid averaging by manually specifying value for all predictor variables in the model. Let’s say we wish to obtain predictions for a male speaker born in 2000. We pass the variables that are used to define the condition to the spec
argument and then, critically, we specify the values for both in the at
argument. To compute a prediction for a male speaker born in 2000, we run the following:
emmeans(
m,spec = c("dob_c", "gender"),
at = list(
dob_c = 1,
gender = "m"))
To add conditions, we specify additional values in the at
argument. To obtain predictions for both a male and female speaker born in 2000:
emmeans(
m,spec = c("dob_c", "gender"),
at = list(
dob_c = 1,
gender = c("f", "m")))
And we can also add predictions for two additional years (1950 and 1975):
emmeans(
m,spec = c("dob_c", "gender"),
at = list(
dob_c = c(-1, 0, 1),
gender = c("f", "m")))
6.3.8 Average predictions (R)
We can use the emmeans()
function to get average predictions for male and female speakers. If we do not specify a birth year, we get at-means predictions, based on the in-sample average.
emmeans(
m,spec = "gender")
Since the distribution of this variable is not representative of the target population, we avoid sample statistics and instead ask for predictions for the year 1975:
emmeans(
m,spec = "gender",
at = list(
dob_c = 0))
We can likewise obtain average predictions for the variable Date of birth. By default, {emmeans}
treats Gender (the peripheral categorical variable) as-balanced, giving the same weight to male and female speakers. This is what we want. When requesting average predictions for numeric predictors, we also need to specify the locations we wish to consider. Otherwise we will only get a single estimate for the in-sample mean. Here, we choose 20-year steps:
emmeans(
m,spec = "gender",
at = list(
dob_c = c(-1.4, -.6, .2, 1)))
If we wish to instead obtain weighted averages over peripheral categorical variables, we can use the weights
argument. Setting it to "proportional"
, the levels are weighted according to their representation in the estimation sample.
emmeans(
m,spec = "dob_c",
at = list(
dob_c = c(-1.4, -.6, .2, 1)),
weights = "proportional")
6.3.9 Unit-level comparisons (R)
To obtain unit-level comparisons, the first step is to create unit-level predictions. For instance, to compare female and male speakers born in 2000, we use the emmeans()
function to calculate unit-level predictions. We assign these to a new object ap_gender
:
<- emmeans(
ap_gender
m,spec = "gender",
at = list(
dob_c = 1))
ap_gender
gender emmean SE df asymp.LCL asymp.UCL
f -0.878 0.247 Inf -1.36 -0.395
m -1.462 0.258 Inf -1.97 -0.957
Confidence level used: 0.95
To compare these predictions, use the function pairs()
:
pairs(ap_gender)
contrast estimate SE df z.ratio p.value
f - m 0.584 0.27 Inf 2.166 0.0303
The output of pairs()
does not include confidence intervals. To obtain these, we have to wrap the function confint()
around it:
confint(
pairs(ap_gender))
contrast estimate SE df asymp.LCL asymp.UCL
f - m 0.584 0.27 Inf 0.0555 1.11
Confidence level used: 0.95
By default, pairs()
runs all pairwise comparisons among predictions contained in the objects created by emmeans()
. We can simplify the output with the argument simple
. For instance, if we wish to obtain differences between male and female speakers at different birth years, we start by forming a set of 6 predictions:
<- emmeans(
ap_gender
m,spec = c("dob_c", "gender"),
at = list(
dob_c = c(-1, 0, 1),
gender = c("f", "m")))
ap_gender
dob_c gender emmean SE df asymp.LCL asymp.UCL
-1 f 1.490 0.363 Inf 0.778 2.201
0 f 0.306 0.195 Inf -0.076 0.688
1 f -0.878 0.247 Inf -1.361 -0.395
-1 m 0.906 0.380 Inf 0.161 1.650
0 m -0.278 0.217 Inf -0.704 0.147
1 m -1.462 0.258 Inf -1.968 -0.957
Confidence level used: 0.95
We are not interested in all pairwise comparisons, only in the comparisons between the levels of Gender, but for each birth year. The argument simple
allows us to specify the focal variable for the contrast, which gives us what we want:
pairs(
ap_gender, simple = "gender")
dob_c = -1:
contrast estimate SE df z.ratio p.value
f - m 0.584 0.27 Inf 2.166 0.0303
dob_c = 0:
contrast estimate SE df z.ratio p.value
f - m 0.584 0.27 Inf 2.166 0.0303
dob_c = 1:
contrast estimate SE df z.ratio p.value
f - m 0.584 0.27 Inf 2.166 0.0303
The following code produces an interaction plot.
emmip(
m, ~ dob_c,
gender at = list(
dob_c = seq(-1.4, 1, by = .1)
),CIs = TRUE,
col = NA) +
add_thresholds(m$Theta, col_lines = "grey") +
add_threshold_labels(m$Theta, x_location = 1.2, adj_label = 1, col_labels = "grey") +
theme_classic() +
geom_ribbon(aes(ymin = LCL, ymax = UCL, fill = gender), alpha = 0.2, colour = NA) +
geom_line(lty = 1)
6.3.10 Average comparisons (R)
Generating average comparisons works in a similar way. We first create an object containing average predictions and then use pairs()
to obtain relevant contrasts.
<- emmeans(
ap_dob
m,spec = "dob_c",
at = list(
dob_c = c(-1.4, 1)))
ap_dob
dob_c emmean SE df asymp.LCL asymp.UCL
-1.4 1.67 0.437 Inf 0.815 2.527
1.0 -1.17 0.213 Inf -1.588 -0.752
Results are averaged over the levels of: gender
Confidence level used: 0.95
pairs(ap_dob)
contrast estimate SE df z.ratio p.value
(dob_c-1.4) - dob_c1 2.84 0.58 Inf 4.903 <.0001
Results are averaged over the levels of: gender