5  Methods of interpretation

Under construction

These notes are currently under construction and may contain errors. Feedback of any kind is very much welcome: lukas.soenning@uni-bamberg.de

In this section, we discuss different methods of interpretation for ordered regression models. We only briefly touch upon the use of regression coefficients and instead focus mainly on two strategies: (i) the use of predicted probabilities, which applies to any type of ordinal regression model; and (ii) the interpretation on the latent-variable scale, which only applies to the parallel cumulative model.

While the methods of interpretation we discuss in this section are quite general, we apply them here to regression models without a random-effects structure. Mixed-effects ordinal regression models are the topic of the next section, which shows how methods of interpretation need to be adapted or extended when applied to such models.

Before we go further, we should note that model interpretation based on predicted probabilities involves a considerable number of additional complications, due to the non-linearity of the probability scale. The reasons for this are discussed at the beginning of Section 5.2. Model interpretation on the latent-variable scale, on the other hand, is more straightforward and readily extends to mixed-effects models.

For illustration, we will use the lexical preference ratings for the pair packageparcel. We start by loading the data:

d <- read_tsv(
  here("data/analysis_data", 
       "malta_data_parcel.tsv"))

Then we fit the model using the clm() function from the {ordinal} package. Given the layout of the response scale, we choose symmetric thresholds:

m <- clm(
  rating_int ~ dob_c + gender, 
  data = d, 
  threshold = "symmetric")

This chapter is structured as follows. We start by outlining different methods of interpretation, dealing in turn with regression coefficients (Section 5.1), predictions on the probability scale (Section 5.2) and predictions on the latent scale (Section 5.3). Chapter 6 then shows how to apply these methods in R.

5.1 Regression coefficients

The output of an ordered regression model is a table of coefficients. These typically provide relatively little and rather obscure information about relations in the data. Here is the printout for our current model:

formula: rating_int ~ dob_c + gender
data:    d

 link  threshold nobs logLik  AIC    niter max.grad cond.H 
 logit symmetric 193  -260.79 531.59 5(1)  3.82e-13 2.8e+01

Coefficients:
        Estimate Std. Error z value Pr(>|z|)    
dob_c    -1.1840     0.2415  -4.903 9.44e-07 ***
genderm  -0.5843     0.2698  -2.166   0.0303 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
          Estimate Std. Error z value
central.1 -0.81539    0.21223  -3.842
central.2  0.20360    0.20495   0.993
spacing.1  0.39671    0.06906   5.744

There are two ways of making coefficients of an ordered regression model more meaningful: standardization and, in the case of a logit link, re-expression as odds ratios. The aim of standardization is to establish comparability between coefficients. We will not cover standardized coefficients here. For more information, see Long (1997, 60–71, 127–30) and Long and Freese (2014, 180–83, 332–35).

When a logit link is used, odds ratios can be used for interpretation. Odds ratios are obtained through exponentiation of the regression coefficients. The coefficient for the variable Gender, for instance, is -0.58, where “female” is the reference level. This is a difference on the log odds scale. After exponentiation, this becomes an odds ratio of 0.56. This means that, for male speakers, the odds of responding below a particular threshold on the response scale are 0.56 as high as for female speakers. Put differently, the odds for female speakers are higher by a factor of 1.79. For Date of birth, the difference in log odds associated with an increase of 25 years is 0.2, which is an odds ratio of 0.31.

Odds ratios can assist in the comparison of the relative strength of association between predictors and the outcome. It should be noted, however, that odds ratios are counterintuitive for many people, which is why we should perhaps give preference to other strategies of interpretation.

Let us briefly consider the limitations of odds ratios and assume we have two proportions we wish to compare numerically: A (.40) and B (.20). These proportions are shown visually in Figure 5.1. Two straightforward ways of comparing .40 and .20 are to state (i) that the difference is .20 (or 20 percentage points); or (ii) that the proportion in group A is twice as high, or greater by a factor of 2.

Figure 5.1: Two proportions that are to be compared.

To compare these proportions using an odds ratio, on the other hand, we first need to convert them into odds. We obtain the odds for group A by dividing the proportion (.40) by its analogue, i.e. 1 – .40 = .60. The odds in group A are therefore .40/.60 = 0.67, and those in group B are .20/.80 = 0.25. The ratio of these odds is 0.67/0.25 = 2.67. This means that an odds ratio is a ratio of ratios of probabilities – a rather obscure way of quantifying the difference between the two conditions.1

It therefore makes sense to translate regression coefficients into more meaningful quantities. The natural metric for understanding categorical outcomes are probabilities, to which we now turn.

Cumulative proportions play a special role for ordered variables. Here are examples of communicating the association between a predictor and the outcome using probabilities:

  • On average, the proportion of neutral responses was .32 among 20-year-olds and .17 among 60-year-olds, a difference of .15.”
  • “The proportion of responses above the neutral midpoint for the adjectival and quantifier function was .07 and .89, a difference of .82 in absolute terms.”
  • “For an average native Australian native speaker, the probability of being “likely” to use heaps in a conversation with friends is .30. This compares with .01 in a conversation with a superior.”
  • “For female respondents, the probability of agreement (i.e. the categories “agree” and “strongly agree” combined) exceeds that of male respondents by a factor of 1.8.”

5.2 Predictions on the probability scale

Translating tables of coefficients into probabilities/proportions requires some effort, as the model output needs to be processed to arrive at such quantities. The R packages {marginaleffects} (Arel-Bundock, Greifer, and Heiss Forthcoming) and {emmeans} (Lenth 2024) greatly facilitate this task.

Before we go further, we must consider a fundamental issue that complicates the use of predicted probabilities for the interpretation of categorical regression models (see Long and Freese 2014, 133–36). While the scale on which the model is fit – the link scale – is linear, the data scale on which we interpret the model – the probability scale – is non-linear. To understand what is meant by “linear” vs. “non-linear” in this case, consider Figure 5.2. It shows a hypothetical analysis with two predictors, Age and Gender. Each panel shows two trend lines, one for male and one for female speakers. In the left panel, they are shown on the logit scale, where they are straight. This is what is meant by “linear”. The right-hand panel shows them after back-transformation to the probability scale, where they appear as curves and are therefore “non-linear”.

Figure 5.2: Predictions and comparisons on (a) the linear logit scale and (b) the non-linear probability scale.

The non-linearity of the probability scale is an issue if we are looking for a single number to summarize the difference between male and female speakers. The dotted grey lines in Figure 5.2 show this difference at four locations (20, 40, 60, and 80 years). On the logit scale (left panel), they are constant: The difference between male and female speakers is 1.5 logits, irrespective of age. This greatly simplifies the summary of the model. On the probability scale (right panel), on the other hand, we note that the difference between male and female speakers depends on which age group we are considering: For 20-year-olds, the difference is small (.06), for 80 year-olds much larger (.35). On the probability scale, then, the difference between male and female speakers depends on where in the data space (here: for which age group) we evaluate it.

The same problem applies to the comparison of age groups. The triangles in the left panel in Figure 5.3 shows that speaker groups that are 20 years apart (i.e. moving 20 years horizontally) differ by the same amount, i.e. a vertical step of 1.5 units on the logit scale. It does not matter which age groups we compare (80- vs. 60-year-olds or 40- vs. 20-year-olds) and it also does not matter whether we look at male or female speakers. All triangles look the same. The right-hand panel, on the other hand, shows that the difference in predicted probability associated with an age gap of 20 years depends on which part of the predictor space we look at. For male speaker aged 40 vs. 20, the difference is as small as .03. For female speakers aged 80 vs. 60, it is as large as .24. The difference therefore varies by gender and across the age range.

Figure 5.3: Predictions and comparisons on (a) the linear logit scale and (b) the non-linear probability scale.

The basic issue, then, is that when making comparisons on the probability scale, we must keep in mind that it matters where in the data space the comparison is evaluated. To get a sense of the average difference associated with certain predictor values, we therefore usually average over the distribution of differences on the probability scale.

The current section on the use of predicted probabilities for model interpretation is divided into four parts. We will first deal with predictions and then with comparisons between predictions. For each quantity, we distinguish two types: Predictions/comparisons made for individual units vs. averages over such unit-level estimates.

5.2.1 Predicted category-specifc vs. cumulative probabilities

We can predict both types of probabilities

5.2.2 Predictions for each observed unit

A useful first step is to compute predicted probabilities for the estimation sample, i.e. the data that were used to fit the model (see Long and Freese 2014, 138–39, 339–40). This means that we ask the model to generate predictions for each observation in the sample. An examination of this distribution of predictions is helpful for two reasons. First, the variation in predictions gives us an idea of the model’s capacity to capture variation in the response variable. If predictions vary little, the explanatory variables in our model are largely inert. If predicted values vary substantially, on the other hand, these variables effectively capture variation in the distribution of the responses. Further, the distribution of predictions may point to interesting or potentially problematic features in the data or model.

We can visualize the distribution of these predicted probabilities with a dot diagram. Note how the caption of Figure 5.4 clarifies how these predictions were obtained, i.e. how the predictor variables were handled. In the present case, the are treated as-observed, which means that the observed values in the estimation sample are used to calculate predictions.

Figure 5.4: Predicted category-specific probabilities for the observations in the estimation sample.

This first assessment is useful for getting an impression of the predictive capacity of the model, i.e. the extent to which the observed variation in the predictor variables is linked to variation in the outcome probabilities. Figure 5.4 shows large variation in the extreme categories, where probabilities range from .15 to .70 (category 1) and .10 to .55 (category 5), suggesting that the predictors in our model are associated with considerable variation in the ordinal outcome. In such a dot diagram, we would also look for outliers, which may point to problems in the data (or model).

We can also provide summary statistics in a table. Table 5.1 reports the average predicted response probability as well as the in-sample variability pf the response probabilities. Note again how the table caption clarifies how predictor variables were handled.

Response Mean SD Min Max
1 .38 .15 .05 .64
2 .09 .02 .02 .10
3 .21 .03 .11 .25
4 .07 .02 .04 .10
5 .25 .15 .09 .74
Note: Adjusted to: Date of birth (as-observed), Gender (as-observed).
Table 5.1: Summary of the distribution of the predicted category-specific probabilities for the observations in the estimation sample.
Tip

If the distribution of predictor variables in our estimation sample is unbalanced, the data may not provide a good representation of the population of interest. In that case, obtaining predictions for the estimation sample is still of interest, though less informative. Predictions may instead be generated for a hypothetical set of observations, designed to provide a miniature version of the population of interest.

If the distribution of predictions for the observed data shows variation in the outcome probabilities, the next step is to obtain predictions at specific points in the predictor space. This allows us to understand the nature of this variation, i.e. which predictors show particularly strong association with the outcome.

5.2.3 Unit-level predictions

A prediction is made at the unit level if it is generated for a specific predictor profile, i.e. a combination of predictor values. While these predictor values may represent sample averages, the important point is that no averaging takes place for the predicted probabilities. This contrasts unit-level predictions from average predictions.

The predictor profiles we use may be designed with different priorities in mind. We will outline three strategies for creating custom profiles: predictions (i) at specified values, (ii) for the “average unit”, and (iii) for ideal types.

5.2.3.1 Predictions at specified values

It is often informative to obtain predictions at substantively interesting values. We will follow Long and Freese (2014) and refer to a specified condition as a profile. A profile is a combination of predictor values and it describes a (hypothetical) unit – say, a female speaker born in 1975.

With only two predictors, we could decide to construct profiles that sound out the predictor space, varying Gender across its observed levels and Date of birth from 1940 to 2000, in 20-year increments. These predictions are shown in Table 5.2, which shows that predicted probabilities vary considerably by Date of birth.

Female
Male
Date of birth 1 2 3 4 5 1 2 3 4 5
1940 .05 .02 .11 .07 .74 .09 .04 .16 .09 .62
1960 .13 .05 .20 .10 .53 .21 .07 .24 .10 .38
1980 .27 .09 .25 .09 .30 .40 .10 .23 .07 .19
2000 .49 .10 .21 .06 .14 .64 .09 .16 .04 .09
Table 5.2: Predicted response probabilities for different combinations of Gender and Date of birth.

We can also visualize these predictions. Category-specific probabilities can be visualized with a grouped bar chart, which appears as Figure 5.5.

Figure 5.5: Grouped bar chart showing the predicted category-specific probabilities for different combinations of Gender and Date of birth.

Alternatively, we can show them using a line plot. As illustrated in Figure 5.6, we draw predictions for male and female speakers into different panels, or into the same panel.

Figure 5.6: Line plot showing the predicted category-specific probabilities for different combinations of Gender and Date of birth.

Cumulative probabilities can be shown using a stacked bar chart or a line plot.

Figure 5.7: Stacked bar chart showing the predicted cumulative probabilities for different combinations of Gender and Date of birth.

5.2.3.2 Predictions for the “average unit”

We may also wish to form predictions for what may be considered the “average unit” in the data, i.e. a central and/or representative point in the predictor space. The values at which predictors are being held are sometimes referred to as the base values or the specified values of the predictors. For quantitative variables, we may choose the mean (or median), for categorical variables the mode, i.e. the level with the highest frequency in the sample. In the current data, this would be a female speaker born in 1973.34. The predictions for this hypothetical “average” individual are shown in Table 5.3:

Response Pr(y) 95% CI
1 .30 [.22, .38]
2 .09 [.06, .12]
3 .25 [.18, .32]
4 .09 [.06, .12]
5 .28 [.20, .36]
Note: Adjusted to: Date of birth = −0.07 (mean), Gender = female (mode).
Table 5.3: Predicted response probabilities for the “average” speaker in the sample.

We may not be happy with setting categorical variables to their modal values, as we thereby essentially ignore all cases in the less well-represented group(s) (see Long and Freese 2014, 245). If we want to instead hold them at the average value they assume in the sample, we need to average over predictions made for male and female speakers. This is described in the section on aggregated predictions.2

It should be noted that predicted probabilities for the “average unit” in the sample will be different from units on average, i.e. the average over the predicted probabilities for all units in the sample.

5.2.3.3 Predictions for ideal types

If the predictor values at which predictions are calculated are chosen strategically, we may refer to them as ideal types. Long and Freese (2014, 270) note that ideal types are “particularly illustrative for interpretation when independent variables are substantially correlated”.3 This is because the predictor values may be chosen to represent realistic combinations. If a predictions are formed for a particular subgroup in the data, an option is to use local means as base values for peripheral variables (see Long and Freese 2014, 270–80 for discussion).4

For illustration, we will define two ideal types:

  • A young female speaker (born in 2000)
  • An old male speaker (born in 1940)

Predicted category probabilities for these ideal types appear in Table 5.4.

Young female speaker
(Gender = female, Date of birth = 2000)
Old male speaker
(Gender = male, Date of birth = 1940)
Response Pr(y) 95% CI Pr(y) 95% CI
1 .09 [.01, .17] .49 [.37, .61]
2 .04 [.01, .07] .10 [.06, .13]
3 .16 [.07, .26] .21 [.14, .27]
4 .09 [.05, .13] .06 [.03, .08]
5 .62 [.40, .83] .14 [.08, .21]
Table 5.4: Predicted response probabilities for two specific profiles (ideal types).

5.2.4 Average predictions

Instead of forming predictions for specific conditions, we often want to aggregate over predicted probabilities obtained at specified values. This aggregation is carried out on the probability scale. In this section, we will discuss two types of averages: An overall average that represents the estimation sample and/or the target population, and averages for subgroups in the data. Average predictions always aggregate over unit-level predictions, so a crucial question is how the underlying unit-level predictions are computed, i.e. which predictor profiles are considered.

5.2.4.1 Unit-level predictions: as-observed vs. specified values

There are two general strategies for designing the predictor profiles for which unit-level predictions are calculated and then averaged. For each predictor, we may either choose to specify values manually (specified values) or we may instead rely on its distribution in the estimation sample (as-observed values).

For predictors whose distribution in the sample closely corresponds to the distribution in the target population, it makes sense to consider using the as-observed approach, as this adds realism to predictions and their averages. Custom values can instead be specified for predictors whose distribution is unbalanced in the data set.

Consider, as an example, the distribution of the variables Date of birth and Gender in the current data set. The histograms in Figure 5.8 show that while subjects in the data are roughly balanced on Gender (53% female, 47% male), the variable Date of birth is distributed very unevenly: Younger speakers are overrepresented, with 70% of the individuals born 1980 or later. Since the sample is clearly not representative of the population, we should hesitate to treat Date of birth as-observed when calculating unit-level predictions that will feed into averages.

Figure 5.8: Distribution of the speakers in our sample by Date of birth and Gender.

In such cases, we may instead use specified values to control the representation of subgroups. While it is often the case that we want to assigning equal weights to all levels, there may be substantive reasons for weighting them differently. In the case of Gender in the current data, we would opt for equal weights.

For continuous predictors, there are a number of options, which may also be roughly divided into simple and weighted representations. In both cases, the aim is to find a handful of values that offer the representation we seek. If we are interested in assigning the same weight to values across the observed range, we may opt for equally-spaced locations. This approach makes sense for the variable Date of birth in the present setting, and we could use the set [1940, 1950, …, 2000] for adequate coverage of the observed age groups. Note that the specified values will usually not span the actual range of the numeric variable, since this span may be distorted by even a single outlier. Instead, a restricted range, say from the 5th to the 95th percentile makes sense (see Long and Freese 2014, p.).

For other numeric variables, it makes more sense to weight values in proportion to their representation in the data or the target population. A viable strategy is to use decile midpoints, which are illustrated in Figure 5.9. These locations are found by first dividing the target distribution into deciles, i.e. 10 bins with (nearly) the same number of cases. These deciles are delimited with grey vertical lines. The midpoints (i.e. median) of these bins are the decile midpoints - they appear as black dots below the histograms.

Figure 5.9: Decile midpoints for a symmetric and a skewed distribution.

Let us apply the two strategies to our model. Table 5.5 shows that the choice of adjustment strategy does not matter much for Gender, but for Date of birth we note an appreciable effect on the category-specific probabilities.

Gender: as-observed
Date of birth: as-observed
Gender: specified values
Date of birth: as-observed
Gender: specified values
Date of birth: specified values
Response Pr(y) 95% CI Pr(y) 95% CI Pr(y) 95% CI
1 .38 [.32, .44] .37 [.30, .44] .29 [.23, .34]
2 .09 [.06, .11] .09 [.06, .13] .07 [.04, .09]
3 .21 [.16, .27] .24 [.17, .30] .20 [.14, .25]
4 .07 [.05, .09] .08 [.05, .10] .08 [.05, .10]
5 .25 [.19, .31] .23 [.17, .29] .37 [.29, .46]
Note: Adjusted to: Gender = as-balanced (50% male, 50% female); Date of birth = as-balanced (1940, 1960, 1980, 2000).
Table 5.5: Predicted response probabilities using different adjustment strategies: as-observed vs. specified values.

We have seen that there are a number of different adjustment strategies, i.e. ways of handling those predictors over which we wish to average, or – more generally – which we wish to adjust for. Here is an overview:

as-observed

  • complete dataset
  • subset

specified values

  • single value
    • quantitative variables: mean/median (based on sample or population)
    • categorical variables: mode (based on sample or population)
  • multiple values
    • quantitative variables
      • balanced: at equal steps
      • weighted
        • at equal steps, then weighted (based on sample or population)
        • at quantiles (based on sample)
    • categorical variables
      • balanced: simple average
      • weighted: weighted average (based on sample or population)

When averaging over predictions on a non-linear scale, adjustment by using a single specified value for a peripheral variable (e.g. the sample mean or mode) it is generally not recommended. This is because the predicted probability depends on the values of all predictors in the model. Holding them at their means (or modes) may not be representative of the characteristics of the observations in the sample. Many authors therefore prefer the as-observed approach (Long 1997, 74; Cameron 2005, 467; Greene and Hensher 2010, 143; Hanmer and Kalkan 2013, 3).

It should be noted that the recommendation of these scholars is to give preference to the as-observed approach over the specified-single-values approach. Our listing above shows that there is another option, the use of specified-multiple-values. This strategy can copy the advantages of the as-observed approach by taking into consideration the entire data space. A further advantage of this strategy vis-à-vis observed values is that it does not force us to rely on sample characteristics when forming adjusted predictions. Instead, the specified values may be informed by external (or population) data. In settings where the sample does not provide a good representation of the distribution of certain variables in the target population, the specified-multiple-values approach is our preferred adjustment strategy.

5.2.4.2 Average predictions for a quick overview of predictor importance

Average predictions can also help us compare predictors in terms of their strength of association with the outcome. This may be done by forming two average predictions for each variable, which should give a good representation of their “effective range”. For binary inputs, it is natural to compute average predictions at the two levels. For continuous variables, we must find a span that may also be considered as reflecting the typical difference we observe between two units. This span may be determined empirically or on subject-matter grounds. Common choices for data-based spans are based on the standard deviation. Arguments can be made for a 1-SD or a 2-SD span (see Gelman 2007).

In our data, we use a hybrid approach. We will opt for a 30-year step, which may be considered as representing a generational shift in usage patterns. In addition, however, we are interested in how usage patterns differ across the entire range of birth years that are represented in our data. Accordingly, we will form average predictions at three locations: 1940, 1970, and 2000.

Table 5.6 arrange these average predictions in a way that allows us to compare the two predictors in terms of their strength of association with the outcome.

Response category
Level 1 2 3 4 5
Date of birth
1940 .07 .03 .14 .08 .68
1970 .24 .08 .24 .09 .34
2000 .56 .09 .18 .05 .11
Gender
Female .23 .07 .20 .08 .43
Male .33 .08 .21 .08 .31
Note: Adjusted to: Gender = as-balanced; Date of birth = as-balanced (1940, 1950, 1960, 1970, 1980, 1990, 2000).
Table 5.6: Average predictions for the variables in the model.

5.2.4.3 Average predictions for subgroups in the data

When averaging over predictions at the two levels of Gender, there are different options for handling the peripheral variable Date of birth. Two general strategies are (i) the as-observed approach, which uses the values recorded in the estimation sample, and (ii) the use of specified values. When dealing with subgroups of the data, a further question is whether the peripheral variables should be treated identically or differently in the two subgroups. When using specified values, for instance, custom values may be chosen within each subgroup. And for the as-observed approach, the peripheral variables can likewise be adjusted to the values they assume in the relevant subset of the data.

Table 5.7 compares the results of these different averaging strategies. In the first set of average predictions (as-observed, identical), the row “Female” contains the average predicted response probabilities for a scenario in which all respondents in our estimation sample are female (irrespective of their actual gender). The row “Male”, on the other hand, shows the average estimates supposing that all respondents are male.

In the second set of average predictions (as-observed, different), the row “Female” contains the average predicted response probabilities for a group of female speakers with a date-of-birth distribution that is identical to the one in the current sample of female speakers. The same applies to male speakers.

The final set of predictions averages over predictions made for specified values of Date of birth. Since these are 10-year increments, equal weight is given to each cohort,

Response category
Level 1 2 3 4 5
As-observed, identical
Female .32 [.24, .40] .08 [.06, .11] .23 [.16, .29] .08 [.05, .10] .29 [.22, .37]
Male .45 [.36, .54] .09 [.06, .12] .20 [.15, .26] .06 [.04, .08] .20 [.13, .26]
As-observed, different
Female .31 [.23, .39] .08 [.06, .11] .23 [.16, .29] .08 [.05, .11] .30 [.22, .37]
Male .46 [.37, .55] .09 [.06, .12] .20 [.14, .26] .06 [.04, .08] .19 [.13, .26]
Specified valuesᵃ, identical
Female .23 [.17, .29] .07 [.04, .09] .20 [.14, .26] .08 [.05, .11] .43 [.33, .52]
Male .33 [.25, .41] .08 [.05, .10] .21 [.15, .27] .08 [.05, .10] .31 [.21, .42]
Note: ᵃAdjusted to: Date of birth = as-balanced (1940, 1950, 1960, 1970, 1980, 1990, 2000).
Table 5.7: Average predictions for the variables in the model.

Whether the as-observed approach treats the subgroups identically or differently doesn’t make much of a difference. However, using specified values of Date of birth changes the predictions markedly. This is because the as-observed approach produces average predictions that are biased towards the response behavior of younger participants (who are overrepresented in the sample).

5.2.4.4 Using local means for adjustment

Long and Freese (2014, 273–74, 303–8) discuss a useful strategy for subgroup comparisons. Typically, subgroups are defined by varying the targeted variables and holding all peripheral variables at specified (or observed) values. Critically, however, the peripheral variable assume identical values across the subgroups. If this is unrealistic, peripheral variable may instead be set to their subgroup-specific values, or local means. By using such local means for adjustment, we may observe how robust our conclusions are to assumptions about the levels of other variables.

Note

To form average predictions for subgroups in the data, Stata offers the option “over”, which holds other variable at subgroup-specific means.

5.2.5 Comparisons

Comparisons are useful for expressing the strength of association between a predictor and the outcome. As such, a comparison is defined as a function of two (or more) model-based predictions.The term “function” refers to different ways of comparing two numbers: We can calculate a difference, a ratio, or other quantities. Since a model allows us to form different types of predictions, there is a corresponding variety of comparisons that may be drawn.

Typically, a comparison focuses on a single variable. This means that the two predictions that are contrasted differ on that variable only. When talking about a comparison, it therefore makes sense to distinguish between the targeted (focal) variable and peripheral (adjustment) variables.

A comparison can be conceptualized in two ways. First, the predictor profiles involved in the comparison may be understood as representing two different units. These hypothetical units are, however, identical with regard to all peripheral variables and only differ on the targeted variable. In that sense, it is an attempt to compare “like with like”. This way of thinking about comparisons is typical when the aims of a study are descriptive.

Alternatively, the predictor profiles may be considered as representing two versions of one and the same unit. Viewed in this way, a comparison is interpreted as expressing what would happen if we were able to change the value of the targeted variable for the same (hypothetical) unit. This is a thought experiment, and therefore often termed a “counterfactual comparison”. This way of thinking about comparisons is typical of causal explanation.

5.2.5.1 Choice of predictor values

The choice of predictor values for the targeted variable needs some thought. For binary predictors, it seems natural to compare the two levels of the variable. For continuous predictors, on the other hand, the choice of locations at which to compare predictions is less straightforward.

One strategy is to determine two values manually, which means that predictions are formed for two profiles only: One profile where the numeric variable assumes the higher value (for instance the upper quartile, or 1 SD above the mean), and one where it is held at the lower value (e.g. the lower quartile, or 1 SD below the mean). This way of computing a comparison answers the following question: What is the difference between individuals with a relatively high value on the continuous variable (e.g. 1 SD above the mean) compared to those with a relatively low value (e.g. 1 SD below the mean). This is a descriptive question.

Another strategy is to use the observed values of the numeric variable and increase its value for each unit in the sample by a specific amount (e.g. 1, or by 1 SD). This means that for the first set of profiles, the targeted variable assumes its observed distribution in the data. In the second set of profiles, it is shifted upwards by a fixed amount, as though the targeted variable had increased for the units in the estimation sample. This way of computing a comparison answers the following question: If we shift the value of the numeric variable by X units for each individual in the sample, what is the average difference we observe in the outcome probability. This maps onto a causal research question, which envisages some sort of intervention that may be applied to the units in the data.

Figure 5.10 illustrates these different strategies for the variable Date of birth in our analysis. The shifted histograms at the top represent the as-observed approach, which uses the distribution of the variable in the sample and shifts it upward by a certain amount. An increment of +1 is the default setting in the comparisons() function. The bars below the histogram illustrate different options for the approach relying on specified values. Shown in monospace font at the left margin of the plot are shortcuts that can be used with the {marginaleffects} package. These will be illustrated shortly.

Let us pause to consider these two strategies (observed vs. specified values). It appears that using the observed values to evaluate a comparison for a numeric predictor makes most sense if we intend to draw causal conclusions about the effect of a hypothetical intervention, which would be applied at the unit level, i.e. to the units whose predictor values are shifted. It should be kept in mind, however, that for some variables it does not make sense to envisage an intervention. Date of birth is an obvious example. If the focus is instead on description, it is likely that the comparison of two hypothetical individuals with different values on the targeted variable is more informative.

Figure 5.10: Different types of comparison for numeric variables.

Nevertheless, an advantage of the observed-values approach is that it gives us an idea of how much the difference varies across the span of the numeric variable. It therefore makes sense to consider an intermediate approach, which allows us to appreciate how much the difference associated with a 20-year gap in Date of birth depends on where in the predictor space (including the target variable itself) it is evaluated. The approach is shown visually in Figure 5.11. We first settle on the span we wish to evaluate, in our case the gap in years. We could rely on sample statistics, but in the present data set this does not seem very sensible due to its imbalanced composition. We instead decide to consider a 30-year gap, which roughly corresponds to one generation. The comparison between birth dates that are one generation (i.e. 30 years) apart is then evaluated at a specified number of locations across the (possibly restricted) range of the numeric variable.

Figure 5.11: Comparisons for a numeric variable: An intermediate strategy that combines the strenghts of the as-observed and the as-balanced approach.

Similar to the calculations of predictions, there are different ways of handling peripheral variable: They may be held at their means, treated in an as-balanced manner, or allowed to assume their observed values in the data.

5.2.5.2 Comparisons for each observed unit

We should bear in mind that by making comparisons for each unit in the estimation sample, we are essentially running a thought experiment – in comparing two versions of the same unit, we are imagining some kind of intervention. This might not make sense for certain variable, with Gender being a case in point. We can nevertheless think of the distribution of differences as reflecting the extent to which male and female speakers differ on the probability scale at different points in the predictor space, in our case across different birth years.

We can inspect this distribution with a dot diagram; the scaling of the x-axis is chosen in a way that permits direct comparison with Figure 5.13 below.

Figure 5.12: Gender: Distribution of differences between predicted response probabilities for the observations in the estimation sample.

For a numeric variable, we can also compare two version of the same unit. For instance, we can add 25 years to the speaker’s actual date of birth. We should bear in mind that it also does not make sense to think of Date of birth as changing value within an individual. Plus, for some speakers this shifts the year of birth into the future, and thereby extrapolated into an unobserved region of the data space. Nevertheless, the distribution of differences gives us some idea of how much the difference associated with a 1-unit (i.e. 25-year) shift in year of birth varies in the data space. Further below, we will use a more sensible set of specified values to reveal this feature of our data and model.

Figure 5.13: Date of birth: Distribution of differences between predicted response probabilities for the observations in the estimation sample.

5.2.5.3 Comparisons for numeric predictors

For numeric predictors, we illustrated the use of different locations and values of the numeric variable itself see 5.10. Table 5.8 lists the differences in response probabilities for each of the approaches illustrated in Figure 5.10, broken down by Gender.

Response category
Level 1 2 3 4 5
observed → observed + 1 (default)
Female +.28 [+.17, +.40] +.00 [−.02, +.02] −.07 [−.13, −.01] −.04 [−.06, −.02] −.17 [−.23, −.11]
Male +.28 [+.18, +.38] −.03 [−.05, −.00] −.10 [−.16, −.05] −.04 [−.06, −.02] −.11 [−.16, −.07]
observed → observed + 0.5
Female +.14 [+.08, +.19] +.01 [−.00, +.02] −.02 [−.05, +.01] −.02 [−.03, −.01] −.10 [−.14, −.06]
Male +.15 [+.09, +.20] −.01 [−.02, +.01] −.05 [−.08, −.02] −.02 [−.03, −.01] −.07 [−.10, −.04]
−1 → +1
Female +.41 [+.27, +.55] +.06 [+.03, +.09] +.06 [−.02, +.13] −.03 [−.06, +.00] −.50 [−.68, −.32]
Male +.50 [+.34, +.65] +.03 [−.00, +.06] −.05 [−.13, +.02] −.06 [−.09, −.03] −.41 [−.61, −.22]
mean − SD → mean + SD
Female +.29 [+.18, +.41] +.03 [+.01, +.06] −.01 [−.06, +.04] −.04 [−.06, −.01] −.28 [−.39, −.17]
Male +.34 [+.21, +.46] +.00 [−.02, +.03] −.08 [−.14, −.03] −.05 [−.08, −.02] −.21 [−.31, −.11]
mean − SD/2 → mean + SD/2
Female +.15 [+.09, +.21] +.02 [+.01, +.03] −.00 [−.03, +.02] −.02 [−.04, −.01] −.14 [−.20, −.08]
Male +.17 [+.11, +.24] +.00 [−.01, +.02] −.05 [−.08, −.01] −.03 [−.04, −.01] −.10 [−.15, −.05]
lower quartile → upper quartile
Female +.18 [+.11, +.26] +.02 [+.00, +.04] −.01 [−.05, +.02] −.03 [−.04, −.01] −.16 [−.23, −.10]
Male +.21 [+.13, +.29] +.00 [−.02, +.02] −.06 [−.10, −.02] −.03 [−.05, −.01] −.12 [−.18, −.06]
minimum → maximum
Female +.44 [+.30, +.58] +.07 [+.04, +.11] +.10 [+.02, +.17] −.01 [−.05, +.02] −.60 [−.79, −.40]
Male +.54 [+.40, +.69] +.05 [+.01, +.08] −.01 [−.10, +.08] −.05 [−.09, −.02] −.53 [−.75, −.31]
Table 5.8: Different options for comparisons for numeric predictors.

5.2.5.4 Comparisons at specified values

Comparisons can also be made at specified values. For speakers born in 1975:

  • Interpretation: “For speakers born in 1975, the average difference between male and female speakers is…”
Male − Female
Response Δ Pr(y) 95% CI
1 +.12 [+.01, +.23]
2 +.02 [+.00, +.03]
3 +.00 [−.02, +.02]
4 −.02 [−.03, +.00]
5 −.12 [−.23, −.01]
Note: Adjusted to: Date of birth = 0 (1975).
Table 5.9: Comparison between male and female speakers born in 1975.

5.2.5.5 Comparisons for the “average unit”

We can set peripheral variables equal to their means (numeric predictors) or modes (categorical predictors).

  • Interpretation: For speakers with the average date of birth in the sample (2120.07), the difference between male and female speakers is…
Male − Female
Response Δ Pr(y) 95% CI
1 +.13 [+.01, +.26]
2 +.01 [−.00, +.02]
3 −.02 [−.05, +.01]
4 −.02 [−.04, −.00]
5 −.10 [−.19, −.01]
Note: Adjusted to: Date of birth = −0.07 (mean).
Table 5.10: Comparison between the male and female “average” speaker in the sample.

5.2.5.6 Comparisons for ideal types

The calculation of comparison for (or between) ideal types works accordingly.

5.2.6 Average comparisons

Since a comparison is based on two specific profiles, it is a conditional quantity. This means that its magnitude depends on where in the predictor space it is evaluated, i.e. on the setting(s) of the peripheral variable(s). Since the units in the estimation sample have different values on the predictor variables, there is a distribution of differences (or ratios, etc.) in the sample. It often makes sense to inspect this distribution, and to evaluate a comparison at different points in the predictor space, to consider both its average magnitude as well as its variability across conditions (see Maddala 1983, 24).

Before we go further, we should note that an average comparison may be calculated in two ways, depending on the point at which we average. The difference is illustrated in Figure 5.14. The starting point are two sets of predictions, one for each value of the targeted variable. These two sets are shown using grey tiles in Figure 5.14.

  • Compare and average: We start with a unit-level comparison of the two sets of predictions, which would result in a set of unit-level comparisons. This set is of the same length as each of the sets of predictions. We then take the average over these unit-level comparisons. This means that we first compare, and then average.
  • Average and compare: Alternatively, we could start by averaging over each set of predictions to obtain two average predictions, which we then compare. This means that we first average and then compare.
Figure 5.14: Forming average comparisons: Compare and average vs. average and compare.

Importantly, these two procedures yield different results in non-linear models. In general, the first approach (compare and average) is preferable. It also allows us to see how the magnitude of, say, the difference of interest varies across the conditions considered.

The compare-and-average approach has been variously referred to as the “average marginal effect” (Long 1997; Long and Freese 2014), the “average predictive comparison” (Gelman and Hill 2007), and the “average discrete change” (Long and Mustillo 2017).

5.2.6.1 Average comparisons: Observed vs. specified values

The points in the predictor space where the comparison is made can be determined in two ways. The values for the peripheral variables may be specified manually (specified values) or we may use the values they assume in the estimation sample (observed values). This means that we can average them over the sample or compute them at fixed values.

We can show these average comparisons graphically:

Figure 5.15

Recall that the distribution of Date of birth in our sample is very unbalanced. It therefore makes sense to specify its values manually, to obtain estimates that are more representative of the target population. Since the target population includes all speakers of Maltese English irrespective of their year of birth, we wish to treat them equally when reporting a population-level average. We will evaluate the comparison at seven locations that are 10 years apart (1940, 1950, …, 2000). Table 5.11 juxtaposes average comparisons for these different treatment of the peripheral variable Date of birth.

Date of birth = as-observed
Date of birth = as-balancedᵃ
Response Diff 95% CI Diff 95% CI
1 +.13 [+.01, +.24] +.10 [+.01, +.19]
2 +.00 [−.00, +.01] +.01 [−.00, +.02]
3 −.02 [−.04, +.00] +.01 [−.00, +.02]
4 −.02 [−.03, −.00] −.01 [−.01, +.00]
5 −.10 [−.18, −.01] −.11 [−.21, −.01]
Note: ᵃAdjusted to: Date of birth = (1940, 1950, 1960, 1970, 1980, 1990, 2000).
Table 5.11: Average comparisons for Gender using different adjustment methods for Date of birth.

Interpretation: “The average difference in predicted probability between male and female respondents, evaluated at (and averaged over) the observed values of age, is

5.2.6.2 Average comparisons for a quick overview of predictor importance

To be able to directly compare the two variables Gender and Date of birth, we must first decide what a “fair comparison” may be. It seems that a 30-year span, which represents a generational shift, makes sense. Further, we must decide whether to evaluate it at two fixed points (e.g. 1990 vs. 1960) or whether we should embrace the observed range of the date-of-birth scale to also take into consideration the fact that the difference associated with a 30-year gap will depend on where in the predictor space it is evaluated (including which birth years are considered). We will use our intermediate approach (see Figure 5.11).

Table 5.12 shows average comparisons for the two predictors. We note that the predicted response probabilities vary more with the date of birth.

Date of birthᵃ
(Difference: 30 years)
Genderᵇ
(Female − Male)
Response Diff 95% CI Diff 95% CI
1 +.25 [+.18, +.32] +.10 [+.01, +.19]
2 +.04 [+.01, +.06] +.01 [−.00, +.02]
3 +.03 [−.02, +.07] +.01 [−.00, +.02]
4 −.02 [−.04, −.00] −.01 [−.01, +.00]
5 −.29 [−.41, −.18] −.11 [−.21, −.01]
a Adjusted to: Gender = as-balanced; Date of birth = as-balanced (1970 − 1940, 1980 − 1950, 1990 − 1960, 2000 − 1970).
b Adjusted to: Gender = as-balanced; Date of birth = as-balanced (1940, 1950, 1960, 1970, 1980, 1990, 2000).
Table 5.12: Average comparisons to compare the two predictors in terms of their strength of association with the outcome.

Long and Freese (2014, 344–51) show these differences graphically, which makes sense if there are a large number of predictors in the model. Figure 5.16 mimics the kind of display they use.

Figure 5.16: Visualization of average comparisons for side-by-side comparisons of predictors.

5.2.6.3 Average comparisons for subgroups in the data

So far we have used average comparisons to obtain a single summary for a given predictor, to quantify its overall strength of association with the outcome. As we saw in Figure 5.2 and Figure 5.3, the difference between predicted probabilities are bound to vary across the predictor space. This is because differences between predictions near the endpoints of the probability scale (i.e. near 0 or 1) will be smaller. Since this local compression is scale-induced, it occurs even if the model does not include any interaction terms.

It is therefore often informative to calculate average comparisons for subgroups in the data. These subgroups may be formed based on the peripheral variables, but also based on the predictor of interest, the targeted variable. This allows us to note how the predictive capacity of a variable varies across regions of the data space.

Let us first consider the change in outcome probabilities associated with a 30-year difference in year of birth. We will evaluate this 30-year step at three different locations of the predictor variable:

  • 1970 vs. 1940
  • 1985 vs. 1955
  • 2000 vs. 1970
Response category
Comparison 1 2 3 4 5
1970 vs. 1940 +.17 [+.14, +.20] +.05 [+.03, +.07] +.10 [+.03, +.18] +.01 [−.02, +.05] −.33 [−.46, −.21]
1985 vs. 1955 +.28 [+.19, +.37] +.03 [+.01, +.06] +.00 [−.04, +.04] −.03 [−.05, −.01] −.28 [−.41, −.16]
2000 vs. 1970 +.32 [+.20, +.44] +.01 [−.00, +.03] −.06 [−.10, −.02] −.05 [−.07, −.02] −.23 [−.32, −.14]
Table 5.13: Average comparisons for a 30-year gap in Date of birth at different locations of the Date-of-birth scale.

We can also evaluate it separately for male and female speakers. For a balanced representation of the focal variable Date of birth, we use the intermediate approach illustrated in Figure 5.11.

Response category
Gender 1 2 3 4 5
f +.22 [+.15, +.29] +.04 [+.02, +.06] +.06 [+.01, +.11] −.01 [−.03, +.01] −.31 [−.43, −.20]
m +.28 [+.20, +.36] +.03 [+.01, +.05] −.00 [−.06, +.05] −.03 [−.05, −.01] −.27 [−.40, −.15]
Table 5.14: Average comparisons for a 30-year gap in Date of birth at different levels of the variable Gender.

5.3 Predictions on the latent-variable scale

Recall that an interpretation in terms of a latent variable is only valid for parallel cumulative models. If the assumptions of this model type hold, the latent variable allows for simple strategies for model interpretation.

Importantly, model-based predictions on the latent-variable scale only make sense in the context of the thresholds, which means that it will usually be necessary to visualize them. This is because we have to know the spacing of the thresholds to appreciate the meaning of differences on the latent scale. If a probit link is used, the coefficients on the latent-variable scale may be interpreted as standardized effect size measures, the benchmark being the residual variation. In multilevel models, there will be residuals at different levels, which means that the level-1 error variation may not be the appropriate yardstick for all predictors in the model.

When we interpret predictions on the latent scale, we no longer have to confront the fundamental problem of linear vs. non-linear response scales – all predictions are linear on the latent scale. This means that there are no scale-induced interactions. For a quantitative predictor, the as-observed and the specified-single-value approach therefore produce the same results. This means that the task of model interpretation is greatly simplified.

It should be noted, however, that the issue becomes relevant when the model includes interaction terms, and when a numeric variable is modeled using polynomials or flexible smooths.

We fit a model that includes an interaction:

m <- clm(
  rating_int ~ dob_c * gender, 
  data = d, 
  threshold = "symmetric")

This is the regression table:

formula: rating_int ~ dob_c * gender
data:    d

 link  threshold nobs logLik  AIC    niter max.grad cond.H 
 logit symmetric 193  -260.60 533.20 5(1)  5.96e-13 8.1e+01

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
dob_c          -1.3333     0.3443  -3.872 0.000108 ***
genderm        -0.6857     0.3151  -2.176 0.029559 *  
dob_c:genderm   0.2916     0.4681   0.623 0.533339    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
          Estimate Std. Error z value
central.1 -0.86376    0.22847  -3.781
central.2  0.15823    0.21999   0.719
spacing.1  0.39761    0.06923   5.744

5.3.1 Predictions for each observed unit

We can obtain predictions for the estimation sample to obtain an overview of the predictive capacity of the variables in our model. Figure 5.17 shows the distribution of these predictions for the observations in our sample.

Figure 5.17: Predicted scores on the latent-variable scale for the observations in the estimation sample.

5.3.2 Unit-level predictions

We can calculate predictions on the latent scale for a male and a female speaker born in 2000.

95% CI
Gender Prediction Lower limit Upper limit
f −0.98 −1.56 −0.4
m −1.37 −1.95 −0.8
Note: Adjusted to: Date of birth = +1 (2000).
Table 5.15: Predictions for a male and a female speaker born in 2000.

It is hard to interpret these predictions without reference to the thresholds. It is therefore best to consider them visually:

Figure 5.18: Predictions for a male and a female speaker born in 2000.

We can also look at how the predictions for male and female speakers vary across the entire date-of-birth range – that is, for each combination of Gender and Date of birth. Figure 5.19 shows two trends lines, one for male and one for female speakers.

Figure 5.19: Predictions for male and female speakers across the span of the variable Date of birth.

Since the model only includes two predictors, Figure 5.19 shows the complete predictor space: Every condition that is represented by the model appears in the figure. This means that no predictor was backgrounded, or averaged over.

5.3.3 Average predictions

We can also average predictions across the values of a peripheral variable. This allows us to compare predictors side by side, to assess their relative strength of association with the outcome.

We calculate average predictions for Date of birth by forming simple averages over the two levels of Gender. This means that the two trend lines in Figure 5.19 merge into one. Since we give the same weight to male and female speakers, they “meet” half way. When forming average predictions for Gender, on the other hand, we hold Date of birth at 1975, which is close to the center of the distribution. These two sets of average predictions can then be displayed side by side, as is done in Figure 5.20. This side-by-side arrangement allows us to compare them in terms of their average strength of association with the outcome. It is clear from Figure 5.19 that Date of birth has a much stronger association with the lexical preference for package over parcel.

Figure 5.20: Average predictions for the two predictors in the model.

5.3.4 Unit-level comparisons

Rather than compare unit-level predictions side by side as in Figure 5.18, we can express the comparison numerically, by calculating the difference between the two. In Figure 5.18, we saw predictions for male and female speakers born in 2000. Female speakers have a higher latent mean: −0.98, 95% CI [−1.56, −0.4] compared to −1.37, [−1.95, −0.8] for male speakers. The difference between female and male speakers is 0.39, [−0.4, 1.19].

Instead of comparing male and female speakers with the same birth year, we may also ask how the difference between the subgroups varies across levels of Date of birth. Since the model includes an interaction between these predictors, the difference is not constant across the scale. As Figure 5.19 showed, male and female speakers have different slopes. To see how the estimated difference between male and female speakers varies with birth year, we can obtain unit-level comparisons at several locations – i.e. for several birth years. The results is visualized in Figure 5.21: For 1940, the difference is about +1, for the birth year 2000 it is about +0.5.

Figure 5.21: Comparison between male and female speakers across the span of the variable Date of birth.

We can also form comparisons for the variable Date of birth: The difference between female speakers born in 1950 (1.69, 95% CI [0.73, 2.64]) vs. 2000 (−0.98, 95% CI [−1.56, −0.4]) is 2.67, 95% CI [1.32, 4.02].

5.3.5 Average comparisons

The comparisons so far involved no averaging – all predictors were held at specific values. In models that include a larger number of predictor variables, we often need to average over one or several peripheral variables. To generate average comparisons, we first create two or more average predictions which we then compare.

Figure 5.20 showed average predictions for both predictors. We can compare speakers born in 2000 and 1940, averaging over Gender (as-balanced, i.e. simple average).

Likewise, we can compare male and female speakers, averaging over Date of birth.

and then use pairs() to obtain relevant contrasts.

 dob_c emmean    SE  df asymp.LCL asymp.UCL
  -1.4   1.67 0.439 Inf     0.812     2.533
   1.0  -1.18 0.214 Inf    -1.596    -0.759

Results are averaged over the levels of: gender 
Confidence level used: 0.95 
 contrast             estimate    SE  df z.ratio p.value
 (dob_c-1.4) - dob_c1     2.85 0.582 Inf   4.896  <.0001

Results are averaged over the levels of: gender 

  1. As Osbourne (2015, 34) puts it, “odds ratios are problematic in that researchers, practitioners, and the lay public often don’t intuitively understand odds – although they often think they do […] [R]atios of things that people don’t understand are necessarily even more fraught with difficulty.”↩︎

  2. An alternative strategy may also be used if categorical predictors are represented with dummy variables, i.e. indicator variables coded as 0/1. Gender, for instance, may be expressed as an indicator variable Female, with a value of 1 representing a female speakers, and a value of 0 a male speaker. The in-sample averages for this variable Female is 0.53 (i.e. 53% of the speakers are female). We can then form predictions for values between 0 and 1 and thereby give different weight to the two levels. Note, however, that the resulting predicted probabilities differ from the ones we would have obtained by averaging over predicted probabilities generated for male and female speaker. This is because the averaging was done on different scales (logit vs. probability scale).↩︎

  3. For instance, frequency and dispersion as indicators of usage patterns are interrelated and covary strongly enough to consider them jointly as reflecting prototypes: Current and pervasive items vs. infrequent and specialized items. Ideal types may even me modeled on the basis of existing exemplars.↩︎

  4. Interesting: Long and Freese (2014, 272) argue against mixing as-observed and as-specified approaches to defining ideal types.↩︎