4 A latent variable model – ordered

4.1 Relation between latent variable and ordinal response categories

In the latent-variable perspective, the ordinal variable is thought of as an imperfect, coarser measurement of an unobserved (latent) trait. Whenever the variable crosses a threshold, we observe a category change. Thresholds (also called cutpoints) “cut up” the latent scale into intervals that correspond to the ordinal categories. The latent variable (often denoted as y*) ranges from negative to positive infinity. This is illustrated in Figure 4.1, where three thresholds (\(\tau_1\), …, \(\tau_3\)) divide the latent scale into 4 stretches.

Figure 4.1: Illustration of the mapping from the latent variable y* to the observed ordered categories.

The thresholds (often denoted as \(\tau\)) translate the continuous variable into rank-ordered categories. The number of thresholds is always 1 less than the number of response categories. For four categories, there are three thresholds.

It makes sense to assume that the trait that is being measured by an ordinal variable is distributed symmetrically in the population of interest – indeed, most ordinal regression models make this assumption. This distribution therefore assumes a bell shape, as illustrated in Figure 4.2. Depending on the specific form of model that is implemented, this may be a standard normal or a standard logistic distribution.

The area under the density curve represents all responses in the sample (i.e. 100%). The thresholds divide up this area into four parts, one for each response category. These are shown in Figure 4.2 using grey shading. The area below \(\tau_1\), which appears in dark grey, represents the proportional share of response category 1. In Figure 4.2, it is .16. The share of category 2 – .46 – is represented by the portion between \(\tau_1\) and \(\tau_2\). The response probability for categories 3 (light grey) and 4 (white) are .22 and .16, respectively.

Figure 4.2 illustrates how a parallel cumulative model represents the category probabilities from a latent-variable perspective. What determines this distribution is (i) the location of the threshold parameters, and (ii) the center of the bell curve. The center of the density curve can be thought of as the model intercept.

Figure 4.2: Latent trait interpretation of the cumulative model: Relationship between the latent variable *y** and the observed ordinal variable y.

The predictor variables model the location of the density curve. For a subgroup that shows a larger proportion of higher responses, the density curve is shifted rightward. This means that its center shifts upwards along the latent scale. This is illustrated in Figure 4.3, where the center is shifted by +0.7 to the right – the average is now +1.7. The response probabilities for categories 1 to 4 are now .04, .30, .27, and .38. The dotted curve represents the density in Figure 4.2, which is centered about 1.

Figure 4.3: Change in response probabilities resulting from a shift of the latent mean.

One way to model ordinal data is to assume that the underlying latent variable has a standard normal distribution, with a mean of 0, and a standard deviation of 1. The model then determines the location of the cutpoints. Instead of assuming equal distances between them (as is the case in MRMs), the data decide where to cut up the latent scale. Importantly, the normally distributed trait has a mean, which varies as a function of the predictors in the model. In this regard, the latent variable model for ordered outcomes is similar to MRMs. Thus, if a predictor shows an association with the outcome, it shifts the mean of the latent trait upwards or downwards on the latent response scale.

4.2 The data-generating mechanism represented by a LVM

We can also think of the latent-variable interpretation of cumulative models as dividing the data-generating process into two parts (see Grilli and Rampichini 2012, 393). The first is the underlying continuous representation of the trait that is being measured, including an assumption about the shape of its distribution in the population. The most commonly used link functions (logit and probit) stipulate a symmetric, bell-shaped profile. The second part is a measurement model, which represents the way in which this continuous trait is being measured. In some sense, then, the first part represents the construct of interest, i.e. a sensation (or perception/attitude/opinion) that may vary between individuals and contexts, and the second part represents the instrument used to measure this construct, e.g. design features of the questionnaire (e.g. number and wording of response categories).

Figure 4.4 illustrates the difference between these two components. Let’s assume a researcher is interested in the extent to which respondents agree to a particular statement. We can think of agreement as a continuous variable – opinions vary, and some individuals will agree (or disagree) more strongly than others. The latent-variable model assumes that the continuous feature of interest (here: agreement) has a bell-shaped distribution. The density curve at the top of Figure 4.4 shows the assumed distribution of the underlying trait in a sample of speakers.

To measure the level of agreement, the researcher may use different response scales, which can vary in (i) the number of response categories and (ii) their wording. This means that the instrument may be designed in different ways to measure the variable of interest. Figure 4.4 shows three scenarios: a 3-point scale, a 5-point scale, and a 7-point scale. On the left side, the design of the ordinal response scale is shown. On the far right, the observed distribution of responses across the categories is visualized using a grouped bar chart. The density curves in the middle illustrate how the latent-variable model represents both components: the distribution of the latent trait in the sample of informants (bell-shaped curve), and the design of the ordinal scale, i.e. the number and perceived meaning of the verbal scale labels (number and location of thresholds).

Figure 4.4: Illustration of the data-generating mechanism: While the assumed distribution of the continuous variable in the population remains the same, differences in the number and phrasing of response categories will produce different observed distributions.

4.3 Potential for confusion: Scaling of latent variable vs. cumulative proportions

A potentially confusing feature of cumulative models is the directionality of the model scale, and its relation to the cumulative probabilities. To recognize the issue, consider Figure 4.5. The horizontal axis represents the latent variable. Higher values correspond to a higher value on the underlying continuum (e.g. a higher level of agreement). The thresholds divide the scale into four intervals, which represent the response categories of the ordered scale. Two densities are shown – one for the average latent score of 0, and the second one with an average of 0.7, which is therefore displaced to the right. The averages are shown with filled circles (marked “A” and “B”). We note that group B shows a higher average on the latent scale. On the ordinal response scale, this means that group B will show a greater share of responses in the higher response categories.

Figure 4.5: Cumulative models can be formulated on the basis of (a) cumulative probabilities, i.e. the probability of a response below a threshold; and (b) exceedance probabilities, i.e. the probability of a response above a threshold. Only formulation (b) produces latent scores that are consistent with the directionality of the ordinal scale.

Scaling issues can arise because the cumulative model is formulated in terms of cumulative probabilities, and for a given threshold, there are two probabilities we can foreground: the probability of observing a category below the threshold, or the probability of observing a category above the threshold. The probability of observing a category below the threshold is usually referred to as a cumulative probability, and the probability of observing a category above the threshold as an exceedance probability (i.e. 1 – cumulative probability). Both model formulations are found in the literature.

To see how this may cause confusion, let us consider a simple example. In Figure 4.5, threshold \(\tau_{1}\) divides the scale into those categories below \(\tau_{1}\) (1) and those above \(\tau_{1}\) (2, 3, 4). Panel (a) foregrounds the probability of observing a response below the threshold. This probability is shown using grey shading. We see that this shaded area is smaller in group B (.04) than in group A (.16). This means that the probability of observing a category below the threshold decreases as the latent score increases. The two types of quantity (cumulative probability, latent score) therefore have a reverse interpretation, which can cause confusion.

In a model that describes the distribution of cumulative probabilities, the latent score generated by the linear predictor is therefore subtracted from the thresholds. This ensures that the coefficients of the model will be consistent with the ordinal scale.

\[ ln\left(\frac{Pr(y \le m)}{Pr(y > m)}\right) = \tau_m - \textbf{x}\beta \]

In panel (b), on the other hand, the probability of observing a response above the threshold is foregrounded. Now the shaded area is larger in group B (.96) than in group A (.84), which means that the probability of observing a category above the threshold increases as the latent score increases. The two types of quantity (exceedance probability, latent score) therefore have the same directionality of interpretation.

In a model that describes the distribution of exceedance probabilities, the thresholds are therefore subtracted from the latent score generated by the linear predictor. This again ensures that the coefficients of the model will be consistent with the ordinal scale.

\[ ln\left(\frac{Pr(y \ge m)}{Pr(y < m)}\right) = \textbf{x}\beta - \tau_m \]

This means that choice of formulation affects the meaning of the thresholds. Let us illustrate this with a simple regression model for the heaps data, with Age as the only predictor variable. We fit two models to the data – one based on cumulative and one based on exceedance probabilities. The models return the exact same coefficient for the predictor Age. The thresholds, on the other hand, are mirror images of one another.

This is illustrated in Figure 4.6. First note that the y-axis is scaled identically in both panels. The trend lines for Age coincide. The threshold coefficients, however, run in opposite directions. For cumulative probabilities, thresholds increase with the latent scale – they increase from bottom to top. For exceedance probabilities, on the other hand, they decrease from bottom to top. In fact, the thresholds in the right panel are the exact mirror image of those in the left, with zero as the mirror axis.

Figure 4.6: The location and meaning of the thresholds depends on whether cumulative or exceedance probabilities are modeled.

The thresholds for cumulative probabilities, which appear in the left panel, allow for a natural interpretation of the latent scale: Thresholds represent cutpoints that divide the continuous latent scale into an ordered set of categories.

Both formulations exist in the literature. The advantage of the exceedance-probability formulation is that the threshold coefficients are consistent with the intercepts we obtain from running a series of binary regressions on the cutpoint equations along the ordinal scale (Harrell 2015, 313). This makes it easier to run model checks. An advantage of the cumulative-probability formulation, on the other hand, is that it makes transparent the latent-variable perspective. This is because the threshold coefficients can be interpreted as the cutpoints that discretize the underlying continuous variable.

When using cumulative models (or, indeed any other form of ordered regression), it is important to know which formulation is implemented in the software. Table Table 4.1 shows which type of probability is foregrounded in a number of R packages.

R package	Function	Reference	Probability
ordinal	clm(), clmm()	Christensen (2018)	Cumulative
MASS	polr()	Venables and Ripley (2002)	Cumulative
VGAM	vglm()	Yee (2015)	Cumulative
brms	brm()	Bürkner (2017)	Cumulative
rms	orm()	Harrell (2015)	Exceedance

Table 4.1: The formulation of cumulative models in different R packages.

4.4 Latent-variable analysis of the heaps data

For illustration, we apply a LVM to the heaps data. Here, we focus on model interpretation and how the distribution of the ordinal responses is represented using thresholds and shifted density curves. We will therefore directly move on to illustrative graphs of the fitted model.

4.4.1 Category proportions under the density curve

In the following diagrams, the (vertical) y-axis shows the underlying trait. For the current data set, this is the reported “likelihood of usage”, ranging from “never” to “likely”. The questionnaire offered six tick boxes (see Figure 1.4), so we have five thresholds. Let us consider each predictor in turn.

In Figure 4.7, the horizontal axis shows the predictor Age, which varies from around 80 years (left margin) to about 20 years (right margin). The distribution of the latent trait is shown for three discrete values of age (75, 50, and 25 years). There is a clear association between Age and the usage of heaps: The center of the density curve shifts upwards. For 75-year-olds, the response “never” has the largest share (red), while for 25-year-olds, the response “likely” has the highest probability (grey). The three snapshots we have chosen are points on a continuum, which is represented by the trend line.

Figure 4.7: Distribution of the latent variable *y** at different years of age.

The model-based density curves for the different levels of Register are shown in Figure 4.8. The black dots trace the mean of the latent variable across the levels of Register. We observe that while the proportional share of “likely” is highest when talking to friends, the different types of interlocutor account for relatively little variation in the probability of outcome responses.

Figure 4.8: Distribution of the latent variable *y** for different registers.

Differences among syntactic functions are more pronounced. As Figure 4.9 shows, the bell-shaped distributions shift across the levels of this variable. The shaded area under the curve shows how the responses are distributed across the syntactic functions. heaps has the highest level of acceptability when used as a quantifier. As an intensifier of an adjective in the positive form, it is clearly dispreferred.

Figure 4.9: Distribution of the latent variable for different syntactic functions.

4.5 Condensed representation: Focus on the average

The latent variable model allows us to interpret and communicate an ordered regression model in very similar ways as MRMs. To this end, we can construct predictive margins on the latent variable scale for each predictor in the model. Figure 4.10 shows the means of the distribution of the latent variable across different levels or values of the predictors in the model, adjusting for all other predictors in the model. These adjusted predictions should be compared to the MRM version in Figure 1.5.

Figure 4.10: Predictive margins on the latent-variable scale for a side-by-side comparison of predictors.

4.5.0.1 A model with an interaction

We can include an interaction between Age and Syntactic function.

Figure 4.11: Predictive margins on the latent-variable scale for an inspection of the interaction between Age and Syntactic function.

4.5.1 Thresholds

The default approach to the estimation of thresholds is to grant the model full flexibility when locating them. This means that their spacing is determined empirically, based on the data. While this may at first seem like a sensible approach, it is also the most costly in terms of the number of parameters needed to record their location. For \(k\) response categories, we need \(k-1\) threshold parameters. It therefore makes sense to consider alternative threshold structures.

Whether it makes sense to use the default approach and fit flexible thresholds depends on the design of the ordinal scale. It turns out that in many cases, additional information can be used to fit thresholds. Supplementary information may be available on scale symmetry or equidistance. By building this information into the model, fewer parameters are needed to estimate the cutpoints.

Let us consider symmetry first. Some ordinal scales are constructed in a symmetric way, which is reflected in parallel wording to both sides of the (implicit) scale midpoint. The illustrative agreement scales in Figure 4.4, for instance, are symmetric. The thresholds are therefore spaced symmetrically around their midpoint. Further examples, which are taken from the linguistic research literature, appear in Figure 4.12. Note how each set of verbal scale point labels is folded at the center.

Figure 4.12: Examples of diverging ordinal scales with symmetric scale point labels.

Verbalizations that create symmetric scales are relatively frequent in the literature. In a recent survey of the use of ordinal response scales in the linguistic research literature, Sönning (2024) looked at 4,441 articles published between 2012 and 2022 in 17 linguistic journals. In total, 473 ordinal response scales were identified. Of these, 89 (19%) showed a symmetric verbalization, the most typical variant being Likert-type agreement scales.

The estimation of thresholds may also be directed by information on equidistance. If an ordinal scale consists of a sequence of boxes which are labeled only at the endpoints, the spacing between thresholds can arguably be considered equidistant. This kind of layout is typical for semantic differential scales, for instance. Figure 4.13 shows a number of examples from the literature. Out of the 473 ordinal response scales surveyed in Sönning (2024), 245 (52%) permit an equidistant spacing of the thresholds.

Figure 4.13: Examples of response scales with verbal labels only at the endpoints, which suggest the use of equidistant thresholds.

Symmetric thresholds divide the latent scale into proportionate stretches. As Figure 4.14 illustrates, the distance between thresholds is the same to both sides of the scale midpoint. When we constrain the thresholds to be symmetric, we need fewer parameters to describe their location. For 6 outcome categories, for instance, we need 3 (instead of 5) parameters.

Figure 4.14: Examples of different symmetric threshold structures for a 5- and a 6-point ordered scale.

The equidistance assumption allows us to fit an even more parsimonious threshold model. Table 4.2 compares the three methods in terms of the number of parameters they require for response scales with 4 to 8 categories. The flexible approach always requires \(k - 1\) parameters, where \(k\) is the number of categories. The equidistant approach always requires 2 parameters, irrespective of the length of the ordinal scale. The symmetric approach is in between, and its parsimony advantage grows with the number of response categories.

Figure 4.15: Examples of equidistant thresholds for ordered scales with 4 to 7 response categories.

	Number of response categories
Method	Three	Four	Five	Six	Seven	Eight
Flexible	2	3	4	5	6	7
Symmetric	2	2	3	3	4	4
Equidistant	2	2	2	2	2	2

Table 4.2: Number of model parameters needed to represent different threshold structures.

Let us briefly consider what threshold structures we may wish to use for our case studies. For the BrE-AmE lexical pairs, the response scale is verbalized symmetrically (see Figure 1.1). We therefore use symmetric thresholds when analyzing these data.

For the heaps data, the layout of the response scale is more complex (see Figure 4.16). Only three categories of the 6-point scale are labeled. The sequence from “likely” to “unlikely” can be considered equidistant. The distance between “unlikely” and “never”, on the other hand, should remain flexible.

Figure 4.16: Excerpt from the questionnaire used to collect reported usage rates for *heaps*.

This means that we would ideally like to use custom thresholds when modeling these data. The custom set, which is illustrated in Figure 4.17, includes a threshold parameter \(a\) for the equidistant sequence, and a threshold parameter \(b\) for the flexible spacing at the right end of the scale. Since the implementation of custom thresholds requires considerable additional work, we will instead use flexible thresholds when modeling these data.

Figure 4.17: Custom threshold structure for the *heaps* data: While the spacings labeled ‘a’ may be considered constant, spacing ‘b’ is best represented using an additional parameter.

4.5.2 Model identification: Thresholds and intercept

In a cumulative model, the intercept and the threshold parameters cannot be estimated simultaneously (see, e.g. Rabe-Hesketh and Skrondal 2021, 642–44). In a model with 5 outcome categories, for example, it is not possible to estimate an intercept as well as all four threshold parameters. Instead, the location of one of these parameters must be fixed to 0, which is equivalent to dropping it from the model. Since different options exist, there are different parameterizations of cumulative models.

The intercept can be set to 0, which means that all threshold parameters are estimated. This is the default approach in virtually all software packages. Alternatively, one of the thresholds can be set to 0, and then the remaining thresholds and the intercept are estimated. This option is attractive if the ordinal scale is diverging (perhaps symmetric), with a logical (or natural) midpoint. What makes this option attractive for these kind of scale layouts is the fact that we can give the latent scale a meaningful zero point. Positive scores on the latent scale then reflect sentiments above the scale midpoint, while negative scores signal the opposite direction. With an even number of response categories, the middle threshold can be mapped to zero. With an uneven number of response categories, we can set the mean of the two central thresholds to 0. These two strategies are illustrated in Figure 4.14 above.

Figure 4.18 illustrates two basic parameterizations of cumulative models. The model does not include any predictors but instead models a set of observed category proportions: 1 (.16), 2 (.53), 3 (.15), 4 (.16). These probabilities correspond to the areas under the density curve between two thresholds (see Figure 4.2).

In the parameterization on the left, the location of the intercept is set equal to zero. It therefore appears in grey, with a dotted line. This means that all three thresholds are estimated (\(\tau_1\) = –1, \(\tau_2\) = +0.5 and \(\tau_3\) = +1). These three parameters then apportion the probability mass appropriately.

Figure 4.18: Two parameterizations of the cumulative model: (a) The intercept or (b) the first threshold is fixed to zero and thereby excluded from the model.

In the parameterization on the right, it is the location of the first threshold that is fixed to zero; the intercept is therefore grayed out. The other two cutpoints (\(\tau_2\) and \(\tau_3\)) are estimated (+1.5 and +2, respectively), and the intercept is also estimated as +1. This again yields the appropriate distribution of the probability mass.

If we compare these two graphs, we observe that the bell-shaped curves are not aligned. In panel (a), the density is centered at 0; in panel (b), it is centered at +1. The location of the density curve on the latent scale depends on the parameterization, i.e. which parameter is fixed to zero.

4.6 Variance components

In mixed-effects models, a standard deviation parameter is estimated for each clustering variable that is represented with random intercepts. While the random-effects SD parameter is estimated from the data, the residual (level-1) SD parameter remains fixed at 1. This is a somewhat odd feature of mixed-effects LVMs for ordinal data. As a result, both measures of spread must be understood in the context of the threshold parameters, which usually change systematically when alternating between a simple and a mixed-effects model.

To see why this is so, consider a simple regression model that is applied to a hierarchically structured set of data with multiple observations from each subject. If the variable Subject is not included as a random effect in the model, the between-subject variability is soaked up by the residual (level-1) error term. Since it is fixed to 1, however, the location of the thresholds must represent the surplus variation, i.e. the additional variability due variation to between subjects. The thresholds will be spaced more tightly, which results in the probability mass shifting to more extreme response categories. If a random intercept term is added to the model, the spacing of the threshld parameters will increase, to reflect the reduction in the residual (level-1) error term.

Figure 4.19: Level-1 and level-2 error terms.