3 Ordered regression models

Under construction

These notes are currently under construction and may contain errors. Feedback of any kind is very much welcome: lukas.soenning@uni-bamberg.de

This chapter provides an overview of the most commonly used types of ordinal regression models. It is useful to know about different options, because the choice of model depends on aspects of the data. Unfortunately, the literature on ordered regression models can be confusing due to variation in terminology. Here, we follow Fullerton and Xu (2016), who propose a unified framework for classifying ordered regression models. Section 3.1 outlines the three building blocks that are combined to construct ordinal regression models. Section 3.2 then provides guidelines for choosing among different model types.

The most frequently used type of ordered regression model is the parallel cumulative model. This is the model that is consistent with the latent-variable motivation discussed in Chapter 4. It is worth repeating that the choice between models is not arbitrary and in some cases another type is preferable. This may be because the assumptions of the parallel cumulative model may be violated. Apart from that, different model types are adequate for different data-generating processes. For instance, if the categories of the ordinal outcome capture a sequence of stages with a logical starting point, where units progress through these stages in a fixed order, the cumulative model may be less adequate. In the following, we will first describe how the great majority of ordered regression models can be decomposed into three building blocks. Then we will consider guidelines that are helpful when choosing between these different types. Finally, some special cases are briefly mentioned.

3.1 Three building blocks

In general, ordered regression models consist of three building blocks. These can be combined to form different versions of ordered regression models. We will discuss the individual components in turn.

3.1.1 Model class

There are three major classes of ordinal regression models:

cumulative models (Aitchison and Silvey 1957; McKelvey and Zavoina 1975),
continuation-ratio models (Fienberg 1980), and
adjacent-category models (Goodman 1983).

To understand the difference between these approaches, we must recognize that an ordered regression model decomposes the ordinal scale (with K categories) into K - 1 binary variables. A model for 5 outcome categories, for instance, breaks down the 5-point scale into 4 binary comparisons. Under the hood, then, every ordered regression model consists of a set of connected binary regressions. These are estimated simultaneously, with certain constraints on the parameters. The number of binary splits is always K − 1, where K is the number of outcome categories.

Since ordinal scales can be split in different ways, there are different classes of ordered regression models. The definition of a binary comparison (e.g. categories 1+2 vs. categories 3+4+5; or category 1 vs. category 2) is referred to as a cutpoint equation (Fullerton and Xu 2016, 5). For ordered outcomes, there are three basic possibilities. These are illustrated in Figure 3.1, where the horizontally aligned boxes represent a 5-point ordinal scale. Accordingly, 4 binary comparisons can be drawn, and these comparisons are shown as a stack of aligned boxes. In each chain, the grey box(es) is/are compared to the white box(es). The cutpoint equation states which categories are involved in the comparison.

Figure 3.1: Different classes of ordinal regression models divide the ordinal scale into different binary splits.

For cumulative models, which appear at the far left in Figure 3.1, we split the ordinal scale at different cumulative probabilities, comparing in each case all categories below the split to all categories above the split. Each binary comparison therefore involves all response categories. This cumulative type of comparison asks: How do cases below a certain threshold compare to those above a certain threshold?
For continuation-ratio models, the scale is partitioned in a different way: We compare each category (except for the highest one) to all categories above it. Thus, while the lowest category is contrasted with all other categories combined, only two categories remain for the final comparison. This continuation-ratio approach asks: How do cases in a specific category compare to those with a higher value on the ordinal scale?
In adjacent-category models, neighboring categories are compared. Each comparison therefore involves only two categories. This strategy asks: How do cases in this specific category compare to those in the next-higher category?

While cumulative, continuation-ratio, and adjacent-category models use the ordinal information in the data, the baseline-category model does not. This type of model, which is shown at the far right in Figure 3.1, is therefore typically used for nominal outcomes.

3.1.2 (Non-)reliance on the parallel regression assumption

The parallel regression assumption (sometimes called the proportional odds assumption) is an important building block of ordered regression models. It makes a fairly strict statements about the relationship between predictor and outcome. It is important to note that this assumption does not apply to the model as a whole, but rather to each predictor individually. Above, we saw that ordered regression models break up an ordinal scale with K categories into K – 1 binary splits. The parallel regression assumption states that the association between the predictor and the outcome is the same across all binary splits.

When assuming parallel regressions for a predictor, this means that the effect of the predictor is the same across all K – 1 binary comparisons. In other words, the parallel regression assumption assumes that the relationship between predictor and outcome is the same at each point along the ordinal scale. For instance, differences in age groups would be assumed to be identical at both ends of the ordinal scale. This is sometimes referred to as the assumption of symmetry.

We will discuss the parallel regression assumption in more detail below, including how to assess or test it. It is worth repeating that this assumption is made (or relaxed) for each predictor in the model individually.

Models where parallel regressions are assumed for all predictors are called parallel models. If the parallel regression assumption is relaxed for all predictors, we call it a non-parallel model (also referred to as a generalized model). The intermediate case, where at least one, but not all of the predictors are allowed to show non-parallel associations, the model is called a partial model.

3.1.3 Link function

Link functions describe the mathematical nuts and bolts of ordered regression models. The most commonly used link functions are the logit, probit, and complementary log-log link. The choice between the logit and probit is basically a question of convenience and/or convention. Results barely differ. The complementary log-log link function, on the other hand, should be handled with care since it is not symmetric. This means that results will differ depending on how you arrange the ordered scale before modeling (in increasing or decreasing order).

3.1.4 Summary

Here is an overview of the three building blocks. Any combination of these features may be used to model ordinal data.

Model type
Cumulative
Adjacent-category
Continuation ratio

Parallel regressions
Yes (parallel)
No (non-parallel)
For some predictors (partial)

Link function
Probit
Logit
Complementary log-log

The most widely used form of ordered regression is the parallel cumulative model with a logit link, which is also referred to as a proportional odds model. Non-parallel models are sometimes referred to as generalized models.

3.2 Selecting a model: Guidelines

The following recommendation by Long and Freese (2014, 310) is worth keeping in mind: “[…] we suggest that you always compare the results from ordinal models with those from a model that does not assume ordinality,” i.e. a baseline-category model.

In general, if the idea of a continuous latent variable is sensible, a cumulative model may be chosen. However, in order to permit simple interpretations based on means of the underlying latent variable, the parallel regression assumption must be met. This is to say that latent variable interpretations are only valid for parallel cumulative models.

3.2.1 Model type

The choice of model type depends on the research question and the assumed data-generating process.

3.2.1.1 Cumulative model

The cumulative model is useful when the focus is on identifying trends in the outcome, either upward or downward, for different values of the predictors. An advantage of the cumulative model is that it is (approximately) invariant to the choice and number of response categories. This means that the same results would have been obtained had the outcome been measured with fewer (or more) categories.

Potential problems. The cumulative model usually fits poorly if the variability of an underlying latent variable changes dramatically over the range of observed values. This is the case, say, when attitudes become more (or less) variable (or polarized) in certain sub-populations, e.g. in older respondents. However, it is possible to model the variability of the underlying variable as a function of (the) predictors in the model (see Section #).

3.2.1.2 Continuation-ratio model

A requirement for continuation-ratio models is that the ordinal variable reflects a sequential process. This means that there is a logical starting point and units have to proceed through earlier stages in the sequence in order to reach a higher category. This progression is also assumed to be irreversible. Typical sequential stages are duration and development scales.

The focus of a continuation-ratio analysis is to understand the factors that distinguish between those observations that have reached a particular response level (but do not move on) from those observations that do advance to a higher level. Of course, by reversing the category order we can reverse the interpretation, i.e. compare those units that have “made it” to a certain stage to those that haven’t. Thus, the model and its interpretation depend on how the ordinal outcome is coded (in increasing vs. decreasing order).

Due to the way the ordinal scale is split into binary comparisons (see above), the model provides the probability that an observation moves beyond a stage once a particular stage has been reached (likelihood of advancing or the probability of a transition).

Potential problems. As illustrated above, the continuation ratio model progressively narrows down the subset of observations involved in binary comparisons. Observations with responses at the same end of the scale may be expected to be more homogenous, on average. This creates the potential problem of sample selection bias, as the continuation-ratio model relies on the assumption that subsamples do not differ systematically with respect to unobserved or omitted variables. Changes in coefficients may thus reflect changes in unobserved heterogeneity rather than “true” effects in the variables.

3.2.1.3 Adjacent-category model

An advantage of adjacent-category models is that they are also applicable to retrospective (case-control) studies; a retrospective study is one in which sampling depends on the outcome. This is to say that the sample is composed so that there is a certain (prespecified) number of observations for each outcome category. As a result, the observed distribution of the outcome variable is then not representative for the population but deliberately manipulated by the researcher; this sampling procedure is common in medical research, for instance, where the observed sample of the dependent variable is typically skewed in favor of one outcome or the other to achieve a more balanced sample than random sampling would produce. This leads to oversampling of certain outcome categories, and the sample is not representative of the population.

Adjacent-category and cumulative models usually fit equally well. Other things being equal, the choice between the two depends on whether you prefer effects to refer to individual response categories (adjacent-category model) or groupings of categories using the entire scale (cumulative model). Adjacent-category models can also clarify which explanatory variables might best predict a response being in the next-highest category, thus helping to identify differences between pairs of categories.

Potential problems. Adjacent-category models rely on the additional assumption of independence of irrelevant attributes (see Long 1997, 182–83; 2014, 177; Cheng 2007).

Fullerton and Xu (2016) state that “adjacent models are better suited than cumulative models for certain types of ordinal variables, including Likert scales and other attitudinal scales that have “additional structure” (Sobel 1997, pp. 215–216).”

To do

Read Sobel 1997.

3.2.2 The parallel regression assumption

As discussed above, the parallel regression assumption (also called proportional odds restriction) imposes a constraint on the regression coefficients for a specific predictor: They are forced to be constant across the ordinal scale. This means that the change in log odds associated with a change in the predictor is the same for each cutpoint equation. To understand the nature of this constraint, consider Figure 3.2. The graph shows the distribution of responses for female and male informants using a stacked bar chart. The cumulative proportions of the observed distribution are located on the probability scale using black dots. These sample proportions are then mapped onto the logit scale, the scale on which the model operates. These quantities are referred to as cumulative logits. It is on this scale that the two groups are compared. This means that a cumulative model looks at the differences between these cumulative logits. In Figure 3.2, the lines connecting the cumulative logits for male and female informants are nearly parallel, suggesting that the difference between cumulative logits are very similar across the ordinal scale. The parallel regression assumption states that these differences (or slopes) are constant across the scale. Figure 3.2 suggests that this assumption is tenable for the predictor Gender in the dataset at hand.

Figure 3.2: Illustration of the cumulative model and the parallel regression constraint: The predictor Gender in the *parcel* data.

Let us also look at the parallel regression assumption for a continuous predictor. We start by creating five Age bins with roughly the same number of speakers. For each age subgroup we then calculate exceedance proportions and convert these into logits. The left-hand graph in Figure 3.3 shows exceedance proportions. The right-hand panel shows them on the logit scale. If the profiles are parallel on the logit scale, the parallel regression assumption is tenable. Clear deviation from a straight-line pattern would suggest that a simple regression line underfits the data.

Figure 3.3: Line plot showing cumulative logits for a binned version of Date of birth.

The parallel regression assumption can be assessed in different ways. Formal test, for instance, can be used to compare two models, one which enforces the parallel constraint on a specific predictor (parallel model), and one where the coefficients may vary freely across cutpoint equations (non-parallel model) (see Fullerton and Xu 2016, Ch. 5). These tests often reject the parallel regression assumption and therefore suggest fitting a more complex model with non-parallel slopes.

Before we illustrate different ways of assessing the parallel regression assumption, let us emphasize that we should be somewhat hesitant to dismiss the parallel regression assumption. This is for several reasons (see Fullerton and Xu 2016, 9, 109).

Tests are oversensitive: Sole reliance on tests is not recommended, as they are overly sensitive to departures from the parallel regression assumption. These tests are also sensitive to other types of model misspecification (see Greene and Hensher 2010, 187–88).
Parsimony: Parallel models have the advantage of being parsimonious: They are less costly in terms of the number of parameters that are “spent” to describe patterns in the data. This advantage is especially important for analyses based on small samples. For reasons of parsimony, a simple model with parallel regression structure is sometimes preferable. Tutz (2012: 241) notes that “[…] with categorical data, parsimonious models are to be preferred because the information content in the response is always low”.
Practical significance: While a formal test may suggest that the variation in coefficients across cutpoint equations is “statistically significant”, that a non-parallel model provides a better fit, or that it has greater predictive utility, this does not mean that the parallel model leads to substantively different conclusions.
Latent-variable interpretation: A parallel model can be interpreted on a latent-variable scale, which allows us to effectively condense the patterns of variation (see Sönning et al. 2024). The non-parallel model rules out this interpretative strategy, and dissociates the model from the idea of an underlying continuous process (Greene and Hensher 2010, 190).

If a formal test rejects the proportional odds restriction, this should always be followed up with an inspection of predicted probabilities, as these give an idea of the importance of the difference (see Kim 2003; Long 2014, 183, 200; Fullerton and Xu 2016, 123).

We now consider formal tests and informal assessments of the parallel regression constraint.

3.2.2.1 Formal tests

Fullerton and Xu (2016, Ch. 5) describe a number of tests that can be used to assess the parallel regression assumption. Among these, the likelihood-ratio test (LR test) appears to show the most favorable performance (Peterson and Harrell 1990, 209). This test compares two models:

M₀: The reference model with the parallel regression constraint for a specific predictor
M₁: The more complex model, where coefficients for this predictor are free to vary across cutpoint equations

M₀ is nested in (i.e. a special case of) M₁, which is necessary for a sensible application of the LR test. The test then looks at which model provides a statistically better fit to the data, keeping in mind the number of parameters spent.

Table 3.1 shows the results of the LR test comparing M₀ and M₁.

Model	Parameters	AIC	Log likelihood	Test statistic	df	p-value
m0	4	534.3159	-263.1579	NA	NA	NA
m1	6	533.1532	-260.5766	5.162656	2	0.0756734

Table 3.1: Model comparison using a likelihood-ratio test.

The test suggest that M₁, the non-parallel model, provides a better fit to the data. This is reflected in the AIC scores (lower values signal greater predictive utility) and the log likelihoods (higher values reflect a statistically better fit). The p-value tells us that M₁ fits statistically better, which means that the parallel regression assumption appears to be violated.

3.2.2.2 Informal assessments of the parallel regression assumption

When there is indication of a violation of the parallel regression assumption, it is useful to also run informal tests, to inspect the nature of the departure and also appreciate its seriousness (see Kim 2003). Fullerton and Xu (2016, 118–30) suggest three types of checks, which we will discuss next:

Model comparison using information criteria
Comparison of the varying coefficients
Comparison of predicted probabilities

3.2.2.2.1 Model comparison using information criteria

It should be noted that the BIC is biased towards parsimony (see Fullerton and Xu 2016, 120), which means that it will be more hesitant to dismiss the parallel regression assumption. For the predictor Date of birth, the two information criteria lead us to different conclusions - while the AIC suggests better predictive performance of the M₁, BIC suggests a draw between the models.

Model	AIC	BIC	Log.likelihood	Deviance
M0	534.3	547.4	-263.2	526.3
M1	533.2	552.7	-260.6	521.2

Table 3.2: Model comparison using a likelihood-ratio test.

3.2.2.2.2 Comparison of coefficients across cutpoint equations

Another informative assessment is to fit the relevant K – 1 binary regressions separately and compare the coefficients to check whether they vary greatly (or systematically) along the scale. The first step is to create K – 1 indicator variables, which divide the K-point ordinal scale at K – 1 points. For checking the parallel regression assumption on a cumulative model, these binary splits are cumulative, as illustrated in Figure 3.4. Note how the responses exceeding a particular cutpoint may be coded as 1 (for cumulative probabilities) or 0 (for exceedance probabilities). This directionality is necessary in order for the coefficients from the binary regressions to have the same sign as those from the ordinal model.

Figure 3.4: The ordinal scale is broken down into four binary comparisons.

The coefficients are listed in Table 3.3.

		95% CI
Cutpoint equation	Estimate	Lower limit	Upper limit
1 vs. (2,3,4,5)	−0.83	−1.43	−0.29
(1,2) vs.(3,4,5)	−1.32	−1.96	−0.76
(1,2,3) vs.(4,5)	−1.35	−1.92	−0.82
(1,2,3,4) vs. 5	−1.38	−1.96	−0.84

Table 3.3: Regression coefficients (slopes) for the predictor Date of birth for different cutpoint equations.

Figure 3.5 shows a graphical inspection of the parallel regression assumption in the heaps data. Each estimate is shown as a dot and the error bar indicates statistical uncertainty (95% CI).

Figure 3.5: Regression coefficients (slopes) for the predictor Date of birth for different cutpoint equations.

The following questions should guide the interpretation of the figure:

Do the coefficients for each predictor (or across the levels of a categorical predictor such as Function and Register) form a horizontal line? If they do, this means that they are of roughly equal magnitude across the binary splits, which in turn means that their effect is (nearly) constant across the ordinal scale. If the coefficients align horizontally, we can safely maintain the parallel regression assumption.
If there is no horizontal pattern, how do the coefficients differ? Do they show an erratic or a systematic pattern? In Figure 3.5, we see a systematic pattern: The association between Date of birth and the ordinal outcome, for instance grows in strength towards the “+2” end of the ordinal scale. This means that differences between cohorts are more pronounced at the upper end of the scale.
Finally, we should also take into consideration the statistical uncertainty of the coefficient estimates. Estimates that show relatively wide margins of error should caution us against drawing strong conclusion about deviations from the parallel regression assumption.

This provides information about the plausibility of parallelism for the data.

3.2.2.2.3 Comparison of predicted probabilities

Fullerton and Xu (2016, 123) argue that it is more informative to compare predicted probabilities since large differences in log odds (or on the probit scale) may not correspond to large differences in predicted response probabilities. To have an additional point of reference, we also consider a visual summary of the observed data distribution.

Let us now compare these sample proportions with the predicted probabilities based on the two models. Recall that have fit two models: M₀, which enforces the parallel regression assumption for the predictor Date of birth (parallel cumulative probit model), and M₁, which relaxes the assumption for this predictor (non-parallel cumulative probit model).

In the left-hand panel in Figure 3.6, a line plot shows the observed distribution of response proportions by Date of birth. For this descriptive graph, it is important that we apply smoothing techniques to the data summary – otherwise it would be of little use as a benchmark.¹ The right-hand panel in Figure 3.6 presents the predicted response probabilities: solid lines for the parallel model, dotted lines for the non-parallel model. The two models yield similar predictions, but we do observe differences in the extreme categories: the downward trend in apparent time for exclusively BrE parcel (dark blue) is steeper in the non-parallel model. The upward trend for AmE package (dark red), on the other hand, is attenuated in the non-parallel model.

Figure 3.6: Comparison of empirical response proportions (left) with model-based predicted probabilities (right) based on the parallel (solid lines) and non-parallel model (dotted lines).

Next, we compare the predicted probabilities using area charts. Figure 3.7 shows that the differences between M0 and M1 are relatively minor.

Figure 3.7: Area chart comparing empirical cumulative proportions (left) with model-based predicted probabilities based on the parallel model (middle) and the non-parallel model (right).

Figure 3.8 shows the cumulative probabilities using a line plot. This allows us to superpose the predicted probabilities from the models in the same display.

Figure 3.8: Line plot comparing empirical cumulative proportions (left) with model-based predicted probabilities (right) based on the parallel (solid lines) and non-parallel model (dotted lines).

3.2.2.3 Parallel or non-parallel?

In our illustrative example, we have seen the while formal tests reject the parallel regression assumption. An inspection of regression coefficients across cutpoint equations showed that the association between Date of birth and the ordinal outcome is somewhat larger at cutpoint 1|2. When we looked at predicted probabilities, however, we noted that the parallel and non-parallel model produce largely similar results. We would therefore maintain the parallel regression assumption for the predictor.

In a model with multiple predictors, the decision whether or not to relax the parallel regression assumption must be made for each one individually. If at least one (but not all) predictors are “set free”, the model is called a partial model². If all predictors are “set free”, the model is called a non-parallel model. Partial and non-parallel models allow for asymmetrical relationships (factors may have a stronger effect at one end of the ordered outcome scale) or erratic relationships. If possible, we should rely on theory to decide whether or not it is reasonable to expect (a)symmetrical relationships. Non-parallel models are equivalent to baseline-category models, which are commonly used for nominal data; this means they have no precision (or “power”) advantage, because no use is made of the ordinal information in the data.

Table 3.4 shows the trade-off between parsimony and flexibility that is tied to the extent to which the parallel regression assumption is relaxed:

	Predictors			Number of model parameters
Model	X1	X2	X3	3	4	5	6	Flexibility	Parsimony
Parallel model	Parallel	Parallel	Parallel	5	6	7	8	−	++
Constrained	Parallel	Constrained	Free	7	9	11	13	o	+
Unconstrained	Parallel	Free	Free	7	9	13	16	+	o
Non-parallel	Free	Free	Free	8	12	16	20	++	−

Table 3.4: The parallel regression constraint and model parsimony.

3.2.3 Link functions

While the logit, probit, and complementary log-log links are applied most often, other link functions are possible. For instance, the cauchit link is less sensitive to outliers and thus provide a more robust fit to the data. However, it requires specialized software. The choice between logit and probit is largely discipline-specific and thus conventional. On the probability scale, the two link functions produce virtually indistinguishable results. The logit link has the (minor) advantage that exponentiated regression coefficients can be interpreted as odds ratios. However, as we will see below, odds ratio are not useful for understanding ordered regression models since they are an unintuitive, unnatural metric that is poorly understood by most audiences.

Symmetry: The logit and probit link functions are symmetrical, which means that the estimates are unaffected by a reversal of the outcome categories. This does not apply to the complementary log-log link. In the absence of any principled recommendations, the choice of link function can be based on information criteria, which indicate which link achieves the best fit to the data.

3.3 Model diagnostics

3.3.1 Residuals

Harrell (2015, 314–15) provides a brief discussion of residuals in parallel cumulative models. He notes that partial residuals based on binary regressions for all cutpoint equations may be calculated. For these, smoothed curves are then examined. If the proportional odds assumptions holds, these should be parallel, and if a numeric predictor is represented appropriately in the model, there should be no indication of non-linearity. Partial residuals for binary models are discussed in Landwehr, Pregibon, and Shoemaker (1984).

To do

Read Landwehr, Pregibon, and Shoemaker (1984)

Li and Shepherd (2012) proposed a probability-scale residual, which yields one score per unit (or case/observation).

To do

Read Li and Shepherd (2012)

Fox (2010, 109–10) shows how to calculate residuals for cumulative-link models. Are these really Bayesian latent residuals?

Figure 3.9 illustrates the calculation of probability-scale residuals. Based on the fitted value for a particular observation, we can calculate the predicted response probabilities. The top panel in Figure 3.9 illustrates this for a case whose fitted value falls in between thresholds 1 and 2. The residual for this observation then depends on the observed response. If the observed response is 1, it is lower then expected (or predicted). The residual is then positive, since the prediction was too high. The value of the residual is a simple difference of two predicted probabilities: The model-based probability of a response lower than the observed response, and the model-based probability of a response higher than the one observed one. These probabilities are shown in Figure 3.9 using grey shading. The fitted value remains the same across all graphs. They show how the residual is calculated for different observed responses. The subtraction of the two densities is done in a way that positive differences indicate that the observed response is higher than expected, and negative difference indicate that it is lower than expected.

Figure 3.9: Illustration of the calculation of probability-scale residuals.

The calculation of a latent-scale residual is demonstrated in Fox (2010, 109–10). If we denote the fitted value of observation i as \(\theta_i\) and the observed category as \(c\), the latent-scale residual is obtained as follows:

\[ E(\epsilon_i | Y_i = c, \theta_i) = \frac{\phi(\tau_{c-1} - \theta_i) - \phi(\tau_{c} - \theta_i)}{\Phi(\tau_{c} - \theta_i) - \Phi(\tau_{c-1} - \theta_i)} \]

To understand this formula, let us simplify it by representing the two quantities in the numerator as \(D_1\) and \(D_2\) (since they are densities); the difference between the two quantities in the denominator, on the other hand, is just the predicted response probability of the observed response.

\[ \begin{align} E(\epsilon_i | Y_i = c, \theta_i) &= \frac{\phi(\tau_{c-1} - \theta_i) - \phi(\tau_{c} - \theta_i)}{\Phi(\tau_{c} - \theta_i) - \Phi(\tau_{c-1} - \theta_i)} \\ &= \frac{D_1 - D_2}{P(Y_i = c)} \end{align} \]

Now we can look at these quantities in a graph. Figure 3.10 shows a normal density centered on \(\theta_i\), the fitted value (marked as a dot on the x-axis). The scenario we are looking at is for an observed response of 3. The expected (or predicted) probability of this response is represented by the shaded area between \(\tau_2\) and \(\tau_3\). It makes sense for this predicted probability to occurs in the denominator: If it is small, the model gives a low probability to the response that was actually observed. In other words, the response is in some sense untypical. The smaller the predicted probability, the larger the residual. The other two quantities, \(D_1\) and \(D_2\), are marked on the y-axis. Figure 3.10 shows that these are the normal densities of the two thresholds flanking the observed category, in our case \(\tau_2\) (\(D_1\)) and \(\tau_3\) (\(D_2\)). The subtraction of the two densities is done in a way that positive differences indicate that the observed response is higher than expected, and negative difference indicate that it is lower than expected.

Figure 3.10: Illustration of the calculation of latent-scale residuals.

Let us apply these two types of residual to the model for parcel. Figure 3.10 shows the distribution of the residuals with a histogram. Note that probability-scale residuals are bounded by –1 and +1. The latent-scale residuals assume a more bell-shaped profile.

Figure 3.11: The distribution of probability-scale residuals (left) and latent residuals (right) for the *parcel* model.

Next, we inspect the residuals to detect potential problems with the way our model handles the continuous predictor Date of birth. To this end, Figure 3.12 graphs the residuals against the predictor Date of birth and overlays a smoother, to detect deviation from the assumed linearity. Both residuals seem to suggest that a straight-line trend may oversimplify the pattern in the data, but the deviation from this linear representation is relatively complex.

Figure 3.12: Residual plot checking lack of fit for the predictor Date of birth: Probability-scale residuals (left) and latent residuals (right).

It should be possible to use residuals to check for heteroskedasticity by looking at their absolute values against predictor variables.

Figure 3.13: Residual plot checking for indications of non-constant variance (i.e. heteroskedasticity) for the predictor Date of birth: Probability-scale residuals (left) and latent residuals (right).

To do

Residuals for mixed-effects models.

3.3.2 Ordinality assumption

Harrell (2015) talks about checking the ordinality assumption. A simple way of doing this is to calculate means of the predictor variable for different levels of the ordinal outcome. If the means for two neighboring categories on the ordinal scale are largely equvalent for many of the predictor variables, these categories may be collapsed.

3.4 Other ordered regression models

The stereotype model can be considered as intermediate between multinomial and ordered regression models. It can be used when you are unsure about the ordering of levels, or when it is suspected that one or more levels can be collapsed. The stereotype model can therefore be used to evaluate if the levels are in their proper order. It provides the (data-based) ordering of levels and quantifies the “closeness” of categories (see Agresti 2010, 103–15; Long and Freese 2014, 445–54).

Agresti, Alan. 2010. Analysis of Ordinal Categorical Data. Hoboken, NJ: John Wiley & Sons.

Aitchison, J., and S. D. Silvey. 1957. “The Generalization of Probit Analysis to the Case of Multiple Responses.” Biometrika 44 (1/2): 130–40. https://doi.org/10.2307/2333245.

Cheng, J. Scott, Simon & Long. 2007. “Testing the IIA in the Multinomial Logit Model.” Sociological Methods and Research 35 (4): 583–600.

Fienberg, S. E. 1980. The Analysis of Cross-Classified Categorical Data. Second. Cambridge, MA: MIT Press.

Fox, Jean-Paul. 2010. Bayesian Item Response Modeling: Theory and Applications. New York: Springer.

Fullerton, Andrew S., and Jun Xu. 2016. Ordered Regression Models: Parallel, Partial, and Non-Parallel Alternatives. Boca Raton, FL: CRC Press.

Goodman, Leo A. 1983. “The Analysis of Dependence in Cross-Classifications Having Ordered Categories, Using Log-Linear Models for Frequencies and Log-Linear Models for Odds.” Biometrics 39 (1): 149–60. https://doi.org/10.2307/2530815.

Greene, William H., and David A. Hensher. 2010. Modeling Ordered Choices: A Primer. Cambridge: Cambridge University Press.

Harrell, Frank E. Jr. 2015. Regresion Modeling Strategies. 2nd ed. New York: Springer.

Kim, Ji-Hyun. 2003. “Assessing Practical Significance of the Proportional Odds Assumption.” Statistics &Amp; Probability Letters 65 (3): 233–39. https://doi.org/10.1016/j.spl.2003.07.017.

Landwehr, James M., Daryl Pregibon, and Anne C. Shoemaker. 1984. “Graphical Methods for Assessing Logistic Regression Models.” Journal of the American Statistical Association 79 (385): 61–71. https://doi.org/10.1080/01621459.1984.10477062.

Li, Chun, and Bryan E. Shepherd. 2012. “A New Residual for Ordinal Outcomes.” Biometrika 99 (2): 473–80. https://doi.org/10.1093/biomet/asr073.

Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oakes, CA: Sage.

———. 2014. “Regression Models for Nominal and Ordinal Outcomes.” In The Sage Handbook of Regression Analysis and Causal Inference, edited by Christof Best Henning & Wolf, 173–203. London: Sage.

Long, J. Scott, and Jeremy Freese. 2014. Regression Models for Categorical Dependent Variables Using Stata. College Station, TX: Stata Press.

McKelvey, Richard D., and William Zavoina. 1975. “A Statistical Model for the Analysis of Ordinal Level Dependent Variables.” The Journal of Mathematical Sociology 4 (1): 103–20. https://doi.org/10.1080/0022250x.1975.9989847.

Peterson, Bercedis, and Frank E. Harrell. 1990. “Partial Proportional Odds Models for Ordinal Response Variables.” Applied Statistics 39 (2): 205–17. https://doi.org/10.2307/2347760.

Sönning, Lukas, Manfred Krug, Fabian Vetter, Timo Schmid, Anne Leucht, and Paul Messer. 2024. “Latent-Variable Modelling of Ordinal Outcomes in Language Data Analysis.” Journal of Quantitative Linguistics 31 (2): 77–106. https://doi.org/10.1080/09296174.2024.2329448.

We would usually rely on the “SJ” method for determining the kernel density estimation bandwidth, but here we have set the smoothing parameter manually.↩︎
Partial models can be further distinguished into “constrained” and “unconstrained” models. Thus, the pattern of the coefficients may be constrained in the sense that they are assumed to vary systematically, i.e. show a certain pattern (such as a stronger effect at the lower end of the scale); thus, constraints are placed on the variation of coefficients across splits (e.g. they could be modeled to vary in a linear fashion).↩︎