1 Background

Under construction

These notes are currently under construction and may contain errors. Feedback of any kind is very much welcome: lukas.soenning@uni-bamberg.de

1.1 Ordinal variables

Ordinal variables consist of a set of ordered categories. Some examples are the following:

Likert-type response scales (strongly agree – agree – neutral – disagree – strongly disagree)
Semantic differential scales (friendly □ □ □ □ □ unfriendly)
Position on grammaticalization cline (content word – grammatical word – clitic – affix)
Grouped continuous variables, i.e. variables that could have been measured on a continuous scale (e.g. frequency categories: high – moderate – low)

In the taxonomy proposed by Stevens (1946), ordinal variables fall in between nominal and interval characteristics. In contrast to nominal variables, the categories are ordered. However, the distance between them is unknown, which sets them apart from interval variables.

There are (many) different types of ordinal characteristics. In some cases, the assumption of an underlying (unobserved) latent variable makes sense. This means that the construct of interest can reasonably be thought of as a continuous (rather than a categorical) trait – what makes it an ordinal categorical variable is the way it is measured (its operationalization). This often applies to Likert-type and semantic differential scales. An ordinal scale may also be sequential, which means that units advance through the categories (or stages) and must start from the lowest. Examples are proficiency levels or the position of an item on the grammaticalization cline. Ordinal variables may also result from limitations of data availability and give a coarse categorization of a variable that could have been measured on an interval scale. Examples of such grouped continuous variables are frequency bands, or income deciles. Finally, some ordinal variables may be considered as reflecting two (or more) dimensions: Likert scale items, for instance, reflect direction (agreement vs. disagreement) and intensity of opinion (strong vs. “moderate” (dis-)agreement).

As we will see in Section 3.2, the nature of an ordinal variable has consequences for its statistical analysis.

1.2 Simplified analysis strategy: Mean response models

In practice, ordinal variables are commonly analyzed as though they had been measured on an interval scale (see Sönning 2024; Sönning et al. 2024). Researchers assign numbers to the categories and then calculate averages and/or fit ordinary linear regression models. This technique has received different labels in the literature:

Mean response model (MRM), because it relies on averages (means) of the assigned scores.
Interval-scale analysis, as the ordinal outcome is treated as an interval variable.
Numeric-conversion approach, since the responses are converted into numeric scores.

This approach to ordinal data analysis has a number of shortcomings compared to ordinal regression models. An orthodox approach to ordinal data analysis would therefore discard MRMs on the grounds that they are inadequate. This view does not seem constructive, however. In his seminal textbook on ordinal data analysis, Agresti (2010, 5) notes that “we do not take a rigid view about permissible methodology for ordinal variables”. It is much more helpful to make an effort to (i) understand the limitations of MRMs, which may at least to some extent be sidestepped and, equally importantly, (ii) learn from the appeal of MRMs to optimize our use (especially our interpretation and communication) of ordered regression models (see Sönning et al. 2024).

1.2.1 Limitations of mean response models

In general, MRMs should only be used if the assumption of an underlying continuous variable makes sense. Even then, a number of issues arise, which are briefly summarized in the current section. For further details, see Long (1997, 35–40, 116–19), Agresti (2010, 5–8, 137–40), Bürkner and Vuorre (2019), and Liddell and Kruschke (2018).

1.2.1.1 Assigning scores to categories

MRMs require us to specify the distances between categories. If these are unknown, the analysis can give misleading results. In many cases, there is no clear-cut choice for the scores. Most often researchers assume equal distances between categories (i.e. assign scores of, say, 1 to 5 for a 5-point response scale). In certain cases, however, a different set of scores (or, more precisely, a different set of distances between categories) can be motivated on substantive grounds (see Sönning 2024). A custom scoring system may then be used instead.¹

1.2.1.2 Floor and ceiling effects

MRMs can also give misleading results due to floor and ceiling effects. A ceiling effect occurs when most observations (or observations in certain parts of the data space) cluster at the upper end of the scale. A floor effect is the reverse pattern. Due to the hard scale bounds, differences between groups will shrink the closer they are to the upper or lower bound of the scale. The variability of the responses within a group also decrease as the group mean approaches the scale limits. To (at least partly) circumvent these distortions, we may try to avoid boundary effects by using extreme categories at both ends of the scale.

1.2.1.3 Measurement error

A particular category on the ordinal scale is typically consistent with a range of values of the underlying attribute that is being measured. To use MRMs, each response category must be represented with a single numeric score. By assigning a single score, we therefore introduce measurement error, since the actual level of the underlying dimension may have been higher or lower than the numeric replacement. The issue of measurement error is more problematic if the number of response categories is rather small – it can be mitigated (somewhat) by using a larger number of categories (5 or more).

1.2.1.4 Other points

MRMs do not yield estimated probabilities for the response categories. This is not a problem for model interpretation if we are only interested in overall trends rather than individual category probabilities. However, it makes it more difficult to check the fit of a model. Further, predictions and estimates based on MRMs can extend beyond the scale limits, above the highest or below the lowest category.

1.2.2 Advantages of mean response models

MRMs have a number of advantages over ordinal regression models. To run the analysis, the researcher requires less technical knowledge and no specialized software. MRM techniques are also more widely familiar. Further, MRM analyses are quicker (in terms of computation time), more flexible, and do not cause computational problems (e.g. non-convergence).

The critical advantage of MRMs, however, lies in model interpretation and communication. These models are particularly useful for quickly identifying important variables and trends. They also provide simple descriptions of the patterns in the data. Finally, the output of MRMs is easier to interpret.

1.3 Interpretation

The choice of data analysis strategy must strike a balance between adequacy and interpretability. While features of the data (such as the measurement scale) may suggest a particular class of procedures, these methods have little value if they produce output that is difficult to interpret and/or communicate. In later sections, we will see that the output of ordered regression models can be communicated just as effectively and transparently as that of mean response models.

Before we go further, let us briefly outline some priorities that may guide the analysis of data and the communication of results. Since regression tables, the typical output of statistical software, are difficult to process and interpret, we need additional methods of interpretation. Inferential measures such as p-values provide an incomplete picture, as our linguistic objectives will usually ask for information about the shape and magnitude of patterns in the data. This is also because we are routinely interested in the relative importance of predictors, i.e. we want to compare them in terms of their strength of association with (or “effect on”) the outcome. When interpreting and communicating results, we will therefore make an effort to use meaningful quantities, such as proportions/percentages, and visual means of communication, which are often superior to numeric and discursive modes of presentation. The greatest advantage of MRMs is the option to condense an ordinal scale into a single number (the average), which can be easily compared across subgroups and conditions. This is beneficial when drawing more complex comparisons.

We will consider two examples to illustrate the advantages of MRMs for data interpretation and communication.

1.3.1 Example 1: Lexical choices in Maltese English

The following data were collected in Malta using the Bamberg Survey of Language Variation and Change. For a total of 68 (near-)synonymous word pairs, respondents were asked to indicate which expression they (tend to) use. Figure 1.1 shows an excerpt from the lexical part of the questionnaire.

Figure 1.1: Excerpt from the lexical part of the BSLVC.

For illustration, we will concentrate on 10 lexical pairs (the second variant being the traditionally British one):

truck/lorry
sick/ill
package/parcel
chips/crisps

fall/autumn
trunk/boot
fries/chips

reservations/bookings
cell phone/mobile phone
soccer/football

We consider a subset of the data, which includes responses from 200 individuals. The overall distribution of the ratings is given in Table 1.1.

	Response category
AmE	1	2	3	4	5	BrE
truck	.69 (134)	.12 (23)	.12 (24)	.01 (2)	.06 (12)	lorry
sick	.47 (92)	.16 (31)	.32 (62)	.02 (3)	.04 (8)	ill
package	.38 (74)	.10 (20)	.21 (40)	.05 (10)	.25 (49)	parcel
reservations	.18 (35)	.03 (6)	.31 (61)	.18 (35)	.31 (61)	bookings
chips	.24 (47)	.07 (13)	.16 (32)	.07 (13)	.46 (89)	crisps
trunk	.19 (36)	.05 (9)	.16 (31)	.16 (31)	.45 (87)	boot
fries	.11 (21)	.02 (4)	.15 (30)	.15 (29)	.57 (112)	chips
fall	.06 (12)	.02 (3)	.07 (13)	.10 (20)	.75 (145)	autumn
cell phone	.03 (5)	.01 (1)	.10 (19)	.12 (23)	.75 (146)	mobile phone
soccer	.03 (6)	.00 (0)	.05 (9)	.09 (17)	.84 (163)	football

Table 1.1: Overall distribution of the ratings for the selected items.

First, we would like to compare the usage preferences (AmE vs. BrE variant) across all 10 pairs. We have the following questions in mind:

For which lexical pairs do we see a trend toward the British variant? Which ones show equilibrium, i.e. considerable variation?
Which pairs suggest a preference for the American form?

To answer these questions, we need to compare 10 distributions. The figures below show two visual comparison strategies.

Figure 1.2: Visualization of the distribution of the responses using a bar chart and a dot plot showing averages based on a numeric-conversion approach.

A horizontal stacked bar chart can be used to show the relative frequency of each category for each item. This preserves information about the distribution of response categories for each item. For instance, we can see that the rate of “no preference” was highest for sick/ill and reservations/bookings.

A dot plot can be used to condense the information by showing an average calculated from the assigned scores −2, −1, 0, +1, +2. Positive values indicate a tendency towards BrE, negative values point to AmE. These averages are a simple version of a MRM.

Arguably, the dot plot answers our question more clearly. It uses one prominent symbol per item pair and we can quickly assess differences between the 10 pairs. Thus, answers to the questions posed above suggest themselves rather quickly and with ease. Note that information about the relative frequency of response categories is lost at this level of analysis.

Next, consider a more complex comparison. Assume that our primary interest is in differences between female and male speakers. We might expect women to take on a leading role in ongoing language change, that is, the drift of Maltese English away from its BrE roots towards more globalized or AmE usage. Now we have the following questions in mind:

Are there gender differences for any of the item pairs?
Where do female respondents show the expected trend?
Are gender differences notable or rather minor?

Thus, for each lexical pair we would like to compare the tendency (BrE vs. AmE) for male and female respondents. The figures below illustrate the merits of information condensation. By representing the distribution of ratings for each subgroup (i.e. item-gender combination) with a single value and symbol, we can quickly identify those item pairs that conform to the expected pattern, i.e. that slope downward from left to right.

Figure 1.3: Visualization of the distribution of the responses for male and female speakers using a bar chart and a line plot.

1.3.2 Example 2: Heaps in Australian English

As a second example, consider the usage of heaps in Australian English (data provided by Romina Buttafoco). Here, our interest is in whether the usage of heaps is sensitive to the following factors: the age of the respondent, gender, register, and syntactic function. Heaps is typically considered an informal marker and our engagement with the literature may lead us to expect the following trends:

The prevalence of heaps is higher among younger speakers.
Women are leading the change in progress (from below) and show higher usage rates.
As an informal structure, heaps is expected to surface more strongly in less formal registers.
Heaps is more likely to be used in prototypical functions, i.e. as a quantifier.

Informants were asked to indicate on an ordinal scale how likely they are to use a certain expression in a given social context (reflecting register differences). We will use the subsample of 370 Australian respondents. Figure 1.4 shows an excerpt from the questionnaire; in the stimulus sentence, heaps is used as a quantifier.

Figure 1.4: Excerpt from the questionnaire used to collect reported usage rates for *heaps*.

An important strategy in model interpretation is the inspection of what are variously referred to as “average predictions”, “partial effects”, “marginal effects”, or “predictive margins”. In models that include multiple predictor variables, we are often interested in understanding the relative importance of the individual variables. To this end, we need to assess the unique contribution of each variable to the variation in the outcome. Average predictions allow us to identify the association between a predictor and the outcome while adjusting for the other predictors in the model. This gives us a clear impression of the link between the focal predictor and the outcome. We can summarize the information provided by this procedure in graphical form. An example is given below. Such graphs are sometimes called partial (or marginal) effect displays. They allow us to make two types of comparisons:

Within-predictor comparisons: Focusing on each predictor individually, we assess how the outcome varies across its levels or values. We ask the following questions: Which predictor values are associated with higher outcomes, which show lower values? How do the levels of a categorical predictor vary? Which ones are rather similar, which ones stick out? How strong, or noticeable, are the patterns that may be discerned? How much statistical uncertainty is there in the estimates?
Between-predictor comparisons: Average predictions also allow us to compare the relative importance of predictors. This helps us identify variables that show a relatively strong or weak association with the outcome.

Figure 1.5: Predictive margins for a quick overview of predictor importance.

Within-predictor and between-predictor comparisons are essential for understanding the association structure suggested by a model. This makes average predictions a very useful strategy for understanding complex models: For each predictor, they facilitate the assessment of its specific pattern and its relative magnitude in comparison with other input variables. Further, they offer information about the statistical uncertainty of the detected patterns in the form of error bars and bands.

For ordered regression models, we would like to be able make the same kinds of comparisons with the same level of ease. That is, we need a way of constructing average predictions that allows us to make comparisons within and between predictors.

In cases where the choice of scores is ambiguous, researchers may decide to do a sensitivity analysis. Repeating the analysis using different scoring systems then reveals the degree to which our conclusions depend on (are sensitive to) the choice of scores.↩︎