Custom scoring systems in ordinal data analysis: A tribute to Sharoff (2018)

ordinal data
This blog post discusses the issue of representing an ordinal response scale using numeric scores. It shows how psychometric research may suggest deviations from the near-universal use of equally spaced integers.
Author
Affiliation

University of Bamberg

Published

November 30, 2025

R setup
library(tidyverse)
library(lattice)
library(uls)         # pak::pak("lsoenning/uls")

directory_path <- "C:/Users/ba4rh5/Work Folders/My Files/R projects/_lsoenning.github.io/posts/2024-03-01_sharoff_2018/"

I recently did a systematic review on how ordinal rating scale data are handled in linguistic research (see Sönning 2024). It included 4,441 publications from 16 linguistic journals (published between 2012 and 2022), covering a broad range of sub-disciplines. It turned out that a vast majority of researchers take a numeric-conversion approach: They translate the response categories into numeric scores and then analyze the data as though ratings were actually collected on a continuous scale. Further, almost all of these studies use a linear scoring system, i.e. equally-spaced integers, to analyze their data. The current blog post is devoted to Sharoff (2018), the only paper in my survey that used a custom set of scale values for the ordered response set.

Numeric-conversion approach to ordinal data

In what follows, I will use the term scoring system (see Labovitz 1967) to refer to the set of values that are used to represent the ordinal responses. Analyses based on scoring systems involve the calculation of averages or the use of ordinary (mixed-effects) regression. This practice, which is widespread in linguistics (see Sönning et al. 2024; Sönning 2024), has sparked heated methodological debates. The widely accepted belief that an interval-level analysis of ordinal data is inappropriate goes back to an influential paper by Stevens (1946), who proposed a taxonomy of scale types (nominal, ordinal, interval, and ratio) along with “permissible statistics” for each. Among the caveats of the numeric-conversion approach is the fact that distances between consecutive categories are usually unknown. In particular, when all scale points are verbalized, the perceived distance between categories will depend on how informants interpret the labels.

Psychometric research

Experimental research has produced insights into the perception of quantificational expressions that are frequently used to build graded scales. Psycholinguistic studies on intensifiers, for instance, have shown that English native speakers recognize similar increments in intensity between hardly-slightly and considerably-highly (Rohrmann 2007). Such insights can inform the design and analysis stage of a study. Earlier methodological work has mainly focused on scale construction, i.e. the selection of approximately equal-interval sequences (e.g. Friedman and Amoo 1999; Rohrmann 2007; Beckstead 2014). As discussed in Sönning (2024), psychometric scale values can also suggest more appropriate scoring systems for data analysis. I find it surprising that only few methodological studies acknowledge this possibility (Labovitz 1967, 155; Worcester and Burns 1975, 191). In applied work, researchers almost universally assign equally-spaced integers to the categories. This is also true for linguistics – in my systematic review, custom scale values are virtually never found. The only exception is a paper by Sharoff (2018), which inspired the present blog post.

Sharoff (2018)

The paper by Sharoff (2018), which appeared in the journal Corpora, presents an approach to classifying texts into genres. To this end, a text-external framework relying on Functional Text Dimensions is used. Examples for such dimensions are ‘informative reporting’ or ‘argumentation’. Raters were asked to indicate the extent to which a text represents a certain functional category. To answer questions such as “To what extent does the text appear to be an informative report of events recent at the time of writing (for example, a news item)?”, informants were provided with four response options:

  • “strongly or very much so”
  • “somewhat or partly”
  • “slightly”
  • “none or hardly at all”

As Sharoff (2018, 72) explains, the response scale was purposefully constructed to exhibit a notable gap between “strongly or very much so” and “somewhat or partly”. The following custom scoring system was then used to analyze the data:

  • 2.0 “strongly or very much so”
  • 1.0 “somewhat or partly”
  • 0.5 “slightly”
  • 0 “none or hardly at all”

The step from “somewhat or partly” to “strongly or very much so” was made twice as large as the other steps, in line with the deliberate scale design. Seeing that the scale is essentially composed of intensifying adverbs (e.g. strongly, somewhat, slightly), let us compare this scoring system to experimental findings on the perception of these expressions.

The psychometrics of intensifiers

A number of studies have looked into how speakers interpret intensifying adverbs. Here, I collect results from three studies that used similar methods to scale the meaning of relevant adverbs (Matthews et al. 1978; Krsacok 2001; Rohrmann 2007). To measure the relative level of intensity assigned to a specific expression, subjects are typically asked to locate it on an equally-apportioned 11-point scale. I map this scale to the [0,10] interval. Figure 1 gives an overview of the findings reported in the three studies:

  • The grey dot diagrams indicate the ratings for four speaker groups (Krsacok (2001) studied two groups, male vs. female subjects).
  • The black dots denote the average across the groups, which is recorded at the left end of the graph.
  • The expressions used in Sharoff (2018) are highlighted in grey.
Figure 1: Comparison of Sharoff (2018)’s scoring system to the psychometric spacing of relevant intensity adverbs

Comparison of Sharoff’s (2018) scoring system with experimental findings

We can make an attempt to roughly pin down the psychometric scale values that may be considered good approximations for Sharoff (2018)’s response categories. We start by locating the appropriate averages in Figure 1:

  • 8.7 very much
  • 3.5 somewhat
  • 3.5 partly
  • 2.4 slightly
  • 1.5 hardly
  • 0.4 not

Then we average across double designations (e.g. hardly/not; somewhat/partly). This allows us to establish an empirically grounded spacing between the response options. Figure 2 compares these relative distances to the ones used by Sharoff (2018). It lends empirical support to the custom scores used in that study: Three (roughly) equally-spaced categories at the lower end of the scale, with a disproportionate gap to the highest response option. In fact, the psychometric evidence would even have licensed a more pronounced numeric gap between very much and somewhat/partly, roughly:

  • 2.0 “strongly or very much so”
  • 0.7 “somewhat or partly”
  • 0.4 “slightly”
  • 0 “none or hardly at all”

More importantly, however, it is clear that it was appropriate for Sharoff (2018) to use a custom scoring system – the default linear set (e.g. 0, 1, 2, 3) would have misrepresented the way speakers interpret the response labels.

Figure 2: Comparison of Sharoff (2018)’s scoring system to the psychometric spacing of relevant intensity adverbs

Conclusion

We have seen how experimental findings may inform the arrangement of custom scoring systems for the analysis of ordinal rating scale data. The fact that researchers almost exclusively rely on equal-spaced integers may be considered unsatisfactory. Following the good example of Sharoff (2018), more frequent use should be made of custom scoring systems. This methodological topic is discussed much more detail in Sönning (2024), where further psychometric insights are summarized and the inherent limitations of the numeric-conversion approach to ordinal data are given due consideration.

References

Beckstead, Jason W. 2014. “On Measurements and Their Quality. Paper 4: Verbal Anchors and the Number of Response Options in Rating Scales.” International Journal of Nursing Studies 51 (5): 807–14. https://doi.org/10.1016/j.ijnurstu.2013.09.004.
Friedman, Hershey H., and Taiwo Amoo. 1999. “Rating the Rating Scales.” The Journal of Marketing Management 9 (3): 114–23.
Krsacok, Stephen J. 2001. “Quantification of Adverb Intensifiers for Use in Ratings of Acceptability, Adequacy, and Relative Goodnessacceptability, Adequacy, and Relative Goodness.” PhD thesis, University of Dayton.
Labovitz, Sanford. 1967. “Some Observations on Measurement and Statistics.” Social Forces 46 (2): 151–60. https://doi.org/10.2307/2574595.
Matthews, Josephine L., Calvin E. Wright, Kenneth L. Yudowitch, James C. Geddie, and Palmer R. L. 1978. “The Perceived Favorableness of Selected Scale Anchors and Response Alternatives.” Technical report 319. U. S. Army Research Institute for the Behavioral; Social Sciences.
Rohrmann, Bernd. 2007. “Verbal Qualifiers for Rating Scales: Sociolinguistic Considerations and Psychometric Data.” Technical report. University of Melbourne.
Sharoff, Serge. 2018. “Functional Text Dimensions for the Annotation of Web Corpora.” Corpora 13 (1): 65–95. https://doi.org/10.3366/cor.2018.0136.
Sönning, Lukas. 2024. “Ordinal Rating Scales: Psychometric Grounding for Design and Analysis.” OSF Preprints. https://doi.org/10.31219/osf.io/jhv6b.
Sönning, Lukas, Manfred Krug, Fabian Vetter, Timo Schmid, Anne Leucht, and Paul Messer. 2024. “Latent-Variable Modeling of Ordinal Outcomes in Language Data Analysis.” OSF Preprints. https://doi.org/10.31219/osf.io/jhv6b.
Stevens, S. S. 1946. “On the Theory of Scales of Measurement.” Science 103 (2684): 677–80. https://doi.org/10.1126/science.103.2684.677.
Worcester, Robert M., and Timothy R. Burns. 1975. “A Statistical Examination of the Relative Precision of Verbal Scales.” Journal of the Market Research Society 17 (3): 181–97.

Citation

BibTeX citation:
@online{sönning2025,
  author = {Sönning, Lukas},
  title = {Custom Scoring Systems in Ordinal Data Analysis: {A} Tribute
    to {Sharoff} (2018)},
  date = {2025-11-30},
  url = {https://lsoenning.github.io/posts/2024-03-01_sharoff_2018/},
  langid = {en}
}
For attribution, please cite this work as:
Sönning, Lukas. 2025. “Custom Scoring Systems in Ordinal Data Analysis: A Tribute to Sharoff (2018).” November 30, 2025. https://lsoenning.github.io/posts/2024-03-01_sharoff_2018/.