R setup
library(tlda)
library(lattice)
library(knitr)
library(kableExtra)
library(gtools)
library(tidyverse)
source("C:/Users/ba4rh5/Work Folders/My Files/R projects/my_utils_website.R")October 27, 2025
In his dissertation, Lyne (1985) presents a corpus-based analysis of The vocabulary of French business correspondence. The book includes a chapter that examines the behavior of three dispersion measures: D, D2, and S. The question of main interest is whether Carroll’s D2 (Carroll 1970) and Rosengren’s S (Rosengren 1971) can really be considered an improvement over Juilland’s D, as claimed by these authors. To shed light on the way these measures respond to different subfrequency patterns, Lyne (1985) considers an item that occurs 10 times in a corpus that is divided into five parts. He then looks at all possible ways in which this item could be distributed over the five parts.
In R, we can use the function combinations() in the package {gtools} (Warnes et al. 2023) to find the number of ways in which 10 occurrences can be distributed across 5 corpus parts. It turns out that there are 30 possible ways, which are here listed from the most uneven distribution (dispersion pessimum) to the most even distribution (dispersion optimum).
comb <- gtools::combinations(
    n = 5, 
    r = 10, 
    v = 1:10, 
    repeats.allowed = TRUE)
comb_tbl <- matrix(
    NA, 
    nrow = nrow(comb), 
    ncol = ncol(comb))
for(i in 1:nrow(comb)){
    comb_tbl[i,] <- sort(table(factor(comb[i,], levels = 1:5)), decreasing = TRUE)
}
combs <- str_split(
    unique(paste(
        comb_tbl[,1],
        comb_tbl[,2],
        comb_tbl[,3],
        comb_tbl[,4],
        comb_tbl[,5],
        sep = "-")),
    "-", simplify = TRUE)
combs <- data.frame(combs)
for(i in 1:5){
    combs[,i] <- as.numeric(combs[,i])
}
combs$n_zeros <- rowSums(combs[,1:5] == 0)
# print all combinations
distr_matrix <- (t(combs[,-6]))
write.table(
  format(
    distr_matrix, 
    justify="right"),
  row.names=F, col.names=F, quote=F)10  9  8  8  7  7  7  6  6  6  6  6  5  5  5  5  5  5  4  4  4  4  4  4  4  3  3  3  3  2
 0  1  2  1  3  2  1  4  3  2  2  1  5  4  3  3  2  2  4  4  3  3  3  2  2  3  3  3  2  2
 0  0  0  1  0  1  1  0  1  2  1  1  0  1  2  1  2  1  2  1  3  2  1  2  2  3  2  2  2  2
 0  0  0  0  0  0  1  0  0  0  1  1  0  0  0  1  1  1  0  1  0  1  1  2  1  1  2  1  2  2
 0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  1  0  0  0  0  1  0  1  0  0  1  1  2To reproduce Lyne (1985)’s graphical analyses of the three indices, we calculate dispersion scores for these 30 distributions. We will make use of the function disp() in the R package {tlda} (Soenning 2025). Further below, we will extend Lyne (1985)’s approach to a total of six dispersion measures:
results <- matrix(
  NA, 
  nrow = nrow(combs),
  ncol = 6
)
for(i in 1:nrow(combs)){
  results[i,] <- disp(
    subfreq = as.numeric(combs[i, 1:5]), 
    partsize = rep(1000, 5),
    verbose = FALSE,
    print_score = FALSE)[-1]
}
colnames(results) <- c("D", "D2", "S", "DP", "DA", "DKL")
combs <- cbind(combs, results)
combs$D_x_location <- as.numeric(factor(rank(-combs$D, ties.method = "min")))
combs$D2_x_location <- as.numeric(factor(rank(-combs$D2, ties.method = "min")))
combs$S_x_location <- as.numeric(factor(rank(-combs$S, ties.method = "min")))
combs$DP_x_location <- as.numeric(factor(rank(-combs$DP, ties.method = "min")))
combs$DA_x_location <- as.numeric(factor(rank(-combs$DA, ties.method = "min")))
combs$DKL_x_location <- as.numeric(factor(rank(-combs$DKL, ties.method = "min")))
combs$disp_avg <- rowMeans(combs[,7:12])
combs$disp_avg_x_location <- as.numeric(factor(rank(-combs$disp_avg, ties.method = "min")))Lyne (1985) uses an elegant graphical technique to evaluate the behavior of D, D2 and S. In his study, Juilland’s D serves as a point of reference, so he starts by ordering the 30 possible combinations according to D. Then he draws this distribution of D scores into a scatterplot. In Figure 1 this sequence of D scores appears as a curved grey reference line, which runs from the top left corner (dispersion optimum: 2 2 2 2 2) to the bottom right corner (dispersion pessimum: 10 0 0 0 0). In a few cases, two combinations produce the same D score (e.g. 4 2 2 1 1 and 3 3 2 2 0), so there are only 22 D scores for the total set of 30 possible combinations.
In a second step, Lyne (1985) adds the D2 scores for all 30 distributions to the scatterplot. He groups them visually according to the number of zeros they contain. The D2 scores therefore form five traces, which are connected using lines. The top-most trace in Figure 1, for instance, connects D2 scores for those patterns that do not contain any zeros. This graphical arrangement allows Lyne (1985, 107) to make the following observations when comparing D2 against D:
combs <- combs[order(combs$D),]
zeros_labels <- c("no zeros",
                  "one zero",
                  "two zeros",
                  "three zeros",
                  "four zeros")
p1 <- xyplot(1~1, xlim = c(0, length(unique(combs$D))), type = "n", ylim = c(0,1),
       par.settings = my_settings, axis = axis_L,
       scales = list(x = list(draw = FALSE),
                     y = list(at = c(0, .25, .5, .75, 1), label = c("(uneven) 0", ".25", ".50", ".75", "(even) 1"))),
       ylab = "\n\nDispersion", xlab = list(label = "\n\n\nDistribution\n(ranked from most to least even (according to D)", lineheight = .85, cex = .9),
       panel = function(x,y){
         panel.segments(x0 = 9, x1 = 9, y0 = 0, y1 = .75, col = "grey", lty = "13")
        panel.points(x = combs$D_x_location, y = combs$D, type = "l", col = "grey")
        panel.points(x = combs$D_x_location, y = combs$D, pch = 21, col = "grey", fill = "white")
        panel.points(x = combs$D_x_location, y = combs$D2, cex = .8)
        panel.text(x = 3, y = .6, label = "D", col = "grey40")
        panel.text(x = 10, y = .95, label = expression(D[2]))
        
        for ( i in 0:4){
            data_subset <- subset(combs, n_zeros == i)
            panel.points(x = data_subset$D_x_location, y = data_subset$D2, type = "l")
            panel.text(x = max(data_subset$D_x_location + 1), 
                       y = data_subset$D2[which.max(data_subset$D_x_location)]+.02,
                       label = zeros_labels[i+1],
                       adj = 0, cex = .8)
            }
        })
print(p1, position = c(-.04,.08,.8,1)) 
The fact that D2 favors zeros is evident from the layered structure of the traces. Thus, in cases where two distributions receive the same value of D, D2 assigns a higher score to the pattern with fewer zeros. The dotted vertical line in Figure 1 points out one such pair of distributions: A 5 3 1 1 0 and B 4 4 2 0 0. Both have a D score of .55, but A (one zero), has a D2 score of .66 compared with .73 for B (two zeros).
This leads Lyne (1985, 107) to consider the question of whether it is appropriate for a dispersion measure to be overly sensitive to the presence of zeros. He compares two sets of subfrequencies,
8  3  3  3  30  5  5  5  5for which D yields the same score (.75) while D2 produces different scores (A .95, B .86). Since, in his eyes, the two distributions are “mirror images of each other”, he argues that D treats zeros impartially, while D2 penalizes them.
Lyne (1985, 110) makes the same graphical comparison for S (vs. D) and observes similar patterns. Figure 2 reproduces his Figure 4, which shows that “S penalizes zeros to an even greater extent than D2”.
p1 <- xyplot(1~1, xlim = c(0, length(unique(combs$D))), type = "n", ylim = c(0,1),
       par.settings = my_settings, axis = axis_L,
       scales = list(x = list(draw = FALSE),
                     y = list(at = c(0, .25, .5, .75, 1), label = c("(uneven) 0", ".25", ".50", ".75", "(even) 1"))),
       ylab = "\n\nDispersion", xlab = list(label = "\n\n\nDistribution\n(ranked from most to least even according to D)", lineheight = .85, cex = .9),
       panel = function(x,y){
        panel.points(x = combs$D_x_location, y = combs$D, type = "l", col = "grey")
        panel.points(x = combs$D_x_location, y = combs$D, pch = 21, col = "grey", fill = "white")
        panel.points(x = combs$D_x_location, y = combs$S, cex = .8)
        panel.text(x = 3, y = .6, label = "D", col = "grey40")
        panel.text(x = 10, y = .95, label = "S")
        
        for ( i in 0:4){
            data_subset <- subset(combs, n_zeros == i)
            panel.points(x = data_subset$D_x_location, y = data_subset$S, type = "l")
            panel.text(x = max(data_subset$D_x_location + 1), 
                       y = data_subset$S[which.max(data_subset$D_x_location)]+.02,
                       label = zeros_labels[i+1],
                       adj = 0, cex = .8)
            }
        })
print(p1, position = c(-.04,.08,.8,1)) 
In the course of this evaluation, Lyne (1985, 107) raises an important question:
[I]s it desirable that a dispersion measure for word frequency counts should penalize zeros in this way? This question is of a different order […], since it is about the real world rather than the world of statistical models.
He later returns to this question, when discussing an example given by Rosengren (1971, 115), for which, she argues, it is inappropriate for a dispersion measure to give the same score:
3  2  1  1  12  2  2  2  0As Lyne (1985, 115) rightly notes, it is essential to state in the first place why B should receive a lower score than A. To him, the reason why D2 and S give lower scores to B is the fact that they penalize zeros. And here, Lyne (1985, 115) clearly states his point of view:
But surely it is not the business of a dispersion measure to discriminate against particular distributions in this way. In the above example, […] it seems to us quite proper that B should be rated as highly as A, because the presence of a single low sub-frequency, 0, in B is balanced by the perfectly even distribution across the remaining four sections.
Even though I will not pursue this point further here, I am not sure whether we are able to answer this question so confidently. Rather, the answer would seem to depend primarily on the linguistic purpose underlying a dispersion analysis. It is easy to imagine settings where the question of whether an item is used at all (i.e. the difference between 0 and 1 occurrence) is more important than the question of whether the item occurs repeatedly (i.e. the difference between 1 and 2 occurrences). In such settings, the feature of main interest may be pervasiveness rather than evenness of distribution, and penalization of zeros may be an attractive feature of a dispersion measure. Further, it seems that the answer would also depend on the kinds of linguistic units that are represented by the corpus parts (e.g. texts, genres, or arbitrary corpus chunks of the same length). The smaller the corpus parts, the more zeros there will be in the resulting set of subfrequencies, which could make penalization, if it is indeed considered an issue, even more of a concern.
Seeing that Lyne (1985)’s technique provides insights into the behavior of dispersion measures, I will now apply it to the other parts-based measures listed above. I will modify the strategy in one important regard: Instead of using D as a baseline, I will rely on a consensus opinion: The 30 combinations will be ordered according to the average score they receive across the six dispersion measures. For illustration, Figure 3 compares S against the consensus profile, which is again shown in grey. We also introduce color coding for signaling the number of zeros in a pattern, ranging from dark blue (no zeros) to dark red (four zeros).
combs <- combs[order(combs$disp_avg),]
my_cols <- rev(scales::pal_brewer(palette = "RdBu")(8)[c(1,2,3,6,7)])
p1 <- xyplot(1~1, xlim = c(0, length(unique(combs$disp_avg_x_location))), type = "n", ylim = c(0,1),
       par.settings = my_settings, axis = axis_L,
       scales = list(x = list(draw = FALSE),
                     y = list(at = c(0, .25, .5, .75, 1), label = c("(uneven) 0", ".25", ".50", ".75", "(even) 1"))),
       ylab = "\n\nDispersion", xlab = list(label = "\n\n\n\nDistribution\n(ranked from most to least even based\non the average across all measures)", lineheight = .85, cex = .9),
       panel = function(x,y){
        panel.points(x = combs$disp_avg_x_location, y = combs$disp_avg, type = "l", col = "grey")
        panel.points(x = combs$disp_avg_x_location, y = combs$disp_avg, pch = 21, col = "grey", fill = "white")
        #panel.points(x = combs$disp_avg_x_location, y = combs$D2, cex = .8)
        panel.text(x = 8, y = .5, label = "All six\ndispersion\nmeasures", col = "grey40", lineheight = .85, cex = .8)
        for ( i in 0:4){
            data_subset <- subset(combs, n_zeros == i)
            panel.points(x = data_subset$disp_avg_x_location, y = data_subset$S, type = "l", col = my_cols[i+1])
            panel.points(x = data_subset$disp_avg_x_location, y = data_subset$S, col = my_cols[i+1], cex = .8, pch = 16)
            panel.text(x = max(data_subset$disp_avg_x_location + 1), 
                       y = data_subset$S[which.max(data_subset$disp_avg_x_location)]+.02,
                       label = zeros_labels[i+1],
                       adj = 0, cex = .8, col = my_cols[i+1])
            }
        panel.text(x = 15.5, y = 1.1, label = expression(S))
        })
print(p1, position = c(-.04,.15,.8,.925)) 
Figure 4 compares each of the six dispersion measures against the consensus benchmark. A number of instructive insights emerge:
combs <- combs[order(combs$disp_avg),]
p_D <- xyplot(1~1, xlim = c(0, length(unique(combs$disp_avg_x_location))), type = "n", ylim = c(0,1),
       par.settings = my_settings, axis = axis_L,
       scales = list(x = list(draw = FALSE),
                     y = list(at = c(0, 1))),
       ylab = " ", xlab = "",
       panel = function(x,y){
        panel.points(x = combs$disp_avg_x_location, y = combs$disp_avg, type = "l", col = "grey")
        panel.points(x = combs$disp_avg_x_location, y = combs$disp_avg, pch = 21, col = "grey", fill = "white")
        for ( i in 0:4){
            data_subset <- subset(combs, n_zeros == i)
            panel.points(x = data_subset$disp_avg_x_location, y = data_subset$D, type = "l", col = my_cols[i+1])
            panel.points(x = data_subset$disp_avg_x_location, y = data_subset$D, cex = .8, pch = 16, col = my_cols[i+1])
            }
        panel.text(x = 15.5, y = 1.1, label = expression(D))
        })
p_D2 <- xyplot(1~1, xlim = c(0, length(unique(combs$disp_avg_x_location))), type = "n", ylim = c(0,1),
       par.settings = my_settings, axis = axis_L,
       scales = list(x = list(draw = FALSE),
                     y = list(at = c(0, 1))),
       ylab = " ", xlab = "",
       panel = function(x,y){
        panel.points(x = combs$disp_avg_x_location, y = combs$disp_avg, type = "l", col = "grey")
        panel.points(x = combs$disp_avg_x_location, y = combs$disp_avg, pch = 21, col = "grey", fill = "white")
        for ( i in 0:4){
            data_subset <- subset(combs, n_zeros == i)
            panel.points(x = data_subset$disp_avg_x_location, y = data_subset$D2, type = "l", col = my_cols[i+1])
            panel.points(x = data_subset$disp_avg_x_location, y = data_subset$D2, cex = .8, pch = 16, col = my_cols[i+1])
            }
        panel.text(x = 15.5, y = 1.1, label = expression(D[2]))
        })
p_S <- xyplot(1~1, xlim = c(0, length(unique(combs$disp_avg_x_location))), type = "n", ylim = c(0,1),
       par.settings = my_settings, axis = axis_L,
       scales = list(x = list(draw = FALSE),
                     y = list(at = c(0, 1))),
       ylab = " ", xlab = "",
       panel = function(x,y){
        panel.points(x = combs$disp_avg_x_location, y = combs$disp_avg, type = "l", col = "grey")
        panel.points(x = combs$disp_avg_x_location, y = combs$disp_avg, pch = 21, col = "grey", fill = "white")
        for ( i in 0:4){
            data_subset <- subset(combs, n_zeros == i)
            panel.points(x = data_subset$disp_avg_x_location, y = data_subset$S, type = "l", col = my_cols[i+1])
            panel.points(x = data_subset$disp_avg_x_location, y = data_subset$S, cex = .8, pch = 16, col = my_cols[i+1])
            }
        panel.text(x = 15.5, y = 1.1, label = expression(S))
        })
p_DP <- xyplot(1~1, xlim = c(0, length(unique(combs$disp_avg_x_location))), type = "n", ylim = c(0,1),
       par.settings = my_settings, axis = axis_L,
       scales = list(x = list(draw = FALSE),
                     y = list(at = c(0, 1))),
       ylab = " ", xlab = "",
       panel = function(x,y){
        panel.points(x = combs$disp_avg_x_location, y = combs$disp_avg, type = "l", col = "grey")
        panel.points(x = combs$disp_avg_x_location, y = combs$disp_avg, pch = 21, col = "grey", fill = "white")
        for ( i in 0:4){
            data_subset <- subset(combs, n_zeros == i)
            panel.points(x = data_subset$disp_avg_x_location, y = data_subset$DP, type = "l", col = my_cols[i+1])
            panel.points(x = data_subset$disp_avg_x_location, y = data_subset$DP, cex = .8, pch = 16, col = my_cols[i+1])
            }
        panel.text(x = 15.5, y = 1.1, label = expression(D[P]))
        })
p_DA <- xyplot(1~1, xlim = c(0, length(unique(combs$disp_avg_x_location))), type = "n", ylim = c(0,1),
       par.settings = my_settings, axis = axis_L,
       scales = list(x = list(draw = FALSE),
                     y = list(at = c(0, 1))),
       ylab = " ", xlab = "",
       panel = function(x,y){
        panel.points(x = combs$disp_avg_x_location, y = combs$disp_avg, type = "l", col = "grey")
        panel.points(x = combs$disp_avg_x_location, y = combs$disp_avg, pch = 21, col = "grey", fill = "white")
        for ( i in 0:4){
            data_subset <- subset(combs, n_zeros == i)
            panel.points(x = data_subset$disp_avg_x_location, y = data_subset$DA, type = "l", col = my_cols[i+1])
            panel.points(x = data_subset$disp_avg_x_location, y = data_subset$DA, cex = .8, pch = 16, col = my_cols[i+1])
            }
        panel.text(x = 15.5, y = 1.1, label = expression(D[A]))
        })
p_DKL <- xyplot(1~1, xlim = c(0, length(unique(combs$disp_avg_x_location))), type = "n", ylim = c(0,1),
       par.settings = my_settings, axis = axis_L,
       scales = list(x = list(draw = FALSE),
                     y = list(at = c(0, 1))),
       ylab = " ", xlab = "",
       panel = function(x,y){
        panel.points(x = combs$disp_avg_x_location, y = combs$disp_avg, type = "l", col = "grey")
        panel.points(x = combs$disp_avg_x_location, y = combs$disp_avg, pch = 21, col = "grey", fill = "white")
        for ( i in 0:4){
            data_subset <- subset(combs, n_zeros == i)
            panel.points(x = data_subset$disp_avg_x_location, y = data_subset$DKL, type = "l", col = my_cols[i+1])
            panel.points(x = data_subset$disp_avg_x_location, y = data_subset$DKL, cex = .8, pch = 16, col = my_cols[i+1])
            }
        panel.text(x = 15.5, y = 1.1, label = expression(D[KL]))
        })
cowplot::plot_grid(
  NULL, NULL, NULL,
  p_D, p_D2, p_S, 
  NULL, NULL, NULL,
  p_DP, p_DA, p_DKL, nrow = 4, 
  rel_heights = c(1,8,1,8)) 
The graphical technique developed by Lyne (1985) proves to be a useful tool for studying the behavior of indices. Considering the fact that Anthony Lyne drew all of his graphs by hand makes his methodological contribution even more impressive. In general, this examination is instructive for understanding how dispersion measures behave in settings where there are few corpus parts. Unfortunately, the technique does not scale well, meaning that it cannot be applied to study distributions across a much larger number of corpus parts. This is because the number of possible combinations grows exponentially as we increase the number of corpus parts and/or the number of occurrences of the item. Nevertheless, Lyne (1985)’s graphical technique has managed to shed more light on the performance of indices and, perhaps more importantly, it has confronted us with a deeper question, which is “beyond the world of statistical models” [p. 107] and needs to be answered from a linguistic perspective: Should dispersion measures deal with subfrequencies of 0 in a special way, i.e. should a drop from 1 to 0 depress a score more noticeably than a drop from 2 to 1?
@online{sönning2025,
  author = {Sönning, Lukas},
  title = {Lyne’s (1985) Graphical Technique for the Evaluation of
    Dispersion Measures},
  date = {2025-10-27},
  url = {https://lsoenning.github.io/posts/2025-10-25_dispersion_lyne/},
  langid = {en}
}