R setup
library(here)
library(tidyverse)
library(lattice)
library(tictoc)
library(knitr)
library(kableExtra)
library(uls) # pak::pak("lsoenning/uls")December 11, 2023
The dispersion measure DA was proposed by Burch, Egbert, and Biber (2017) as a way of quantifying how evenly an item is distributed across corpus parts. The authors attribute this measure to Wilcox (1973), a nice and very readable paper that compares different indices of qualitative variation, i.e. measures of variability for nominal-scale variables. While Wilcox (1973) focuses on categorical variables (with 10 or fewer levels), the measures discussed in that paper are also relevant for quantifying what lexicographers and corpus linguists refer to as “dispersion”. Interestingly, as Burch, Egbert, and Biber (2017, 193) note, a measure equivalent to DP (Gries 2008) can be found in the 1973 paper (the average deviation analog ADA). The index on which DA is based appears in Wilcox (1973) as the mean difference analog (MDA). Both Wilcox (1973) and Burch, Egbert, and Biber (2017) argue that DA (or MDA) has a number of advantages over DP (or ADA). An intuitive explanation of the rationale underlying DA can be found in Sönning (2023).
Gries (2020, 116) has pointed out, however, that DA is computationally expensive. This is because the measure relies on pairwise differences between corpus parts. To calculate DA, we first obtain the occurrence rate (or normalized frequency) of a given item in each corpus part. These occurrence rates can then be compared, to see how evenly the item is distributed. The basic formula for DA requires pairwise comparisons between all corpus parts. If there are 10, the number of pairwise comparisons is 45; for 20 corpus parts, this number climbs to 190. In general, if there are n corpus parts, the number of pairwise comparisons is \((n(n-1))/2\). This number (and hence the computational task) grows exponentially: If we measure dispersion across 500 texts (e.g. ICE or Brown Corpus), 124,750 comparisons are involved. For the BNC2014, which has around 90,000 text files, there are more than 4 billion comparisons to compute.
The purpose of this blog post is to draw attention to a shortcut formula Wilcox (1973) gives in the Appendix of his paper. There, he distinguishes between “basic formulas” and “computational formulas”, which run faster. The formula we will use here is the one listed in the rightmost column (Computational Formulas: Proportions). We will give R code for both the basic and the computational procedure and then compare them in terms of speed. Note that both versions of DA are now available in the function disp_DA() in the {tlda} package (Sönning 2025).
For the prpose of the following evaluation, we start by writing two R functions:
DA_basic(), which uses the basic, slow formula; andDA_quick(), which implements the shortcut given in Wilcox (1973).These functions also work if corpus parts differ in length. They take two arguments:
subfreq: A vector of length n, giving the subfrequencies, i.e. the number occurrences of the item in each of the n corpus partspartsize: A vector of length n, giving the length of each corpus part (number of running words)For the rationale underlying the intermediate quantities R_i and r_i, please refer to Sönning (2023). We first define the basic formula :
And now the computational formula:
Let’s now compare them in two settings: 4,000 texts (about 8 million pairwise comparisons) and 20,000 texts (about 200 million comparisons). We will go directly to the results; to see the background code, click on the triangle below (“R code for comparison of computation time”), which unfolds the commented script.
# We start by creating synthetic data. We use the Poisson distribution to
# generate tokens counts for the smaller corpus (subfreq_4000) and the
# larger corpus (subfreq_20000)
set.seed(1)
subfreq_4000 <- rpois(n = 4000, lambda = 2)
subfreq_20000 <- rpois(n = 20000, lambda = 2)
# Then we create corresponding vectors giving the length of the texts (each is
# 2,000 words long):
partsize_4000 <- rep(2000, length(subfreq_4000))
partsize_20000 <- rep(2000, length(subfreq_20000))
# Next, we use the R package {tictoc} to compare the two functions (i.e.
# computational procedures) in terms of speed, starting with the 4,000-text
# setting. We start with the basic formula:
tic()
DA_basic_4000 <- DA_basic(
subfreq = subfreq_4000,
partsize = partsize_4000)
time_basic_4000 <- toc()
# And now we use the computational formula:
tic()
DA_quick_4000 <- DA_quick(
subfreq = subfreq_4000,
partsize = partsize_4000)
time_quick_4000 <- toc()
# Next, we compare the 20,000-text setting:
tic()
DA_basic_20000 <- DA_basic(
subfreq = subfreq_20000,
partsize = partsize_20000)
time_basic_20000 <- toc()
tic()
DA_quick_20000 <- DA_quick(
subfreq = subfreq_20000,
partsize = partsize_20000)
time_quick_20000 <- toc()Table 1 shows the results: let us first consider computation time. For 4,000 texts, the basic procedure takes 0.75 seconds to run. The computational formula is quicker – it completes the calculations in only 0 seconds. For the 20,000-word corpus, the difference is much more dramatic: The basic formula takes 18.29 seconds to run; the shortcut procedure, on the other hand, is done after 0 seconds. This is an impressive improvement in efficiency.
tibble(
Formula = c("Basic", "Computational"),
`4,000 parts` = c((time_basic_4000$toc - time_basic_4000$tic),
(time_quick_4000$toc - time_quick_4000$tic)) ,
`20,000 parts` = c((time_basic_20000$toc - time_basic_20000$tic),
(time_quick_20000$toc - time_quick_20000$tic)),
`4,000 parts ` = round(c(DA_basic_4000, DA_quick_4000), 4) ,
`20,000 parts ` = round(c(DA_basic_20000, DA_quick_20000), 4)) |>
kbl() |>
add_header_above(c(" " = 1, "Time (seconds)" = 2, "Dispersion score" = 2))
Time (seconds)
|
Dispersion score
|
|||
|---|---|---|---|---|
| Formula | 4,000 parts | 20,000 parts | 4,000 parts | 20,000 parts |
| Basic | 0.75 | 18.29 | 0.6003 | 0.6139 |
| Computational | 0.00 | 0.00 | 0.6005 | 0.6140 |
Table 1 also shows the dispersion scores that the functions return. We note that the two procedures do not yield identical results. However, the approximation offered by the computational shortcut is pretty good, especially considering the fact that dispersion measures are usually (and quite sensibly) reported to two decimal places only.
@online{sönning2023,
author = {Sönning, Lukas},
title = {A Computational Shortcut for the Dispersion Measure
{*D\textasciitilde A\textasciitilde*}},
date = {2023-12-11},
url = {https://lsoenning.github.io/posts/2023-12-11_computation_DA/},
langid = {en}
}