A computational shortcut for the dispersion measure DA

corpus linguistics
methodology
dispersion
This short blog post draws attention to the computational shortcut given in Wilcox (1973) for calculating the dispersion measure DA.
Author
Affiliation

University of Bamberg

Published

December 11, 2023

R setup
library(here)
library(tidyverse)
library(lattice)
library(tictoc)
library(knitr)
library(kableExtra)

source("C:/Users/ba4rh5/Work Folders/My Files/R projects/my_utils_website.R")

The dispersion measure DA was proposed by Burch, Egbert, and Biber (2017) as a way of quantifying how evenly an item is distributed across the texts (or, more generally, the units) in a corpus. The authors attribute this measure to Wilcox (1973), a nice and very readable paper that compares different indices of qualitative variation, i.e. measures of variability for nominal-scale variables. While Wilcox (1973) focuses on categorical variables (with 10 or fewer levels), the measures discussed in that paper are also relevant for quantifying what lexicographers and corpus linguists refer to as “dispersion”. Interestingly, as Burch, Egbert, and Biber (2017, 193) note, a measure equivalent to DP (Gries 2008) can be found in the 1973 paper (the average deviation analog ADA). The index on which DA is based appears in Wilcox (1973) as the mean difference analog (MDA). Both Wilcox (1973) and Burch, Egbert, and Biber (2017) argue that DA (or MDA) has a number of advantages over DP (or ADA). An intuitive explanation of the rationale underlying DA can be found in Sönning (2023).

Gries (2020, 116) has pointed out, however, that DA is computationally expensive. This is because the measure relies on pairwise differences between texts. To calculate DA, we first obtain the occurrence rate (or normalized frequency) of a given item in each text. These occurrences rates can then be compared, to see how evenly the item is distributed across texts. The basic formula for DA requires pairwise comparisons between all texts. If we have 10 texts, the number of pairwise comparisons is 45; for 20 texts, this number climbs to 190. In general, if there are n texts (or units), the number of pairwise comparisons is \((n(n-1))/2\). This number (and hence the computational task) grows exponentially: For 500 texts (e.g. ICE or Brown Corpus), 124,750 comparisons are involved. For the BNC2014, with 88,171 texts, there are almost 4 billion comparisons to compute.

The purpose of this blog post is to draw attention to a shortcut formula Wilcox (1973) gives in the Appendix of his paper. There, he distinguishes between “basic formulas” and “computational formulas”, which run faster. The formula we will use here is the one listed in the rightmost column (Computational Formulas: Proportions). We will give R code for both the basic and the computational procedure and then compare them in terms of speed.

We start by writing two R functions:

These functions also work if texts differ in length. They take two arguments:

For the rationale underlying the intermediate quantities R_i and r_i, please refer to Sönning (2023). We first define the basic formula:

DA_basic <- function(n_tokens, word_count){
  
    R_i <- n_tokens / word_count
    r_i <- R_i / sum(R_i)
    k   <- length(n_tokens)

    dist_r <- as.matrix(dist(r_i))
    DA <- 1 - ( mean(dist_r[lower.tri(dist_r)]) / (2/k) )

    names(DA) <- "DA"
    return(DA)
}

And now the computational formula:

DA_quick <- function(n_tokens, word_count){
  
    R_i <- n_tokens / word_count
    r_i <- R_i / sum(R_i)
    k   <- length(n_tokens)

    DA <- (2*sum((sort(r_i, decreasing=TRUE) * 1:k)) -1) / (k-1)

    names(DA) <- "DA"
    return(DA)
}

Let’s now compare them in two settings: 4,000 texts (about 8 million pairwise comparisons) and 20,000 texts (about 200 million comparisons). We will go directly to the results; to see the background code, click on the triangle below (“R code for comparison of computation time”), which unfolds the commented script.

R code for comparison of computation time
# We start by creating synthetic data. We use the Poisson distribution to 
# generate tokens counts for the smaller corpus (n_tokens_4000) and the 
# larger corpus (n_tokens_20000)

set.seed(1)

n_tokens_4000 <- rpois(n = 4000, lambda = 2)
n_tokens_20000 <- rpois(n = 20000, lambda = 2)

# Then we create corresponding vectors giving the length of the texts (each is 
# 2,000 words long):

word_count_4000 <- rep(2000, length(n_tokens_4000))
word_count_20000  <- rep(2000, length(n_tokens_20000))

# Next, we use the R package {tictoc} to compare the two functions (i.e. 
# computational procedures) in terms of speed, starting with the 4,000-text 
# setting. We start with the basic formula:

tic()
DA_basic_4000 <- DA_basic(n_tokens_4000, word_count_4000)
time_basic_4000 <- toc()

# And now we use the computational formula:

tic()
DA_quick_4000 <- DA_quick(n_tokens_4000, word_count_4000)
time_quick_4000 <- toc()

# Next, we compare the 20,000-text setting:

tic()
DA_basic_20000 <- DA_basic(n_tokens_20000, word_count_20000)
time_basic_20000 <- toc()

tic()
DA_quick_20000 <- DA_quick(n_tokens_20000, word_count_20000)
time_quick_20000 <- toc()

Table 1 shows the results: let us first consider computation time. For 4,000 texts, the basic procedure takes 1.5 seconds to run. The computational formula is quicker – it completes the calculations in only 0 seconds. For the 20,000-word corpus, the difference is much more dramatic: The basic formula takes 35.8 seconds to run; the shortcut procedure, on the other hand, is done after 0.02 seconds. This is an impressive improvement in efficiency.

R code for Table 1
tibble(
  Formula = c("Basic", "Computational"),
  `4,000 texts` = c((time_basic_4000$toc - time_basic_4000$tic), 
                    (time_quick_4000$toc - time_quick_4000$tic)) ,
  `20,000 texts` = c((time_basic_20000$toc - time_basic_20000$tic), 
                     (time_quick_20000$toc - time_quick_20000$tic)),
  `4,000 texts ` = round(c(DA_basic_4000, DA_quick_4000), 4) ,
  `20,000 texts ` = round(c(DA_basic_20000, DA_quick_20000), 4)) |> 
  kbl() |> 
  add_header_above(c(" " = 1, "Time (seconds)" = 2, "Dispersion score" = 2))
Table 1: Computation time (in seconds)
Time (seconds)
Dispersion score
Formula 4,000 texts 20,000 texts 4,000 texts 20,000 texts
Basic 1.5 35.80 0.6003 0.6139
Computational 0.0 0.02 0.6005 0.6140

Table 1 also shows the dispersion scores that the functions return. We note that the two procedures do not yield identical results. However, the approximation offered by the computational shortcut is pretty good, especially considering the fact that dispersion measures are usually (and quite sensibly) reported to two decimal places only.

References

Burch, Brent, Jesse Egbert, and Douglas Biber. 2017. “Measuring and Interpreting Lexical Dispersion in Corpus Linguistics.” Journal of Research Design and Statistics in Linguistics and Communication Science 3 (2): 189–216. https://doi.org/10.1558/jrds.33066.
Gries, Stefan Th. 2008. “Dispersions and Adjusted Frequencies in Corpora.” International Journal of Corpus Linguistics 13 (4): 403–37. https://doi.org/10.1075/ijcl.13.4.02gri.
———. 2020. “Analyzing Dispersion.” In A Practical Handbook of Corpus Linguistics, 99–118. Springer. https://doi.org/10.1007/978-3-030-46216-1_5.
Sönning, Lukas. 2023. “Advancing Our Understanding of Dispersion Measures in Corpus Research.” PsyArxiv Preprint. https://doi.org/10.31234/osf.io/ns4q9.
Wilcox, Allen R. 1973. “Indices of Qualitative Variation and Political Measurement.” The Western Political Quarterly 26 (2): 325–43. https://doi.org/10.2307/446831.

Citation

BibTeX citation:
@online{sönning2023,
  author = {Sönning, Lukas},
  title = {A Computational Shortcut for the Dispersion Measure
    {*D\textasciitilde A\textasciitilde*}},
  date = {2023-12-11},
  url = {https://lsoenning.github.io/posts/2023-12-11_computation_DA/},
  langid = {en}
}
For attribution, please cite this work as:
Sönning, Lukas. 2023. “A Computational Shortcut for the Dispersion Measure *D~A~*.” December 11, 2023. https://lsoenning.github.io/posts/2023-12-11_computation_DA/.