Bootstrapping dispersion measures

corpus linguistics

dispersion

This blog post discusses bootstrapping for dispersion measures and illustrates how two particularly useful variants, simple and stratified bootstrapping, may be applied with the {tlda} package in R.

Author

Affiliation

Lukas Sönning

University of Bamberg

Published

November 18, 2025

R setup

# install development version directly from Github
#pak::pak("lsoenning/tlda")

library(tlda)      # for access to datasets
library(tidyverse) # for data wrangling

source("C:/Users/ba4rh5/Work Folders/My Files/R projects/my_utils_website.R")

Bootstrapping

Bootstrapping is a resampling method that has many different uses in data analysis. One of its key functions is to aid in the construction of confidence intervals. For helpful introductions to the use of bootstrapping in corpus linguistics, see Egbert and Plonsky (2020) and Gries (2022).

In dispersion analysis, bootstrapping can be used to assess the statistical uncertainty surrounding a dispersion returned by our analysis. This information is provided by a confidence interval (CI). As there is no statistical theory that would permit us to mathematically construct CIs for any of the commonly used parts-based dispersion indices, we need to rely on bootstrapping as a work-around strategy.

The technique works as follows. We draw a large number of bootstrap samples (say, 800) from the original data and then calculate, for each bootstrap sample, the quantity of interest (e.g. a dispersion score). This yields a distribution of bootstrap estimates. Each bootstrap sample has the same size as the original data and is obtained by randomly sampling with replacement from the original data. This means that some observations will occur multiple times in the bootstrap sample while others will not appear at all in it.

The distribution of the bootstrap estimates reflects the statistical variation associated with the sample statistic of interest. The average (mean or median) of this distribution is sometimes referred to as the bootstrap estimator. A percentile bootstrap confidence interval is then obtained by finding the appropriate quantiles of the distribution: For a 90% percentile bootstrap CI, we locate the .05 and the .95 quantile of the distribution, which enclose the middle 90% of the bootstrap estimates.

Data structure and bootstrap sampling

In general, the resampling scheme that is used to sample from the original data with replacement should align with the data-generating process. In corpus linguistics, this means that the resampling units should correspond to the sampling units in corpus compilation. These are the text files and speakers making up the corpus. Thus, corpus construction usually starts with a sampling frame, which delineates the range (and proportional share) of text categories, socio-demographic strata, or language varieties in general that the corpus is meant to represent. Then, within each category, texts, text excerpts, or speakers are selected or (randomly) sampled. Since the bootstrapping scheme should generally be coherent with the original sampling scheme, bootstrapping corpus data usually involves resampling texts, text files, or speakers (see Gries 2022).

For dispersion analysis, this means that it should always be these kinds of units (or corpus parts) that are selected with replacement. If it is across these units that dispersion is measured, bootstrapping is particularly straightforward. If corpus parts represent higher-level categories such as registers, genres, or socio-demographic groups, the bootstrapping procedure becomes a bit more complex. We will deal with both scenarios in turn.

Simple bootstrapping: Dispersion across texts/speakers in a single text category

When measuring dispersion across texts within a single text category, we learn about the generality of the item in this domain of language use (see this blog post). In this analysis setting, bootstrapping is simple. This is because, within a specific text category, corpus compilation can be considered as proceeding by simple random sampling: Each unit in the population has the same probability of being selected. Bootstrap samples are therefore drawn in the same way: By simple random sampling (with replacement) from the pool of text files in the data. This most basic type of bootstrap sampling is sometimes called case resampling.

Let us consider, as an example, the item thirteen in the Spoken BNC2014. There are 668 speakers in this corpus, and speakers will be our corpus parts. We measure dispersion using Gries’s (2008) deviation of proportions (specifically, the modification proposed by Egbert, Burch, and Biber 2020). Scores will be scaled the conventional way, with 0 representing a maximally uneven, and 1 a maximally even distribution.

We will draw 500 bootstrap samples, with each being drawn randomly with replacement from the pool of 668 speakers. Recall the each bootstrap sample consists of 668 “speakers”, where a considerable number of individuals will appear (i) multiple times or (ii) not at all in the bootstrap sample. The figure below shows the distribution of the 500 bootstrap estimates for D_P.

Bootstrap DP and draw graph

set.seed(2025)

DP_boot <- disp_DP_boot(
    subfreq = biber150_spokenBNC2014[130,],
    partsize = biber150_spokenBNC2014[1,],
    freq_adjust = TRUE,
    n_boot = 500,
    return_distribution = TRUE,
    print_score = FALSE
)


Based on 500 bootstrap samples (random sampling with replacement)
  Dispersion score is the median over the 500 bootstrap samples
  Unweighted estimate (all corpus parts weighted equally)

The dispersion score is adjusted for frequency using the min-max
  transformation (see Gries 2024: 196-208); please note that the
  method implemented here does not work well if corpus parts differ
  considerably in size; see vignette('frequency-adjustment')

For 0 bootstrap samples, the frequency-adjusted score exceeds the limits
  of the unit interval [0,1]; these scores were replaced by 0 or 1

Scores follow conventional scaling:
  0 = maximally uneven/bursty/concentrated distribution (pessimum)
  1 = maximally even/dispersed/balanced distribution (optimum)

Computed using the modification suggested by Egbert et al. (2020)

Bootstrap DP and draw graph

data.frame(
    DP = DP_boot) |> 
    ggplot(aes(
        x = DP)) +
    geom_dotplot(method = "histodot",
                 binwidth = .007, 
                 dotsize = .6, 
                 stackratio = 1.2) +
    scale_x_continuous(limits = c(0,1), expand = c(0,0),
                       labels = c("0", ".25", ".50", ".75", "1")) +
    scale_y_continuous(limits = c(0, 1.1), expand = c(.005, .005)) +
    annotate("point", x = median(DP_boot), y = 1.05) +
    annotate("errorbar", 
             xmin = quantile(DP_boot, .05),
             xmax = quantile(DP_boot, .95), 
             y = 1.05, width = 0.03) +
    theme_dotplot() +
    xlab(expression(D[P]))

Dot diagramm showing the distribution of the 500 bootstrap estimates, as well as their median and a 90% percentile confidence interval.

Above the pile of dots, a filled circle marks the median of this distribution, and the error bars extend from the .05 to the .95 quantile of the distribution, marking a 90% percentile bootstrap CI.

Stratified bootstrapping: Dispersion across text categories

When dispersion is measured across text categories, the goal is to quantify the register-specificity of an item (see this blog post). At this level of analysis, the corpus parts no longer coincide with the sampling units. Since it is preferable for bootstrapping to mimic the data-collection procedure as closely as possible, we need to resort to a different sampling scheme. The variant we will adopt is referred to as stratified bootstrapping. This resampling scheme takes into account the way in which the texts are organized, which means that it draws on those variables that informed data collection. A helpful blog post on stratified bootstrapping can be found here.

Before we go further, we should note that it does not make sense, in general, to resample text categories (i.e. sub-corpora). This is because text categories are (usually) not sampled during corpus compilation. Instead, they are delineated prior to data collection, sometimes in the form of a sampling frame.

In keeping with corpus creation, then, we must resample text files using what is referred to as stratified bootstrapping. The text categories then form the strata, and a bootstrap sample is composed of subsamples, one per stratum. From within each text category, text files are then drawn randomly with replacement.

To illustrate, let’s assume our corpus consists of four genres (A, B, C, and D), which form the corpus parts for a dispersion analysis:

Sub-corpus A: 20 text files
Sub-corpus B: 30 text files
Sub-corpus C: 50 text files
Sub-corpus D: 20 text files

For a stratified bootstrap sample, observations are randomly sampled (with replacement) within each stratum (here: sub-corpus). The number of observations sampled from each stratum is equal to its size. A bootstrap sample therefore looks as follows:

20 text files drawn randomly and with replacement from sub-corpus A
30 text files drawn randomly and with replacement from sub-corpus B
50 text files drawn randomly and with replacement from sub-corpus C
20 text files drawn randomly and with replacement from sub-corpus D

To measure dispersion across the four genres, the bootstrap sample must then be aggregated: For each corpus part (i.e. genre), the subfrequency and partsize are determined by summing the relevant counts across all text files that ended up in the respective bootstrap subsample. The dispersion score is calculated based on these aggregated tallies, to obtain the bootstrap estimate for this bootstrap sample.

In general, dispersion analysis needs to resort to stratified (instead of simple) bootstrapping if the corpus parts represent higher-level categories, above the level of the sampling units (texts and speakers). We can therefore safely talk about corpus parts (instead of strata), which simplifies the terminology.

Implementation in R

Simple bootstrapping

The {tlda} package includes a set of functions suffixed by _boot, which implement simple bootstrapping. To repeat the analysis of thirteen in the Spoken BNC2014, we use the function disp_DP_boot(). The item is sitting in row 130 of the term-document matrix biber150_spokenBNC2014 (available in the {tlda} package). We supply the following arguments:

n_boot = 1000 for increasing the number of bootstrap samples (default: 500)
boot_ci = TRUE for a percentile bootstrap confidence interval
conf_level = .90 for a 90% CI (default: 95%)

disp_DP_boot(
    subfreq = biber150_spokenBNC2014[130,],
    partsize = biber150_spokenBNC2014[1,],
    n_boot = 1000,
    boot_ci = TRUE, 
    conf_level = .90
)

       DP  conf.low conf.high 
0.5860616 0.5393471 0.6275054 

Based on 1000 bootstrap samples (random sampling with replacement)
  Median and 90% percentile confidence interval limits
  Unweighted estimate (all corpus parts weighted equally)

For 0 bootstrap samples, the frequency-adjusted score exceeds the limits
  of the unit interval [0,1]; these scores were replaced by 0 or 1

Scores follow conventional scaling:
  0 = maximally uneven/bursty/concentrated distribution (pessimum)
  1 = maximally even/dispersed/balanced distribution (optimum)

Computed using the modification suggested by Egbert et al. (2020)

We can also instruct the function to return the bootstrap estimates:

return_distribution = TRUE to obtain the distribution of bootstrap estimates
print_score = FALSE to stop the function from printing all 1,000 estimates to the console
verbose = FALSE to suppress the information print-out (which is identical to the one above)

DP_distribution <- disp_DP_boot(
    subfreq = biber150_spokenBNC2014[130,],
    partsize = biber150_spokenBNC2014[1,],
    n_boot = 1000,
    return_distribution = TRUE,
    print_score = FALSE,
    verbose = FALSE
)

We can then graph this distribution:

data.frame(
    DP = DP_distribution) |> 
    ggplot(aes(
        x = DP)) +
    geom_histogram()

Stratified bootstrapping

The function disp_DP_sboot() in the {tlda} package implements stratified bootstrapping. For illustration, let us consider the item methods in the Brown Corpus. We will look at its dispersion across the genres (n = 15) in the corpus. To implement this scheme, we need the following pieces of information about each text file in the corpus:

its ID (optional, but good to have)
its subfrequency (i.e. how often methods occurs in it)
its length
the genre to which it belongs

Relevant text-level tallies are available in row 87 of the data set biber150_brown:

t(biber150_brown[c(1,87),1:10])

    word_count methods
A01       2242       0
A02       2277       2
A03       2275       0
A04       2216       0
A05       2244       0
A06       2263       0
A07       2270       1
A08       2187       0
A09       2234       0
A10       2282       2

What is missing, however, is information about genre, i.e. the corpus parts across which we want to measure dispersion. Metadata for the Brown Corpus is available in the dataset metadata_brown in the {tlda} package:

str(metadata_brown)

'data.frame':   500 obs. of  4 variables:
 $ text_file  : chr  "A01" "A02" "A03" "A04" ...
 $ macro_genre: Ord.factor w/ 4 levels "press"<"general_prose"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ genre      : Ord.factor w/ 15 levels "press_reportage"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ word_count : int  2258 2242 2277 2275 2216 2244 2263 2270 2187 2234 ...

The following code adds genre information to the text-level tallies above:

d <- data.frame(t(biber150_brown[c(1,87),]))
d$text_file <- rownames(d)

d <- full_join(
    d, 
    metadata_brown[,-4],
    by = "text_file")

Let’s see whether this was successful:

str(d)

'data.frame':   500 obs. of  5 variables:
 $ word_count : num  2242 2277 2275 2216 2244 ...
 $ methods    : num  0 2 0 0 0 0 1 0 0 2 ...
 $ text_file  : chr  "A01" "A02" "A03" "A04" ...
 $ macro_genre: Ord.factor w/ 4 levels "press"<"general_prose"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ genre      : Ord.factor w/ 15 levels "press_reportage"<..: 1 1 1 1 1 1 1 1 1 1 ...

Now we are good to go. Here is a brief run-down of the ensuing analysis:

Brown Corpus
Item: methods
Corpus parts: Genres (n = 15)
Dispersion measure: D_P
- Using modified formula suggested by Egbert, Burch, and Biber (2020)
- With frequency adjustment
- Using conventional scaling
Bootstrapping
- Stratified bootstrapping (sampling text files within genres)
- 1,000 bootstrap samples
- 95% percentile CI

We use the function disp_DP_sboot() bootstrap Gries’ deviation of proportions:

text_freq: the number of times the item occurs in each text
text_size: the length of the text files
corpus_parts: the corpus parts for the dispersion analysis (and strata for stratified resampling)

disp_DP_sboot(
    text_freq = d$methods,
    text_size = d$word_count,
    corpus_parts = d$genre,
    n_boot = 1000,
    boot_ci = TRUE, 
    freq_adjust = TRUE,
    formula = "egbert_etal2020",
    directionality = "conventional"
)

DP_nofreq  conf.low conf.high 
0.6221049 0.5370112 0.6970551 

Based on 1000 stratified bootstrap samples (sampled with replacement)
  Median and 95% percentile confidence interval limits
  Unweighted estimate (all corpus parts weighted equally)

The dispersion score is adjusted for frequency using the min-max
  transformation (see Gries 2024: 196-208); please note that the
  method implemented here does not work well if corpus parts differ
  considerably in size; see vignette('frequency-adjustment')

For 0 bootstrap samples, the frequency-adjusted score exceeds the limits
  of the unit interval [0,1]; these scores were replaced by 0 or 1

Scores follow conventional scaling:
  0 = maximally uneven/bursty/concentrated distribution (pessimum)
  1 = maximally even/dispersed/balanced distribution (optimum)

Computed using the modification suggested by Egbert et al. (2020)

References

Egbert, Jesse, Brent Burch, and Douglas Biber. 2020. “Lexical Dispersion and Corpus Design.” International Journal of Corpus Linguistics 25 (1): 89–115. https://doi.org/10.1075/ijcl.18010.egb.

Egbert, Jesse, and Luke Plonsky. 2020. “Bootstrapping Techniques.” In A Practical Handbook of Corpus Linguistics, 593–610. Springer International Publishing. https://doi.org/10.1007/978-3-030-46216-1_24.

Gries, Stefan Th. 2008. “Dispersions and Adjusted Frequencies in Corpora.” International Journal of Corpus Linguistics 13 (4): 403–37. https://doi.org/10.1075/ijcl.13.4.02gri.

———. 2022. “Toward More Careful Corpus Statistics: Uncertainty Estimates for Frequencies, Dispersions, Association Measures, and More.” Research Methods in Applied Linguistics 1 (1): 100002. https://doi.org/10.1016/j.rmal.2021.100002.

Citation

BibTeX citation:

@online{sönning2025,
  author = {Sönning, Lukas},
  title = {Bootstrapping Dispersion Measures},
  date = {2025-11-18},
  url = {https://lsoenning.github.io/posts/2025-11-15_bootstrapping_dispersion/},
  langid = {en}
}

For attribution, please cite this work as:

Sönning, Lukas. 2025. “Bootstrapping Dispersion Measures.” November 18, 2025. https://lsoenning.github.io/posts/2025-11-15_bootstrapping_dispersion/.