Modeling clustered frequency data I: Texts of similar length

corpus linguistics

regression

clustered data

frequency data

negative binomial

This blog post illustrates a number of strategies for modeling clustered count data. It describes how they handle the non-independence among observations and what kind of estimates they return. The focus is on a situation where texts have roughly the same length.

Author

Affiliation

Lukas Sönning

University of Bamberg

Published

May 14, 2025

When describing or modeling corpus-based frequency counts, the fact that a corpus is divided into texts has consequences for statistical modeling. For count variables, there are different options for modeling such data. This blog post contrasts approaches that differ in the way they represent (or account for) the non-independence of data points and looks at a setting where texts are very similar in length.

R setup

library(tidyverse)          # for data wrangling and visualization
library(dataverse)          # for downloading data from TROLLing
library(marginaleffects)    # to compute model-based estimates
library(MASS)               # to fit a negative binomial regression model
library(corpora)            # to calculate a log-likelihood score
library(kableExtra)         # for drawing html tables
library(lme4)               # to fit mixed-effects regression models
library(lattice)            # for data visualization
library(gamlss)             # for drawing gamma densities

# pak::pak("lsoenning/uls") # install package "uls"
library(uls)                # for ggplot2 dotplot theme

source("C:/Users/ba4rh5/Work Folders/My Files/R projects/my_utils_website.R")

Case study: The frequency of should in written AmE of the 1960s and 1990s

Our focus will be on the frequency of the modal verb should in written American English, and we will rely on data from the Brown and Frown Corpus. This allows us to work with straightforward research questions about normalized frequencies and their comparison, which are quite common in corpus work. The following questions guide our analysis:

What is the frequency of should in written American English of the early 1960s and early 1990s?
Has its frequency changed over time?

We will consider the imbalance across genres in the Brown Family of corpora a meaningful feature of the population of interest and therefore not adjust our estimates for the differential representation of these text categories. For an alternative approach, see this blog post.

We start by downloading the data from the TROLLing archive:

dat <- get_dataframe_by_name(
    filename  = "modals_freq_form.tsv",
    dataset   = "10.18710/7LNWJX",
    server    = "dataverse.no",
    .f        = read_tsv,
    original  = TRUE
  )

The table we have downloaded contains text-level frequencies for nine modal verbs from six members of the Brown Family (Brown, Frown, LOB, FLOB, BE06, AmE06). It includes the following variables:

text_id: The text ID used in the Brown Family corpora (“A01”, “A02”, …)
modal: the modal verb
n_tokens: number of occurrences of the modal verb in the text
corpus: member of the Brown Family
genre: broad genre (Fiction, General prose, Learned, Press)
text_category: subgenre
n_words: length of the text (number of word tokens)
time_period: time period represented by the corpus
variety: variety of English represented by the corpus

str(dat)

'data.frame':   27000 obs. of  9 variables:
 $ text_id      : chr  "A01" "A01" "A01" "A01" ...
 $ modal        : chr  "can" "could" "may" "might" ...
 $ n_tokens     : num  1 0 1 1 3 0 6 14 9 4 ...
 $ corpus       : chr  "Brown" "Brown" "Brown" "Brown" ...
 $ genre        : chr  "press" "press" "press" "press" ...
 $ text_category: chr  "press_reportage" "press_reportage" "press_reportage" "press_reportage" ...
 $ n_words      : num  2206 2206 2206 2206 2206 ...
 $ time_period  : num  1961 1961 1961 1961 1961 ...
 $ variety      : chr  "AmE" "AmE" "AmE" "AmE" ...

We extract the data for should in Brown and Frown:

should_data <- dat |> 
  filter(
    corpus %in% c("Brown", "Frown"),
    modal == "should"
  )

should_Brown <- should_data |> 
  filter(
    corpus == "Brown")

should_Frown <- should_data |> 
  filter(
    corpus == "Frown")

Data description

Let us start by summarizing key features of the data. There are 500 texts in each corpus:

table(should_data$corpus)


Brown Frown 
  500   500

Let’s also take a look at the distribution of occurrence rates across texts. We first add a new variable that expresses the text-level frequency of should as a normalized frequency (per thousand words):

should_data$rate_ptw <- should_data$n_tokens/should_data$n_words*1000

Then we draw a histogram showing the distribution of these rates by Corpus.

draw figure

should_data |> 
  ggplot(aes(x = rate_ptw)) +
  geom_histogram(binwidth = .15) +
  facet_grid(corpus ~ .) +
  theme_classic_ls() +
  xlab("Normalized frequency of should (per thousand words)") +
  ylab("Number of texts")

Figure 1: Histogram showing the distribution of text-level occurrence rates by Corpus.

Seeing that the normalized frequencies are skewed, we try a square-root-transformation, which somewhat mitigates the skew:

draw figure

should_data |> 
  ggplot(aes(x = rate_ptw)) +
  geom_histogram(binwidth = .05) +
  facet_grid(corpus ~ .) +
  theme_classic_ls() +
  xlab("Normalized frequency of should (per thousand words)") +
  ylab("Number of texts") +
  scale_x_sqrt()

Figure 2: Histogram showing the distribution of text-level occurrence rates by Corpus, using a square-root-scale trnasformation to mitigate the skew.

Then we compare the distributions with a boxplot:

draw figure

should_data |> 
  ggplot(aes(x = rate_ptw, y = corpus)) +
  geom_boxplot() +
  theme_classic_ls() +
  xlab("Normalized frequency of should (per thousand words)") +
  ylab(NULL) +
  scale_x_sqrt()

Figure 3: Boxplot comparing the distribution of text-level occurrence rates in Brown and Frown, using a square-root trnasformation.

As for numerical summaries, a quick measure of the frequency of should in Brown can be calculated by dividing its corpus frequency by the size of the corpus. We can do the same for Frown. We will multiply these rates by 1,000, to get normalized frequencies ‘per thousand words’.

freq_should_Brown <- sum(should_Brown$n_tokens) / sum(should_Brown$n_words) * 1000
freq_should_Frown <- sum(should_Frown$n_tokens) / sum(should_Frown$n_words) * 1000

And here they are, rounded to two decimal places:

round(freq_should_Brown, 2)

[1] 0.79

round(freq_should_Frown, 2)

[1] 0.68

For Brown, we get a rate of 0.79 per thousand words, and for Frown the rate is 0.68 per thousand words.

For a quick answer to the second question, we divide the rate in Frown by that in Brown, which gives us a rate ratio of 0.86. This tells us that the frequency of should in the 1990s was only 86% as large as that in the 1960s:

round(freq_should_Frown / freq_should_Brown, 2)

[1] 0.86

Poisson regression

We start with a Poisson regression model, which does not take into account the fact that each corpus breaks down into 500 texts. Rather, the corpus is treated as an unstructured bag of words.

We can fit a Poisson model with the glm() function:

m <- glm(
  n_tokens ~ corpus + offset(log(n_words)),
  data = should_data,
  family = "poisson")

A model summary appears in the following table:

summary(m)


Call:
glm(formula = n_tokens ~ corpus + offset(log(n_words)), family = "poisson", 
    data = should_data)

Coefficients:
            Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -7.14048    0.03315 -215.416  < 2e-16 ***
corpusFrown -0.14774    0.04864   -3.037  0.00239 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2671.7  on 999  degrees of freedom
Residual deviance: 2662.5  on 998  degrees of freedom
AIC: 4352.5

Number of Fisher Scoring iterations: 5

We use the {marginaleffects} package (Arel-Bundock, Greifer, and Heiss 2024) to calculate model-based predictions for the frequency of should in each corpus. These coincide with the plain corpus frequencies reported above. We specify the n_words = 1000 to get normalized frequencies ‘per thousand words’.

avg_predictions(
  m, 
  variables = "corpus",
  newdata = datagrid(
    n_words = 1000)) |> 
  tidy() |> 
  dplyr::select(corpus, estimate, conf.low, conf.high) |> 
  mutate(across(2:4, \(x) round(x, 2))) |> 
  kable()

corpus	estimate	conf.low	conf.high
Brown	0.79	0.74	0.84
Frown	0.68	0.64	0.73

The function comparisons() in the {marginaleffects} package allows us to compare the two corpora in relative terms, in the form of a frequency ratio. The rate of should in Frown is only 86% of that in Brown, suggesting a decrease of 14 percentage points.

avg_comparisons(
  m, 
  variables = "corpus",
  newdata = datagrid(
    n_words = 1000),
  comparison = "ratio") |> 
  tidy() |> 
  dplyr::select(contrast, estimate, conf.low, conf.high) |> 
  mutate(across(2:4, \(x) round(x, 2))) |> 
  kable()

contrast	estimate	conf.low	conf.high
mean(Frown) / mean(Brown)	0.86	0.78	0.94

Quasi-Poisson regression

A Quasi-Poisson model introduces a dispersion parameter to adjust inferences for the non-independence of the data points. This parameter, \(\phi\), is estimated on the basis of a global \(\chi^2\) statistic of model (mis)fit, and it is then used to adjust the standard errors returned by the model, which are multiplied by \(\sqrt{\phi}\). For some more background on this way of accounting for overdispersion, see this blog post.

We can run a Quasi-Poisson model as follows:

m <- glm(
  n_tokens ~ corpus + offset(log(n_words)),
  data = should_data,
  family = "quasipoisson")

The model is summarized in the following table:

summary(m)


Call:
glm(formula = n_tokens ~ corpus + offset(log(n_words)), family = "quasipoisson", 
    data = should_data)

Coefficients:
            Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -7.14048    0.06479 -110.212   <2e-16 ***
corpusFrown -0.14774    0.09507   -1.554    0.121    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 3.820299)

    Null deviance: 2671.7  on 999  degrees of freedom
Residual deviance: 2662.5  on 998  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

The regression table tells us that the dispersion parameter is about 3.8, which means that the standard errors for the Quasi-Poisson model should be 1.95 times (\(\sqrt{3.8}\)) larger than in the Poisson model.

The regression coefficients themselves do not change, and neither do the model-based predictions. We get the same point estimates, though with (appropriately) wider confidence intervals:

avg_predictions(
  m, 
  variables = "corpus",
  newdata = datagrid(
    n_words = 1000)) |> 
  tidy() |> 
  dplyr::select(corpus, estimate, conf.low, conf.high) |> 
  mutate(across(2:4, \(x) round(x, 2))) |> 
  kable()

corpus	estimate	conf.low	conf.high
Brown	0.79	0.69	0.89
Frown	0.68	0.59	0.78

The Quasi-Poisson model also returns the same relative difference between the corpora:

avg_comparisons(
  m, 
  variables = "corpus",
  newdata = datagrid(
    n_words = 1000),
  comparison = "ratio") |> 
  tidy() |> 
  dplyr::select(contrast, estimate, conf.low, conf.high) |> 
  mutate(across(2:4, \(x) round(x, 2))) |> 
  kable()

contrast	estimate	conf.low	conf.high
mean(Frown) / mean(Brown)	0.86	0.7	1.02

Negative binomial regression

Negative binomial regression explicitly takes into account the texts in the data, and represents the observed text-to-text variability in the frequency of should using a probability distribution. The model therefore has an additional parameter that represents the variability of text-level frequencies. As discussed in more detail in this blog post, this parameter controls the shape of a gamma distribution, which in turn describes the multiplicative variation in speaker-specific rates. More background is provided in this blog post.

We can fit a negative binomial model using the function glm.nb() in the {MASS} package (Venables and Ripley 2002):

m <- MASS::glm.nb(
    n_tokens ~ corpus + offset(log(n_words)), 
    data = should_data)

Here is the regression table for this model:

summary(m)


Call:
MASS::glm.nb(formula = n_tokens ~ corpus + offset(log(n_words)), 
    data = should_data, init.theta = 0.9278339214, link = log)

Coefficients:
            Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -7.13436    0.05699 -125.175   <2e-16 ***
corpusFrown -0.15053    0.08166   -1.843   0.0653 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(0.9278) family taken to be 1)

    Null deviance: 1043.2  on 999  degrees of freedom
Residual deviance: 1039.8  on 998  degrees of freedom
AIC: 3569

Number of Fisher Scoring iterations: 1

              Theta:  0.9278 
          Std. Err.:  0.0720 

 2 x log-likelihood:  -3562.9510

Model-based predictions are again virtually identical to the ones from the Poisson and Quasi-Poisson model, with wider uncertainty intervals:

avg_predictions(
  m, 
  variables = "corpus",
  newdata = datagrid(
    n_words = 1000)) |> 
  tidy() |> 
  dplyr::select(corpus, estimate, conf.low, conf.high) |> 
  mutate(across(2:4, \(x) round(x, 2))) |> 
  kable()

corpus	estimate	conf.low	conf.high
Brown	0.80	0.71	0.89
Frown	0.69	0.61	0.76

The negative binomial model also returns the same relative difference between the corpora:

avg_comparisons(
  m, 
  variables = "corpus",
  newdata = datagrid(
    n_words = 1000),
  comparison = "ratio") |> 
  tidy() |> 
  dplyr::select(contrast, estimate, conf.low, conf.high) |> 
  mutate(across(2:4, \(x) round(x, 2))) |> 
  kable()

contrast	estimate	conf.low	conf.high
mean(Frown) / mean(Brown)	0.86	0.72	1

Poisson regression with random intercepts

Another way of accounting for the structure in the data is to use a Poisson regression model with random intercepts on Speaker. This model also represents the observed variation among speakers using a probability distribution. Between-speaker variation is modeled on the scale of natural logarithms using a normal distribution. On the scale of the actual occurrence rates, this translates into a log-normal distribution. For a more detailed discussion of the structure of this model, see this blog post.

We can fit this model using the function glmer() in the R package {lme4} (Bates et al. 2015).

m <- lme4::glmer(
    n_tokens ~ corpus + offset(log(n_words)) + (1 | text_id), 
    data = should_data,
    family = "poisson",
    control = glmerControl(optimizer="bobyqa"))

We print a condensed regression table:

arm::display(m)

lme4::glmer(formula = n_tokens ~ corpus + offset(log(n_words)) + 
    (1 | text_id), data = should_data, family = "poisson", control = glmerControl(optimizer = "bobyqa"))
            coef.est coef.se
(Intercept) -7.43     0.05  
corpusFrown -0.15     0.05  

Error terms:
 Groups   Name        Std.Dev.
 text_id  (Intercept) 0.76    
 Residual             1.00    
---
number of obs: 1000, groups: text_id, 500
AIC = 3820.2, DIC = -1122.2
deviance = 1346.0

The table tells us that the standard deviation of the random intercepts, i.e. the parameter describing the spread of the normal distribution representing text-to-text variation in the occurrence rate of should, is 0.76.

As discussed in detail in this blog post, there are two types of predictions we can calculate for the random-intercept Poisson model. Seeing that we are interested in the normalized frequency of should rather than its natural logarithm, we will want to back-transform model-based predictions to the scale of normalized frequencies. Since there is between-speaker variation, our model-based estimate will have to somehow average over speakers. The question is whether we want to average over speakers on the scale of natural logarithms (the model scale) or on the scale of normalized frequencies (the data scale).

By averaging on the data scale of normalized frequencies, we obtain the mean occurrence rate across texts.
By averaging on the model scale of log normalized frequencies, and then back-transforming this mean log rate, we obtain the median occurrence rate across texts.

This blog post provides a detailed illustration of these two types of frequency estimates.

Through appropriate combination of the regression coefficients for the fixed effects, we get averages over texts on the model scale. This is the estimated mean log rate of should in the population of interest. We can back-transform this into a normalized frequency. This summary measure, however, does not represent the mean over normalized frequencies, since the averaging was done on another scale (the model scale).

In our model, the intercept represents the mean log rate for Brown. If we add the coefficient for the predictor Corpus, we get the mean log rate for Frown. Back-transforming these values gives us:

# Brown
round(exp(fixef(m)[1]) * 1e3, 2)

(Intercept) 
       0.59

# Frown
round(exp(fixef(m)[1] + fixef(m)[2]) * 1e3, 2)

(Intercept) 
       0.51

These frequency estimates are lower than the ones we have obtained above. This is because they represent the median occurrence rate of should, and in a distribution that is skewed toward large values (see Figure 1), the median is always smaller than the mean.

We can also calculate these two types of estimates using the {marginaleffects} package. To get the mean log rate of should, back-transformed to the normalized frequency scale, we run the following code. The argument re.form = NA tells the function to ignore the between-speaker variation:

avg_predictions(
  m, 
  variables = "corpus",
  newdata = datagrid(
    n_words = 1000,
    text_id = NA),
  re.form = NA) |> 
  tidy() |> 
  dplyr::select(corpus, estimate, conf.low, conf.high) |> 
  mutate(across(2:4, \(x) round(x, 2))) |> 
  kable()

corpus	estimate	conf.low	conf.high
Brown	0.59	0.53	0.65
Frown	0.51	0.46	0.56

The corresponding relative difference between the corpora can be retrieved as follows:

avg_comparisons(
  m, 
  variables = "corpus",
  newdata = datagrid(
    n_words = 1000,
    text_id = NA),
  re.form = NA,
  comparison = "ratio") |> 
  tidy() |> 
  dplyr::select(contrast, estimate, conf.low, conf.high) |> 
  mutate(across(2:4, \(x) round(x, 2))) |> 
  kable()

contrast	estimate	conf.low	conf.high
mean(Frown) / mean(Brown)	0.86	0.78	0.94

To get (something close to) the mean normalized frequencies we calculated above, we can ask the function avg_predictions() to average predictions over the speakers in the sample. This means that the by-speaker random intercepts are incorporated into the model predictions. The model-based speaker intercepts are used to get a predicted normalized frequency for each speaker, and these are then averaged.

avg_predictions(
  m, 
  variables = "corpus",
  newdata = datagrid(
    n_words = 1000,
    text_id = unique)) |> 
  tidy() |> 
  dplyr::select(corpus, estimate, conf.low, conf.high) |> 
  mutate(across(2:4, \(x) round(x, 2))) |> 
  kable()

corpus	estimate	conf.low	conf.high
Brown	0.77	0.69	0.85
Frown	0.66	0.59	0.73

The relative difference between the groups remains the same:

avg_comparisons(
  m, 
  variables = "corpus",
  newdata = datagrid(
    n_words = 1000,
    text_id = unique),
  comparison = "ratio") |> 
  tidy() |> 
  dplyr::select(contrast, estimate, conf.low, conf.high) |> 
  mutate(across(2:4, \(x) round(x, 2))) |> 
  kable()

contrast	estimate	conf.low	conf.high
mean(Frown) / mean(Brown)	0.86	0.78	0.94

Comparison

Figure 4 compares the estimated average predictions we have collected in this blog post. Two points are noteworthy:

The uncertainty intervals suggested by the Poisson model are narrower than the ones based on the other models. This is because the other model explicitly take into account the non-independence of observations.
Except for the median occurrence rate estimate based on the random-intercept Poisson model, all models return nearly identical estimates of the normalized frequency of should in Brown and Frown.

draw figure

pred_models <- tibble(
  model = rep(c("Poisson", "Quasi-Poisson", "Negative\nbinomial", 
                "Random-intercept\nPoisson\n(mean rate)",
                "Random-intercept\nPoisson\n(median rate)"), each = 2),
  corpus = rep(c("Brown", "Frown"), 5),
  estimate = c(pred_poisson$estimate,
               pred_quaspoi$estimate,
               pred_negbin$estimate,
               pred_ranef_m$estimate,
               pred_ranef_c$estimate),
  ci_lower = c(pred_poisson$conf.low,
               pred_quaspoi$conf.low,
               pred_negbin$conf.low,
               pred_ranef_m$conf.low,
               pred_ranef_c$conf.low),
  ci_upper = c(pred_poisson$conf.high,
               pred_quaspoi$conf.high,
               pred_negbin$conf.high,
               pred_ranef_m$conf.high,
               pred_ranef_c$conf.high)
)

pred_models$model <- factor(
  pred_models$model,
  levels = c("Poisson", "Quasi-Poisson", "Negative\nbinomial", 
                "Random-intercept\nPoisson\n(mean rate)",
                "Random-intercept\nPoisson\n(median rate)"),
  ordered = TRUE
)


pred_models |> 
  ggplot(aes(x = corpus, y = estimate, group = model)) +
  geom_point() +
  geom_line() +
  geom_linerange(aes(ymin = ci_lower, ymax = ci_upper)) +
  facet_grid(. ~ model) +
  theme_classic_ls() +
  scale_y_continuous(limits = c(.4, .95), expand = c(0,0)) +
  ylab("Occurrence rate of should\n(per thousand words)\n") +
  xlab(NULL) +
  coord_cartesian(clip="off")

Figure 4: Comparison of model-based predictions for the average occurrence rate of *should* in the two corpora.

Summary

This blog post contrasted different approaches to modeling clustered frequency counts. If count data are grouped by text (or speaker), a Poisson regression model is usually too restrictive. This is because it is insensitive to the very likely possibility that the normalized frequency of interest varies among texts (or speakers). We looked at different alternatives, and observed that all of these yielded very similar results. This will be the case in situations where texts (or speakers) are similar in length. We also noted that the random-intercept Poisson model is capable of producing two different average frequency estimates: The mean or the median occurrence rate (i.e. normalized frequency) of the item across texts (or speakers).

References

Arel-Bundock, Vincent, Noah Greifer, and Andrew Heiss. 2024. “How to Interpret Statistical Models Using marginaleffects for R and Python.” Journal of Statistical Software 111 (9): 1–32. https://doi.org/10.18637/jss.v111.i09.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with s. Fourth. New York: Springer. https://www.stats.ox.ac.uk/pub/MASS4/.

Citation

BibTeX citation:

@online{sönning2025,
  author = {Sönning, Lukas},
  title = {Modeling Clustered Frequency Data {I:} {Texts} of Similar
    Length},
  date = {2025-05-14},
  url = {https://lsoenning.github.io/posts/2025-05-09_counts_overdispersion/},
  langid = {en}
}

For attribution, please cite this work as:

Sönning, Lukas. 2025. “Modeling Clustered Frequency Data I: Texts of Similar Length.” May 14, 2025. https://lsoenning.github.io/posts/2025-05-09_counts_overdispersion/.