Modeling clustered binomial data

corpus linguistics
regression
clustered data
binary data
This blog post illustrates a number of strategies for modeling clustered binomial data. It describes how they handle the non-independence among observations and what kind of estimates they return.
Author
Affiliation

University of Bamberg

Published

May 9, 2025

A typical feature of corpus data is their hierarchical layout. Observations are usually clustered, which is the case if multiple data points are from the same text (or speaker). Observations from the same source are usually more similar to one another, reflecting idiosyncracies of the author/speaker or particularities of the context of language use. For binary outcome variables, there are different options for modeling such data. This blog post builds on a paper by Anderson (1988) and contrasts approaches that differ in the way they represent (or account for) the non-independence of data points.

R setup
library(tidyverse)         # for data wrangling and visualization
library(marginaleffects)   # to compute model-based estimates
library(corpora)           # for data on passives
library(kableExtra)        # for drawing html tables
library(lattice)           # for data visualization
library(likelihoodExplore) # for drawing the binomial likelihood
library(gamlss)            # to fit a variant of the quasi-binomial model
library(aod)               # to fit a beta-binomial model
library(PropCIs)           # to calculate Wilson score CIs
library(doBy)              # to convert data from short to long format
library(lme4)              # to fit mixed-effects regression models


source("C:/Users/ba4rh5/Work Folders/My Files/R projects/my_utils_website.R")

Data: Passives in academic writing

We use data on the frequency of the passive in the Brown Family of corpora, which is part of the {corpora} package (Evert 2023). We concentrate on the genre Learned and consider texts from Brown and Frown.

d <- PassiveBrownFam |> 
  filter(
    genre == "learned",
    corpus %in% c("Brown", "Frown")) |> 
  select(id, corpus, act, pass, verbs)

This leaves us with 160 texts:

str(d)
'data.frame':   160 obs. of  5 variables:
 $ id    : chr  "brown_J01" "brown_J02" "brown_J03" "brown_J04" ...
 $ corpus: Factor w/ 5 levels "BLOB","Brown",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ act   : int  88 66 134 117 119 65 95 108 40 192 ...
 $ pass  : int  43 73 61 46 53 65 46 67 84 26 ...
 $ verbs : int  131 139 195 163 172 130 141 175 124 218 ...

There is one row per text and the following variables are relevant for our analyses:

  • id text identifier
  • corpus source corpus (“Brown” vs. “Frown”)
  • act number of active verb phrases in the text
  • pass number of passive verb phrases in the text
  • verbs total number of verb phrases in the text

For each text, the frequency of the passive can be expressed as a proportion: the proportion of verb phrases that are in the passive voice. We add this variable to the data frame:

d$prop_passive <- d$pass/d$verbs

We use a dot diagram to inspect the distribution of these proportions across the 160 texts. In Figure 1, each dot represents a text, and the scores reflect the share of passive verb phrases among all verb phrases in the text document. We will refer to this quantity as the text-specific proportion of passive verb phrases.

draw figure
d |> 
  ggplot(aes(x = prop_passive)) +
  geom_dotplot(method = "histodot", binwidth = .015, dotsize=.8) +
  theme_dotplot() +
  scale_x_continuous(
    limits = c(0,1), expand = c(0,0),
    breaks = c(0, .25, .5, .75, 1),
    labels = c("0", ".25", ".50", ".75", "1")) +
  xlab("Proportion of passive verb phrases")
Figure 1: Dot diagram showing the proportion of passive verb phrases in the 160 texts.

The 160 texts also differ in the number of verb phrases they contain, so let us also look at this distribution. Figure 2 shows that this count varies between roughly 100 and 250.

draw figure
d |> 
  ggplot(aes(x = verbs)) +
  geom_dotplot(method = "histodot", binwidth = 3, dotsize = .7) +
    theme_dotplot() +
    xlab("Number of verb phrases")
Figure 2: Dot diagram showing the distirbution of the number of verb phrases per text file.

The 160 texts can be considered a sample of academic prose from the language variety of interest, written American English in the second half of the 20th century. In selecting (or sampling) these 160 academic texts, the corpus compilers essentially selected a set of authors, or speakers, of this language variety. In some sense, these individuals represent the primary sampling units: Our sample size for making inference about a larger population of speakers is 160.

Each text in the Brown Family of corpora is around 2,000 words long. A text excerpt, and the verb phrases it contains, can be considered as a sample from a (hypothetical) population, the academic prose produced by a specific author. At this level, the language use (or writing style) of this individual is the population of interest. The 2,000 words, (or, e.g., 160 verb phrases) then represent the secondary sampling units.

This means that we can use the information in the text to make inferences about the underlying propensity of the author(s) to use the passive voice in their academic writing. Texts with fewer verb phrases provide less information, and – due to sampling variation – we would expect smaller samples to yield more variable proportions (see Sönning and Schlüter 2022 for an illustration).

This is indeed the case for the present data. The point cloud in the Figure 3 below shows a trumpet-like shape: the highest proportions are from the texts with the fewest verb phrases. We should note, however, that other factors may contribute to this pattern: Thus, texts with fewer verb phrases necessarily feature longer (and presumably more elaborate) sentences, an indicator of abstract writing style that is also associated with passive usage.

draw figure
d |> ggplot(aes(x = verbs, y = prop_passive)) +
  geom_point() +
  theme_classic() +
  scale_y_continuous(
    limits = c(0,1), expand = c(0,0),
    breaks = c(0, .5, 1),
    labels = c("0", ".5", "1")) +
  ylab("Proportion of passives") +
  xlab("Number of verb phrases in the text")
Figure 3: Scatterplot showing the relation between the proportion of passives and the sample size (number of verb phrases in the text).

To emphasize the two-stage sampling design involved in corpus compilation, let us make visual inferences about the language use of the individual authors. To this end, we can construct a 95% confidence interval for each text-specific estimate. We will use the package {PropCIs} (Scherer 2018) to calculate 95% Wilson score confidence intervals for each of the 160 texts.

?@fig-dotplot-text-cis presents text-level estimates of the proportion of passives with a 95% confidence interval. The degree of overlap among the 160 intervals can be interpreted as giving an indication of the heterogeneity of the individual authors. If the 160 authors showed a similar inclination toward the passive, we would observe considerable overlap among the error bars. Judging from the figure below, however, there seems to be appreciably heterogeneity.

draw figure
ci_upper <- NA
ci_lower <- NA

for(i in 1:160){
  ci_lower[i] <- scoreci(x = d$pass[i], n = d$verbs[i], conf.level = .95)$conf.int[1]
  ci_upper[i] <- scoreci(x = d$pass[i], n = d$verbs[i], conf.level = .95)$conf.int[2]
}

p1 <- xyplot(1~1, type = "n", ylim=c(0,1), xlim = c(0,163),
       par.settings = my_settings, axis = axis_left,
       scales = list(
         y = list(
               at = c(0, .5, 1),
               label = c("0", ".5", "1"))),
       ylab = "Proportion of passives", xlab = NULL,
       panel = function(x,y){
         panel.points(x=c(1:80, 83:162), y = d$prop_passive, pch=19)
         panel.segments(x0=c(1:80, 83:162), x1 = c(1:80, 83:162), y0 = ci_lower, y1 = ci_upper)
         panel.segments(x0=1, x1=80, y0=0, y1=0)
         panel.segments(x0=83, x1=162, y0=0, y1=0)
         panel.text(x=c(40.5, 122.5), y = -.1, label = c("Brown", "Frown"))
         panel.text(x=162, y = -.25, label="Error bars: 95% CIs", col = "grey40", cex = .8, adj=1)
       })

cairo_pdf("fig_passives_text_cis.pdf", width = 7, height = 2)
p1
dev.off()
png 
  2 

Binomial model

We start with a simple binomial model, which basically ignores the structure of the data. It uses a single parameter to express the mean proportion of the passive in the dataset. This means that it essentially treats all verb phrases in the data (n = 26772) as an unstructured sample from the population of interest, each one drawn independently of the other ones. The way the data are presented to the model, with one row per text, does not matter to the binomial model – it produces the same result if we supply just one row, with verbs representing the total number of verb phrases and pass the total number of passives in the data.

Since the clustering variable Text is not taken into account, the model rests on the assumption that the 160 texts share the same underlying relative frequency of the passive. This means that the authors are assumed to be perfectly homogeneous with respect to their stylistic preferences. Under this model, the observed variability in proportions is merely a result of sampling variation, which in turn depends on (i) the number of verb phrases in the text, and (ii) the overall proportion of the passive.

Point (ii) deserves some more comment. For binomial data, sampling variation is smaller the closer we get to 0 and 1. This is due to the boundedness of the scale – near 0 or 1, there is less room for variation. Statistically speaking, the variance of the binomial distribution depends on its mean. As Figure 4 illustrates, it is greatest at .50 and decreases toward the endpoints of the proportion scale. This means that the variability of observed proportions depends on the mean of the binomial distribution.

draw figure
xyplot(
  1~1, type = "n", xlim=c(0,1), ylim = c(0,.25),
  par.settings = my_settings, axis = axis_L,
  scales = list(y = list(
    at = c(0, .1, .2, .3, .4, .5),
    label = c("0", ".1", ".2", ".3", ".4", ".5")),
    x = list(
      at = c(0, .25, .5, .75, 1),
      labels = c("0", ".25", ".50", ".75", "1"))),
  ylab = "Variance",
  xlab = "Mean",
  panel = function(x,y){
    panel.points(x = seq(0, 1, .01),
                 y = (seq(0, 1, .01)*(1-seq(0, 1, .01))),
                 type="l")
       })
Figure 4: Mean-variance relationship in the binomial model.

We can fit this model in R using the glm() function:

m <- glm(
  cbind(pass, act) ~ 1, 
  data = d, 
  family = "binomial")

The model intercept is -1.37, which is the mean probability of the passive, expressed on the log odds scale. We can use the function plogis() to back-transform to the proportion scale:

round(
  plogis(coef(m)), 3)
(Intercept) 
      0.203 

A 95% CI can be constructed using the function confint():

plogis(confint(m))
    2.5 %    97.5 % 
0.1981468 0.2077818 

Model-based estimates on the proportion scale are easy to obtain using the {marginaleffects} package (Arel-Bundock, Greifer, and Heiss 2024). The function avg_predictions() returns a model-based prediction of the mean probability of a passive verb phrase in the population of interest, along with a 95% CI.

avg_predictions(m) |> 
  tidy() |> 
  select(estimate, conf.low, conf.high) |> 
  round(3)
# A tibble: 1 × 3
  estimate conf.low conf.high
     <dbl>    <dbl>     <dbl>
1    0.203    0.198     0.208

The estimated proportion of .203 comes with a crisp 95% CI, which ranges from .198 to .208. Apparently, the model is very confident in its predicted probability of the passive voice.

To check how well the binomial model fits the data, we can ask it to “retrodict” the data, i.e. to tell us what it thinks the distribution of the 160 text-level proportions looks like. For this task, the binomial model uses the estimated overall proportion (.203) and the sample size for each text, i.e. the number of verb phrases it contains. It can then produce a density curve for each text, which is centered on .203 and spread out in accordance with the sample size: for a text with more verb phrases, the density curve is more peaked, and for a text with fewer verb phrases it is spread more widely around .203. Importantly, however, all density curves are centered on .203, the estimate of the population proportion.

Figure 5 draws the 160 density curves against the observed distribution of text-specific proportions. Apparently, the model fails to capture the heterogeneity among texts.

draw figure
xyplot(
  1 ~ 1, type = "n", xlim=c(0,1), ylim = c(0,.16),
  par.settings = my_settings, axis = axis_bottom,
  scales = list(
    x = list(
      at = c(0, .25, .5, .75, 1),
      label = c("0", ".25", ".50", ".75", "1"))),
  xlab.top = "Binomial model\n",
  xlab = "Text-specific proportion of passive verb phrases", ylab = NULL,
  panel = function(x,y){
    panel.dotdiagram(d$prop_passive, scale_y = .006, n_bins=50, set_col="grey")
    for(i in 1:160){
      panel.points(
        x = seq(0, 1, length = 100),
        y = exp(likbinom(
          x = round(d$verbs[i]*.203), 
          size = d$verbs[i], 
          prob = seq(0,1, length = 100), log = TRUE)) /
          sum(exp(likbinom(
            x = round(d$verbs[i]*.203), 
            size = d$verbs[i], 
            prob = seq(0,1, length = 100), log = TRUE))),
        type = "l", alpha = .1)
      }
    panel.text(x = .4, y = .06, label = "Observed distribution", col = "grey40", cex = .9, adj = 0)
    panel.text(x = .25, y = .13, label = "Expected distribution", col = 1, cex = .9, adj = 0)
    })
Figure 5: Fit between the binomial model and the data.

If the observed data show greater variation than anticipated by a statistical model, the data are said to be overdispersed relative to this model. Alternative modeling approaches take into account this overdispersion, or heterogeneity. As we will see, however, they do so in different ways.

Quasi-binomial model including a heterogeneity parameter

A quasi-binomial model includes a second parameter that explicitly captures the excess variation in the data (see Agresti 2013, 150–51). This dispersion parameter adjusts the variance of the binomial distribution and is often denoted as \(\phi\). It is estimated on the basis of a global \(\chi^2\) fit statistic for the model.

In this way, the quasi-binomial model allows the standard deviation of observed proportions to be greater than anticipated by the simple dependency on the mean of the distribution, which we saw in Figure 4 above. Essentially, the dispersion parameter is a multiplicative factor that adjusts the variance of the binomial distribution upwards. If the dispersion parameter is 1, the model reduces to the binomial model discussed above. The dispersion parameter also affects inferences from the model: standard errors are multiplied by \(\sqrt{\phi}\).

We can fit a quasi-binomial model using the glm() function in R:

m <- glm(
  cbind(pass, act) ~ 1, 
  data = d, 
  family = "quasibinomial")

Then we use the {marginaleffects} package to obtain the model-based predicted probability of a passive verb phrase (+ 95% CI):

avg_predictions(m) |> 
  tidy() |> 
  select(estimate, conf.low, conf.high) |> 
  round(3)
# A tibble: 1 × 3
  estimate conf.low conf.high
     <dbl>    <dbl>     <dbl>
1    0.203    0.186      0.22

This produces the same estimate as above, but with a wider uncertainty interval. The model intercept, once back-transformed to the probability scale, yields the same estimate:

round(
  plogis(coef(m)), 3)
(Intercept) 
      0.203 

The heterogeneity factor adjusts the variance of the binomial distribution to align the model with the excess variability in the observed proportions. However, the model still assumes a constant underlying proportion for all authors. The heterogeneity parameter basically states that some perturbation, perhaps caused by omitted predictors or positively correlated (i.e. non-independent) observations in each row of the table, leads to greater variability of the observed proportions. The model does not point to a specific source of the overdispersion.

Since the variance is increased proportionally to the overdisersion parameter, the bell-shaped curves are spread out more widely. This is illustrated in the figure below. We see that the fit between model and data is much better now.

draw figure
xyplot(
  1 ~ 1, type = "n", xlim=c(0,1), ylim = c(0,.06),
  par.settings = my_settings, axis = axis_bottom,
  scales = list(
    x = list(
      at = c(0, .25, .5, .75, 1),
      label = c("0", ".25", ".50", ".75", "1"))),
  xlab.top = "Quasi-binomial model\n",
  xlab = "Text-specific proportion of passive verb phrases", ylab = NULL,
  panel = function(x,y){
    panel.dotdiagram(d$prop_passive, scale_y = .004, n_bins=50, set_col="grey")
    for(i in 1:160){
      interpolated_to_100_steps <- approx(
        x = (0:d$verbs[i])/d$verbs[i],
        y = dDBI(0:d$verbs[i], mu=.203, sigma=11, bd=d$verbs[i]),
        xout = seq(0, 1, length=100))
      
      panel.points(x = interpolated_to_100_steps$x,
                   y = interpolated_to_100_steps$y/sum(interpolated_to_100_steps$y),
                   type = "l", alpha = .1)
    }
    panel.text(x = .48, y = .03, label = "Observed distribution", 
               col = "grey40", cex = .9, adj = 0)
    panel.text(x = .25, y = .055, label = "Expected distribution", 
               col = 1, cex = .9, adj = 0)
  })
Figure 6: Fit between the quasi-binomial model and the data.

However, the quasi-binomial model does not represent the structure in our data. It does not attribute the excess variation in proportions to the fact that we are looking at 160 different texts, from speakers who may very well show different stylistic attitudes toward passive usage. Rather, it states that some noise variable increased the sampling variation when drawing verb phrases from each speaker, with speakers nevertheless being actually homogeneous with respect to their underlying usage rate of the passive.

While the quasi-binomial model effectively adjusts inferences for the non-independence of observations, the way in which this is achieved may not be appropriate in all situations (see Finney 1971, 72). In particular, if the data are clustered, this information should be explicitly taken into account. The models we consider next embrace the data structure and introduce parameters that describe between-cluster variation, thereby linking overdispersion to a specific source. As noted by Agresti (2013, 151), this approach is preferable, because it actually models the observed heterogeneity.

Beta-binomial model

The beta-binomial model also includes a second parameter, but this parameter has a different function (and interpretation). It explicitly allows for the possibility that the texts in the data differ in the underlying probability of passive usage. This parameter aims to represent the distribution of text-specific proportions, which means that it actively takes into account the clustering variable Text. If texts vary considerably, reflecting large overdispersion relative to the binomial model, the parameter describing the text-to-text variation will be large. If there is no evidence for surplus variation among texts, the beta-binomial model reduces to the binomial model. The relationship between the binomial mean and variance (see Figure 4) therefore remains unaltered.

Since the text-specific proportions are bounded between 0 and 1, a distribution that respects these limits must be used. In the case of the beta-binomial model, this is the beta distribution. As discussed in more detail in this blog post, the beta distribution has two parameterizations. It can be defined using two so-called shape parameters, or it can be defined using a mean and a standard deviation parameter. The mean, in our case, is the model-based overall mean proportion of passive verb phrases.

We can fit a beta-binomial model using the function betabin() in the R package {aod} (Lesnoff et al. 2012):

m <- betabin(
  cbind(pass, act) ~ 1, 
  ~ 1, 
  data = d)

The parameter controlling the spread of the text-specific proportions is termed \(\phi\). It can be extracted from the model object as follows:

m@random.param
phi.(Intercept) 
      0.0649405 

The \(\phi\) parameter returned by aod::betabin() is the reciprocal of the standard deviation of the beta distribution, so we convert it:

sd_beta <- 1/m@random.param

This gives us the mean and standard deviation of the beta distribution that describes the variability among texts. To graph this distribution, we need to translate these parameters into shape parameters (see this blog post):

muphi_to_shapes <- function(mu, phi) {
  shape1 <- mu * phi
  shape2 <- (1 - mu) * phi
  return(list(shape1 = shape1, shape2 = shape2))
}

shape_parameters <- muphi_to_shapes(
  mu = plogis(coef(m)), 
  phi = 1/m@random.param)

We can now graph the beta distribution:

draw figure
xyplot(
  1~1, type = "n", xlim=c(0,1), ylim = c(0,5),
  par.settings = my_settings, axis = axis_bottom,
  scales = list(x = list(
      at = c(0, .25, .5, .75, 1),
      labels = c("0", ".25", ".50", ".75", "1"))),
  xlab = "Text-specific proportion of passive verb phrases", ylab = NULL,
  panel = function(x,y){
    panel.points(x = seq(0, 1, .01),
                 y = dbeta(seq(0, 1, .01), 
                           shape1 = shape_parameters$shape1,
                           shape2 = shape_parameters$shape2),
                 type="l")
    panel.text(x=.4, y=5, label="Beta distribution with parameters:", 
               col = "grey40", cex=.8, adj=0)
    panel.text(x=.45, y = 3.5, label = "\u2022 Mean = 0.21; SD = 15.4",
               col = "grey40", cex=.8, adj=0)
    panel.text(x=.45, y = 2.5, label = "\u2022 Shape 1 = 3.3; Shape 2 = 12.1",
               col = "grey40", cex=.8, adj=0)
       })
Figure 7: Beta density describing the distirbution of text-specific proportions.

The intercept of the beta-binomial model translates into a slightly higher mean probability of the passive:

plogis(coef(m))
(Intercept) 
  0.2137407 

The function avg_predictions() returns the same estimate, along with an appropriately wide confidence interval:

avg_predictions(m) |> 
  tidy() |> 
  select(estimate, conf.low, conf.high) |> 
  round(3)
# A tibble: 1 × 3
  estimate conf.low conf.high
     <dbl>    <dbl>     <dbl>
1    0.214    0.197     0.231

Finally, we check the fit between model and data visually (see Figure 8). The beta distribution appears to capture the spread of the observed text-specific proportions quite well. We should note, however, that the density curve does not capture the additional variation among the observed proportions that is due to sampling variation. The dots are therefore expected to be spread out more widely, albeit only slightly so due to the large sample sizes (see Figure 2).

draw figure
xyplot(1 ~ 1, type = "n", xlim=c(0,1), ylim = c(0,.03),
       par.settings = my_settings, axis = axis_bottom,
       scales = list(
         x = list(
               at = c(0, .25, .5, .75, 1),
               label = c("0", ".25", ".50", ".75", "1"))),
       xlab.top = "Beta-binomial model\n",
       xlab = "Text-specific proportion of passive verb phrases", ylab = NULL,
       panel = function(x,y){
         panel.dotdiagram(d$prop_passive, scale_y = .002, n_bins=80, set_col="grey60", seq_min = -.125,, seq_max = 1)
         panel.points(x = seq(0, 1, length=100),
                      y = dbeta(x = seq(0, 1, length=100),
                                shape1 = shape_parameters$shape1, 
                                shape2 = shape_parameters$shape2)/200, type = "l")
         panel.text(x = .48, y = .015, label = "Observed distribution", 
                    col = "grey40", cex = .8, adj = 0)
         panel.text(x = .25, y = .03, label = "Expected distribution", 
                    col = 1, cex = .8, adj = 0)
       })
Figure 8: Fit between the beta-binomial model and the data.

Since the variation in text-specific rates is represented using a probability distribution, we can use the beta-binomial model to describe this variation in informative ways. For instance, we may be interested in the interquartile range, i.e. the spread of the central 50% of the proportions.

qbeta(
  c(.25, .75), 
  shape1 = shape_parameters$shape1, 
  shape2 = shape_parameters$shape2) |> 
  round(3)
[1] 0.138 0.277

Or, seeing that the the model estimate is the mean over the text-specific proportions, we may instead be interested in the median proportion:

qbeta(
  .5, 
  shape1 = shape_parameters$shape1, 
  shape2 = shape_parameters$shape2) |> 
  round(3)
[1] 0.201

We now move on to a class of models that may be more familiar to many researchers: Mixed-effects regression models.

Comparison

Figure 12 brings together the estimates from the different models. The first thing we note is that, in terms of proposed statistical precision, they form two groups: The binomial model is the odd one out, with a very narrow confidence interval on the estimated proportion. All other models produce confidence intervals that are very similar in length.

The second thing to note is that estimates form three groups:

  • The lowest estimate is the conditional mean based on the logistic random-effects model. It is the mean over cluster-specific logits, back-transformed to the proportion scale.
  • The marginal mean from the same model, which is the mean over cluster-specific proportions (nearly) coincides with the mean proportion returned by the beta-binomial model, and (interestingly) the conditional estimate from the ordinary random-effects model.
  • The third group, which yields an intermediate predicted proportion, is formed by the binomial, the quasi-binomial, and the marginal estimate from the ordinary random-effects model.
draw figure
comp_models <- tibble(
  model = c("Binomial", "Quasi-binomial", "Beta-binomial", 
                "Ordinary random-effects (conditional)",
                "Ordinary random-effects (marginal)",
                "Logistic random-effects (conditional)",
                "Logistic random-effects (marginal)"),
  estimate = c(pred_binomial$estimate,
               pred_quasibin$estimate,
               pred_betabin$estimate,
               pred_ord_ranef_c$estimate,
               pred_ord_ranef_m$estimate,
               pred_log_ranef_c$estimate,
               pred_log_ranef_m$estimate),
  ci_lower = c(pred_binomial$conf.low,
               pred_quasibin$conf.low,
               pred_betabin$conf.low,
               pred_ord_ranef_c$conf.low,
               pred_ord_ranef_m$conf.low,
               pred_log_ranef_c$conf.low,
               pred_log_ranef_m$conf.low),
  ci_upper = c(pred_binomial$conf.high,
               pred_quasibin$conf.high,
               pred_betabin$conf.high,
               pred_ord_ranef_c$conf.high,
               pred_ord_ranef_m$conf.high,
               pred_log_ranef_c$conf.high,
               pred_log_ranef_m$conf.high)
)

comp_models$model <- factor(
  comp_models$model,
  levels = c("Binomial", "Quasi-binomial", "Beta-binomial", 
                "Ordinary random-effects (conditional)",
                "Ordinary random-effects (marginal)",
                "Logistic random-effects (conditional)",
                "Logistic random-effects (marginal)"),
  ordered = TRUE
)

comp_models |> 
  ggplot(aes(y = model, x = estimate)) +
  geom_vline(xintercept = c(.1935, .203, .213), color = "grey95", lwd=3) +
  geom_point() +
  geom_linerange(aes(xmin = ci_lower, xmax = ci_upper)) +
  scale_x_continuous(breaks = seq(.18, .23, .01), 
                     labels = c(".18", ".19", ".20", ".21", ".22", ".23")) +
  theme_classic_ls() +
  xlab("Estimated proportion of passives") +
  ylab(NULL)
Figure 12: Comparison of model-based mean predictions.

Summary

We discussed different approaches to modeling clustered binomial data. These differ in the way they address the resulting non-independence of observations in the data. If a clustering variable is present, it is generally preferable to use a model that links the observed non-independence to this source. These (proper) modeling approaches represent the observed variation across clusters in different ways, and they yield different types of estimates for the mean proportion in the population of interest. We saw how the {marginaleffects} package can be used to construct these mean predictions, which mainly differ in terms of the scale on which they average over cluster-specific quantities.

References

Agresti, Alan. 2013. Categorical Data Analysis. Hoboken, NJ: Wiley.
Anderson, Dorothy A. 1988. “Some Models for Overdispersed Data.” Australian Journal of Statistics 30 (2): 125–48. https://doi.org/10.1111/j.1467-842x.1988.tb00844.x.
Arel-Bundock, Vincent, Noah Greifer, and Andrew Heiss. 2024. “How to Interpret Statistical Models Using marginaleffects for R and Python.” Journal of Statistical Software 111 (9): 1–32. https://doi.org/10.18637/jss.v111.i09.
Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.
Evert, Stephanie. 2023. Corpora: Statistics and Data Sets for Corpus Frequency Data. https://doi.org/10.32614/CRAN.package.corpora.
Finney, David J. 1971. Probit Analysis. New York: Cambridge University Press.
Lesnoff, M., Lancelot, and R. 2012. Aod: Analysis of Overdispersed Data. https://cran.r-project.org/package=aod.
Scherer, Ralph. 2018. PropCIs: Various Confidence Interval Methods for Proportions. https://doi.org/10.32614/CRAN.package.PropCIs.
Sönning, Lukas, and Julia Schlüter. 2022. “Comparing Standard Reference Corpora and Google Books Ngrams: Strengths, Limitations and Synergies in the Contrastive Study of Variable h- in British and American English.” In Data and Methods in Corpus Linguistics, 17–45. Cambridge University Press. https://doi.org/10.1017/9781108589314.002.

Citation

BibTeX citation:
@online{sönning2025,
  author = {Sönning, Lukas},
  title = {Modeling Clustered Binomial Data},
  date = {2025-05-09},
  url = {https://lsoenning.github.io/posts/2025-05-05_binomial_overdispersion/},
  langid = {en}
}
For attribution, please cite this work as:
Sönning, Lukas. 2025. “Modeling Clustered Binomial Data.” May 9, 2025. https://lsoenning.github.io/posts/2025-05-05_binomial_overdispersion/.