Statistics for linguist(ic)s blog

Benchmarking sample-based estimates to population values: Two broad strategies

corpus linguistics

representativeness

regression

bias

imbalance

random forests

This blog post outlines two strategies that can be used to adjust model-based data summaries for known differences between sample and target population. These involve the use of weights, which allow us to account for mismatches either at the stage of model fitting (sampling weights) or when post-processing model predictions (poststratification weights).

Causally informed regression modeling: The dative alternation

corpus linguistics

regression

bias

causal inference

binary data

This blog post describes how assumed causal relations among variables may inform the analysis and interpretation of corpus-linguistic datasets. For illustration, I will rely on Bresnan et al.’s (2007) data on the English dative alternation.

Issues in random-forest modeling: Interaction predictors

corpus linguistics

binary data

random forests

partial dependence plots

In this blog post, I demonstrate that the inclusion of manually specified interaction predictors into a random-forest model yields biased (or uninterpretable) partial dependence plots.

Issues in random-forest modeling: Treatment of clustering variables

corpus linguistics

clustered data

bias

binary data

random forests

Corpus data often show a hierarchical structure, where observations are grouped by text/speaker or item. In this blog post, I comment on the practice of including clustering variables as predictors into a random-forest model. I demonstrate that this undermines the predictive utility of associated cluster-level predictors.

Custom scoring systems in ordinal data analysis: A tribute to Sharoff (2018)

ordinal data

This blog post discusses the issue of representing an ordinal response scale using numeric scores. It shows how psychometric research may suggest deviations from the near-universal use of equally spaced integers.

Drawing grouped dot plots in R

data visualization

dot plot

corpus linguistics

frequency data

binary data

In this blog post, I describe how to draw grouped dot plots as an alternative to grouped bar charts in R using the {ggplot2} package .

Drawing panel charts in R

data visualization

dot plot

corpus linguistics

frequency data

binary data

In this blog post, I describe how to draw panel charts as an alternative to stacked bar charts in R, using the {ggplot2} package.

Bootstrapping dispersion measures

corpus linguistics

dispersion

This blog post discusses bootstrapping for dispersion measures and illustrates how two particularly useful variants, simple and stratified bootstrapping, may be applied with the {tlda} package in R.

Dispersion: Levels of analysis

corpus linguistics

dispersion

This blog post argues that there are three basic levels at which dispersion analysis isolates a particular distributional feature and thereby returns clear and interpretable scores. In corpus-based work, these levels are often compounded, and dispersion scores therefore represent a mix of differnet types of distributional information.

Tukey’s folded power transformation in R

corpus linguistics

dispersion

data visualization

tlda

binary data

This blog post demonstrates a new functionality of the {tlda} package: The implementation of Tukey’s folded power transformation for proportions (and percentages), including its use in data visualization with {ggplot2}.

Drawing spike graphs to examine dispersion across text files

corpus linguistics

dispersion

data visualization

This blog post describes how to draw spike graphs that visualize the dispersion of an item across the text files in a corpus. These graphs are enriched with information about corpus design (and structure).

Drawing structured dispersion plots in R

corpus linguistics

dispersion

data visualization

This short note describes how to draw dispersion plots that include information about the design (and structure) of a corpus in R.

Nelson’s (2025) Poisson-based dispersion measure

corpus linguistics

dispersion

This blog post discusses the dispersion measure MB, which was recently proposed by Nelson (2025). The motivation behind MB – to build an index that tolerates sampling variation – is highly commendable. To extend the applicability of this measure, I describe how it can be applied to linguistically grounded corpus parts (e.g. genres/regsiters, text files), which may also differ in length. I then study the behavior of MB and D_MB and observe that both seem to show an overall bias toward evenness.

Lyne’s (1985) graphical technique for the evaluation of dispersion measures

corpus linguistics

dispersion

data visualization

In this blog post, I revisit the graphical strategy used by Lyne (1985) to study the behavior of the dispersion measures D, D₂, and S, and then apply it to a number of other parts-based indices that have been proposed more recently.

Color-coded dendrograms using the R function `A2Rplot()`

data visualization

In this blog post, I illustrate how to use Romain Francois’ R function A2Rplot() to draw dendrograms with visually distinct clusters.

Exporting R graphics: A basic workflow

data visualization

In this blog post, I describe my workflow for exporting and polishing graphs drawn in R.

Modeling clustered frequency data II: Texts of disproportionate length

corpus linguistics

regression

clustered data

frequency data

bias

imbalance

negative binomial

This blog post illustrates a number of strategies for modeling clustered count data. It describes how they handle the non-independence among observations and what kind of estimates they return. The focus is on a situation where texts have very different lengths.

Modeling clustered frequency data I: Texts of similar length

corpus linguistics

regression

clustered data

frequency data

negative binomial

This blog post illustrates a number of strategies for modeling clustered count data. It describes how they handle the non-independence among observations and what kind of estimates they return. The focus is on a situation where texts have roughly the same length.

Frequency estimates based on random-intercept Poisson models

corpus linguistics

regression

clustered data

frequency data

negative binomial

Clustered count data can be modeled using a Poisson regression model including random intercepts. This blog post describes how this model represents the data and the different kinds of frequency estimates it produces.

Modeling clustered binomial data

corpus linguistics

regression

clustered data

binary data

This blog post illustrates a number of strategies for modeling clustered binomial data. It describes how they handle the non-independence among observations and what kind of estimates they return.

Imbalance across predictor levels affects data summaries

Obstacles to replication in corpus linguistics

corpus linguistics

replication crisis

regression

bias

imbalance

This blog post is part of a small series on obstacles to replication in corpus linguistics. It deals with problems that can arise if the observations drawn from a corpus are unbalanced across relevant subgroups in the data. I show how simple and comparative data summaries can vary depending on whether we (unintentionally) calculate weighted averages, or adjust estimates for imbalances by taking a simple average across subgroups. As these are two different estimands, the choice affects the comparability of studies – including an original study and its direct replication.

Clustering in the data affects statistical uncertainty intervals

Obstacles to replication in corpus linguistics

corpus linguistics

replication crisis

regression

clustered data

This blog post is part of a small series on obstacles to replication in corpus linguistics. It deals with a prevalent issue in corpus data analysis: the non-independence of data points that results from clustered (or hierarchical) data layouts. I show how an inadequate analysis can produce unduly narrow expectations of a replication study.

Unbalanced distributions and their consequences: Speakers in the Spoken BNC2014

corpus linguistics

clustered data

negative binomial

clustered data

imbalance

This blog post illustrates how the disproportionate representation of speakers in a corpus can lead to distorted results if the source of data points (i.e. the speaker ID) is not taken into account in the analysis.

Modeling the interpretation of quantifiers using beta regression

regression

distributional modeling

This blog post shows how to use beta regression to model the proportional interpretation of the quantifiers few, some, many, and most. We consider variable-dispersion and mixed-effects structures as well as diagnostics for frequentist and Bayesian models.

Different parameterizations of the negative binomial distribution

corpus linguistics

dispersion

negative binomial

This blog post discusses two different parameterizations of the negative binomial distribution and groups R packages (and functions) based on the version they implement.

The negative binomial distribution: A visual explanation

corpus linguistics

dispersion

negative binomial

This blog post uses a visual approach to explain how the negative binomial distribution works.

A computational shortcut for the dispersion measure D_A

corpus linguistics

dispersion

This short blog post draws attention to the computational shortcut given in Wilcox (1973) for calculating the dispersion measure D_A.

The replication crisis: Implications for myself

replication crisis

open science

In this blog post, I reflect on the ways in which learning about the replication crisis in science has affected my own work.

Structured down-sampling: Implementation in R

corpus linguistics

down-sampling

This blog post shows how to implement structured down-sampling in R.

Two types of down-sampling in corpus-based work

corpus linguistics

down-sampling

This short blog post contrasts the different ways in which the term down-sampling is used in corpus-based work.

‘Dispersion’ in corpus linguistics and statistics

corpus linguistics

dispersion

This blog post clarifies the different ways in which the term dispersion is used in corpus linguistics and statistics.