Drawing structured dispersion plots in R

corpus linguistics

dispersion

data visualization

This short note describes how to draw dispersion plots that include information about the design (and structure) of a corpus in R.

Author

Affiliation

Lukas Sönning

University of Bamberg

Published

November 11, 2025

R setup

# install development version directly from Github
#pak::pak("lsoenning/tlda")
#pak::pak("lsoenning/wls")

library(tlda)      # for access to datasets
library(wls)       # for custom ggplot theme
library(tidyverse) # for data wrangling
library(ggh4x)     # for drawing nested facets in ggplot
library(uls)       # pak::pak("lsoenning/uls")

Dispersion plots were introduced into corpus linguistics in the mid-1990s by Mike Scott and his popular software WordSmith Tools. They are useful for examining the distribution of an item in a text or corpus. The typical layout of a dispersion plot shows a layered arrangement, with each row showing a text file in the corpus. Thin vertical bars then show where in the document a word occurs. If you want to draw such a plot in R, the {quanteda} package (Benoit et al. 2018) is worth checking out (see Section Lexical dispersion plot on the package website).

If we are working with a corpus that consists of different text categories (such as registers, genres, etc.), it makes sense to enrich dispersion plots with information about the structure of the corpus. In this blog post, we look at one way of doing so, which I will refer to as a structured dispersion plot. It strings together all text files in the corpus and locates the occurrences of an item throughout the entire corpus. The structure of the text files is reflected through one or several annotation layers above the dispersion plot; this way, we can see in which sub-corpus a particular instance occurred.

Information about corpus position and source text of item

To draw a dispersion plot, we need information about the location (or position) of each occurrence of the item in the corpus. For illustration, we will consider the distribution of two word forms, methods and thirteen, in ICE-GB (Nelson, Wallis, and Aarts 2002). Biber et al. (2016) expect methods to be writing-skewed and thirteen to be speech-skewed. Both word forms are part of a list of items that was compiled by Biber et al. (2016) to study the behavior of dispersion measures in different distributional settings. The 150 items are intended to cover a broad range of frequency and dispersion levels (see Sönning 2025a and help("biber150_ice_gb")).

We start by loading a data frame that lists every word and non-word token in the corpus and contains 1,072,393 rows in total. The table has three columns:

text_file the text files in which the item occurs
item the item, i.e. word form
corpus_position the corpus position of the item

ice_gb <- readRDS("ice_gb.rds")
str(ice_gb)

'data.frame':   1072393 obs. of  3 variables:
 $ text_file      : chr  "s1a-001" "s1a-001" "s1a-001" "s1a-001" ...
 $ item           : chr  NA NA NA NA ...
 $ corpus_position: int  1 2 3 4 5 6 7 8 9 10 ...

The column item only identifies the word forms that are part of Biber et al.’s (2016) list; other tokens have been replaced by NA. The item methods occurs 63 times in ICE-GB, thirteen 27 times:

sum(ice_gb$item == "methods", 
    na.rm = TRUE)

[1] 63

sum(ice_gb$item == "thirteen", 
    na.rm = TRUE)

[1] 27

Information about corpus structure

The ICE family corpora rely on a standardized sampling frame (see here). Relevant metadata about the 500 text files in ICE-GB is provided in the data object metadata_ice_gb in the {tlda} package (Sönning 2025b). See help("metadata_ice_gb") for more information about this data table.

str(metadata_ice_gb)

'data.frame':   500 obs. of  7 variables:
 $ text_file    : chr  "s1a-001" "s1a-002" "s1a-003" "s1a-004" ...
 $ mode         : chr  "spoken" "spoken" "spoken" "spoken" ...
 $ text_category: Ord.factor w/ 4 levels "dialogues"<"monologues"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ macro_genre  : Ord.factor w/ 12 levels "private_dialogues"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ genre        : Ord.factor w/ 32 levels "face_to_face_conversations"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ genre_short  : Ord.factor w/ 32 levels "con"<"ph"<"les"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ word_count   : int  2195 2159 2287 2290 2120 2060 2025 2177 2063 2146 ...

Let us briefly explore the structure of ICE-GB. Starting at the broadest level of classification, it covers two modes of production. The number of spoken and written text files differs:

ftable(
  metadata_ice_gb$mode, 
  row.vars = 1)

            
spoken   300
written  200

At the next level, the corpus is divided into four text categories:

ftable(
  metadata_ice_gb$text_category, 
  row.vars = 1)

                
dialogues    180
monologues   120
non_printed   50
printed      150

And these break down further into 12 macro genres:

ftable(
  metadata_ice_gb$macro_genre, 
  row.vars = 1)/500

                           
private_dialogues      0.20
public_dialogues       0.16
unscripted_monologues  0.14
scripted_monologues    0.10
student_writing        0.04
letters                0.06
academic_writing       0.08
popular_writing        0.08
reportage              0.04
instructional_writing  0.04
persuasive_writing     0.02
creative_writing       0.04

When analyzing the dispersion of an item in ICE-GB, this structure needs to be taken into account.

Data preparation

Importantly, the classification variables denoting text varieties (text_category, macro_genre, and genre) are already ordered based on the sampling frame that informs the design of the ICE family of corpora. In metadata_ice_gb, they are represented as ordered factors (see above). This is important for visualization later on, because we want to order the text files (and higher-level text categories) in a sensible way.

We now need to combine the information contained in the two tables. The linking column is text_file, which allows us to join ice_gb with metadata_ice_gb:

ice_gb <- full_join(
  ice_gb, 
  metadata_ice_gb)

This yields a data frame with more information about each token in the corpus:

str(ice_gb)

'data.frame':   1072393 obs. of  9 variables:
 $ text_file      : chr  "s1a-001" "s1a-001" "s1a-001" "s1a-001" ...
 $ item           : chr  NA NA NA NA ...
 $ corpus_position: int  1 2 3 4 5 6 7 8 9 10 ...
 $ mode           : chr  "spoken" "spoken" "spoken" "spoken" ...
 $ text_category  : Ord.factor w/ 4 levels "dialogues"<"monologues"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ macro_genre    : Ord.factor w/ 12 levels "private_dialogues"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ genre          : Ord.factor w/ 32 levels "face_to_face_conversations"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ genre_short    : Ord.factor w/ 32 levels "con"<"ph"<"les"<..: 1 1 1 1 1 1 1 1 1 1 ...
 $ word_count     : int  2195 2195 2195 2195 2195 2195 2195 2195 2195 2195 ...

Preparation for plotting

Three more preparatory steps are necessary. First, we replace the NA code in the column item with a blank space (” “), as this will make it easier to plot the data:

ice_gb$item <- ifelse(
  is.na(ice_gb$item), 
  " ",
  ice_gb$item)

Then we convert corpus_position into a factor (position_fct):

ice_gb <- ice_gb |> 
    mutate(position_fct = factor(
        corpus_position, levels = 1:nrow(ice_gb)
    ))

And finally, to have nicer labels in the plot, I will replace the “_” symbols in the macro genre labels with line breaks. This is optional.

ice_gb$macro_genre_nice <- factor(
    ice_gb$macro_genre,
    labels = str_replace(
      levels(ice_gb$macro_genre), 
      pattern = "_", replacement = "\n"))

ice_gb$text_category_nice <- factor(
    ice_gb$text_category,
    labels = str_replace(
      levels(ice_gb$text_category), 
      pattern = "_", replacement = "-"))

Drawing the structured dispersion plot

Now we are ready for plotting. The following annotated code draws a structured dispersion plot. It uses the function facet_nested() from the {ggh4x} package (van den Brand 2024) to draw nested facets. The function theme_dispersion_plot() from the {wls} package (Sönning 2025c) adjusts the ggplot2 theme for a nicer appearance. We start by drawing a structured dispersion plot for methods, with two structural annotation layers: text category (4 categories) nested within mode (2 categories):

ice_gb |> 
  mutate(item_location = ifelse(          # create new column indicating the
    item == "methods",                    #   occurrence of the item (for color 
    item, "other")) |>                    #   coding further below)
  ggplot(aes(x = position_fct)) +         # corpus position as the x-variable
  geom_segment(aes(                       # draw vertical bars at location of
    x = position_fct,                     #   occurrence (position_fct),
    xend = position_fct,                  #   extending from 0 to 1 vertically  
    y = 0, yend = 1,                      #
    color = item_location)) +             # only draw bars where item occurs
  facet_nested(                           # nested facets with the {ggh4x} package:
    . ~ mode + text_category_nice,        #   text categ. facets nested within mode
    scales = "free",                      #   allow x-scale to vary across facets
    space = "free_x") +                   #   facet width proportional to length
  scale_color_manual(                     # set color manually so bars appear 
    values = c("black", "transparent")) + #   only where item occurs
  scale_x_discrete(expand = c(0,0)) +     # avoid left/right padding in facets 
  scale_y_continuous(expand = c(.1,.1)) + # add some top/bottom padding
  theme_bw() +                            # specify theme_bw() as basis
  theme_dispersion_plot() +               # custom theme for dispersion plot
  xlab(NULL) + ylab(NULL)                 # no axis labels

Figure 1: Structured dispersion plot showing the distribution of *methods* in ICE-GB.

In line with Biber et al. (2016)’s expectation, methods is skewed toward written usage. The vertical bars gravitate toward the right, where written texts are located. Let us zoom in to the written part, to examine whether this item is associated with specific genres. Using the filter() function in the {dplyr} package, we can isolate the written part of the corpus. We also change the facetting scheme – annotation layers now show macro genre (8 categories) nested within text category (2 categories), nested within mode (here only 1 category):

Draw Figure

ice_gb |> 
    mutate(item_location = ifelse(
            item == "methods",
            item, "other")) |>
    filter(mode == "written") |>             # filter: written texts only
    ggplot(aes(x = position_fct)) +
    geom_segment(aes(
        x = position_fct,
        xend = position_fct,
        y = 0, yend = 1,
        color = item_location)) +
    facet_nested(
      . ~ mode + text_category_nice + macro_genre_nice,  # new facetting scheme
      scales = "free",
      space = "free_x") +
    scale_color_manual(
      values = c("black", "transparent")) +
    scale_x_discrete(expand = c(0,0)) +
    scale_y_continuous(expand = c(.1,.1)) +
    theme_bw() +
    theme_dispersion_plot() +
    xlab(NULL) + ylab(NULL)

Figure 2: Structured dispersion plot showing the distribution of *methods* in the written part of ICE-GB.

Perhaps unsurprisingly, occurrences of methods are most densely clustered in the macro genre academic writing.

Let’s also look at a structured dispersion plot for thirteen, which shows the reverse pattern – it is more common in speech. Note that the annotation layers above the plot now include mode (2 categories) and text category (4 categories).

Draw Figure

ice_gb |> 
    mutate(item_location = ifelse(          # 
            item == "thirteen",                 # select new item 
            item, "other")) |>                  #
    ggplot(aes(x = position_fct)) +
    geom_segment(aes(
        x = position_fct,
        xend = position_fct,
        y = 0, yend = 1,
        color = item_location)) +
    facet_nested(
      . ~ mode + text_category_nice,
      scales = "free",
      space = "free_x") +
    scale_color_manual(
      values = c("transparent", "black")) +
    scale_x_discrete(expand = c(0,0)) +
    scale_y_continuous(expand = c(.1,.1)) +
    theme_bw() +
    theme_dispersion_plot() +
    xlab(NULL) + ylab(NULL)

Figure 3: Structured dispersion plot showing the distribution of *thirteen* in ICE-GB.

While I think structured dispersion plots can be useful in corpus analysis, I am not really happy with the way they are implemented here: Slow and lots of (unelegant) code. If you have a better idea of how to draw such plots in R, please let me know!

References

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. 2018. “Quanteda: An r Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3 (30): 774. https://doi.org/10.21105/joss.00774.

Biber, Douglas, Randi Reppen, Erin Schnur, and Romy Ghanem. 2016. “On the (Non)utility of Juilland’sDto Measure Lexical Dispersion in Large Corpora.” International Journal of Corpus Linguistics 21 (4): 439–64. https://doi.org/10.1075/ijcl.21.4.01bib.

Nelson, Gerald, Sean Wallis, and Bas Aarts. 2002. Exploring Natural Language: Working with the British Component of the International Corpus of English. John Benjamins. https://doi.org/10.1075/veaw.g29.

Sönning, Lukas. 2025a. “Biber et al.’s (2016) set of 150 BNC items for the analysis of dispersion measures: Dataset for ‘Evaluation of text-level measures of lexical dispersion’.” DataverseNO. https://doi.org/10.18710/ATCQZW.

———. 2025b. Tlda: Tools for Language Data Analysis. https://github.com/lsoenning/tlda.

———. 2025c. Wls: R Utilities for Workshops Taught by Lukas Soenning. https://github.com/lsoenning/wls.

van den Brand, Teun. 2024. Ggh4x: Hacks for ’Ggplot2’. https://doi.org/10.32614/CRAN.package.ggh4x.

Citation

BibTeX citation:

@online{sönning2025,
  author = {Sönning, Lukas},
  title = {Drawing Structured Dispersion Plots in {R}},
  date = {2025-11-11},
  url = {https://lsoenning.github.io/posts/2025-11-11_structured_dispersion_plot/},
  langid = {en}
}

For attribution, please cite this work as:

Sönning, Lukas. 2025. “Drawing Structured Dispersion Plots in R.” November 11, 2025. https://lsoenning.github.io/posts/2025-11-11_structured_dispersion_plot/.