Two types of down-sampling in corpus-based work

corpus linguistics
methodology
down-sampling
terminology
This short blog post contrasts the different ways in which the term down-sampling is used in corpus-based work.
Author
Affiliation

University of Bamberg

Published

November 17, 2023

The data available from corpora are often too vast for certain types of linguistic analysis. Researchers are then forced to select a subset of the data, and this selection process can be referred to as “down-sampling”. Currently, the term is used to refer to two very different types of down-sizing.

The first deals with lists of occurrences extracted from a corpus and is used in studies that start out with a corpus query and a body of hits (often in the form of concordance lines). If the structure of interest is relatively frequent and/or the source corpus large, the researcher may need to reduce the number of data points studied. In particular, this will be necessary in variationist-type research, which often involves considerable manual work (e.g. disambiguation and annotation). In this form of down-sampling, the selection of elements usually proceeds (to some extent) at random, i.e. it involves a chance component. Simple techniques are implemented in corpus software, which allows users to extract from a list of hits a random sample. In CQPweb (Hardie 2012), for instance, this option is referred to as “thinning”. Depending on our research goals and the structure of our data, however, other strategies may be more efficient (e.g. structured down-sampling, see Sönning and Krug 2022).

The second type of down-sampling is concerned with the selection of texts for close reading. Here, the objective is to pick from a corpus those texts that are likely to be most informative for a thorough qualitative analysis. This method, which Gabrielatos et al. (2012) refer to as “targeted down-sampling”, uses surface-level features (such as the occurrence rate of certain forms) to detect relevant documents for a critical discourse analysis (see also Baker et al. 2008, 285). A procedure much in the same spirit is discussed in Anthony and Baker (2015), where prototypical exemplars, i.e. texts that are most representative of their corpus of origin, are selected based on keyword profiles.

It may therefore sometimes be helpful to distinguish the two types of down-sampling: We could call the first type “selection of concordance lines for annotation” and the second type “selection of texts for close reading”.

References

Anthony, Laurence, and Paul Baker. 2015. “ProtAnt: A Tool for Analysing the Prototypicality of Texts.” International Journal of Corpus Linguistics, August, 273–92. https://doi.org/10.1075/ijcl.20.3.01ant.
Baker, Paul, Costas Gabrielatos, Majid KhosraviNik, Michał Krzyżanowski, Tony McEnery, and Ruth Wodak. 2008. “A Useful Methodological Synergy? Combining Critical Discourse Analysis and Corpus Linguistics to Examine Discourses of Refugees and Asylum Seekers in the UK Press.” Discourse &Amp; Society 19 (3): 273–306. https://doi.org/10.1177/0957926508088962.
Gabrielatos, Costas, Tony McEnery, Peter J. Diggle, and Paul Baker. 2012. “The Peaks and Troughs of Corpus-Based Contextual Analysis.” International Journal of Corpus Linguistics 17 (2): 151–75. https://doi.org/10.1075/ijcl.17.2.01gab.
Hardie, Andrew. 2012. “CQPweb — Combining Power, Flexibility and Usability in a Corpus Analysis Tool.” International Journal of Corpus Linguistics 17 (3): 380–409. https://doi.org/10.1075/ijcl.17.3.04har.
Sönning, Lukas, and Manfred Krug. 2022. “Comparing Study Designs and down-Sampling Strategies in Corpus Analysis: The Importance of Speaker Metadata in the BNCs of 1994 and 2014.” In Data and Methods in Corpus Linguistics, 127–60. Cambridge University Press. https://doi.org/10.1017/9781108589314.006.

Citation

BibTeX citation:
@online{sönning2023,
  author = {Sönning, Lukas},
  title = {Two Types of down-Sampling in Corpus-Based Work},
  date = {2023-11-17},
  url = {https://lsoenning.github.io/posts/2023-11-17_downsampling_two_types/},
  langid = {en}
}
For attribution, please cite this work as:
Sönning, Lukas. 2023. “Two Types of down-Sampling in Corpus-Based Work.” November 17, 2023. https://lsoenning.github.io/posts/2023-11-17_downsampling_two_types/.