1 Introduction

The focus of the present study is on corpus-based variationist research, where corpus data are used to shed light on some type of alternation phenomenon. Classic examples of this type of research are the dative, genitive, or comparative alternation in Present-day English.

The goal is to work toward a unified approach to the statistical analysis of this type of corpus data. The framework I have chosen is multilevel (mixed-effects) regression, which is one of the most widely used tools for modeling variationist data. As I will try to lay out in the following chapters, it is also a remarkably useful tool since is allows us to summarize data in an insightful way that is capable of closely integrating the researcher’s objectives and our knowledge about the structure of the data. The current treatment elaborates on earlier accounts of multilevel modeling of language data by giving special consideration to these two guideposts of language data modeling.

One of the key themes, which is dealt with in Chapter 2, is the role of the researcher’s objectives when analyzing variationist corpus data. It turns out that scientific (and therefore statistical) inferences can be categorized along two dimensions, in terms of their type and scope. Being clear about how the goals of a study relate to these dimensions helps us withe the specification and interpretation of statistical models.

Chapter 3 then describes the way in which statistical models can help us pursue our linguistic goals, and how their form may vary depending on the type and scope of inference. The discussion is supported by illustrative demonstrations that allow us to recognize how alterations to a model affect the meaning and interpretation of statistical uncertainty estimates such as confidence intervals. I will also deal with two fundamental concepts in statistical theory – the methodological device of random sampling and the notion of a population – and lay out their role and relevance in variationist corpus research.

Chapter 4 turns to the second key feature of corpus data, their structural layout. By this, I mean what may be considered a variationist data universal: The tokens, or instances of the structure of interest, are always clustered in the sense that there will be multiple tokens from the same text (speaker or author). Depending on the linguistic structure studied, there may also be multiple tokens per item (word form, lexical item, or lemma). I will refer to this as the structural component of the data and contrast it with the systematic component, which includes the set of predictor variables that are assumed to show an association with the choices speakers make. Borrowing heavily from the literature on the design and analysis of experiments, I will present a simple template that researchers may fill in to recognize the relation between the variables (structural and systematic) in their data. This template then provides guidance for model specification, as it brings into view the components that could in principle be included in a mixed-effects model, and whose exclusion brings with it an additional assumption that the model is forced to make.

Chapter 5 elaborates on the dialogue between model specification and research objectives and pays particular attention to the use of random effects in data analysis. I will discuss the distinction between fixed and random effects, which remains a source of confusion and dispute throughout the language sciences. We will lay out the dimensions along which these two classes have been contrasted, which relate to very different features of empirical research, including study design, characteristics of the variable as such, and the researcher’s objectives. A variable’s status as fixed vs. random may therefore vary across dimensions, which helps us make sense of the confusion in the literature. I will consider fixed effects and random effects as prototypes, which allows us to appreciate where on a continuum a specific variable may be located.

Chapter 6 then turns to the concrete task of specifying a regression model for variationist corpus data. The implications of the template presented in Chapter 4 will be illustrated using a case study on the variable (ING), sometimes referred to as “g-dropping”.

Chapter 7 will deal in more detail with the treatment of what I will refer to as token-level predictors. These are variables that are coded at the level of the individual corpus hits (rather than attributes of higher-level units such as texts of lexical items). Such token-level predictors play a special role in multilevel modeling, since their association with the outcome can be partitioned into what is referred to as a between-cluster and within-cluster component. While this partition is discussed prominently in other fields of study (e.g. research on education), it has – to my knowledge – so far been largely neglected in language research. As I will illustrate, however, this partitioning of the variation of token-level predictors allows us to address informative research questions.

Chapter 8 discusses the potential of mixed-effects models to fruitfully combine two fundamentally different approaches to empirical work, namely what has been referred to as the nomothetic vs. idiographic orientation. The former seeks generalizations, the latter represent a case-study-type approach that looks in depth at a (much) smaller number of subjects. A brief historical digression will show that the relevance of both perspectives has found its way into variationist research, including the more recent use of mixed-effects regression analysis. I will demonstrate how this modeling framework may be used to this end and will draw attention to an important pitfall that does not seem to have received much attention by practitioners.

Chapter 9 is practical in nature and deals with various modeling tactics.

Chapter 10 turns to an important topic in mixed-effects modeling of categorical outcome variables: The different types of model-based predictions (or estimates) on the proportion scale. It discusses the important distinction between what are often referred to as conditional (or cluster-specific) and marginal (or population-averaged) predictions, which is arguably discussed too rarely in the methodological literature on language data analysis. I will also lay out and contrast different ways of adjusting for variables in the model when forming predictions for model interpretation.

Chapter 11 and Chapter 11 present two case studies.

…

How can language data be thought about in a systematic way
Try and see some sort of unity that will be helpful and constructive
How can language data be studied quantitatively in a way that advances understanding
Corpus data is analyzed with a variety of methodological approaches
Mixed-effects regression modeling has emerged as one of the chief methodological approaches.
Why research on alternation phenomena is useful [Arppe_etal2010, pp. 13-15]
causal inference ought to be an increasingly prominent concern, to which corpus-based analyses do not always pay sufficient heed
recognize archetypes
consider approaches with a common language
prototypical analyses
“too often, framework are siloed within specific disciplines, clouded by domain-specific language”
“this makes methods harder to discover, and hides their general applicability”
“it can be hard for practitioners outside of these fields to recognize when the problem they are facing fits one of these paradigms”