2 Descriptive statistics

Under construction

These notes are currently under construction and may contain errors. Feedback of any kind is very much welcome: lukas.soenning@uni-bamberg.de

This chapter provides an overview of strategies for describing the distribution of ordinal variables. Since the categories are ordered, there are two types of relative frequency that are important for summarizing their distribution. These are discussed in Section 2.1. Section 2.2 then deals with data visualization.

2.1 Relative frequencies: Category-specific vs. aggregated

The distribution of categorical variables is described using relative frequencies such as proportions or percentages. The choice between the two is arbitrary, as they express the same information. We will usually use proportions (see Harrell 2018), and we report them without a zero before the decimal point (i.e. “.10” instead of “0.10”) (APA 2020, 182).

Another metric that is often used for categorical data is probability. While probabilities can be understood as meaning the same thing as proportions, they are often used to describe quantities derived from a model, i.e. predicted (or estimated) relative frequencies. Here, we will use the term proportion for data description (observed distributions), and the term probability for model-based predictions or estimates (predicted distributions).

For ordinal data, two types of probabilities (or proportions/percentages) are important: Category-specific probabilities (also called response probabilities) and aggregated probabilities. Aggregated probabilities can be broken down into cumulative probabilities which collapse categories starting from the lower end of the scale, and exceedance probabilities, which collapse categories starting from the upper end of the scale. The distinction between category-specific and aggregated frequencies is crucial for the interpretation and communication of results.

To illustrate, let’s assume we have measured an outcome on a 5-point Likert-type agreement scale. The observed distribution of responses is shown in Figure 2.1.

Figure 2.1: Relative frequencies for ordinal variables: Category-specific, cumulative, and exceedance proportions.

2.1.1 Category-specific proportions

Category-specific (or response) proportions give the relative frequency of each of the 5 outcome categories. For instance, we can see that most of the respondents “disagreed” (a share of .40). Listing all individual proportions gives an exhaustive account of the distribution of the outcome. This is useful if we are interested in one (or each) category in its own right, or when detailed information is required about the distribution across categories.

2.1.2 Aggregated frequencies: Cumulative and exceedance proportions

Aggregated proportions add up the individual proportions along the ordinal scale: They express the share of observations that are in or below a particular category (cumulative proportions) or above a particular category (exceedance proportions). For instance, looking at the cumulative proportions in Figure 2.1 (middle set of stacked bars), we can state that 30% of respondents either “agreed” or “strongly agreed”. In other words, 30% of the respondents at least “agreed”. This could be used as a measure of agreement.

Exceedance proportions, on the other hand, are the logical counterpart to cumulative proportions. They express the share of observations that exceed a particular category. The exceedance proportions in Figure 2.1 indicate that 50% of respondents either “disagreed” or “strongly disagreed”. For the ordinal scale at hand, exceedance proportions can therefore be used as a measure of disagreement.

Cumulative and exceedance proportions can be helpful in certain settings, since category-specific proportions may offer too much information for some purposes. Individual response proportions also do not take into account the fact that the categories are ordered. Often, we need to condense the distribution into a single number to report on, say, the extent to which respondents agreed, on average. For instance, we may be interested in comparing the level of agreement in three groups – A, B and C. Comparing 15 response probabilities is harder than comparing three numbers summarizing the level of agreement. Thus, we could state that the percentage of respondents who agreed with the statement (i.e. responded “agree” or “strongly agree”) was 60% in group A, 40% in group B, and 35% in group C. Agreement was therefore highest in group A. For more complex comparisons, this condensation to a single number is very helpful.

Cumulative and exceedance proportions have advantages and disadvantages: On the one hand, they discard information in the data. Thus, a cumulative share of .30 does not tell us how many informants agreed strongly (only 5% or as many as 25%?). On the other hand, this reduction in information is beneficial for more complex comparisons, because single scores can be visualized and compared much more easily and effectively. This is a trade-off and each type serves different purposes.

When using cumulative and exceedance proportions for the interpretation and communication of results, two important choices need to be made:

Cognitive fit: When talking about an ordinal trait, you should make a decision about how you want (your audience) to think about the outcome. In our example, we could talk about the level of disagreement or the level of agreement. We should choose one, stick to it, and use the matching type of proportion. Here, cumulative proportions reflect agreement, while disagreement is indicated by exceedance frequencies. It is also helpful for the y-axis to have an informative, transparent label (e.g. “Agreement”).
Split: If we are looking for a single cumulative proportion to represent the distribution of the ordinal variable, we need to decide where to split the ordinal scale. In our example, we chose to collapse “agree” and “strongly agree” to arrive at a proportion reflecting agreement (e.g. .30). There are different options for splitting an ordinal scale, and the split we choose should be informative for addressing the question we have in mind. While it usually makes sense to use the midpoint of the scale, the number of response categories may be uneven. In our present data, for instance, we have a neutral category in the middle. In such cases we can divide the middle category in half. In a sense, the split is therefore made at the midpoint of the ordinal scale.

In short, category-specific proportions should be used when interest lies (i) in the detailed distribution of an ordinal outcome or (ii) in individual categories. Aggregated proportions should be used when a condensed summary measure is needed for more complex comparisons (such as trends across a continuous predictor or the comparison of several groups). The choice between cumulative and exceedance probabilities and the split should aim at providing a substantively meaningful summary score.

Table 2.1 gives different relative frequencies for the variable Syntactic function in the heaps data.

	Category-specific proportions				Cumulative proportions				Exceedance proportions
Response	p_adj	verb	c_adj	quant	p_adj	verb	c_adj	quant	p_adj	verb	c_adj	quant
6	.11	.27	.39	.47	1.0	1.0	1.0	1.0	.00	.00	.00	.00
5	.08	.14	.14	.18	.89	.73	.61	.53	.11	.27	.39	.47
4	.07	.13	.12	.12	.81	.58	.46	.34	.19	.42	.54	.66
3	.05	.06	.07	.06	.74	.45	.34	.22	.26	.55	.66	.78
2	.18	.17	.11	.08	.69	.38	.27	.16	.31	.62	.73	.84
1	.52	.22	.17	.08	.52	.22	.17	.08	.48	.78	.83	.92

Table 2.1: Category-specific, cumulative, and exceedance proportions for the variable Syntactic function in the heaps data.

2.2 Visualizing ordinal data

The visualization of ordinal data is a surprisingly difficult task. In the following, we will survey different display types and options that may be considered. Given the importance of visual means of interpretation and communication, we will reflect on the relative merits of the available options and offer some recommendations.

There are a number of points we need to think about when graphing ordinal variables. Much hinges on the decision of whether to show category-specific or aggregated frequencies. The distinction was discussed above. An attractive feature of stacked bar charts and area charts is that they show both types of frequency in the same display. A simplification strategy that should be considered for more complex visualization tasks, where the distribution of an ordinal variable is to be compared across an array of conditions, is to choose a single representative aggregated probability. Such a condensed representation can help us avoid an overly cluttered display, and draw complex comparisons with more ease.

When graphing ordinal data, we need tools for categorical and continuous predictors. Categorical predictors are less problematic. For continuous predictors, on the other hand, we must choose between (i) retaining the variable as a continuous feature, to plot “smooth” trends; or (ii) use binning to create a discretized version of the variable. For strategy (ii), we can make use of the same strategies as for categorical predictors.

The following plot types will be useful:

Bar charts (stacked and grouped)
Area charts
Line plots
Dot plots

Let us now consider different plot types and options for ordinal data visualization.

2.2.1 Categorical predictors

To show the distribution of an ordinal variable across the levels of a categorical predictor, we may decide to show category-specific or cumulative proportions.

2.2.1.1 Category-specific proportions

When showing individual response proportions for the levels of a categorical predictor, it makes sense to order the groups along the x-axis to produce a smooth pattern (i.e. monotonically increasing/decreasing probabilities). This will also make it easier to compare the strength of association across predictors including numeric ones. Figure 2.2 shows two graph types that may be used to show response proportions for discrete variables. The color scale runs from red (“never”) to grey (“likely”) – red therefore signals dispreference, and grey reflects acceptance/usage. This choice of fill colors facilitates the interpretation of the ordinal outcome variable.

The left-hand plot uses a grouped bar chart. The proportion of grey increases from left to right, which shows that the quantifier use of heaps is most acceptable. While grouped bar charts show the distribution quite well, they quickly become crowded.

The right-hand graph is a line plot, which represents individual response proportions as dots and connects them across the graph with lines. This graphical arrangement allows us to see more easily how the proportion of the extreme categories varies across subgroups.

Figure 2.2: Grouped bar chart and line plot showing response probabilities for a categorical predictor.

2.2.1.2 Cumulative and exceedance proportions

When showing the distribution of an ordinal variable using aggregated proportions, we must decide on the order of categories within each stack of bars: Should the highest or lowest category appear at the bottom? This decision is critical, since it is the segments that are aligned at the bottom of the display that will be most readily compared. The immediate message in Figure 2.3, for instance, is that there is an increase from left to right. Since the grey segments appear at the bottom of the graph, the primary visual signal is that the share of “grey” increases from left to right. In other words, the acceptability of heaps increases from left to right – it is highest for the quantifier use (“quan”). With this ordering of the categories (grey at the bottom, red at the top), the y-axis in the graph shows acceptability (rather than inacceptability).

Stacked bar charts

When using stacked bar charts, make sure you pay attention to the arrangement of ordinal categories, i.e. which end of the scale appears at the bottom and which at the top. This choice affects the message conveyed by the graph.

On our ordinal scale, red represents the lowest category (“never”) and grey the highest one (“likely”). This means that Figure 2.3 shows the highest category at the bottom, and the lowest at the top. The lines that divide the bars into segments therefore represent exceedance proportions. For the leftmost syntactic function (“p_adj”), the meaning of the cut at about .20 means that about 20% of the responses exceed category 4 – that is, 20% of the responses are in category 5 or 6. The fact that the order of the response categories is reversed in Figure 2.3 makes it quite challenging to recognize this fact.

A variant of this display type uses diverging bars (see Heiberger and Robbins 2014): Rather than aligning bars at the baseline of 0, they are aligned at (what may be considered) the midpoint of the ordinal scale. Segments above this midpoint extend upward, and segments below the midpoint point downward. Diverging bar charts add another visual cue for interpretation: The vertical (or horizontal) position of each stack.

The diverging version of Figure 2.3 appears in Figure 2.4. Note that, to preserve the interpretation of the graph as showing the level of acceptability, the order of the ordinal categories has been reversed: The red segments now appear at the bottom.

Diverging bar charts

Attention must again be paid to the arrangement of the ordered categories, i.e. which end of the scale appears at the bottom and which at the top. To encourage the same interpretation as an ordinary stacked bar chart, the order of the categories must be reversed.

We can also use a line plot to show this distribution. The points then mark the exceedance proportions, which means that they show the location of the cutpoints in the stacked bar charts (see Figure 2.3). The points are then connected with lines. An example appears in Figure 2.5. The exceedence probability for the highest category is redundant and can be left out of the graph. This means that line plots of cumulative or exceedance proportions represent the distribution of \(k\) ordered categories with \(k-1\) linear profiles.

Figure 2.5: Line plot showing how the distribution of the ordinal variable varies across the levels of the categorical predictor Syntactic function. Since the highest category appears at the bottom (grey: ‘likely’), the points mark exceedance proportions.

Add alluvial diagram

Alluvial diagrams discourage the quantitative interpretation of the x-variable. Perhaps write a ggplot function to create them.

2.2.2 Numeric predictors

For numeric predictors, we can also either look at category-specific or aggregated proportions.

2.2.2.1 Category-specific proportions

Individual response probabilities are perhaps best shown using a line plot. Figure 2.6 shows the observed distribution of the category-specific proportions – these are smoothed.

Figure 2.6: Line plot showing category-specific sample proportions.

2.2.2.2 Cumulative and exceedance proportions

Aggregated proportions can be graphed using an area chart. Again, the way in which the ordinal categories are arranged from bottom to top is critical for the immediate message conveyed by the graph. Since we would like the y-axis in our graph to show the acceptability of heaps, the highest category, shown in grey, appears at the bottom. This means that the order of the categories is reversed (as in Figure 2.3), since we need our viewers to see how the share of grey increases for younger speakers. Figure 2.9 shows the smoothed distribution of the sample proportions; the amount of smoothing influences the perception of the trends in the data. The lines that divide the rectangle into colored areas are again exceedance proportions, as in Figure 2.3.

Figure 2.7: Area chart showing how the distribution of the ordinal variable varies with the numeric predictor Age. Since the highest category appears at the bottom (grey: ‘likely’), the lines that divide up the rectagle show exceedance proportions.

We can also use what is sometimes referred to as a spine plot, which is a stacked bar chart based on a binned version of a continuous variable on the x-axis. The width of the bars in this type of display is proportional to the number of observations in the bin. Such a graph appears in Figure 2.8, where 10-year bins are used. Apart from the change of response proportions across age, the graph also shows that the current dataset includes few respondents younger than 20 or older than 70.

Figure 2.8: Area chart showing cumulative proportions for a numeric predictor.

We can also use a line plot to show a reduced version of the area chart in Figure 2.7. To this end, only the smoothed exceedance proportions are shown visually. Similar to Figure 2.5, the distribution of \(k\) categories is shown using \(k-1\) linear profiles.

Figure 2.9: Line plot showing cumulative proportions for a numeric predictor.

2.2.3 Complexity

Illustrate how showing individual aggregated proportions can simplify data description.

Figure 2.10 shows the distribution of response for each of the 370 informants in the data.

Figure 2.10: Stacked bar chart showing response proportions for each speaker in the data.