Session 3: Practice
Continue working in the quarto document “vocab_data_analysis.qmd”.
Load the
{tidyverse}
package.Use the functions
group_by()
andsummarize()
to obtain data summaries by year:- Number of participants
- Average score on test
- Average number of years of education
Render the document and make sure everything looks OK.
Copy this code, then use piping to draw the following graphs:
- A line plot showing the number of respondents across years. If you need help with this, refer to the R graphics cookbook (Section 2.2), which is available for free online.
- A line plot showing the average number of years of education across years.
- A line plot showing the average test score across years.
Render the document and make sure everything looks OK.
Add a line of code after each code chunk to save the graph as a PDF file in the subfolder “figures”. Use informative names for the figure PDFs.
We can use chunk options to improve the appearance of figures in the html and pdf files that we generate from the quarto notebook. Use the arguments
fig-height:
andfig.width:
to shrink the plots you have drawn to an appropriate size. For illustration, refer to the Quarto guideRender the document and make sure the graphs look better.
Additional task (use the R graphics cookbook for help)
- Copy the code for the line plots and modify it so that the figure shows separate trend lines for male and female respondents (Section 4.3)
Further study: Dealing with overplotting
Let’s look at the relationship between education
(years of education) and vocabulary
(test score, out of 10 points).
- Create a scatterplot showing the two variables. If you need help with this, refer to the R graphics cookbook (Section 2.1).
- Make sure
education
is assigned to the x-axis andvocabulary
to the y-axis. Why? - In its current shape, the graph is almost useless. Let’s see what we can do about overplotting.
- For each of the following steps, create a new graph and add text to the quarto file explainaing what the new code implements.
- As a first step, use jittering. Google “ggplot jittering”. In the official ggplot2 documentation websites, it is usually most informative to scroll down to the “Examples” section, where you can observe what code does.
- Apparently, jittering doesn’t get us too far. It’s a useful first step, but we need to do more.
- Let’s use a different plotting symbol: Open (instead of filled) circles.
- For an overview of different plotting symbols, google “R pch”. Open circles are denoted by “1”.
- In ggplot2, we can change the plotting symbol by adding the argument
shape =
inside the functiongeom_jitter()
. For open circles, supply the value “1” (shape = 1
). - Better, but still not there.
- Another technique to avoid overplotting is called “alpha blending”, which makes plotting symbols partly transparent. To find the right argument, place the cursor inside the
geom_jitter()
function and hit the tab key. This will list potential arguments in a drop-down menu. Look for something that seems like it will implement alpha blending and click on it - it will then appear in your code. Then try different values between 1 (no transparency) and 0 (full transparency, i.e. invisible). - The default theme in ggplot2 makes it difficult to see light grey symbols, so we must use a different theme. Google “ggplot theme” and scroll down to the “Examples” section to find a suitable one. Be sure to try
theme_minimal()
. - Additional tasks (use Google for help):
- Change the label of the x-axis and y-axis.
- Add an informative title to the graph.
- Additional tasks (use R graphics cookbook for help):
- Use different plotting symbols for male and female respondents (Section 5.3).
- Add a regression line (Section 5.6).
- Have a look at other techniques that allow you to deal with overplotting (Section 5.5).