Session 1: R project setup and Quarto documents
Reproducibility
Your data analysis project should be self-contained, which means that everything that is needed to reproduce the results is stored in the project folder.
**bold**
*italics*
install.packages()
library()
## R setup
, where you load the packages you need<-
(shortcut: Alt + -)function(argument = "...")
str()
head()
mean()
, median()
PhDPublications
datasetAER
package (that’s where the data is)d
(= copy it)<-
Ph
and hit Tab keystr()
head()
articles
# articles published during last 3 years of PhDgender
married
kids
# of children less than 6 years oldprestige
prestige of graduate programmentor
# articles published by mentor'data.frame': 915 obs. of 6 variables:
$ articles: int 0 0 0 0 0 0 0 0 0 0 ...
$ gender : Factor w/ 2 levels "male","female": 1 2 2 1 2 2 2 1 1 2 ...
$ married : Factor w/ 2 levels "no","yes": 2 1 1 2 1 2 1 2 1 2 ...
$ kids : int 0 0 0 1 0 2 0 2 0 0 ...
$ prestige: num 2.52 2.05 3.75 1.18 3.75 ...
$ mentor : int 7 6 6 3 26 2 3 4 6 0 ...
- attr(*, "datalabel")= chr "Academic Biochemists / S Long"
- attr(*, "time.stamp")= chr "30 Jan 2001 10:49"
- attr(*, "formats")= chr [1:6] "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
- attr(*, "types")= int [1:6] 98 98 98 98 102 98
- attr(*, "val.labels")= chr [1:6] "" "sexlbl" "marlbl" "" ...
- attr(*, "var.labels")= chr [1:6] "Articles in last 3 yrs of PhD" "Gender: 1=female 0=male" "Married: 1=yes 0=no" "Number of children < 6" ...
- attr(*, "version")= int 6
- attr(*, "label.table")=List of 6
..$ marlbl: Named num [1:2] 0 1
.. ..- attr(*, "names")= chr [1:2] "Single" "Married"
..$ sexlbl: Named num [1:2] 0 1
.. ..- attr(*, "names")= chr [1:2] "Men" "Women"
..$ : NULL
..$ : NULL
..$ : NULL
..$ : NULL
### R setup
(load packages)### Data
(load data)### Descriptive analysis
(tables, graphs)dplyr
tidyverse
|>
group_by()
divide the datasetsummarize()
summarize subsetsn()
count how many there aremean()
, median()
, sd()
, max()
d |>
group_by(kids) |>
summarise(
N = n(),
mean_pubs = mean(articles),
median_pubs = median(articles),
sd_pubs = sd(articles),
max_pubs = max(articles))
# A tibble: 4 x 6
kids N mean_pubs median_pubs sd_pubs max_pubs
<int> <int> <dbl> <dbl> <dbl> <int>
1 0 599 1.72 1 1.93 19
2 1 195 1.76 1 2.05 12
3 2 105 1.54 1 1.74 11
4 3 16 0.812 1 0.911 3
ggplot2
tidyverse