Vignette for Data Analysis at The Mark USA, Inc.:
This file is the starting point for any data analysis task at The Mark. The examples provided are for a baseline-annual (pre-post) survey comparison, but there are many ways this workflow can be used for other data tasks. This vignette corresponding to the available template titled: “Data Analysis for The Mark USA, Inc.”. Please provide any feedback from the use of this vignette to determine if there are more tasks to include or other templates to construct.
Loading baseline and annual data from .csv and .xlsx:
How to read in .xlsx data:
# readxl is a package that has the most up to date function for reading in excel files either .xls or xlsx, read_excel(), if you know the file is .xlsx you can use
# the function read_xlsx(), both are shown below:
# For this example, the baseline data is in a folder called "extdata" which is located up four folders so ../ moves up one folder-
# Read in the baseline.xlsx data using read_excel():
baseline <- readxl::read_excel("../inst/extdata/baseline.xlsx")
# Read in the annual.xlsx data using read_xlsx():
annual <- readxl::read_xlsx("../inst/extdata/annual.xlsx")
# Remove to resuse names:
rm(annual, baseline)
How to read in .csv data:
# readr is a package included in "tidyverse" that has the most up to date function for reading in .csv data, read_csv():
# For this example, the baseline data is in a folder called "extdata" which is located up four folders so ../ moves up one folder-
# Read in the baseline.csv data:
baseline <- readr::read_csv("../inst/extdata/baseline.csv")
# Read in the annual.csv data:
annual <- readr::read_csv("../inst/extdata/annual.csv")
Info about example data in baseline and annual datasets:
Each dataset contains 7 variables and 20 observations.
- Unique Identifier: a unique ID (1 to 20)
- role: 1= “Undergraduate student”, 2 =“Graduate student”, 3= “Postdoc”, 4 = “Faculty”
- Gender: 1 = “male”, 2 = “female”, 3 = “other”
- Institution: 1 = “University of Place”, 2 = “State University of Another Place”, 3 = “Technical State”, 4 = “University of One More Place”
- 5 variables of that make up a composite scale: Organization, Source, Publish, Write, Research
- these are all on a 5-point Likert scale of 1 to 5 needs to be recoded to: c(“Minimal”, “Slight”, “Moderate”, “Good”, “Extensive”)
Merging data:
The R package dplyr
, which is included in
tidyverse
, includes many functions to merge data sets, the
most common of the “outer join” functions are left_join()
and full_join()
.
Use full_join()
if you want to merge two datasets and
include all the observations of both datasets, this is most commonly
used if you want to join the datasets top to bottom, see the dpylr’s
documentation for more examples.
For most uses at The Mark, data will be merged using
left_join()
, this function is used to merge data that share
a common unique identifier and only keep observations that occur in one
of the datasets. Specically, left_join()
keeps all of the
observations in the first supplied dataset and merges all of the
observations in the second dataset that matches the user supplied “by”
variable in the first dataset.
# Merge the baseline and annual datasets using the variable `Unique Identifier` as the by argument inside join_by(""), this is the variable that will match the observations from both datasets, the suffix argument is a c() of length two which is added to variables taken from the respective datasets, so for this example variables from baseline will have a suffix of "{variable_name}_pre" and variables from annual will have "{variable_name}_post":
merged_data <- baseline %>% left_join(., annual, by = join_by("Unique Identifier"), suffix = c("_pre","_post"))
# can also be written: merged_data <- left_join(baseline, annual, by = join_by("Unique Identifier", "role"), suffix = c("_pre","_post"))
# the new merged data has 14 variables, the original 9 from baseline and the 5 that match from the annuals dataset.
# head(merged_data, n = 4)
# If you would rather use a prefix for pre and post this is one way to do that:
# add prefix before joining, for baseline skip the first 4 vars and annual skip first var:
names(baseline)[5:9] <- paste0("pre_", names(baseline)[5:9])
names(annual)[2:6] <- paste0("post_", names(annual)[2:6])
# in join, use names with prefixes
merged_data <- baseline %>% left_join(annual, by = join_by("Unique Identifier"))
# check variables from the merged_data using str() and summary():
str(merged_data)
#> spc_tbl_ [20 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ Unique Identifier: num [1:20] 1 2 3 4 5 6 7 8 9 10 ...
#> $ Gender : num [1:20] 1 2 2 2 2 2 1 2 1 1 ...
#> $ Role : num [1:20] 1 2 2 4 4 1 2 2 1 2 ...
#> $ Institution : num [1:20] 2 1 2 1 2 2 3 1 2 4 ...
#> $ pre_Organization : num [1:20] 5 3 3 1 2 2 5 2 5 2 ...
#> $ pre_Source : num [1:20] 1 2 4 5 4 4 2 5 4 3 ...
#> $ pre_Publish : num [1:20] 2 2 2 3 4 1 5 4 3 5 ...
#> $ pre_Write : num [1:20] 4 1 3 3 2 2 5 2 2 2 ...
#> $ pre_Research : num [1:20] 5 2 2 2 5 1 2 2 5 2 ...
#> $ post_Organization: num [1:20] 5 4 4 2 3 3 5 3 5 3 ...
#> $ post_Source : num [1:20] 3 4 4 5 4 4 4 5 4 5 ...
#> $ post_Publish : num [1:20] 3 3 3 4 5 2 5 5 4 5 ...
#> $ post_Write : num [1:20] 4 3 5 5 4 4 5 4 4 4 ...
#> $ post_Research : num [1:20] 5 3 3 3 5 2 3 3 5 3 ...
#> - attr(*, "spec")=
#> .. cols(
#> .. `Unique Identifier` = col_double(),
#> .. Gender = col_double(),
#> .. Role = col_double(),
#> .. Institution = col_double(),
#> .. Organization = col_double(),
#> .. Source = col_double(),
#> .. Publish = col_double(),
#> .. Write = col_double(),
#> .. Research = col_double()
#> .. )
#> - attr(*, "problems")=<externalptr>
# merged_data contains all numeric variables with has 14 variables and 20 observations.
Clean up variable names:
# The janitor package has a lot of functions to help clean data below will do the following:
# Clean up column names and take out empty/constant columns (not necessary here but showing its available):
merged_data <- merged_data %>% janitor::clean_names() %>% janitor::remove_empty() %>% janitor::remove_constant()
# Get a list of all columns names:
colnames(merged_data)
#> [1] "unique_identifier" "gender" "role"
#> [4] "institution" "pre_organization" "pre_source"
#> [7] "pre_publish" "pre_write" "pre_research"
#> [10] "post_organization" "post_source" "post_publish"
#> [13] "post_write" "post_research"
# All the column names are pretty easy to use but one change may be better-
# Rename column names, this renames unique_identifier as unique_id:
merged_data <- merged_data %>% rename(., unique_id = unique_identifier)
How to recode numeric variables to factor/categorical variables:
First, in order to use the variable “role” from the merged_data we need to change it to a factor/categorical variables:
# For all of the data analysis in this template, the tidyverse will be the basis of all the data manipulation, the use of the mutate() and case_when() from "dplyr" is a simple way to create a new variable in R, I will add a prefix of "cat_{variable_name}" to signifiy the new variable is a category not numeric:
# role = role of respondent takes on a scale of 1 to 4 needs to be recoded to:1= "Undergraduate student", 2 ="Graduate student", 3= "Postdoc", 4 = "Faculty"
merged_data <- merged_data %>% mutate(cat_role = factor(case_when(
role == 1 ~ "Undergraduate student",
role == 2 ~ "Graduate student",
role == 3 ~ "Postdoc",
role == 4 ~ "Faculty"
), levels = c("Undergraduate student", "Graduate student", "Postdoc", "Faculty"))
)
# case_when() takes in a statement on the left side of the ~ and when that is true returns the statement on the right side of the ~ to the new variable, in this case cat_role
# so when role == 1 then cat_role will be == "Undergraduate student" and so on.
# TheMarkUSA package also has a function that works similar to the above code called recodeCat(), This function takes in a df, use the scale_labels argument to pass the new labels along with the original value, returns a the original variable(s) new factor variable named "cat_{variable(s)}":
cat_role <- merged_data %>% select(role) %>%
recodeCat(scale_labels = c("Undergraduate student" = "1", "Graduate student" = "2", "Postdoc" = "3", "Faculty" = "4")) %>% select(cat_role)
# Add new cat_role variable back to merged data:
merged_data <- merged_data %>% mutate(cat_role)
Next, in order to use the variable “gender” from the merged_data we need to change it to a factor/categorical variables:
# recode gender with recodeCat():
cat_gender <- merged_data %>% select(gender) %>%
recodeCat(scale_labels = c("male"= "1", "female"= "2", "other"= "3")) %>% select(cat_gender)
# Add new cat_gender variable back to merged data:
merged_data <- merged_data %>% mutate(cat_gender)
Recode institution from the merged_data and change it to a factor/categorical variable:
# recode institution with recodeCat():
cat_institution <- merged_data %>% select(institution) %>%
recodeCat(scale_labels = c("University of Place" = "1", "State University of Another Place"= "2", "Technical State"= "3", "University of One More Place"= "4")) %>%
select(cat_institution)
# Add new cat_institution variable back to merged data:
merged_data <- merged_data %>% mutate(cat_institution)
Frequency Tables for all demographics:
A common task at The Mark for data analysis is creating frequency tables, the next examples will show how to do with with TheMarkUSA package.
# The first step is to us the dataSumm() from "TheMarkUSA" to calculate frequency and percentages of the variable, be sure to use the recoded factor var with "cat_" prefix:
role_summ <- merged_data %>% select(cat_role) %>% TheMarkUSA::dataSumm()
role_summ
#> # A tibble: 4 × 5
#> question response n_answers percent_answers percent_answers_label
#> <chr> <fct> <int> <dbl> <chr>
#> 1 cat_role Undergraduate student 6 0.3 30%
#> 2 cat_role Graduate student 7 0.35 35%
#> 3 cat_role Postdoc 3 0.15 15%
#> 4 cat_role Faculty 4 0.2 20%
# Next, the tblSumm() from "TheMarkUSA" will use flextable() to create a nice formatted table that can be rendered to html and also works nicely in .pptx and .docx
tbl_role <- role_summ %>% tblSumm()
tbl_role
Response |
Percent |
Count |
---|---|---|
Undergraduate student |
30% |
6 |
Graduate student |
35% |
7 |
Postdoc |
15% |
3 |
Faculty |
20% |
4 |
Total |
- |
20 |
Now, do the same thing for gender and institution:
# dataSumm() and tblSumm() from "TheMarkUSA" can be used in one line to calculate frequency and percentages of the variable and make flextable output:
# gender
tbl_gender <- merged_data %>% select(cat_gender) %>% dataSumm() %>% tblSumm()
tbl_gender
Response |
Percent |
Count |
---|---|---|
male |
50% |
10 |
female |
45% |
9 |
other |
5% |
1 |
Total |
- |
20 |
# institution
tbl_institution <- merged_data %>% select(cat_institution) %>% dataSumm() %>% tblSumm()
tbl_institution
Response |
Percent |
Count |
---|---|---|
University of Place |
15% |
3 |
State University of Another Place |
45% |
9 |
Technical State |
25% |
5 |
University of One More Place |
15% |
3 |
Total |
- |
20 |
How to recode Likert scale items:
In the current The Mark workflow, the most common data analysis is with survey items with Likert scales, usually 1-5 point scales. In the example data provided, the 5 variables named: Organization, Source, Publish, Write, Research. These variables represent items with 5-point Likert scales. The following code will create new factor variables from the numeric variables on 1-5 point scales with the correct labels of: “Minimal”, “Slight”, “Moderate”, “Good”, “Extensive”.
# recodeCat() from TheMarkUSA package takes in a tibble/data frame of as many variables that share the same desired scale_labels, use the scale_labels argument to pass the new labels in the order of the number_levels which you can also pass as an argument.
# Returns the original variable(s) and the new factor variable(s) named "cat_{variable(s)}":
# use the : symbol to select all the variables that are consecutive in the tibble, i.e. select(pre_organization:post_write) selects all the likert scale items,
# the second select call (select()) selects all the new variables created with the prefix "cat_{variable(s)}":
cat_likert_items <- merged_data %>% select(pre_organization:post_research) %>%
recodeCat(scale_labels = c("Minimal" = "1", "Slight" = "2", "Moderate" = "3", "Good" = "4", "Extensive" = "5")) %>% select(contains("cat"))
# Add the new cat_items df to merged data:
merged_data <- merged_data %>% mutate(cat_likert_items)
# Check the data:
str(merged_data)
#> tibble [20 × 27] (S3: tbl_df/tbl/data.frame)
#> $ unique_id : num [1:20] 1 2 3 4 5 6 7 8 9 10 ...
#> $ gender : num [1:20] 1 2 2 2 2 2 1 2 1 1 ...
#> $ role : num [1:20] 1 2 2 4 4 1 2 2 1 2 ...
#> $ institution : num [1:20] 2 1 2 1 2 2 3 1 2 4 ...
#> $ pre_organization : num [1:20] 5 3 3 1 2 2 5 2 5 2 ...
#> $ pre_source : num [1:20] 1 2 4 5 4 4 2 5 4 3 ...
#> $ pre_publish : num [1:20] 2 2 2 3 4 1 5 4 3 5 ...
#> $ pre_write : num [1:20] 4 1 3 3 2 2 5 2 2 2 ...
#> $ pre_research : num [1:20] 5 2 2 2 5 1 2 2 5 2 ...
#> $ post_organization : num [1:20] 5 4 4 2 3 3 5 3 5 3 ...
#> $ post_source : num [1:20] 3 4 4 5 4 4 4 5 4 5 ...
#> $ post_publish : num [1:20] 3 3 3 4 5 2 5 5 4 5 ...
#> $ post_write : num [1:20] 4 3 5 5 4 4 5 4 4 4 ...
#> $ post_research : num [1:20] 5 3 3 3 5 2 3 3 5 3 ...
#> $ cat_role : Factor w/ 4 levels "Undergraduate student",..: 1 2 2 4 4 1 2 2 1 2 ...
#> $ cat_gender : Factor w/ 3 levels "male","female",..: 1 2 2 2 2 2 1 2 1 1 ...
#> $ cat_institution : Factor w/ 4 levels "University of Place",..: 2 1 2 1 2 2 3 1 2 4 ...
#> $ cat_pre_organization : Factor w/ 5 levels "Minimal","Slight",..: 5 3 3 1 2 2 5 2 5 2 ...
#> $ cat_pre_source : Factor w/ 5 levels "Minimal","Slight",..: 1 2 4 5 4 4 2 5 4 3 ...
#> $ cat_pre_publish : Factor w/ 5 levels "Minimal","Slight",..: 2 2 2 3 4 1 5 4 3 5 ...
#> $ cat_pre_write : Factor w/ 5 levels "Minimal","Slight",..: 4 1 3 3 2 2 5 2 2 2 ...
#> $ cat_pre_research : Factor w/ 5 levels "Minimal","Slight",..: 5 2 2 2 5 1 2 2 5 2 ...
#> $ cat_post_organization: Factor w/ 5 levels "Minimal","Slight",..: 5 4 4 2 3 3 5 3 5 3 ...
#> $ cat_post_source : Factor w/ 5 levels "Minimal","Slight",..: 3 4 4 5 4 4 4 5 4 5 ...
#> $ cat_post_publish : Factor w/ 5 levels "Minimal","Slight",..: 3 3 3 4 5 2 5 5 4 5 ...
#> $ cat_post_write : Factor w/ 5 levels "Minimal","Slight",..: 4 3 5 5 4 4 5 4 4 4 ...
#> $ cat_post_research : Factor w/ 5 levels "Minimal","Slight",..: 5 3 3 3 5 2 3 3 5 3 ...
The new items are added to the merged_data tibble and are now factor variables ready for use.
Table with Mean and Standard Deviation (SD) for Numeric Items:
The following code with create a three new variables: mean for the “pre_” “post_” Likert items, mean for the “post_” Likert items, and a difference score of “post_” mean minus “pre_” mean. Then, it will create a table with the mean and SD using the function tbl_summary from the gtsummary package:
# use the original numeric variables: pre Likert items = pre_organization:pre_research, post Likert items = post_organization:post_research
# First, if we needed to change a value to NA this is one way to do it quickly, not needed for this data so will comment out:
# merged_data <- merged_data %>% mutate(across(pre_organization:pre_research, ~ifelse(. == 9, NA, .)),
# (across(post_organization:post_research, ~ifelse(. == 9, NA, .))))
# Create Composites means for the 5 pre Likert items (pre_items) and 5 post Likert items (post_items):
# Also, create a new variable for difference btw pre and post= diff_items (this will be used in a test for normality, shapiro test), rowwise() is a function from dplyr package that allows you to compute new row by row, this will return means for pre and post for each individual row, not overall:
merged_data <- merged_data %>% rowwise() %>% mutate(pre_items = mean(c_across(pre_organization:pre_research), na.rm = TRUE),
post_items = mean(c_across(post_organization:post_research), na.rm = TRUE),
diff_items = post_items - pre_items)
# Table for M and SD for pre-post composites:
merged_data %>%
select(pre_items, post_items) %>%
gtsummary::tbl_summary(statistic = list(gtsummary::all_continuous() ~ "{mean}, ({sd})"),
type = list(pre_items ~ 'continuous', post_items ~ 'continuous'),
label = list(pre_items ~ 'Pre Likert Items',
post_items ~ 'Post Likert Items'),
missing_text = "Missing") %>%
gtsummary::add_stat_label() %>%
gtsummary::bold_labels() %>%
gtsummary::modify_header(label ~ "**Variable**") %>%
gtsummary::modify_caption("Mean and SD of Likert Items")
Variable | N = 20 |
---|---|
Pre Likert Items, Mean, (SD) | 3.07, (0.55) |
Post Likert Items, Mean, (SD) | 4.02, (0.36) |
How to Test for Normality using pre-post difference scores:
If we want to check to see if a pre-post means/composite scores are normally distributed (in order to use parametric tests like t-tests), we should use the Shapiro-Wilk Test. This requires a difference score of the pre-post means. Shapiro-Wilk tests have a null hypothesis that the data is distributed normally, so if the test is significant that means that the data is not distributed normally and parametric tests should not be used. (Use non-parametric tests like Wilcoxon test). If the Shapiro-Wilk is not significant the data is distributed normally and parametric tests (like t-tests) can be used.
# test for normality on the differences score of the Likert items:
shapiro.test(merged_data$diff_items)
#>
#> Shapiro-Wilk normality test
#>
#> data: merged_data$diff_items
#> W = 0.91, p-value = 0.06
Test was not significant so the data is normally distributed, will can use a t-test for pre-post analysis.
How to use a t-test with pre-post mean/composite scores:
If data is pre-post then a paired t-test should be used, below shows the code for that:
# t-tests for parametric analysis
t_test_items <- t.test(merged_data$pre_items, merged_data$post_items, paired = TRUE)
t_test_items
#>
#> Paired t-test
#>
#> data: merged_data$pre_items and merged_data$post_items
#> t = -15, df = 19, p-value = 0.000000000003
#> alternative hypothesis: true mean difference is not equal to 0
#> 95 percent confidence interval:
#> -1.0784 -0.8216
#> sample estimates:
#> mean difference
#> -0.95
The t-test has a null hypothesis that the data is not meaningfully different, so if the test is significant we can reject the null hypothesis and conclude that the pre-post means are meaningfully different. The t-test above was significant as shown by the very small p-value = 0.
How to use a Wilcoxon test with pre-post mean/composite scores:
If the data was not normally distributed, then use a Wilcoxon test, it the data is pre-post then a paired Wilcoxon test should be used, below shows the code for that:
# Wilcoxon test for non-parametric analysis
wilcox_test_items <- wilcox.test(merged_data$pre_items, merged_data$post_items, paired = TRUE)
wilcox_test_items
#>
#> Wilcoxon signed rank test with continuity correction
#>
#> data: merged_data$pre_items and merged_data$post_items
#> V = 0, p-value = 0.00008
#> alternative hypothesis: true location shift is not equal to 0
t-tests for individual items:
Here is how you would run t-test for all the likert items, again paired t-tests since the data is pre-post from the same individuals:
# t-tests for parametric analysis: Organization, Source, Publish, Write, Research
# organization:
t.test(merged_data$pre_organization, merged_data$post_organization, paired = TRUE)
#>
#> Paired t-test
#>
#> data: merged_data$pre_organization and merged_data$post_organization
#> t = -8.7, df = 19, p-value = 0.00000005
#> alternative hypothesis: true mean difference is not equal to 0
#> 95 percent confidence interval:
#> -0.9921 -0.6079
#> sample estimates:
#> mean difference
#> -0.8
# Source:
t.test(merged_data$pre_source, merged_data$post_source, paired = TRUE)
#>
#> Paired t-test
#>
#> data: merged_data$pre_source and merged_data$post_source
#> t = -4.4, df = 19, p-value = 0.0003
#> alternative hypothesis: true mean difference is not equal to 0
#> 95 percent confidence interval:
#> -1.4802 -0.5198
#> sample estimates:
#> mean difference
#> -1
# Publish:
t.test(merged_data$pre_publish, merged_data$post_publish, paired = TRUE)
#>
#> Paired t-test
#>
#> data: merged_data$pre_publish and merged_data$post_publish
#> t = -7.5, df = 19, p-value = 0.0000004
#> alternative hypothesis: true mean difference is not equal to 0
#> 95 percent confidence interval:
#> -0.9579 -0.5421
#> sample estimates:
#> mean difference
#> -0.75
# Write:
t.test(merged_data$pre_write, merged_data$post_write, paired = TRUE)
#>
#> Paired t-test
#>
#> data: merged_data$pre_write and merged_data$post_write
#> t = -6.7, df = 19, p-value = 0.000002
#> alternative hypothesis: true mean difference is not equal to 0
#> 95 percent confidence interval:
#> -1.8401 -0.9599
#> sample estimates:
#> mean difference
#> -1.4
# Research:
t.test(merged_data$pre_research, merged_data$post_research, paired = TRUE)
#>
#> Paired t-test
#>
#> data: merged_data$pre_research and merged_data$post_research
#> t = -8.7, df = 19, p-value = 0.00000005
#> alternative hypothesis: true mean difference is not equal to 0
#> 95 percent confidence interval:
#> -0.9921 -0.6079
#> sample estimates:
#> mean difference
#> -0.8