-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
320 lines (223 loc) · 11.1 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
---
title: "Index"
author: '220250711'
date: "2023-05-16"
output:
html_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Research Question
The aim of the study from which the data was taken (Simmons et al., 2011) was to express that psychology papers contain a lot of false positive results, where analyses are statistically significant but have no real effect. They wanted to prove that by strategically analysing data they could create significance from variables that definitely have no real interaction- that song listened to by participants significantly affected their age, which is obviously impossible. My visualisation aims to address the question of whether choice of covariate in an ancova model has an effect on the significance, by analysing whether song listened to affects participant age or participant’s father’s age.
``` {r messages= FALSE, echo=FALSE, include =FALSE}
# Call all libraries
library(tidyverse)
library(readxl)
library(dplyr)
library(car)
library(multcomp)
library(readr)
library(ggplot2)
library(here)
library(patchwork)
library(ggpubr)
## Data retrieval ##
# Site of data in Github project repo
url <- "https://github.com/molly1902/psy_6422/raw/main/s2.xlsx"
# Download the Excel file from GitHub
download.file(url, destfile = "s2.xlsx", mode = "wb")
# Read in the Excel file
s2 <- read_excel("s2.xlsx")
```
## Data Origins
The data came from open source repository published by the following study: Simmons, J P, Nelson, L D and Simonsohn, U (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. *Psychological Science 22*(11): 1359–1366, DOI: https://doi.org/10.1177/0956797611417632.
Variable names are responses to nuisance questions, such as ‘what is your favourite football player’, because the original study was just interested in age of participants (‘?’ column) and what song they listened to (‘potato’, when64, or ‘kalimba’).
The data from the article downloads in .txt form, which I copied and pasted into an Excel spreadsheet as it appeared neater in R. I then uploaded the data in Excel form to GitHub, which is where the data is pulled from by R.
Here are the first few lines of data before processing:
```{r echo=TRUE}
head(s2)
```
``` {r messages= FALSE, echo=FALSE, include =FALSE}
## Data wrangling ##
# Remove the columns not used in analysis
# Leaves father and participant age in years, 3 song conditions
st2 <- subset(s2, select=-c(aged, mom, female, root, bird, political, quarterback, olddays, feelold, computer, diner, cond))
# The 3 song conditions (potato, when64, kalimba) are coded '1' if listened to
# and 0 if not listened to
# Replace all '1' values with the participant age value in the same row
# This also retains corresponding father age per song
stu2 <- st2 %>%
mutate(potato = replace(potato, potato == "1", aged365[potato=="1"])) %>%
mutate(when64 = replace(when64, when64 == "1", aged365[when64=="1"])) %>%
mutate(kalimba = replace(kalimba, kalimba == "1", aged365[kalimba=="1"])) %>%
mutate(dad)
#Remove all 0 values from song columns
new1 <- stu2 %>%
filter(!if_any(starts_with("p"), ~ . == 0))
new2 <- stu2 %>%
filter(!if_any(starts_with("w"), ~ . == 0))
new3 <- stu2 %>%
filter(!if_any(starts_with("k"), ~ . == 0))
# Index to further remove unwanted columns
new1pot <- new1[,-3:-5]
new2when <- new2[,-2,-4,-5]
new3kal <- new3[,-2,-3,-5]
# Create new column song type with the same number of values
# as are in the existing column, for each song
song1 <- c("Potato", "Potato", "Potato", "Potato", "Potato",
"Potato", "Potato", "Potato", "Potato", "Potato",
"Potato", "Potato", "Potato", "Potato")
song2 <- c("When", "When", "When", "When", "When",
"When", "When", "When", "When", "When",
"When")
song3 <- c("Kalimba", "Kalimba", "Kalimba", "Kalimba",
"Kalimba", "Kalimba", "Kalimba", "Kalimba",
"Kalimba")
# Add new column to each existing corresponding data frame
# containing participant age and father age
newpot <- new1pot %>%
mutate(song1)
newwhen <- new2when %>%
mutate(song2)
newkal <- new3kal %>%
mutate(song3)
# Create a data frame containing columns: father age, participant age, song type
datafra <- data.frame(dadage = c(newpot$dad, newwhen$dad, newkal$dad),
pptage = c(newpot$potato, newwhen$when64, newkal$kalimba),
song = rep(c(newpot$song1, newwhen$song2, newkal$song3)))
# Convert song type column from a character to a factor for statistical analyses
factor_song <- as.factor(datafra$song)
class(factor_song)
# Replace the character column with the new factor column in a data frame
df1 <- datafra
df1$factor_song <- factor_song
df1 <- df1[, -3]
```
## Data Preparation
The first step was to remove all the nuisance variables, leaving only the necessary ones for analysis- participant age, father age, potato, kalimba, when64. The song variables were coded as 1 or 0 depending on if they listened to that song or not. These variables were then recoded to only keep rows which were listened to (1) for each song. Ultimately, this left a dataframe with 3 columns; participant age, father age, and song listened to.
```{r echo=TRUE}
head(df1)
```
## Statistical analyses
```{r messages= FALSE, echo=TRUE, include =TRUE}
## Statistical analyses ##
# Descriptive statistics for participant and father age by song
descriptive <- df1 %>%
group_by(factor_song) %>%
summarise(mean_age = mean(pptage),
sd_age = sd(pptage),
mean_dad = mean(dadage),
sd_dad = sd(dadage))
summary(descriptive)
# ANCOVA model
# Response variable = participant age
# Group variable = song
# Covariate = father age
ancova_ppt <- aov(pptage ~ factor_song + dadage, data = df1)
ancova_pptage <- Anova(ancova_ppt, type="III")
summary(ancova_ppt)
summary(ancova_pptage)
# Test for Homogeneity
leveneTest(pptage~factor_song, data = df1)
# p=0.56, test was not significant so assumption met
# Test for Independence of covariate and group
m1 <- lm(pptage ~ factor_song + dadage, data=df1)
m2 <- lm(pptage ~ factor_song * dadage, data=df1)
anova(m1, m2)
# p=0.13, test was not significant so assumption met
# This ANCOVA meets statistical assumptions
# ANCOVA model
# Response variable = father age
# Group variable = song
# Covariate = participant age
ancova_dad <- aov(dadage ~ factor_song + pptage, data = df1)
ancova_dadage <- Anova(ancova_dad, type="III")
summary(ancova_dad)
summary(ancova_dadage)
# Test for Homogeneity
leveneTest(dadage~factor_song,data= df1)
# p=0.37, test was not significant so assumption met
# Test for Independence of covariate and group
n1 <- lm(dadage ~ factor_song + pptage, data=df1)
n2 <- lm(dadage ~ factor_song * pptage, data=df1)
anova(n1, n2)
# p=0.46, test was not significant so assumption met
# This ANCOVA meets statistical assumptions
# Post Hoc analyses on both ANCOVA models
# Anlayses within group differences for significance
posthoc_ppt <- glht(ancova_ppt, linfct = mcp(factor_song = "Tukey"))
summary(posthoc_ppt)
posthoc_dad <- glht(ancova_dad, linfct = mcp(factor_song = "Tukey"))
summary(posthoc_dad)
```
```{r Data wrangling of statistical analyses, echo=FALSE, include=FALSE, messages=FALSE}
## Data wrangling of statistical analyses results ##
# Extract p values from Post Hoc analyses
# Round to 3 significant figures
summary_posthoc_ppt <- summary(posthoc_ppt)
p_values_ppt <- summary_posthoc_ppt$test$pvalues
pptpvalues <- round(p_values_ppt, 3)
summary_posthoc_dad <- summary(posthoc_dad)
p_values_dad <- summary_posthoc_dad$test$pvalues
dadpvalues <- round(p_values_dad, 3)
# Calculate mean age of participant per song
mean_age_ppt <- aggregate(df1$pptage, by=list(df1$factor_song), FUN=mean)
# Create columns containing the difference in mean age between songs
kalpot <- 21.17169 - 20.57143
whenkal <- 21.17169 - 20.33524
whenpot <- 20.57143 - 20.33524
songdiffs <- c(kalpot, whenkal, whenpot)
songs <- c("Kalimba-Potato", "Kalimba-When", "Potato-When")
# Add columns to a data frame for participant age
mean_age_ppt <- mean_age_ppt %>%
mutate(songdiffs) %>%
mutate(songs)
# Calculate mean age of father per song
mean_age_dad <- aggregate(df1$dadage, by=list(df1$factor_song), FUN=mean)
# Create columns containing the difference in mean age between songs
kalpot2 <- 55.07143 - 49.88889
whenkal2 <- 52.09091 - 49.88889
whenpot2 <- 55.07143 - 52.09091
songdiffs2 <- c(kalpot2, whenkal2, whenpot2)
# Add columns to a data frame for father age
mean_age_dad <- mean_age_dad %>%
mutate(songdiffs2) %>%
mutate(songs)
# Create columns for whether the age is of father or participant,
# which songs the difference comes from,
# the difference in mean age for participant and father,
# and the p values for each within group difference from participant and father
Who<- c('Participant','Participant', 'Participant', 'Father','Father','Father')
songminussong <- c('Kalimba-Potato','Kalimba-When64','Potato-When64',
'Kalimba-Potato','Kalimba-When64','Potato-When64')
songsdiffs <- c(mean_age_ppt$songdiffs, mean_age_dad$songdiffs2)
p_values <- c(pptpvalues,dadpvalues)
# Create a data frame combining these columns
df2 <- data.frame(songsdiffs,Who,songminussong,p_values)
```
```{r messages= FALSE, echo=FALSE, include =FALSE}
plot1 <- ggplot(df2, aes(x = songminussong, y = songsdiffs, color = Who)) +
geom_bar(stat = "identity", color = 'black') +
geom_text(aes(label = "" ))+
labs(x = "Post Hoc analysis of songs", y = "Difference in age in years",
title = "ANCOVA results of participant or father age by song listened to",
subtitle = "Covariate was father or participant age",
caption = "Source: Simmons et al., (2011)") +
guides(fill = none) +
theme(panel.background = element_rect(fill = 'white'),
axis.line = element_line(color = "grey"))+
scale_y_continuous(breaks = seq(0,6,1))
```
## Visualisation: Difference in age, because of song listened to, or covariate?
``` {r messages= FALSE, echo=FALSE, include =TRUE}
# Add the p values to the plot for participant and dad
# Save the plot
plot1+ annotate("text", x = c(1, 2, 3), y = c(6.1, 3.4, 3.5),
label = c("p = 0.011*", "p = 0.248", "p = 0.317"), color= 'firebrick2')+
annotate("text", x = c(1, 2, 3), y = c(0.9, 1.2, 0.5),
label = c("p = 0.118", "p = 0.254", "p = 0.867"), color= 'cyan2')
```
## Summary
This vizualisation shows that false significance can be created from nonesense variables through careful manipulation of statistical analyses. With more time, the study could have been stretched to include more data, such as from study 1 dataset of the original paper. It is limited in that it only shows data from 42 participants, with each group having an unequal number of participants, but this information is not displayed on the graph. This kind of data being shown would have enriched the vizualisation by further expressing how statistical analyses can hide the meaningful origins of the data. Future research could investigate the effect of unequal group size on post hoc comparisons in ANCOVAs.