-
Notifications
You must be signed in to change notification settings - Fork 0
/
p8105_hw5_mbc2178.Rmd
441 lines (362 loc) · 14.1 KB
/
p8105_hw5_mbc2178.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
---
title: "P8105_hw5_mbc2178"
author: "Melvin Coleman"
date: "`r format(Sys.time(), '%Y-%m-%d')`"
output: github_document
---
```{r setup, include=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(dplyr)
library(stringr)
library(plotly)
set.seed(1)
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
fig.width = 6,
fig.asp = .6,
out.width = "90%"
)
theme_set(theme_minimal() + theme(legend.position = "bottom"))
options(
ggplot2.continuous.colour = "viridis",
ggplot2.continuous.fill = "viridis"
)
scale_colour_discrete = scale_colour_viridis_d
scale_fill_discrete = scale_fill_viridis_d
```
### Functions
Let's define the function to calculate proportions with confidence intervals in
this homework problem. This function will be used in problem 2.
```{r}
# Prop.test function
prop_fx = function(x,n) {
prop.test(x,n) %>%
broom::tidy() %>%
select(estimate, conf.low, conf.high) %>%
mutate(estimate = round(estimate*100, 2),
conf.low = round(conf.low*100, 2),
conf.high = round(conf.high*100, 2))
}
```
Let's define the function to perform t-test for a mean(s) and the p-value arising
from a test of null hypothesis =0 using alpha =0.05. Used `tidy.broom` function to clean output,
and round values to 3 digits. This function will be used in problem 3.
```{r}
# T-test function
ttest_fx = function(true_mu){
sample = rnorm(30, mean= true_mu, sd=5)
test_results = t.test(x=sample, mu=0) %>%
broom::tidy() %>%
select(estimate, p.value) %>%
# Round estimate & p-value to 3 digits
mutate(
estimate = round(estimate,3),
p.value = round(p.value,3)
)
}
```
### Problem 1
Let's import the data in individual csv files from `data/problem_1`. We create
a datafram with the lsit of all files. We `map` over each file's path and import the data
using `read_csv` and `unnest`.
```{r message=FALSE, warning=FALSE}
study_df =
tibble(
files = list.files("data/problem_1/"),
path = str_c("data/problem_1/", files)
) %>%
mutate(data = map(path, read_csv)) %>%
unnest()
```
We tidy the resulting dataset and select the variables we need.
```{r}
tidy_study =
study_df %>%
mutate(
files = str_replace(files, ".csv", ""),
group = str_sub(files, 1, 3)) %>%
pivot_longer(
week_1:week_8,
names_to = "week",
values_to = "outcome",
names_prefix = "week_") %>%
mutate(week = as.numeric(week)) %>%
select(group, subjects = files, week, outcome)
```
Let's create plots with data for each individual, faceted by group.
```{r}
tidy_study %>%
ggplot(aes(x = week, y = outcome, group = subjects, color = group)) +
geom_point() +
geom_path() +
facet_grid(~group)
```
### Problem 2
Let's examine the Washington Post's data on homicides in 50 large U.S. cities.
We will refer to this raw dataset as `homicide_df`.
```{r}
homicide_df =
read_csv('data/homicide-data.csv', col_names = TRUE)
```
The `homicide_df` contains data from 50 large cities in the U.S. with `r ncol(homicide_df)`
fields or variables and `r nrow(homicide_df)` records or observations. This dataset includes
first and last names as well as ethnicity of homicide victims (`victim_last`,`victim_first`,
`victim_race`), dates of homicide report (`reported_date`) and the status of cases characterized
as closed with arrest or closed without arrest (`disposition`). In addition, the
dataset also contains the longitude (`lon`) and latitude (`lat`) of the cities
where the homicide occurred.
Let's perform some cleaning and manipulation on our dataset. We first create
a new variable `city_state` that combines the name of cities and states.
```{r}
homicide_df =
homicide_df %>%
janitor::clean_names() %>%
mutate(
city_state = str_c(city,',',state))
```
We now summarize within cities to obtain the total number of homicides within each city
across the U.S. from our dataset. We created a new dataset called `tot_homicides`
containing just two variables; `city_state` containing the cities across states and
`tot_homicides` containing the total number of homicides per city in the United
States.
```{r}
tot_homicides =
homicide_df %>%
group_by(city_state) %>%
summarise(tot_number_homicides = n()) %>%
arrange(desc(tot_number_homicides))
```
From our analysis, the city of Chicago has the largest number of total homicides
in the United States with a total of 5,535 homicides. Tulsa,AL has the least
number of homicides, just 1 homicide. This is most likely a data error as there
is no city in the U.S in Alabama named Tulsa and logically doesn't make sense
that this city would have just 1 homicide in total.
We now summarize within cities to obtain the total number of unsolved homicides (those
for which the disposition was "Closed without arrest" or "Open/No arrest") across the
U.S. from our dataset.
We replaced all missing values with 0 for those cities that did not have any cases
closed without arrest and/or open/no arrest. We used the `pivot wider` function
to create a table that's easier to read with the columns being the disposition
status. We created a new variable called `total_unsolved_homicides` that summed
the dispositions giving us the total unsolved homicides per city in the US.
We called this new dataset `unsolved_homicides`.
```{r}
unsolved_homicides =
homicide_df %>%
group_by(city_state, disposition) %>%
filter(disposition %in% c('Closed without arrest', 'Open/No arrest')) %>%
summarize(n = n()) %>%
mutate(
city_state = as.factor(city_state)) %>%
arrange(desc(n)) %>%
pivot_wider(
names_from = disposition,
values_from = n
) %>%
janitor::clean_names() %>%
replace(is.na(.),0) %>%
mutate(total_unsolved_homicides = sum(open_no_arrest, closed_without_arrest))
```
From our analysis, Chicago has the largest number of unsolved homicides in the US (4,073) with
3,686 cases still open without any arrest and 387 cases closed without arrest. On
the other hand, Tampa has the least number of unsolved homicides in the US (95) with
87 cases still open without arrest and 8 cases closed without arrest.
We now merge the two datasets created above, `tot_homicides` and `unsolved_homicides`.
with 50 US cities (Tulsa, AL was ommitted from newly created dataset) and nested
our homicide totals for each city.
This will allow us to perform additional manipulations and answer specific questions
using the variables created from the analysis conducted to create the datasets.
We kept the following variables in the newly created dataset, `homicide_mrg` :
`city_state`, `total_unsolved_homicides`, `tot_number_homicides`.
```{r}
homicide_mrg =
merge(unsolved_homicides, tot_homicides) %>%
select(city_state, total_unsolved_homicides, tot_number_homicides) %>%
arrange(desc(tot_number_homicides))
```
Now we examine the proportion of homicides that are unsolved for the city of Baltimore
by using the `prop.test` function.`homicide_prop`was created as a list that contains the
proportion estimates and confidence intervals for each city. We utilize the function
created above to run this procedure making use of `purrr::map` and `unnest` to
produce a tidy dataframe. Output was saved as `prop_baltimore`.
```{r}
prop_baltimore =
homicide_mrg %>%
filter(city_state %in% c('Baltimore,MD')) %>%
mutate( homicide_prop =
map2(.x=total_unsolved_homicides, .y=tot_number_homicides, .f=prop_fx)) %>%
unnest(homicide_prop) %>%
select(estimate, conf.low, conf.high)
prop_baltimore %>%
rename('Lower level CI'= conf.low ,
'Upper Level CI' = conf.high,
Proportion = estimate)
```
`r prop_baltimore %>% pull(estimate)` of homicides with a confidence interval of
`r prop_baltimore %>% pull(conf.low)` and `r prop_baltimore %>% pull(conf.high)`
are unsolved in the city of Baltimore, MD.
Now we run `prop.test` for each of the cities in our dataset, and extract both
the proportion of unsolved homicides and the confidence interval for each city.
We apply the function created above making use of `purrr::map` and `unnest` to
produce a tidy dataframe.`homicide_prop` variable was created as a list that
contains the proportion estimates and confidence intervals for each city.
Output was saved as `prop_cities`.
```{r}
prop_cities=
homicide_mrg %>%
mutate(homicide_prop =
map2(.x =total_unsolved_homicides,.y=tot_number_homicides,
.f= ~prop_fx(x = .x, n = .y))) %>%
unnest(homicide_prop) %>%
select(city_state,estimate, conf.low, conf.high)
prop_cities %>%
rename('Lower level CI'= conf.low ,
'Upper Level CI' = conf.high,
Proportion = estimate,
City = city_state)
```
Let's create a plot that shows the estimates and confidence intervals for each
city adding error bars based on the upper and lower limits of the confidence interval.
Cities are organized according to the proportion of unsolved homicides
```{r}
cities_plot =
prop_cities %>%
mutate(city_state = fct_reorder(city_state,estimate)
) %>%
ggplot(aes(x=city_state, y= estimate)) +
geom_bar(stat="identity", color="black", fill="lightblue",
position=position_dodge()) +
scale_y_continuous(breaks = c(0,20,40,60,80,100)) +
geom_errorbar(aes(ymin =conf.low, ymax= conf.high), width= .2,
position=position_dodge(.9)) +
labs(title = "Proportion of unsolved homicides across 50 US cities") +
xlab("Cities") +
ylab("Proportions") +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
cities_plot
```
### Problem 3
In this problem, we conduct a simulation to explore power in a one-sample t-test.
We defined a function above with these fixed criteria, sample size(n) = 30 and
sigma (sd) = 5.
First, we set the mean = 0 and generate 5,000 datasets following a normal distribution
to create a tidy dataframe by making use of `expand_grid`, `map2` and `unnest`
functions. We performed a t-test setting our null hypothesis = 0 with a
95% confidence interval. We used the `broom::tidy` function to clean the output
of the t-test and saved the estimated mean and corresponding p-value.
Performed a t-test for a true mean = 0.
```{r, cache=TRUE}
sim_results_zero_df =
expand_grid(
true_mu = 0,
iter = 1:5000
) %>%
mutate(
estimate_df =
map(.x = true_mu, ~ttest_fx(true_mu = .x))
) %>%
unnest(estimate_df)
```
We repeated a similar procedure for true mean = 1,2,3,4,5, and 6.
```{r}
sim_multiple_df =
expand_grid(
true_mu = 1:6,
iter = 1:5000
) %>%
mutate(
estimate_df=
map(.x = true_mu, ~ttest_fx(true_mu = .x))
) %>%
unnest(estimate_df)
```
Now, we make a plot showing the proportion of times the null was rejected (the power
of the test) on the y-axis and the true value of the mean(`true_mu`) on the x axis.
A new variable `power` was created from a `mutate` and `ifelse` statement to specify
whether the null was rejected or not. `power` was divided by the sample size of
5,000 and multiplied by 100 to find the proportion.
```{r}
null_plot =
sim_multiple_df %>%
mutate(
power = ifelse(p.value < 0.05, "rejected", "not_rejected"),
true_mu = as.factor(true_mu)
) %>%
group_by(true_mu,power) %>%
count(power) %>%
pivot_wider(
names_from = power,
values_from = n
) %>%
mutate(
percent_rejected = (rejected/5000) * 100
) %>%
ggplot(aes(x = true_mu, y = percent_rejected)) +
geom_bar(stat= 'identity', color="lightblue", fill="grey",
position = position_dodge()) +
labs(title = "Effect size and Power") +
xlab("True Mean") +
ylab("Power(%)")
null_plot
```
From the graph above, as the true mean increased, the proportion of times the null
was rejected also increased. From the simulation, we can conclude that as effect
size increases, so does power. Furthermore, after a certain point as observed in the
graph from a true mean of 5-6 ,the power remains constant as the effect size doesn't
change the association.
Now, we make a plot showing the average estimate of the mean on the y-axis and the
true value of the mean on the x axis. We used an `ifelse` function in a `mutate`
to create a new variable, `power` that specified whether the null was rejected or
not based on the p-value. We grouped by the true mean (`true_mu`) and created an
interactive plot as seen below.
In the same code chunk, we made another plot with the average estimate of the
true mean only in samples for which the null was rejected on the y axis and the
true value of the mean on the x axis. The same procedure applied to produce the
first plot was used to create the plot below.
```{r}
## Plot 1
sim1_plot =
sim_multiple_df %>%
mutate(
true_mu = as.factor(true_mu)
) %>%
group_by(true_mu) %>%
summarize(n_obs = n(),
avg_mean= mean(estimate)) %>%
ggplot(aes(x = true_mu, y = avg_mean, color = true_mu)) +
geom_point() +
labs(title = "Estimate of mean vs True mean") +
xlab("True mean") +
ylab("Estimate of mean") +
theme(legend.position="none")
## Plot 2
sim2_plot =
sim_multiple_df %>%
mutate(
power = ifelse(p.value <0.05, "rejected", "not rejected"),
true_mu = as.factor(true_mu)
) %>%
filter(power %in% c("rejected")) %>%
mutate(
true_mu = as.factor(true_mu)
) %>%
group_by(true_mu) %>%
summarize(n_obs = n(),
avg_mean= mean(estimate)) %>%
ggplot(aes(x=true_mu, y= avg_mean, color = true_mu)) +
geom_point() +
labs(title = "Estimate of mean vs True mean (Samples null rejected Only)") +
xlab("True mean") +
ylab("Estimate of mean") +
theme(legend.position="none")
## Output interactive plots
sim1_plot
sim2_plot
```
The sample average of the mean across tests for which
the null is rejected are approximately equal to the true mean values of 3,4, 5 and
6 only. This is because when we have a large sample size, we expect the true mean to
approximate the estimated mean. In addition, with a large sample size and a large
effect size, we expect to reject the null hypothesis more often compared to a small
effect size and/or sample size.