-
Notifications
You must be signed in to change notification settings - Fork 1
/
notes.rmd
445 lines (316 loc) · 14 KB
/
notes.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
---
output: rmarkdown::github_document
---
# CookBook
```{r warning=FALSE}
library(readxl)
library(ggplot2)
```
##Statistical Inference
Notes on statistical inference made for learning statistics for data scientists
## Estimation
### Confidence Interval for true mean
**Assumptions:**
1. Random Sampling from DGP
2. Stability in DGP
3. If n< 30 => Normal Distribution
```{r}
Congress <- read_excel("Piracy.xlsx", col_names = TRUE)
mean = 30.7833
sd = 1.7862
size = 12
sample_sd = sd/sqrt(size)
lower = mean + qt(0.05, size-1)*sample_sd
higher =mean -qt(0.05, size-1)*sample_sd
lower
higher
```
### Confidence Interval estimation for proportion
**Assumptions:**
1. Random Sampling from DGP
2. Stability in DGP
```{r}
poll <- read_excel('Poll.xlsx', col_names = TRUE)
table(poll)
```
#### Wilscon score interval method - generally preferred
prop.test can be used for testing the null that the proportions (probabilities of success) in several groups are the same, or that they equal certain given values.
```{r}
prop.test(358, (358+407), conf.level = .95)
```
#### Exact binomial estimation
Performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment.
Here hypothesized probability of success is provided as parameter. 0.5 in this case
```{r}
binom.test(358, (358+407), 0.5, conf.level = .9)
```
### Calculating CI directly from sample
The file WaitTime.xlsx contains data of wait times (in minutes) for a random sample of a
bank’s customers. Find a 90% confidence interval for the mean wait time at the bank.
**t.test** - Performs one and two sample t-tests on vectors of data and also give CI.
There are two kinds of hypotheses for a one sample t-test, the null hypothesis and the alternative hypothesis. The alternative hypothesis assumes that some difference exists between the true mean (μ) and the comparison value (m0), whereas the null hypothesis assumes that no difference exists. The comparison value (m0) can be specified with mu parameter below, default is 0. The value of mu does not affect the CI, as CI is the estimation of true mean using leveraging the given sample.
```{R}
waiting = read_excel('WaitTime.xlsx')
t.test(waiting$WaitTime, mu = 0, conf.level = .90)
t.test(waiting$WaitTime, mu = mean(waiting$WaitTime), conf.level = .90)
```
An equivalent alternative to t.test() is manual calculation as done below
```{R}
sample_mean = mean(waiting$WaitTime)
sample_sd = sd(waiting$WaitTime)
sampled_sd = sample_sd/sqrt(length(waiting$WaitTime))
sample_mean+qt(0.05, length(waiting$WaitTime-1))*sampled_sd
sample_mean-qt(0.05, length(waiting$WaitTime-1))*sampled_sd
```
*
*Interpretation**: There is a 90% chance that the interval we got captures the true mean waiting time*
## Hypothesis Testing
### Errors
Accepting null hypothesis when alternative hypothesis is true -> Type 1
Accepting alternative hypothesis when null hypothesis is true -> Type 2
The probability of marking Type 1 error is Alpha (accepting Ha)
The probability of marking Type 2 error is Beta (failure in rejecting Ho)
The closer is true mean to the hypothetical mean, the higher the value of Beta.
We can only control Type 1 error through desired level of significance. Hence, design hypothesis such that the error we want to control is the Type 1 error.
** Statisticians Dodge: ** Since we cannot control Type 2 error. (ie. Accepting Ho when Ha is true). Never draw the conclusion of accepting Ho, instead what we can conclude is that: 'we do not reject Ho) ie: Failed to reject the Null Hypothesis
P value is the probability of getting the sample result if Ho were true
### T - Test
The one sample t-test is a statistical procedure used to determine whether a sample of observations could have been generated by a process with a specific mean. Suppose you are interested in determining whether an waiting time at a clinic is less then 5 min. To test this hypothesis, you could collect a sample of waiting times, measure their weights, and compare the sample with a value of five using a one-sample t-test.
**Manually calculating T value using the formula**
$$
t = \frac{(\text{mean}_f - \text{mean}_m) - \text{expected difference}}{SE} \\
~\\
~\\
SE = \sqrt{\frac{sd_f^2}{n_f} + \frac{sd_m^2}{n_m}} \\
~\\
~\\
\text{where, }~~~df = n_m + n_f - 2
$$
P value of a T statistic can then be calculated as
```{r}
se = sd(waiting$WaitTime)/sqrt(length(waiting$WaitTime))
t= (mean(waiting$WaitTime)-5)/se
df=length(waiting$WaitTime)-1
pt(t, df)
```
**Assumptions:**
1. Random Sampling from DGP
2. Stability in DGP
3. If n< 30 => Normal Distribution
4. Null hypothesis is true
Test if the waiting time is less than 5 min
Null hypothesis -> true mean is greater then and equal to 5
Alt. hypothesis -> true mean is less than 5
```{r}
t.test(waiting$WaitTime, alternative = 'less', mu=5)
```
The likely hood of getting sample statistic as extreme as 1.85 in 0.966 if the true mean waiting time is greater then or equal to 5
The **significance level** is the probability of obtaining a result as extreme as, or more extreme than, the result actually obtained when the null hypothesis is true
The significance level and confidence level are the complementary portions in the normal distribution.
### Testing normality
Test normality for generated numbers
**Assumptions:**
1. Random Sampling from DGP
2. Stability in DGP
#### Using histogram
```{r}
seq <- rnorm(50,5,1)
hist(seq)
```
#### Using qq plot
```{r}
qqnorm(seq)
qqline(seq, col = "blue")
```
#### Using Shapiro Wilk test
Null hypothesis -> distribution is normal
Alt. hypothesis -> distribution is not normal
```{r}
shapiro.test(seq)
```
P-value being so high, we can conclude that: We fail to reject null hypothesis.
Hence, our assumption of distribution being normal is not rejected.
Definition of P-Value: Given a null hypothesis and sample evidence of size n, the p-value is the probability of getting a sample evidence that is equally or more unfavorable to the null hypothesis while the null hypothesis is actually true.
## Tables
### Chi-Square distribution
The Chi Square distribution is the distribution of the sum of squared standard normal deviates. The degrees of freedom of the distribution is equal to the number of standard normal deviates being summed
Probability of getting X-square value of less then or equal to 5 is
```{R}
pchisq(5,1)
```
X-square value associated with cumulative probability of 0.9746527 is
```{r}
qchisq(0.9746527,1)
```
*The **chi-squared test** is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.*
### One-Sample Goodness of fit test (Using Chi Square distribution) / Multinomial Test
Chi-Square goodness of fit test is a non-parametric test that is used to find out how the observed value of a given phenomena is significantly different from the expected value. In Chi-Square goodness of fit test, the term goodness of fit is used to compare the observed sample distribution with the expected probability distribution.
Null hypothesis -> There is no significant difference between the observed and the expected value.
Alt. hypothesis -> There is a significant difference between at least one of them
**Assumptions:**
1. Random Sampling from DGP
2. Stability in DGP
3. Expected value for each group is at least 5
```{r}
gas<- read_excel("Gasoline.xlsx", col_names=TRUE)
head(gas)
table(gas$`Second-last`, gas$Last)
result <- chisq.test(table(gas$`Second-last`, gas$Last))
result
```
Check expected values are greater then equal to 5
```{r}
result$expected
```
Cell-by-cell contributions to observed Chi-Square value
```{r}
(result$residuals)^2
```
## Relationships
Null hypothesis -> No Relationship
Alt. hypothesis -> Relationship
Looking at relationships between variables to understand and and predict what is going on.
### Twoway Table
Response | Predictor
------------- | -------------
Nominal | Nominal
Ho:No relationship exists; probabilistic independence
Ha: Relationship exists; probabilistic dependence
Ex. Transit Railroads is interested in the relationship between travel distance and the ticket class purchased.
A random sample of 200 passengers is taken.
```{r}
transit<-read_excel("Transit.xlsx", na="NA", col_names = TRUE)
table(transit)
result <- chisq.test(table(transit))
result
# cell-by-cell contributions to observed Chi-Square value: ((O-E)^2)/E
result$residuals
```
### Oneway ANOVA
Response | Predictor
------------- | -------------
Interval | Nominal
**F - Distribution**: Generally arises from a statistic that involves a ratio of variances. pf & qf functions in R for handling F-distribution
F statistic is the value we receive when we run an ANOVA test on different groups to understand the differences between them. The F statistic is given by the ratio of between group variability to within group variability
If there are two predictors, then it becomes Twoway ANOVA
Oneway ANOVA is a test of relationship between a interval level response variable and nominal level predictor variable. Or can also be expressed as a test of multiple means.
**Assumptions:**
1. Random Sampling from DGP
2. Stability in DGP
3. For each level of predictor variable, in n < 30 then normally distributed
4. Standard deviation is same for all the groups (Evaluate with Leven's test)
Ho:No relationship exists; Distribution is identical across all categories
Ha: Relationship exists; Difference in means across at-least two categories
```{r}
chol<- read_excel("Cholesterol.xlsx", col_names=TRUE)
head(chol)
fit <- aov(CholReduction ~ Drug, data=chol)
summary(fit)
```
DF = (k-1), (n-k) = 2, 15
Mean Sq, gives MST (above) and MSE (below)
Sum Sq, gives SST (above) and SSE (below)
The probability of getting a sample statistic as or more extreme then 40.79 is the P value. Which is extremely low. Hence, there is a significant relationship between means of at least 2 drugs.
```{r}
boxplot(CholReduction ~ Drug, data=chol)
```
Use Tukey HSD test to determine which drugs have relationship.
#### Tukey HSD (Honest Significance Difference) test
Studentised range distribution is used to test each pairwise difference
**Assumptions:** Same as ANOVA
1. Random Sampling from DGP
2. Stability in DGP
3. For each level of predictor variable, in n < 30 then normally distributed
4. Standard deviation is same for all the groups (Evaluate with Leven's test)
Ho:No relationship exists: μi = μj for each i, j being checked
Ha: Relationship exists: μi ≠ μj
```{r}
TukeyHSD(fit, conf.level = .9)
```
Ex. here. B is greater then A buy 15.4
There is a probability of 0.00028 or more to get a sample statistic of 15.5 if Ho where true
Hence, all three of them differ significantly
#### Leven's test
Whenever we wish to determine if variances are all equal or not across different groups
**Assumptions:**
1. Random Sampling from DGP
2. Stability in DGP
Ho: σ1 = σ2 = … = σk ; k independent samples/sets of responses
Ha: At least one pair is not equal
```{r warning=FALSE}
library('car')
leveneTest(chol$CholReduction, chol$Drug)
```
### Linear Regression
Response | Predictor
------------- | -------------
Interval | At least one Interval
**Assumptions:**
1. Random Sampling from DGP
2. Stability in DGP
3. εi ~ Normal (Mean 0, SD σ) constant, over all possible values of the predictors. We can use the following to evaluate this assumption
+ Standardized Residuals are normally distributed
+ Mean 0, SD fixed across predictors. Plot St. residual plots for individual predictors (Xi) and predicted values (Y hat)
We are interested in predicting the amount of carbohydrates (in grams) for a menu item based on its calorie content (measured in 100s).
```{r}
starbucks <- read_excel('starbucks.xlsx')
head(starbucks)
attach(starbucks)
linefit <- lm(carb ~ calories, data = starbucks)
plot(starbucks$calories, starbucks$carb)
abline(linefit,lty=2, col="blue")
```
#### Model as a whole doing something or not
Ho:No relationship exists; β1 = β2 =…=βk = 0
Ha: Relationship exists; at least one predictor βi ≠ 0
ANOVA is used to test this
```{r}
anova(linefit)
```
#### Testing Individual Predictors
Ho:No relationship exists; βi = 0, 1 ≤ i ≤ k
Ha: Relationship exists; βi ≠ 0
For each predictor two tailed T - test is used to test this
```{r}
summary(linefit)
```
#### Assumption Evaluation
##### Standardized residuals normality check
**Standardized residuals plot**
```{r}
linefit.stres <- rstandard(linefit)
plot(starbucks$calories, linefit.stres, pch = 16, main = "Standardized Residual Plot", xlab = "Calories", ylab = "Standardized Residuals
")
abline(0,0, lty=2, col="blue")
```
**Normal probability plot (QQPlot)**
```{r}
qqnorm(linefit.stres, main = "Normal Probability Plot", xlab = "Normal Scores", ylab = "Standardized Residuals")
qqline(linefit.stres, col = "blue")
```
**Shapiro Wilk Normality test**
```{r}
shapiro.test(linefit.stres)
```
#### Standardized Residual Plot - on fitted values
```{r}
plot(linefit$fitted.values, linefit.stres, pch = 16, main = "Standardized Residual Plot", xlab = "Fitted Carbs", ylab = "Standardized Residuals")
abline(0,0, lty=2, col="red")
```
#### Standardized Residual Plot - on calories
```{r}
plot(starbucks$calories, linefit.stres, pch = 16, main = "Standardized Residual Plot", xlab = "Fitted Chol", ylab = "Standardized Residuals")
abline(0,0, lty=2, col="red")
```
#### Confidence Interval for Estimates (βi)
```{r}
confint(linefit, level=.9)
```
#### Confidence Interval Estimation of Mean of Y
```{r}
predict(linefit,data.frame(calories=4.5), interval ="confidence", level = .80)
```
#### Prediction Interval Estimation of Y
```{r}
predict(linefit, data.frame(calories=4.5), interval="predict", level = .80)
```