notes.rmd

---
output: rmarkdown::github_document
---
# CookBook
```{r warning=FALSE}
library(readxl)
library(ggplot2)
```
##Statistical Inference
Notes on statistical inference made for learning statistics for data scientists


## Estimation
### Confidence Interval for true mean
**Assumptions:**

1. Random Sampling from DGP
2. Stability in DGP
3. If n< 30 => Normal Distribution


```{r}
Congress <- read_excel("Piracy.xlsx", col_names = TRUE)

mean = 30.7833
sd = 1.7862
size = 12
sample_sd = sd/sqrt(size)

lower = mean + qt(0.05, size-1)*sample_sd
higher =mean -qt(0.05, size-1)*sample_sd
lower
higher

```
### Confidence Interval estimation for proportion
**Assumptions:**

1. Random Sampling from DGP
2. Stability in DGP
```{r}
poll <- read_excel('Poll.xlsx', col_names = TRUE)
table(poll)
```

#### Wilscon score interval method - generally preferred 
prop.test can be used for testing the null that the proportions (probabilities of success) in several groups are the same, or that they equal certain given values.
```{r}
prop.test(358, (358+407), conf.level = .95)
```

#### Exact binomial estimation
Performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment.
Here hypothesized probability of success is provided as parameter. 0.5 in this case
```{r}
binom.test(358, (358+407), 0.5, conf.level = .9)
```

### Calculating CI directly from sample
The file WaitTime.xlsx contains data of wait times (in minutes) for a random sample of a
bank’s customers. Find a 90% confidence interval for the mean wait time at the bank.


**t.test** - Performs one and two sample t-tests on vectors of data and also give CI.
There are two kinds of hypotheses for a one sample t-test, the null hypothesis and the alternative hypothesis. The alternative hypothesis assumes that some difference exists between the true mean (μ) and the comparison value (m0), whereas the null hypothesis assumes that no difference exists. The comparison value (m0) can be specified with mu parameter below, default is 0. The value of mu does not affect the CI, as CI is the estimation of true mean using leveraging the given sample. 
```{R}
waiting = read_excel('WaitTime.xlsx')
t.test(waiting$WaitTime, mu = 0, conf.level = .90)
t.test(waiting$WaitTime, mu = mean(waiting$WaitTime), conf.level = .90)


```

An equivalent alternative to t.test() is manual calculation as done below
```{R}
sample_mean = mean(waiting$WaitTime)
sample_sd = sd(waiting$WaitTime)
sampled_sd = sample_sd/sqrt(length(waiting$WaitTime))

sample_mean+qt(0.05, length(waiting$WaitTime-1))*sampled_sd
sample_mean-qt(0.05, length(waiting$WaitTime-1))*sampled_sd
```
*
*Interpretation**: There is a 90% chance that the interval we got captures the true mean waiting time*


## Hypothesis Testing

### Errors
Accepting null hypothesis when alternative hypothesis is true -> Type 1  

Accepting alternative hypothesis when null hypothesis is true -> Type 2

The probability of marking Type 1 error is Alpha (accepting Ha)  

The probability of marking Type 2 error is Beta  (failure in rejecting Ho)

The closer is true mean to the hypothetical mean, the higher the value of Beta. 
We can only control Type 1 error through desired level of significance. Hence, design hypothesis such that the error we want to control is the Type 1 error. 
** Statisticians Dodge: ** Since we cannot control Type 2 error. (ie. Accepting Ho when Ha is true). Never draw the conclusion of accepting Ho, instead what we can conclude is that: 'we do not reject Ho) ie: Failed to reject the Null Hypothesis 
P value is the probability of getting the sample result if Ho were true

### T - Test
The one sample t-test is a statistical procedure used to determine whether a sample of observations could have been generated by a process with a specific mean. Suppose you are interested in determining whether an waiting time at a clinic is less then 5 min. To test this hypothesis, you could collect a sample of waiting times, measure their weights, and compare the sample with a value of five using a one-sample t-test.

**Manually calculating T value using the formula**
$$
t = \frac{(\text{mean}_f - \text{mean}_m) - \text{expected difference}}{SE}  \\
~\\
~\\
SE = \sqrt{\frac{sd_f^2}{n_f} + \frac{sd_m^2}{n_m}}  \\
~\\
~\\
\text{where, }~~~df = n_m + n_f - 2
$$
P value of a T statistic can then be calculated as
```{r}
se = sd(waiting$WaitTime)/sqrt(length(waiting$WaitTime))
t= (mean(waiting$WaitTime)-5)/se
df=length(waiting$WaitTime)-1
pt(t, df)

```



**Assumptions:**

1. Random Sampling from DGP
2. Stability in DGP
3. If n< 30 => Normal Distribution
4. Null hypothesis is true


Test if the waiting time is less than 5 min  
Null hypothesis -> true mean is greater then and equal to 5  
Alt. hypothesis -> true mean is less than 5





```{r}
t.test(waiting$WaitTime, alternative = 'less', mu=5)
```
The likely hood of getting sample statistic as extreme as 1.85 in 0.966 if the true mean waiting time is greater then or equal to 5

 The **significance level** is the probability of obtaining a result as extreme as, or more extreme than, the result actually obtained when the null hypothesis is true

The significance level and confidence level are the complementary portions in the normal distribution.

### Testing normality
Test normality for generated numbers

**Assumptions:**

1. Random Sampling from DGP
2. Stability in DGP

#### Using histogram
```{r}
seq <- rnorm(50,5,1)
hist(seq)
```

#### Using qq plot
```{r}
qqnorm(seq)
qqline(seq, col = "blue")
```

#### Using Shapiro Wilk test
Null hypothesis -> distribution is normal  

Alt. hypothesis -> distribution is not normal
```{r}
shapiro.test(seq)
```
P-value being so high, we can conclude that: We fail to reject null hypothesis.  
Hence, our assumption of distribution being normal is not rejected.

Definition of P-Value: Given a null hypothesis and sample evidence of size n, the p-value is the probability of getting a sample evidence that is equally or more unfavorable to the null hypothesis while the null hypothesis is actually true. 


## Tables
### Chi-Square distribution
The Chi Square distribution is the distribution of the sum of squared standard normal deviates. The degrees of freedom of the distribution is equal to the number of standard normal deviates being summed

Probability of getting X-square value of less then or equal to 5 is 
```{R}
pchisq(5,1)
```
X-square value associated with cumulative probability of 0.9746527 is
```{r}
qchisq(0.9746527,1)
```
*The **chi-squared test** is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.*

### One-Sample Goodness of fit test (Using Chi Square distribution) / Multinomial Test

Chi-Square goodness of fit test is a non-parametric test that is used to find out how the observed value of a given phenomena is significantly different from the expected value.  In Chi-Square goodness of fit test, the term goodness of fit is used to compare the observed sample distribution with the expected probability distribution. 

Null hypothesis -> There is no significant difference between the observed and the expected value. 

Alt. hypothesis -> There is a significant difference between at least one of them

**Assumptions:**

1. Random Sampling from DGP
2. Stability in DGP
3. Expected value for each group is at least 5


```{r}
gas<- read_excel("Gasoline.xlsx", col_names=TRUE)
head(gas)
table(gas$`Second-last`, gas$Last)
result <- chisq.test(table(gas$`Second-last`, gas$Last))
result

```

Check expected values are greater then equal to 5
```{r}
result$expected
```


Cell-by-cell contributions to observed Chi-Square value
```{r}
(result$residuals)^2

```

## Relationships

Null hypothesis -> No Relationship  

Alt. hypothesis -> Relationship

Looking at relationships between variables to understand and and predict what is going on. 

### Twoway Table

Response | Predictor
------------- | -------------
Nominal | Nominal 

Ho:No relationship exists; probabilistic independence  

Ha: Relationship exists; probabilistic dependence

Ex. Transit Railroads is interested in the relationship between travel distance and the ticket class purchased.  
A random sample of 200 passengers is taken.
```{r}
transit<-read_excel("Transit.xlsx", na="NA", col_names = TRUE)
table(transit)
result <- chisq.test(table(transit))
result

# cell-by-cell contributions to observed Chi-Square value: ((O-E)^2)/E
result$residuals
```

### Oneway ANOVA
Response | Predictor
------------- | -------------
Interval | Nominal 

**F - Distribution**: Generally arises from a statistic that involves a ratio of variances. pf & qf functions in R for handling F-distribution

F statistic is the value we receive when we run an ANOVA test on different groups to understand the differences between them. The F statistic is given by the ratio of between group variability to within group variability

If there are two predictors, then it becomes Twoway ANOVA   
Oneway ANOVA is a test of relationship between a interval level response variable and nominal level predictor variable. Or can also be expressed as a test of multiple means.

**Assumptions:**

1. Random Sampling from DGP
2. Stability in DGP
3. For each level of predictor variable, in n < 30 then normally distributed
4. Standard deviation is same for all the groups (Evaluate with Leven's test)


Ho:No relationship exists; Distribution is identical across all categories  

Ha: Relationship exists; Difference in means across at-least two categories
```{r}
chol<- read_excel("Cholesterol.xlsx", col_names=TRUE)
head(chol)
fit <-  aov(CholReduction ~ Drug, data=chol)
summary(fit)
```
DF = (k-1), (n-k) = 2, 15
Mean Sq, gives MST (above) and MSE (below)
Sum Sq, gives SST (above) and SSE (below) 
The probability of getting a sample statistic as or more extreme then 40.79 is the P value. Which is extremely low. Hence, there is a significant relationship between means of at least 2 drugs. 

```{r}
boxplot(CholReduction ~ Drug, data=chol)
```



Use Tukey HSD test to determine which drugs have relationship.


#### Tukey HSD (Honest Significance Difference) test
Studentised range distribution is used to test each pairwise difference

**Assumptions:** Same as ANOVA

1. Random Sampling from DGP
2. Stability in DGP
3. For each level of predictor variable, in n < 30 then normally distributed
4. Standard deviation is same for all the groups (Evaluate with Leven's test)


Ho:No relationship exists: μi = μj for each i, j being checked  

Ha: Relationship exists: μi ≠ μj
```{r}
TukeyHSD(fit, conf.level = .9)

```

Ex. here. B is greater then A buy 15.4
There is a probability of 0.00028 or more to get a sample statistic of 15.5 if Ho where true
Hence, all three of them differ significantly

#### Leven's test

Whenever we wish to determine if variances are all equal or not across different groups  

**Assumptions:**

1. Random Sampling from DGP
2. Stability in DGP

Ho: σ1 = σ2 = … = σk ; k independent samples/sets of responses  

Ha: At least one pair is not equal

```{r warning=FALSE}
library('car')
leveneTest(chol$CholReduction, chol$Drug)
```

### Linear Regression
Response | Predictor
------------- | -------------
Interval | At least one Interval


**Assumptions:**

1. Random Sampling from DGP
2. Stability in DGP
3. εi ~ Normal (Mean 0, SD σ) constant, over all possible values of the predictors. We can use the following to evaluate this assumption
    + Standardized Residuals are normally distributed
    + Mean 0, SD fixed across predictors. Plot St. residual plots for individual predictors (Xi) and predicted values (Y hat)
    

We are interested in predicting the amount of carbohydrates (in grams) for a menu item based on its calorie content (measured in 100s).
```{r}
starbucks <- read_excel('starbucks.xlsx')
head(starbucks)
attach(starbucks)
linefit <- lm(carb ~ calories, data = starbucks)
plot(starbucks$calories, starbucks$carb)
abline(linefit,lty=2, col="blue")
```

#### Model as a whole doing something or not
Ho:No relationship exists; β1 = β2 =…=βk = 0  

Ha: Relationship exists; at least one predictor βi ≠ 0

ANOVA is used to test this 
```{r}
anova(linefit)
```

#### Testing Individual Predictors
Ho:No relationship exists; βi = 0, 1 ≤ i ≤ k  

Ha: Relationship exists; βi ≠ 0

For each predictor two tailed T - test is used to test this

```{r}
summary(linefit)
```
#### Assumption Evaluation
##### Standardized residuals normality check
**Standardized residuals plot**
```{r}
linefit.stres <- rstandard(linefit)

plot(starbucks$calories, linefit.stres, pch = 16, main = "Standardized Residual Plot", xlab = "Calories", ylab = "Standardized Residuals
")
abline(0,0, lty=2, col="blue")
```

**Normal probability plot (QQPlot)**
```{r}
qqnorm(linefit.stres, main = "Normal Probability Plot", xlab = "Normal Scores", ylab = "Standardized Residuals")
qqline(linefit.stres, col = "blue")
```

**Shapiro Wilk Normality test**
```{r}
shapiro.test(linefit.stres)
```


#### Standardized Residual Plot - on fitted values
```{r}
plot(linefit$fitted.values, linefit.stres, pch = 16, main = "Standardized Residual Plot", xlab = "Fitted Carbs", ylab = "Standardized Residuals")
abline(0,0, lty=2, col="red")
```


#### Standardized Residual Plot - on calories
```{r}
plot(starbucks$calories, linefit.stres, pch = 16, main = "Standardized Residual Plot", xlab = "Fitted Chol", ylab = "Standardized Residuals")
abline(0,0, lty=2, col="red")
```

#### Confidence Interval for Estimates (βi)
```{r}
confint(linefit, level=.9)
```


#### Confidence Interval Estimation of Mean of Y
```{r}
predict(linefit,data.frame(calories=4.5), interval ="confidence", level = .80)
```


#### Prediction Interval Estimation of Y
```{r}
predict(linefit, data.frame(calories=4.5), interval="predict", level = .80)
```