-
Notifications
You must be signed in to change notification settings - Fork 0
/
lecture4_data_exploration_and_normalization.Rmd
1137 lines (824 loc) · 37 KB
/
lecture4_data_exploration_and_normalization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "BCB420 - Computational Systems Biology"
subtitle: "Lecture 4 - Exploring the data and basics of Normalization"
author: "Ruth Isserlin"
date: "`r Sys.Date()`"
output:
xaringan::moon_reader:
lib_dir: libs
css: [default, ./libs/example.css]
nature:
highlightStyle: github
highlightLines: true
highlightSpans: true
countIncrementalSlides: false
---
```{r include=FALSE}
library(GEOquery)
library(kableExtra)
library(edgeR)
```
## Before we start
## Assignment #1
* <font size=5> Due February 4th, 2020 @ 20:00 </font>
## What to hand in?
* **html rendered RNotebook** - you should be able to submit this through quercus
* Make sure the notebook and all associated code is checked into your github repo as I will be pulling all the repos at the deadline and using them to compile your code. - Your checked in code must replicate the handed in notebook.
* **Do not check the data file into your repo!** - your code should download the data from GEO and generate a new, cleaned data file.
* Document your work and your code directly in the notebook.
* **Read the paper associated with your data!**
* You are allowed to use helper functions or methods but make sure when you source those files the paths to them are relative and that they are checked into your repo as well.
---
## Lecture Overview
### Data Exploration
* Data standards - what are they and what are they good for?
### Data Normalization
* Why do we need to normalize out data?
* What different types of distributions are there and why does this matter?
### Identifier Mapping
* BiomaRt example
---
## Recap of Last class
* We used the GEOmetadb Bioc package to query the GEO meta data to help find:
* RNASeq data
* human
* dataset from within 5 years
* related to ovarian cancer
* supplementary file is counts
* [GEOmetadb](https://www.bioconductor.org/packages/release/bioc/html/GEOmetadb.html) is a great package that basically indexes all the meta data in GEO which is very helpful when it isn't based on the strongest of controlled vocabularies.
* I narrowed my search down to the GEO identifier - GSE70072
* So where do we start?
--
<font size=8> Read the paper! </font>
---
### Standards
* [Functional Genomics Data Society](http://fged.org/)
* MIAME - **M**inimum **I**nformation **A**bout a **M**icroarray **E**xperiment
* originally published in [Nature genetics in 2001](https://www.nature.com/articles/ng1201-365) and represented the guiding principles of the minimal information needed for submitting a dataset
* MINSEQE - **M**inimum **I**nformation about a high-throughput **SEQ**uencing **E**xperiment
--
* **GEO representation of MIAME and MINSEQE**
<img src=./images/img_lecture4/MIAME_MINSEQE.png>
---
### Standards
* [Functional Genomics Data Society](http://fged.org/)
* MIAME - **M**inimum **I**nformation **A**bout a **M**icroarray **E**xperiment
* originally published in [Nature genetics in 2001](https://www.nature.com/articles/ng1201-365) and represented the guiding principles of the minimal information needed for submitting a dataset
* MINSEQE - **M**inimum **I**nformation about a high-throughput **SEQ**uencing **E**xperiment
* **GEO representation of MIAME and MINSEQE**
<img src=./images/img_lecture4/MIAME_MINSEQE_highlighted.png>
---
## NCBI GEO standards and services for microarray data
* An interesting read! [Nat Biotechnol. 2006 Dec; 24(12): 1471–1472.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2270403/)
<img src=https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2270403/bin/nihms41580f1.jpg>
---
### Highlights from NCBI GEO standards and services for microarray data
the MIAME requirements have been criticized recently. The criticism stems, in part, from **different interpretations of the level of detail required to adequately report a microarray experiment**, and debates as to whether there is a genuine benefit to making microarray data public.
"In June 2005, we released major database revisions that included specific provisions for all MIAME data elements. In 2006, mechanisms for provision of raw data were further streamlined, and several MIAME elements that were previously optional became mandatory. Yet, even with these advances, it is still possible for a submitter to supply data that do not strictly adhere to the MIAME requirements. The difficulty lies in the fact that MIAME is a **subjective set of guidelines** where the level of detail to report is open to interpretation and, thus, cannot be unequivocally validated or enforced by computational means."
"each submission is inspected by curators for content integrity. GEO curators employ a pragmatic approach; we aim to ensure that sufficient information has been supplied to allow general interpretation of the experiment. Although encouraged, we have been less dogmatic with regards to provision of all-inclusive experimental protocols that would possibly permit practical replication of the entire experiment. Our reasoning is that provision of granulated experimental details adds a significant burden on the submitter, for (arguably) minimal real benefit for most end-users who are usually less concerned with this level of detail."
???
It more of a user driven resource.
A little bit of the honour system - these are the guidelines and we hope that submitters will abide by them but no one is forcing them to follow them.
If a reviewer has comments then submitter can easily make adjustments - but how often do you think reviewers go and download the data.
---
## Standards - Proteomics
[Proteomics Standards Initiative](http://www.psidev.info/) - PSI
<img src=./images/img_lecture4/psi.png>
???
point out: the number of different branches of PSI that there are.
they have standards and guidelines as well as controlled vocabularies.
As a previous submitter the process is not easy but as you will see later in the course when we look at protein protein interaction networks the psi-mi format is very reach and easy to use computational.
---
## PSI-MI controlled vocabulary example
<img src=./images/img_lecture4/psi_cv.png>
???
Active community that is still adding and adapting their controlled vocab and formats. The base is mostly stable but as new techonologies arise or evolve they are incorporated.
Why is the proteomics community so different from the RNA community? not sure. There is a different amount of work and coordination associated with the different frameworks.
---
Dataset = GSE70072
--
Title of the paper - Apoptosis enhancing drugs overcome innate platinum resistance in CA125 negative tumor initiating populations of high grade serous ovarian cancer
--
Very briefly in my own words:
* This study investigates two cell populations that they isolated from high-grade serous ovarian cancer, CA125+ve cells and CA125-ve cells.
* They hypothesize the CA125-ve cells are stem-like cells and are the cause for resurgence of cancer following treatment as the chemo drugs kills all the CA125+ve cells enriching for CA125-ve cells.
* They further demonstrate how a combination of two chemo drugs given together could help to kill both CA125+ve and CA125-ve cells.
<font size=8>What is the first thing that you might want to investigate or ask with the above summary?</font>
---
## CA125
From the paper:
* CA125 (Muc16) is a cell surface glycoprotein
* highly expressed in high grade serous ovarian cancer and found in blood of effected patients.
* a biomarker of the disease
From Uniprot:
* [Muc16](https://www.uniprot.org/uniprot/Q8WXI7) is not effective for stage I of the diseases so it isn't a good general marker of the disease.
* not specific enough for population screening
* used for monitoring the disease.
???
Highlight the common problem of different names. CA125 is not the recognized symbol of the gene. Why are they using it? Probably historical.
---
## General experimental design
<img src=https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4532886/bin/ncomms8956-f1.jpg width="80%">
<font size=1>Janzen DM, Tiourin E, Salehi JA, Paik DY, Lu J, Pellegrini M, Memarzadeh S. [An
apoptosis-enhancing drug overcomes platinum resistance in a tumour-initiating
subpopulation of ovarian cancer](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4532886/). Nat Commun. 2015 Aug 3;6:7956. - Figure 1</font>
---
## Specific RNASeq experimental design
<img src=./images/img_lecture4/RNASeq_design_fig3c.png>
---
### Specific RNASeq experimental design - Corrected 6 months after initial publication
* Original publication - August 2015
<img src=./images/img_lecture4/RNASeq_design_fig3c.png width="75%">
* Corrigendum - February 2016
<img src=./images/img_lecture4/RNASeq_design_fig3c_redo.png width="75%">
---
### Error!
* 341,056 of the 730,620 expression values were inadvertently paired with incorrect gene names.
* the DNA repair and apoptosis genes did **not** show a statistically significant expression difference as first reported.
* Although, this error did not effect their conclusions on the different treatment models they introduced just the possible mechanism that they proposed.
<font size=7>Lesson Learned - Identifer mapping is important and the basis for all down stream analysis so we need to get it right!</font>
---
## Data Exploration
* First things first get the GEO description of your dataset.
```{r echo=TRUE, message=FALSE, warning=FALSE}
gse <- getGEO("GSE70072",GSEMatrix=FALSE)
kable(data.frame(head(Meta(gse))), format = "html")
```
???
GSEMatrix parameter will only work for microarray data
---
## Information about Platform - `r names(GPLList(gse))`
```{r message=FALSE, warning=FALSE}
current_gpl <- names(GPLList(gse))[1]
current_gpl_info <- Meta(getGEO(current_gpl))
```
--
```{r}
current_gpl_info$title
current_gpl_info$last_update_date
current_gpl_info$organism
```
---
## Information about Platform - `r names(GPLList(gse))`
### Represent the same as above but using markdown
* with embedded r (expression) surrounded by [grave accent](https://en.wikipedia.org/wiki/Grave_accent). Also see: [R markdown cheatsheet](https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) for good tips<br><br>
**Platform title** : `r current_gpl_info$title`<br>
**Submission data** : `r current_gpl_info$submission_date`<br>
**Last update data** : `r current_gpl_info$last_update_date`<br>
**Organims** : `r current_gpl_info$organism` (taxid: `r current_gpl_info$taxid`)<br>
**Number of GEO datasets that use this techology** : `r length(current_gpl_info$series_id)` <br>
(code: length(current_gpl_info$series_id)) <br>
**Number of GEO samples that use this technology** : `r length(current_gpl_info$sample_id)` <br>
(code: length(current_gpl_info$sample_id)) <br>
---
### Get the expression Data
```{r message=FALSE, warning=FALSE}
sfiles = getGEOSuppFiles('GSE70072')
fnames = rownames(sfiles)
# there is only one supplemental file
ca125_exp = read.delim(fnames[1],header=TRUE,
check.names = FALSE) #<<
```
--
Difference between:
```{r}
read.table(fnames[1],header=TRUE,check.names=FALSE,nrows = 1)[1,1:4]
```
And
```{r}
read.table(fnames[1],header=TRUE,check.names=TRUE,nrows = 1)[1,1:4]
```
---
```{r}
kable(ca125_exp[1:15,1:5], format = "html")
```
---
## Cleaning the data
* How many unique genes do we have?
* Are there any non-genes in our dataset? If so what are they?
* Can we exclude them?
### How many genes do we have measurments for?
```{r}
dim(ca125_exp)
```
--
Define the groups
```{r}
#get the 2 and third token from the column names
samples <- data.frame(lapply(colnames(ca125_exp)[3:22],
FUN=function(x){unlist(strsplit(x, split = "\\."))[c(2,3)]}))
colnames(samples) <- colnames(ca125_exp)[3:22]
rownames(samples) <- c("patients","cell_type")
samples <- data.frame(t(samples))
```
---
## Are any of genes duplicated? and Why?
Get the summarized counts for each gene
```{r}
summarized_gene_counts <- sort(table(ca125_exp$gname),decreasing = TRUE)
```
Only output those that are greater than 1
```{r}
kable(summarized_gene_counts[which(summarized_gene_counts>1)[1:10]], format="html")
```
---
## What are those?
* [Y_RNA](https://en.wikipedia.org/wiki/Y_RNA) - small non-coding RNAs
* [SNOR](https://en.wikipedia.org/wiki/Small_nucleolar_RNA) - small nucleolar RNA
* U - small nuclear RNA
## Do I need to filter them out?
--
* At this point - **No**
* Difference of opinion.
* If we base our analysis on ensembl gene ids then they are unique elements and we don't have to worry about them.
* Collapsing them might increase their overall counts
* **But we do need to filter out genes that have low counts!**
---
* According to the edgeR protocol - Filter weakly expressed and noninformative (e.g., non-aligned) features.
* **In edgeR, it is recommended to remove features without at least 1 read per million in n of the samples, where n is the size of the smallest group of replicates.**
* For our example dataset - there are 10 samples of each group so n=10
```{r}
#translate out counts into counts per million using the edgeR package function cpm
cpms = cpm(ca125_exp[,3:22])
rownames(cpms) <- ca125_exp[,1]
# get rid of low counts
keep = rowSums(cpms >1) >=3 #<<
ca125_exp_filtered = ca125_exp[keep,]
```
What does that do to the dataset?
--
```{r}
dim(ca125_exp_filtered)
```
???
It is not important that the 10 samples that have at least 1 count per million are from the same condition, just that there are 10 of them.
---
### Does that solve some of duplicate issues?
Get the summarized counts for each gene
```{r}
summarized_gene_counts_filtered <- sort(table(ca125_exp_filtered$gname),decreasing = TRUE)
```
Only output those that are greater than 1
```{r}
kable(summarized_gene_counts_filtered[which(summarized_gene_counts_filtered>1)[1:10]], format="html")
```
---
## Normalization
* First steps in any analysis.
* Why is it important? Why do we need to normalize our data?
--
<br>How we think our experimental data should look - Bull's eye every time!
<img src="images/img_lecture4/bullseye_1.png">
---
## Normalization
* First steps in any analysis.
* Why is it important? Why do we need to normalize our data?
<br>How the data actually looks:
<img src="images/img_lecture4/bullseye_2.png">
---
## Variation
--
.pull-left[
**Technical Variation**
]
.pull-right[
**Biological Variation**
]
---
## Variation
.pull-left[
**Technical Variation**
- caused by instrumental, experimental variation
- samples run on different days by different people with slightly different reagents
- Want to control for these factors as best we can but with larger experiments there will always be some level of variation
- read depth, gene length, ...
- This is what we want to correct for.
]
.pull-right[
**Biological Variation**
- came from a different sample or a different condition
- These are the things we are **interested in**
]
---
## Data distributions
* a sampling of data will form some sort of distribution, a mathematical means of representing it.
* in our experimental methods, we are sampling a set of genes/proteins and associating some metric with that sampling.
* There are many different types of well charecterized distributions, including:
1. Normal distribution
1. Bimodal distribution
1. Poisson distribution
1. Power log distribution
---
### Normal distribution
Generate a set of 1000 randomly selected numbers from the normal distribution.
```{r}
r <- rnorm(1000, mean=0, sd=1)
```
If we graph this random distribution it will look like:
--
```{r eval=FALSE}
hist(r,freq = FALSE,breaks = 30,
xlim = c(-4, 4),ylim = c(0, 1),
main = "Normal Distribution",
xlab = "x",ylab = "f(x)", col = "yellow")
```
---
--
```{r echo=FALSE}
hist(r,freq = FALSE,breaks = 30,
xlim = c(-4, 4),ylim = c(0, 1),
main = "Normal Distribution",
xlab = "x",ylab = "f(x)", col = "yellow")
```
---
If we then grab 100 values equally spaced between -4 and 4
```{r eval=FALSE}
hist(r,freq = FALSE,breaks = 30,
xlim = c(-4, 4),ylim = c(0, 1),
main = "Normal Distribution",
xlab = "x",ylab = "f(x)", col = "yellow")
x <- seq(-4, 4, length.out = 100)
#add the density distribution
points(x, dnorm(x), type = "l", lwd = 2, col="firebrick")
```
---
If we then grab 100 values equally spaced between -4 and 4
```{r echo=FALSE}
hist(r,freq = FALSE,breaks = 30,
xlim = c(-4, 4),ylim = c(0, 1),
main = "Normal Distribution",
xlab = "x",ylab = "f(x)", col = "yellow")
x <- seq(-4, 4, length.out = 100)
#add the density distribution
points(x, dnorm(x), type = "l", lwd = 2, col="firebrick")
```
---
### Distribution of our data - Boxplot
```{r message=FALSE, warning=FALSE, eval=FALSE}
data2plot <- log2(cpm(ca125_exp_filtered[,3:22]))
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "CA125 RNASeq Samples")
#draw the median on each box plot
abline(h = median(apply(data2plot, 2, median)), col = "green", lwd = 0.6, lty = "dashed")
```
---
### Distribution of our data - Boxplot
```{r message=FALSE, warning=FALSE, echo=FALSE}
data2plot <- log2(cpm(ca125_exp_filtered[,3:22]))
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "CA125 RNASeq Samples")
#draw the median on each box plot
abline(h = median(apply(data2plot, 2, median)), col = "green", lwd = 0.6, lty = "dashed")
```
---
### Distribution of our data - Density plot
```{r eval=FALSE}
counts_density <- apply(log2(cpm(ca125_exp_filtered[,3:22])), 2, density)
#calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
xlim <- range(c(xlim, counts_density[[i]]$x));
ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="", cex.lab = 0.85)
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
#create legend
legend("topright", colnames(data2plot),
col=cols, lty=ltys, cex=0.75,
border ="blue", text.col = "green4",
merge = TRUE, bg = "gray90")
```
---
### Distribution of our data - Density plot
```{r echo=FALSE}
counts_density <- apply(log2(cpm(ca125_exp_filtered[,3:22])), 2, density)
#calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
xlim <- range(c(xlim, counts_density[[i]]$x));
ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="", cex.lab = 0.85)
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
#create legend
legend("topright", colnames(data2plot),
col=cols, lty=ltys, cex=0.75,
border ="blue", text.col = "green4",
merge = TRUE, bg = "gray90")
```
---
## Common Normalization methods
### Z-scored normalization
* centres the data to zero and the standard deviation to 1.
* assumes that the data is normally distributed
### Quantile normalization
* sort each column by values
* compute the average for each row
* replace all values along the row with the computed average
* put values back in original rows.
* forces the distribution of each sample to be the same, the sum of each column is the same
* **Most commonly used normalization method for microarray based technologies**
<font size=6> What do these normalization methods lack with regards to RNASeq?</font>
---
## Specialized Normalization Techniques for RNASeq
* both methods are very similar and give comparable results (according to in depth comparison that can be found [here](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0206312))
### Trimmed Mean of M-values (TMM)
* implemented in the edgeR package
* based on the hypothesis that most genes are not differentially expressed
* Sample based
### Relative Log Expression (RLE)
* implemented in the DESeq package
* based on the hypothesis that most genes are not differentially expressed
* gene based
---
## M vs A plot
* As it is the premise of the TMM method, just briefly go over it.
* Originally used in microarray experiments with two dye (Red vs green) to assay the technical variability between the intensities of the two dyes.
* M - ratio, ratio Cy5(red) to Cy3(green), A - average of Cy5 and Cy3
* For RNASeq experiments it can be used to compare two samples or two groups of samples.
```{r}
plotMA(log2(ca125_exp[,c(3,4)]), ylab="M - ratio log expression", main="CA125 + vs - - example")
```
---
## Trimmed Mean of M-values
* Trimmed mean - is the average after removing the upper and lower percentage of the data points. (By default, 30% of the M values and 5% of the A values)
<img src=./images/img_lecture4/TMM_def.png>
* N is the total reads for sample K, Y is the observed counts
* TMM compares each sample to a reference.
* Major difference to previous methods - the data does not need to be modified prior to normalization.
---
## Applying TMM to our dataset
### Create an edgeR container for RNASeq count data
* use the filtered counts and make sure it is not a data.frame but is a matrix
* give the defined groups that we are going to be comparing - cell type (CA125+ vs CA125-)
```{r}
filtered_data_matrix <- as.matrix(ca125_exp_filtered[,3:22])
rownames(filtered_data_matrix) <- ca125_exp_filtered$ensembl75_id
d = DGEList(counts=filtered_data_matrix, group=samples$cell_type)
```
## Calculate the normalization factors
```{r}
d = calcNormFactors(d)
```
---
* How did normalization change our data
```{r}
#get the normalized data
normalized_counts <- cpm(d)
```
```{r echo=FALSE}
par(mfrow=c(1,2))
# add the original distribution
counts_density <- apply(log2(filtered_data_matrix), 2, density)
#calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
xlim <- range(c(xlim, counts_density[[i]]$x));
ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="", cex.lab = 0.85)
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
normalized_counts_density <- apply(log2(normalized_counts), 2, density)
#calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(normalized_counts_density)) {
xlim <- range(c(xlim, normalized_counts_density[[i]]$x));
ylim <- range(c(ylim, normalized_counts_density[[i]]$y))
}
cols <- rainbow(length(normalized_counts_density))
ltys <- rep(1, length(normalized_counts_density))
#plot the first density plot to initialize the plot
plot(normalized_counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="", cex.lab = 0.85)
#plot each line
for (i in 1:length(normalized_counts_density)) lines(normalized_counts_density[[i]], col=cols[i], lty=ltys[i])
```
---
* What happens when we use the same scale
```{r echo=FALSE}
par(mfrow=c(1,2))
counts_density <- apply(log2(filtered_data_matrix), 2, density)
normalized_counts_density <- apply(log2(normalized_counts), 2, density)
#calculate the limits across all the samples
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
xlim <- range(c(xlim, counts_density[[i]]$x));
ylim <- range(c(ylim, counts_density[[i]]$y))
}
#add limits from the second plot
for (i in 1:length(normalized_counts_density)) {
xlim <- range(c(xlim, normalized_counts_density[[i]]$x));
ylim <- range(c(ylim, normalized_counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="", cex.lab = 0.85)
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
cols <- rainbow(length(normalized_counts_density))
ltys <- rep(1, length(normalized_counts_density))
#plot the first density plot to initialize the plot
plot(normalized_counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="", cex.lab = 0.85)
#plot each line
for (i in 1:length(normalized_counts_density)) lines(normalized_counts_density[[i]], col=cols[i], lty=ltys[i])
```
---
### Distribution of our data - Boxplot
```{r message=FALSE, warning=FALSE, echo=FALSE}
par(mfrow=c(1,2))
data2plot_prior <- log2(cpm(ca125_exp_filtered[,3:22]))
boxplot(data2plot_prior, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "Original Counts")
#draw the median on each box plot
abline(h = median(apply(data2plot_prior, 2, median)), col = "orange", lwd = 0.6, lty = "dashed")
data2plot <- log2(normalized_counts)
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "Normalized")
#draw the median on each box plot
abline(h = median(apply(data2plot, 2, median)), col = "green", lwd = 0.6, lty = "dashed")
```
---
Inspect the sample separation using a multidimenstional scaling plot or MDS plot post normalization.
* A MDS plot represents the distances between samples
```{r}
plotMDS(d, labels=rownames(samples),
col = c("darkgreen","blue")[factor(samples$cell_type)])
```
---
Next, estimate common and tagwise dispersion
* dispersion is a parameter that describes how much your variance deviates from the mean.
* this is specific to the edgeR pacakge and is used downstream when calculating differential expression.
* we can estimate both the common and tagwise dispersion. Common dispersion calculates a common dispersion value for all genes, while the tagwise method calculates gene-specific dispersions.
* Genes with more counts will have smaller variations between samples than genes with few counts.
* dispersion squared is biological coeffient of variation (BCV) so dispersion is a measure of how much variation there is in your samples.
using:
```{r}
model_design <- model.matrix(~samples$patients + samples$cell_type+0)
d <- estimateDisp(d, model_design)
```
---
Graphing the BCV
* tagwise = genewise, each dot represents the BCV for a gene
```{r}
plotBCV(d,col.tagwise = "black",col.common = "red")
```
---
Create a visual representation of the mean-variance relationship
```{r message=FALSE, warning=FALSE}
plotMeanVar(d, show.raw.vars = TRUE, #<<
show.tagwise.vars=FALSE, NBline=FALSE,
show.ave.raw.vars = FALSE,show.binned.common.disp.vars = FALSE)
```
---
```{r message=FALSE, warning=FALSE}
plotMeanVar(d, show.raw.vars = TRUE,
show.tagwise.vars=TRUE, #<<
NBline=FALSE, show.ave.raw.vars = FALSE,
show.binned.common.disp.vars = FALSE)
```
---
```{r message=FALSE, warning=FALSE}
plotMeanVar(d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,
show.ave.raw.vars = TRUE, #<<
NBline=FALSE, show.binned.common.disp.vars = FALSE)
```
---
```{r message=FALSE, warning=FALSE}
plotMeanVar(d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,
show.ave.raw.vars = TRUE, NBline=FALSE,
show.binned.common.disp.vars = TRUE) #<<
```
---
```{r message=FALSE, warning=FALSE}
plotMeanVar(d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,
show.ave.raw.vars = TRUE,
NBline=TRUE, #<<
show.binned.common.disp.vars = TRUE)
```
---
<img src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fnmeth.3963/MediaObjects/41592_2016_Article_BFnmeth3963_Fig1_HTML.jpg">
<font size=2>Wadi, L., Meyer, M., Weiser, J. et al. [Impact of outdated gene annotations on pathway enrichment analysis](https://www.nature.com/articles/nmeth.3963). Nat Methods 13, 705–706 (2016).</font>
---
## Identifier mapping
**Why can't the data just contain the identifiers that are interested in, i.e genes?**
#### Techonology
* Often data generation works with different identifiers than what we need for the analysis.
* Technology itself works on a different level and tracks different information so often aren't working gene level data. For example -
* RNASeq deals with read counts of short fragments of RNA that don't necessarily map to an individual gene
* Microarray deals with probes which are the base unit of analysis. At design phase probes are associated with different region of the genome but are not necessarily mapped to a gene.
* Proteomics deals with peptides similar to the above examples.
#### Changing annotations
* Thankfully our knowledge of genes, gene regulation are constantly expanding.
* New genes, regulatory mechnisms are being discovered
* RNASeq data can be re analyzed given new information to uncover new things. (microarray not to same degree)
---
## Tools for Identifier Mapping
* Depends on the data you are trying to map.
* Often going to the source will result in the best mappping, for example, for best mapping of ensembl ids go to [Ensembl](https://useast.ensembl.org/index.html). Or if you want mapping for mouse genes go to Mouse Genome Informatics [MGI](http://www.informatics.jax.org/)
* Popular Tools:
* [Ensembl Biomart](https://www.ensembl.org/biomart/martview/5d0da2ec3ccb52f6596c568a5b574402) - has web interface as well as multiple packages - the bioc package [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html) and the R package [biomartr](https://academic.oup.com/bioinformatics/article/33/8/1216/2931816) which actually depends on the biomaRt bioc package
* biomartr is an extension on the biomaRt package and is optimized for those trying to get annotation information out of biomart.
* [BridgeDB](https://openrisknet.org/e-infrastructure/services/119/) - restful service with a bioc interface [tutorial](https://bioconductor.org/packages/devel/bioc/vignettes/BridgeDbR/inst/doc/tutorial.html) - flaky though
* Wherever you get your mappings from it is important that you are working with up to date annotations and the right versions.
---
## Biomart
* Let's check how many of my ensembl gene ids map to the same gene names associated with them.
## Get all the available marts
```{r}
library(biomaRt)
listMarts()
```
---
## What if you need a different or older verion.
```{r}
listEnsemblArchives()[1:10,]
```
---
Connect to the desired mart
```{r}
ensembl <- useMart("ensembl")
```
Get the set of datasets available
```{r}
datasets <- listDatasets(ensembl)
kable(head(datasets),format = "html")
```
---
Limit to the human datasets available
```{r}
kable(head(datasets[grep(datasets$dataset,pattern = "sapiens"),]),format = "html")
```
```{r}
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
```
---
## Building a Biomart Query
Function that we will be using is: **getBM()**
```{r}
help(getBM)
```
What are we trying to convert:
* Human Ensembl Gene Ids to HGNC symbols
* A few things we need:
1. attributes - the type of identifiers that you want to retrieve. **So in our use case this is Ensembl gene Ids and the HGNC symbols** - What would happen if we just specified the HGNC symbols here?
1. filters - how you are filtering your results. Not intuitive. **For our use case this is Ensembl gene ids**
1. values - the values associated with the filters. **So in our use case this is the set of Ensembl gene ids that we have**
---
### Filters
* unfortunately we have to figure out the correct name for all the identifiers that we are using as well as converting to.
how many filters are there?
--
```{r}
dim(listFilters(ensembl))
```
--
What do those filters look like?
```{r}
kable(listFilters(ensembl)[1:10,1:2], type="html")
```
---
### filters - cont'd
```{r}
biomart_human_filters <- listFilters(ensembl)
kable(biomart_human_filters[
grep(biomart_human_filters$name,pattern="ensembl"),],
format="html") %>%
row_spec(3, background = "yellow")
```
---
### Attributes
```{r}
kable(listAttributes(ensembl)[1:10,1:2], type="html")
```
---
### Attributes
* you can also use the function searchAttribtutes to find the attribute you are looking for.
```{r}
kable(searchAttributes(mart = ensembl, 'hgnc') , format="html") %>%
row_spec(2, background = "yellow")
```
---
* if you wanted to look for all of them at once