-
Notifications
You must be signed in to change notification settings - Fork 0
/
lecture6_annotation_resources_and_prelim_ORA.Rmd
521 lines (364 loc) · 16.4 KB
/
lecture6_annotation_resources_and_prelim_ORA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
---
title: "BCB420 - Computational Systems Biology"
subtitle: "Lecture 6 - Annotation Dataset and Intro to Pathway analysis"
author: "Ruth Isserlin"
date: "`r Sys.Date()`"
output:
xaringan::moon_reader:
lib_dir: libs
css: [default, ./libs/example.css]
nature:
highlightStyle: github
highlightLines: true
highlightSpans: true
countIncrementalSlides: false
bibliography: lecture6.bib
---
```{r include=FALSE}
library(edgeR)
library(GEOquery)
library(kableExtra)
```
## Before we start
Please fill this out:
[<font size=8>Mid course Feedback : https://forms.gle/maGA529V7pxvgBXC6</font>](https://forms.gle/maGA529V7pxvgBXC6)
---
## Features of Markdown
Three **must** features to include in your next assignment:
* Table of contents
* Code chunk specifications.
* References using bibtex
For more information on Rmarkdown see the following useful resources:
* See [Rmarkdown cheatsheet](https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) for the basics
* For more in depth on all the **great** features of markdown chekc out [this page](https://yihui.org/knitr/)
---
## Table of contents
.pull-left[
<img src=./images/img_lecture6/rmarkdown_toc.png>
]
.pull-right[
* TOC will be generated by taking all entries specified as a heading using "\#"
* Depending on what depth you specify this will only include the number of "\#" specified
]
---
## Code chunk specifications
.pull-left[
<img src=./images/img_lecture6/rmarkdown_code_chunks.png>
]
.pull-right[
* message = FALSE - to get rid of all those info messages that make the report un-readable.
* warning = FALSE - same as messages but it is important that you pay attention to these when you are iterating over your report. Some warning may indicate that a function is not being used correctly. You don't want these showing up in the final report but you do want to address all of them.
* echo and eval are both useful for reports where you don't want to show the code you are using. If you are giving this report to a biologist or experimentalist they don't want to see the code that generated the data or figures.
]
---
## References using bibtext
* Create a set of references that you use for your Rmarkdown notebook, extension .bib
* To get the bibtex go to google scholar and search for your article. Click on the quotation marks associated with your article and click on Bibtex. Copy it to .bib file.
* Example bibtex to put in your bib file:
```{r eval=FALSE}
@article{cytoscape2003,
title={Cytoscape: a software environment for integrated models of biomolecular interaction networks},
author={Shannon, Paul and Markiel, Andrew and Ozier, Owen and Baliga, Nitin S and Wang, Jonathan T and Ramage, Daniel and Amin, Nada and Schwikowski, Benno and Ideker, Trey},
journal={Genome research},
volume={13},
number={11},
pages={2498--2504},
year={2003},
publisher={Cold Spring Harbor Lab}
}
```
---
## References using bibtex - step 2
Add your bibliography file to your Rmarkdown file
```{r eval=FALSE}
---
title: "BCB420 - Computational Systems Biology"
subtitle: "Lecture 6 - Annotation Dataset and Intro to Pathway analysis"
author: "Ruth Isserlin"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_depth: 2
bibliography: lecture6.bib #<<
---
```
---
## References using bibtex - step3
If I want to add the following sentence to add the reference use the @ sign followed by the name you called the reference.
```{r eval=FALSE}
* [Cytoscape](www.cytoscape.org)[@cytoscape_2003] is a freely available open-source, cross platform network analysis software.
```
--
In your markdown file the above will get rendered to look like:
* [Cytoscape](www.cytoscape.org)(Shannon et al. 2003) is a freely available open-source, cross platform network analysis software.
And the following will appear at the end of your document:
```{r eval=FALSE}
Shannon, Paul, Andrew Markiel, Owen Ozier, Nitin S Baliga, Jonathan T Wang, Daniel Ramage, Nada Amin, Benno Schwikowski, and Trey Ideker. 2003. “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks.” Genome Research 13 (11). Cold Spring Harbor Lab: 2498–2504.
```
---
## References using bibtex - controlling the format
* In order to change the style of your references you can specify a citation style.
* For a set of predefined styles see [here](https://github.com/citation-style-language/styles)
```{r eval=FALSE}
---
title: "BCB420 - Computational Systems Biology"
subtitle: "Lecture 6 - Annotation Dataset and Intro to Pathway analysis"
author: "Ruth Isserlin"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_depth: 2
bibliography: lecture6.bib
csl: biomed-central.csl #<<
```
---
## Annotation Sources
[<font size=8>Fill in this short survey on your annotation source: https://forms.gle/shyJiddEKMk9qnU76</font>](https://forms.gle/shyJiddEKMk9qnU76)
---
## Annotation Sources
.pull-left[
<img src="images/img_lecture6/annotation_resources.png">
]
.pull-right[
* There are a lot of sources that try and collate a lot of the smaller databases
* [Pathway commons](https://www.pathwaycommons.org/)
* [MsigDb](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp)
* [iRefIndex](https://irefindex.vib.be/wiki/index.php/iRefIndex)
* [GeneCards](https://www.genecards.org/)
* There are hundreds of smaller databases!
]
---
## Information that is important to know about our annotation source
.pull-left[
* Where did it come from?
* is it from a primary publication?
* Is it curated by a consortium?
* Is it computationally defined?
]
.pull-right[
* How is it released and how often is it released?
* Is it a custom format? gmt format?
* available for download from a website or ftp site?
* queriable and accessibly through a REST interface?
* Is it versioned?
]
???
REST - REpresentational State Transfer
---
## GMT file format
* File format defined by the GSEA tool when it was originally released in 2003
* Each row represents a pathway or gene set
* Format is: pathway name <tab> pathway description <tab> tab delimited list of genes associated with this pathway.
* A lot of enrichment analysis tools can take this format in as input.
<img src="https://software.broadinstitute.org/cancer/software/gsea/wiki/images/3/3d/Gmt_format_snapshot.gif">
---
## Where are we?
So, what are **we** going to be covering during this course?<br>
![](./images/course_overview_flowchart.png)
---
## Recap from previous weeks
<img src=https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4532886/bin/ncomms8956-f1.jpg width="80%">
<font size=1>Janzen DM, Tiourin E, Salehi JA, Paik DY, Lu J, Pellegrini M, Memarzadeh S. [An
apoptosis-enhancing drug overcomes platinum resistance in a tumour-initiating
subpopulation of ovarian cancer](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4532886/). Nat Commun. 2015 Aug 3;6:7956. - Figure 1</font>
---
First things first,
* Load the data, filter it
```{r message=FALSE, warning=FALSE}
library(GEOquery)
sfiles = getGEOSuppFiles('GSE70072')
fnames = rownames(sfiles)
# there is only one supplemental file
ca125_exp = read.delim(fnames[1],header=TRUE,
check.names = FALSE)
cpms = edgeR::cpm(ca125_exp[,3:22])
rownames(cpms) <- ca125_exp[,1]
# get rid of low counts
keep = rowSums(cpms >1) >=3 #<<
ca125_exp_filtered = ca125_exp[keep,]
filtered_data_matrix <- as.matrix(ca125_exp_filtered[,3:22])
rownames(filtered_data_matrix) <- ca125_exp_filtered$ensembl75_id
```
---
* Define the groups
```{r}
#get the 2 and third token from the column names
samples <- data.frame(
lapply(colnames(ca125_exp_filtered)[3:22],
FUN=function(x){
unlist(strsplit(x, split = "\\."))[c(2,3)]}))
colnames(samples) <- colnames(ca125_exp_filtered)[3:22]
rownames(samples) <- c("patients","cell_type")
samples <- data.frame(t(samples))
```
```{r}
d = DGEList(counts=filtered_data_matrix, group=samples$cell_type)
```
* create the model
```{r}
model_design_pat <- model.matrix(
~ samples$patients + samples$cell_type)
```
---
* estimate the dispersion,
* calculate normalization factors,
* fit the model
* calculate differentail expression
```{r}
#estimate dispersion
d <- estimateDisp(d, model_design_pat)
#calculate normalization factors
d <- calcNormFactors(d)
#fit model
fit <- glmQLFit(d, model_design_pat)
#calculate differential expression
qlf.pos_vs_neg <- glmQLFTest(fit, coef='samples$cell_typeCA125+')
```
---
Get all the results
```{r}
qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue",
n = nrow(filtered_data_matrix))
```
How many gene pass the threshold p-value < 0.05?
```{r}
length(which(qlf_output_hits$table$PValue < 0.05))
```
How many genes pass correction?
--
```{r}
length(which(qlf_output_hits$table$FDR < 0.05))
```
---
Output our top hits
```{r}
kable(topTags(qlf.pos_vs_neg), type="html")
```
---
How many genes are up regulated?
```{r}
length(which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC > 0))
```
How many genes are down regulated?
```{r}
length(which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC < 0))
```
---
Create **thresholded** lists of genes.
```{r}
#merge gene names with the top hits
qlf_output_hits_withgn <- merge(ca125_exp[,1:2],qlf_output_hits, by.x=1, by.y = 0)
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
upregulated_genes <- qlf_output_hits_withgn$gname[
which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC > 0)]
downregulated_genes <- qlf_output_hits_withgn$gname[
which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC < 0)]
write.table(x=upregulated_genes,
file=file.path("data","ca125_upregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_genes,
file=file.path("data","ca125_downregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=data.frame(genename= qlf_output_hits_withgn$gname,F_stat= qlf_output_hits_withgn$rank),
file=file.path("data","ca125_ranked_genelist.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
```
---
## Thresholded list?
* We have gone from a large datasets of 1000s of measurements to a list of differentially expressed genes. Why do we look at a subset of this list?
* Aren't thresholds arbitrary? Why define a threshold and limit our list to just that set?
--
Some reasons for starting with the thresholded list:
* Where else can we start?
* Get the low hanging fruit. Strongest signal might be the most important
* The strongest signals might be the best place to start.
* Easy.
---
## Is there an alternative to a Thresholded list?
### Thresholded list - genes with p-value < threshold
* Asks the question - Are there any gene sets or pathways that are **enriched/over-represented** or **depleted/under-represented** in my list
* Uses stastics like fisher exact test, hypergeometric test to assess this.
* Tools like: [DAVID](https://david.ncifcrf.gov/),[g:profiler](https://biit.cs.ut.ee/gprofiler/),[ErichR](https://amp.pharm.mssm.edu/Enrichr/),[GREAT](http://great.stanford.edu/public/html/)
### non-thresholded list - ranked gene list
* Asks the question - Are there any gene sets or pathways that are ranked surprisingly high or low in my ranked list.
* Uses stastics like KS test, linear models
* Tools and packages like:[GSEA - Gene set enrichment anlaysis](https://www.gsea-msigdb.org/gsea/index.jsp),[GSVA - Gene Set Variation Analysis](https://bioconductor.org/packages/release/bioc/html/GSVA.html)
---
## Pros and Cons of Thresholded list
.pull-left[
**Cons**
* No set value good for threshold - pvalue<0.05 is accepted but ultimately it is still arbitrary
* Weak signals are ignored (many weak signals can contribute to the whole)
* different results at different thresholds
]
.pull-right[
**Pros**
* No learning curve! Plug and play.
* Very accessible.
* Quick and fast results
* Endless of online tools available
* No requirement for a bioinformatician or computational biologist.
]
---
## Gene List Enrichment Analysis
* sometimes called over-representation analysis
### What do we start with:
* Gene list: eg: FUCA2, GCLC, NFYA, NIPAL3, SEMA3F, KRIT1
* Gene sets or pathways: come from any one of the annotation databases that we mentioned at the beginning of the day today.
### What are we looking for:
* Are any of the genelists surprisingly or statistically over-represented in our set of genes.
### What do we need in order to answer this question:
* A way to associate a statistic with the above question
* A way to correct for the multiple hypothesis that we are going to be asking.
---
<img src="./images/img_lecture6/two_class_design.png">
<font size=2>Voisin V, Bader GD, Pathway and network Analysis course - slide 12 [Bioinformatics.ca](https://bioinformatics.ca/workshops/2018-pathway-and-network-analysis-of-omics-data/#course-material)</font>
---
<img src="./images/img_lecture6/simple_enrichment_test.png">
<font size=2>Voisin V, Bader GD, Pathway and network Analysis course - slide 14 [Bioinformatics.ca](https://bioinformatics.ca/workshops/2018-pathway-and-network-analysis-of-omics-data/#course-material)</font>
---
<img src="./images/img_lecture6/fisher_exact1.png">
<font size=2>Voisin V, Bader GD, Pathway and network Analysis course - slide 16 [Bioinformatics.ca](https://bioinformatics.ca/workshops/2018-pathway-and-network-analysis-of-omics-data/#course-material)</font>
---
<img src="./images/img_lecture6/fisher_exact2.png">
<font size=2>Voisin V, Bader GD, Pathway and network Analysis course - slide 17 [Bioinformatics.ca](https://bioinformatics.ca/workshops/2018-pathway-and-network-analysis-of-omics-data/#course-material)</font>
---
<img src="./images/img_lecture6/fisher_exact3.png">
<font size=2>Voisin V, Bader GD, Pathway and network Analysis course - slide 18 [Bioinformatics.ca](https://bioinformatics.ca/workshops/2018-pathway-and-network-analysis-of-omics-data/#course-material)</font>
---
<img src="./images/img_lecture6/fisher_exact4.png">
<font size=2>Voisin V, Bader GD, Pathway and network Analysis course - slide 19 [Bioinformatics.ca](https://bioinformatics.ca/workshops/2018-pathway-and-network-analysis-of-omics-data/#course-material)</font>
---
## More about fisher exact test
* Video using M & M example with fisher exact test : https://www.youtube.com/watch?v=udyAvvaMjfM
* Great explanation on pathway commons site: https://www.pathwaycommons.org/guide/primers/statistics/fishers_exact_test/
---
<img src="./images/img_lecture6/difference_ora_gsea.png">
<font size=2>Voisin V, Bader GD, Pathway and network Analysis course - slide 21 [Bioinformatics.ca](https://bioinformatics.ca/workshops/2018-pathway-and-network-analysis-of-omics-data/#course-material)</font>
---
Check out this publication:
* Geistlinger L, Csaba G, Santarelli M, Ramos M, Schiffer L, Turaga N, Law C,
Davis S, Carey V, Morgan M, Zimmer R, Waldron L. Toward a gold standard for
benchmarking gene set enrichment analysis. Brief Bioinform. 2020 Feb 6 [PMID](https://www.ncbi.nlm.nih.gov/pubmed/32026945)
---
<img src="./images/img_lecture6/waldron_ora_methods.png">
<font size=2>Geistlinger L, Csaba G, Santarelli M, Ramos M, Schiffer L, Turaga N, Law C,Davis S, Carey V, Morgan M, Zimmer R, Waldron L. Toward a gold standard for
benchmarking gene set enrichment analysis. Brief Bioinform. 2020 Feb 6 [PMID](https://www.ncbi.nlm.nih.gov/pubmed/32026945)</font>
---
## Assignment #2
* differentail gene expression and preliminary ORA
* <font size=5> Due March 3, 2020! @ 20:00 </font>
## What to hand in?
* **html rendered RNotebook** - you should submit this through quercus
* Make sure the notebook and all associated code is checked into your github repo as I will be pulling all the repos at the deadline and using them to compile your code. - Your checked in code must replicate the handed in notebook.
* Document your work and your code directly in the notebook.
* **Reference the paper associated with your data!**
* **Introduce your paper and your data again**
* You are allowed to use helper functions or methods but make sure when you source those files the paths to them are relative and that they are checked into your repo as well.