This repository has been archived by the owner on May 29, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 53
/
260_Safikhani_Pharmacogenomics.Rmd
462 lines (376 loc) · 26.2 KB
/
260_Safikhani_Pharmacogenomics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
# 260: Biomarker discovery from large pharmacogenomics datasets
## Instructors:
* Zhaleh Safikhani (<zhaleh.safikhani@utoront.ca>)
* Petr Smirnov (<petr.smirnov@mail.utoronto.ca>)
* Benjamin Haibe-Kains (<benjamin.haibe.kains@utoronto.ca>)
## Workshop Description
This workshop will focus on the challenges encountered when applying machine learning techniques in complex, high dimensional biological data. In particular, we will focus on biomarker discovery from pharmacogenomic data, which consists of developing predictors of response of cancer cell lines to chemical compounds based on their genomic features. From a methodological viewpoint, biomarker discovery is strongly linked to variable selection, through methods such as Supervised Learning with sparsity inducing norms (e.g., ElasticNet) or techniques accounting for the complex correlation structure of biological features (e.g., mRMR). Yet, the main focus of this talk will be on sound use of such methods in a pharmacogenomics context, their validation and correct interpretation of the produced results. We will discuss how to assess the quality of both the input and output data. We will illustrate the importance of unified analytical platforms, data and code sharing in bioinformatics and biomedical research, as the data generation process becomes increasingly complex and requires high level of replication to achieve robust results. This is particularly relevant as our portfolio of machine learning techniques is ever enlarging, with its set of hyperparameters that can be tuning in a multitude of ways, increasing the risk of overfitting when developing multivariate predictors of drug response.
### Pre-requisites
* Basic knowledge of R syntax
* Familiarity with the machine learning concept and at least a few approaches
Following resources might be useful to read:
* https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btv723
* https://academic.oup.com/nar/article/46/D1/D994/4372597
* https://web.stanford.edu/~hastie/Papers/ESLII.pdf
### Workshop Participation
Participants expected to have the following required packages installed on their machines to be able to run the commands along with the instructors.
* PharmacoGx and Biobase from Bioconductor
* xtable, Hmisc, foreach, devtools, mRMRe, caret, glmnet, randomForest from cran
* bhklab/mci and bhklab/PharmacoGx-ML from github
### _R_ / _Bioconductor_ packages used
* https://bioconductor.org/packages/release/bioc/html/PharmacoGx.html
### Time outline
An example for a 45-minute workshop:
| Activity | Time |
|---------------------------------------------|------|
| Introduction | 10m |
| Basic functionalities of PharmacoGx | 15m |
| Consistency assessment between datasets | 15m |
| Machine learning and biomarker discovery | 20m |
## Workshop goals and objectives
### Learning goals
* describe the pharmacogenomic datasets and their usefulness
* learn how to extract information from these datasets and to intersect them over their common features
* identify functionalities available in PharmacoGx package to work with the high dimensional pharmacogenomics data
* assess reproducibility and replication of pharmacogenomics studies
* understand how to handle the biomarker discovery as a pattern recognition problem in the domain of pharmacogenomics studies
### Learning objectives
* list available standardized pharmacogenomic datasets and download them
* understand the structure of these datasets and how to access the features and response quantifications
* create drug-dose response plots
* Measure the consistency across multiple datasets and how to improve such measurements
* Assess whether known biomarkers are reproduced within these datasets
* Predict new biomarkers by applying different machine learning methods
## Abstract
This course will focus on the challenges encountered when applying machine learning techniques in complex, high dimensional biological data. In particular, we will focus on biomarker discovery from pharmacogenomic data, which consists of developing predictors of response of cancer cell lines to chemical compounds based on their genomic features. From a methodological viewpoint, biomarker discovery is strongly linked to variable selection, through methods such as Supervised Learning with sparsity inducing norms (e.g., ElasticNet) or techniques accounting for the complex correlation structure of biological features (e.g., mRMR). Yet, the main focus of this talk will be on sound use of such methods in a pharmacogenomics context, their validation and correct interpretation of the produced results. We will discuss how to assess the quality of both the input and output data. We will illustrate the importance of unified analytical platforms, data and code sharing in bioinformatics and biomedical research, as the data generation process becomes increasingly complex and requires high level of replication to achieve robust results. This is particularly relevant as our portfolio of machine learning techniques is ever enlarging, with its set of hyperparameters that can be tuning in a multitude of ways, increasing the risk of overfitting when developing multivariate predictors of drug response.
## Introduction
Pharmacogenomics holds much potential to aid in discovering drug response
biomarkers and developing novel targeted therapies, leading to development of
precision medicine and working towards the goal of personalized therapy.
Several large experiments have been conducted, both to molecularly
characterize drug dose response across many cell lines, and to examine the
molecular response to drug administration. However, the experiments lack a
standardization of protocols and annotations, hindering meta-analysis across
several experiments.
*PharmacoGx* was developed to address these challenges, by providing a
unified framework for downloading and analyzing large pharmacogenomic datasets
which are extensively curated to ensure maximum overlap and consistency.
*PharmacoGx* is based on a level of abstraction from the raw
experimental data, and allows bioinformaticians and biologists to work with
data at the level of genes, drugs and cell lines. This provides a more
intuitive interface and, in combination with unified curation, simplifies
analyses between multiple datasets.
Load `r BiocStyle::Biocpkg("PharmacoGx")` into your current workspace:
```{r loadlib, eval=TRUE, results='hide'}
suppressPackageStartupMessages({
library(PharmacoGx, verbose=FALSE)
library(mCI, verbose=FALSE)
library(PharmacoGxML, verbose=FALSE)
library(Biobase, verbose=FALSE)
})
```
### Downloading PharmacoSet objects
We have made the PharmacoSet objects of the curated datasets available for download using functions provided in the package. A table of available PharmacoSet objects can be obtained by using the *availablePSets* function. Any of the PharmacoSets in the table can then be downloaded by calling *downloadPSet*, which saves the datasets into a directory of the users choice, and returns the data into the R session.
```{r download_psets, eval=TRUE, results='hide'}
availablePSets(saveDir=file.path(".", "Safikhani_Pharmacogenomics"))
GDSC <- downloadPSet("GDSC", saveDir=file.path(".", "Safikhani_Pharmacogenomics"))
CCLE <- downloadPSet("CCLE", saveDir=file.path(".", "Safikhani_Pharmacogenomics"))
```
## Reproducibility
*PharmacoGx* can be used to process pharmacogenomic datasets. First we want to check the heterogenity of cell lines in one of the available psets, CCLE.
```{r piechart, fig.cap="Tissue of origin of cell lines in CCLE study"}
mycol <- c("#8dd3c7","#ffffb3","#bebada","#fb8072","#80b1d3","#fdb462",
"#b3de69","#fccde5","#d9d9d9","#bc80bd","#ccebc5","#ffed6f",
"#a6cee3","#1f78b4","#b2df8a","#33a02c","#fb9a99","#e31a1c",
"#fdbf6f","#ff7f00","#cab2d6","#6a3d9a","#ffff99","#b15928")
pie(table(CCLE@cell[,"tissueid"]),
col=mycol,
main="Tissue types",
radius=1,
cex=0.8)
```
### Plotting Drug-Dose Response Data
Drug-Dose response data included in the PharmacoSet objects can be conviniently plotted using the *drugDoseResponseCurve* function. Given a list of PharmacoSets, a drug name and a cell name, it will plot the drug dose response curves for the given cell-drug combination in each dataset, allowing direct comparisons of data between datasets.
```{r curves, fig.cap="Cells response to lapatinib in CCLE"}
CCLE.auc <- summarizeSensitivityProfiles(
pSet=CCLE,
sensitivity.measure="auc_published",
summary.stat="median",
verbose=FALSE)
lapatinib.aac <- CCLE.auc["lapatinib",]
cells <- names(lapatinib.aac)[
c(which.min(lapatinib.aac),
which((lapatinib.aac > 0.2) & (lapatinib.aac < 0.4))[1],
which.max(lapatinib.aac))]
par(mfrow=c(2, 2))
drugDoseResponseCurve(drug="lapatinib", cellline=cells[1],
pSets=CCLE, plot.type="Fitted",
legends.label="auc_published")
drugDoseResponseCurve(drug="lapatinib", cellline=cells[2],
pSets=CCLE, plot.type="Fitted",
legends.label="auc_published")
drugDoseResponseCurve(drug="lapatinib", cellline=cells[3],
pSets=CCLE, plot.type="Fitted",
legends.label="auc_published")
```
### Pharmacological profiles
In pharmacogenomic studies, cancer cell lines were also tested for their response to increasing concentrations of various compounds, and form this the IC50 and AAC were computed. These pharmacological profiles are available for all the psets in *PharmacoGx*.
```{r ccleauc, fig.cap="Cells response to drugs in CCLE"}
library(ggplot2, verbose=FALSE)
library(reshape2, verbose=FALSE)
melted_data <- melt(CCLE.auc)
NA_rows <- unique(which(is.na(melted_data), arr.ind=T)[,1])
melted_data <- melted_data[-NA_rows,]
ggplot(melted_data, aes(x=Var1,y=value)) +
geom_boxplot(fill="gray") +
theme(axis.text.x=element_text(angle=90,hjust=1)) +
xlab("Drugs") +
ylab("AAC")
#hist(CCLE.auc["lapatinib",], xlab="Cells response to lapatinib(AAC)",
# col="gray", main="")
```
## Replication
In this section we will investigate the consistency between the GDSC and CCLE datasets. In both CCLE and GDSC, the transcriptome of cells was profiled using an Affymatrix microarray chip. Cells were also tested for their response to increasing concentrations of various compounds, and form this the IC50 and AUC were computed. However, the cell and drugs names used between the two datasets were not consistent. Furthermore, two different microarray platforms were used. However, *PharmacoGx* allows us to overcome these differences to do a comparative study between these two datasets.
GDSC was profiled using the hgu133a platform, while CCLE was profiled with the expanded hgu133plus2 platform. While in this case the hgu133a is almost a strict subset of hgu133plus2 platform, the expression information in *PharmacoSet* objects is summarized by Ensemble Gene Ids, allowing datasets with different platforms to be directly compared. The probe to gene mapping is done using the BrainArray customCDF for each platform \cite{sabatti_thresholding_2002}.
To begin, you would load the datasets from disk or download them using the *downloadPSet* function above.
We want to investigate the consistency of the data between the two datasets. The common intersection between the datasets can then be found using *intersectPSet*. We create a summary of the gene expression and drug sensitivity measures for both datasets, so we are left with one gene expression profile and one sensitivity profile per cell line within each dataset. We can then compare the gene expression and sensitivity measures between the datasets using a standard correlation coefficient.
```{r Replicationcurves, fig.cap="Consistency of drug response curves across studies", results='hide'}
common <- intersectPSet(pSets = list("CCLE"=CCLE, "GDSC"=GDSC),
intersectOn = c("cell.lines", "drugs"),
strictIntersect = TRUE)
drugs <- drugNames(common$CCLE)
##Example of concordant and discordant drug curves
cases <- rbind(
c("CAL-85-1", "17-AAG"),
c("HT-29", "PLX4720"),
c("COLO-320-HSR", "AZD6244"),
c("HT-1080", "PD-0332991"))
par(mfrow=c(2, 2))
for (i in seq_len(nrow(cases))) {
drugDoseResponseCurve(pSets=common,
drug=cases[i,2],
cellline=cases[i,1],
legends.label="ic50_published",
plot.type="Fitted",
ylim=c(0,130))
}
```
### Consistency of pharmacological profiles
```{r sensitivityscatterplots, fig.height=5, fig.width=10, fig.cap="Concordance of AAC values"}
##AAC scatter plot
GDSC.aac <- summarizeSensitivityProfiles(
pSet=common$GDSC,
sensitivity.measure='auc_recomputed',
summary.stat="median",
verbose=FALSE)
CCLE.aac <- summarizeSensitivityProfiles(
pSet=common$CCLE,
sensitivity.measure='auc_recomputed',
summary.stat="median",
verbose=FALSE)
GDSC.ic50 <- summarizeSensitivityProfiles(
pSet=common$GDSC,
sensitivity.measure='ic50_recomputed',
summary.stat="median",
verbose=FALSE)
CCLE.ic50 <- summarizeSensitivityProfiles(
pSet=common$CCLE,
sensitivity.measure='ic50_recomputed',
summary.stat="median",
verbose=FALSE)
drug <- "lapatinib"
#par(mfrow=c(1, 2))
myScatterPlot(x=GDSC.aac[drug,],
y=CCLE.aac[drug,],
method=c("transparent"),
transparency=0.8, pch=16, minp=50,
xlim=c(0, max(max(GDSC.aac[drug,], na.rm=T), max(CCLE.aac[drug,], na.rm=T))),
ylim=c(0, max(max(GDSC.aac[drug,], na.rm=T), max(CCLE.aac[drug,], na.rm=T))),
main="cells response to lapatinib",
cex.sub=0.7,
xlab="AAC in GDSC",
ylab="AAC in CCLE")
legend("topright",
legend=sprintf("r=%s\nrs=%s\nCI=%s",
round(cor(GDSC.aac[drug,],
CCLE.aac[drug,],
method="pearson",
use="pairwise.complete.obs"),
digits=2),
round(cor(GDSC.aac[drug,],
CCLE.aac[drug,],
method="spearman",
use="pairwise.complete.obs"),
digits=2),
round(paired.concordance.index(GDSC.aac[drug,],
CCLE.aac[drug,],
delta.pred=0,
delta.obs=0)$cindex,
digits=2)),
bty="n")
```
### consistency assessment improved by Modified Concordance Index
To better assess the concordance of multiple pharmacogenomic studies we introduced the modified concordance index (mCI). Recognizing that the noise in the drug screening assays is high and may yield to inaccurate sensitive-based ranking of cell lines with close AAC values, the mCI only considers cell line pairs with drug sensitivity (AAC) difference greater than $\delta$ .
```{r mci, eval=TRUE, results='hide'}
c_index <- mc_index <- NULL
for(drug in drugs){
tt <- mCI::paired.concordance.index(GDSC.aac[drug,], CCLE.aac[drug,], delta.pred=0, delta.obs=0, alternative="greater")
c_index <- c(c_index, tt$cindex)
tt <- mCI::paired.concordance.index(GDSC.aac[drug,], CCLE.aac[drug,], delta.pred=0.2, delta.obs=0.2, alternative="greater", logic.operator="or")
mc_index <- c(mc_index, tt$cindex)
}
mp <- barplot(as.vector(rbind(c_index, mc_index)), beside=TRUE, col=c("blue", "red"), ylim=c(0, 1), ylab="concordance index", space=c(.15,.85), border=NA, main="mCI")
text(mp, par("usr")[3], labels=as.vector(rbind(drugs, rep("", 15))), srt=45, adj=c(1.1,1.1), xpd=TRUE, cex=.8)
abline(h=.7, lty=2)
```
### Known Biomarkers
The association between molecular features and response to a given drug is modelled using a linear regression model adjusted for tissue source:
$$Y = \beta_{0} + \beta_{i}G_i + \beta_{t}T + \beta_{b}B$$
where $Y$ denotes the drug sensitivity variable, $G_i$, $T$ and $B$ denote the expression of gene $i$, the tissue source and the experimental batch respectively, and $\beta$s are the regression coefficients. The strength of gene-drug association is quantified by $\beta_i$, above and beyond the relationship between drug sensitivity and tissue source. The variables $Y$ and $G$ are scaled (standard deviation equals to 1) to estimate standardized coefficients from the linear model. Significance of the gene-drug association is estimated by the statistical significance of $\beta_i$ (two-sided t test). P-values are then corrected for multiple testing using the false discovery rate (FDR) approach.
As an example of the reproducibility of biomarker discovery across pharmacogenomic studies, we can model the significance of the association between two drugs and their known biomarkers in CCLE and GDSC. We examine the association between drug *17-AAG* and gene *NQO1*, as well as drug *PD-0325901* and gene *BRAF*:
``` {r biomarker_discovery, results='hide'}
features <- PharmacoGx::fNames(CCLE, "rna")[
which(featureInfo(CCLE,
"rna")$Symbol == "NQO1")]
ccle.sig.rna <- drugSensitivitySig(pSet=CCLE,
mDataType="rna",
drugs=c("17-AAG"),
features=features,
sensitivity.measure="auc_published",
molecular.summary.stat="median",
sensitivity.summary.stat="median",
verbose=FALSE)
gdsc.sig.rna <- drugSensitivitySig(pSet=GDSC,
mDataType="rna",
drugs=c("17-AAG"),
features=features,
sensitivity.measure="auc_published",
molecular.summary.stat="median",
sensitivity.summary.stat="median",
verbose=FALSE)
ccle.sig.mut <- drugSensitivitySig(pSet=CCLE,
mDataType="mutation",
drugs=c("PD-0325901"),
features="BRAF",
sensitivity.measure="auc_published",
molecular.summary.stat="and",
sensitivity.summary.stat="median",
verbose=FALSE)
gdsc.sig.mut <- drugSensitivitySig(pSet=GDSC,
mDataType="mutation",
drugs=c("PD-0325901"),
features="BRAF",
sensitivity.measure="auc_published",
molecular.summary.stat="and",
sensitivity.summary.stat="median",
verbose=FALSE)
ccle.sig <- rbind(ccle.sig.rna, ccle.sig.mut)
gdsc.sig <- rbind(gdsc.sig.rna, gdsc.sig.mut)
known.biomarkers <- cbind("GDSC effect size"=gdsc.sig[,1],
"GDSC pvalue"=gdsc.sig[,6],
"CCLE effect size"=ccle.sig[,1],
"CCLE pvalue"=ccle.sig[,6])
rownames(known.biomarkers) <- c("17-AAG + NQO1","PD-0325901 + BRAF")
library(xtable, verbose=FALSE)
xtable(known.biomarkers, digits=c(0, 2, -1, 2, -1), caption='Concordance of biomarkers across stuudies')
par(mfrow=c(2, 2))
CCLE_expr <- t(exprs(summarizeMolecularProfiles(CCLE, mDataType="rna", fill.missing=FALSE)))
CCLE_cells <- intersect(rownames(CCLE_expr), colnames(CCLE.aac))
plot(CCLE.aac["17-AAG", CCLE_cells], CCLE_expr[CCLE_cells, features],
main="CCLE + 17-AAG + NQO1",
cex.main=1, ylab="Predictions", xlab="drug sensitivity", pch=20, col="gray40")
GDSC_expr <- t(exprs(summarizeMolecularProfiles(GDSC, mDataType="rna", fill.missing=FALSE)))
GDSC_cells <- intersect(rownames(GDSC_expr), colnames(GDSC.aac))
plot(GDSC.aac["17-AAG", GDSC_cells], GDSC_expr[GDSC_cells, features],
main="GDSC + 17-AAG + NQO1",
cex.main=1, ylab="Predictions", xlab="drug sensitivity", pch=20, col="gray40")
CCLE_mut <- t(exprs(summarizeMolecularProfiles(CCLE, mDataType="mutation", fill.missing=FALSE, summary.stat="or")))
CCLE_cells <- intersect(rownames(CCLE_mut), colnames(CCLE.aac))
boxplot(CCLE.aac["PD-0325901", CCLE_cells]~ CCLE_mut[CCLE_cells, "BRAF"], col="gray80", pch=20, main="CCLE + PD-0325901 + BRAF",
cex.main=1, xlab="mutation", ylab="drug sensitivity")
GDSC_mut <- t(exprs(summarizeMolecularProfiles(GDSC, mDataType="mutation", fill.missing=FALSE, summary.stat="or")))
GDSC_cells <- intersect(rownames(GDSC_mut), colnames(GDSC.aac))
boxplot(GDSC.aac["PD-0325901", GDSC_cells]~ GDSC_mut[GDSC_cells, "BRAF"], col="gray80", pch=20, main="GDSC + PD-0325901 + BRAF",
cex.main=1, xlab="mutation", ylab="drug sensitivity")
```
## Machine Learning and Biomarker Discovery
Some of the widely used multivariate machine learning methods such as elastic net, Random Forest (RF) and Support Vector Machine (SVM) have been already implemented in the MLWorkshop. It optimizes hyperparameters of these methods in the training phase. To assess the performance of the predictive models, it implements *m* number of sampling with *n-fold* cross validations (CV). The performance will then be assessed by multiple metrics including pearson correlation coefficient, concordance index and modified concordance index.
```{r machine_learning, results='hide'}
suppressPackageStartupMessages({
library(mRMRe, verbose=FALSE)
library(Biobase, verbose=FALSE)
library(Hmisc, verbose=FALSE)
library(glmnet, verbose=FALSE)
library(caret, verbose=FALSE)
library(randomForest, verbose=FALSE)
})
##Preparing trainig dataset
train_expr <- t(exprs(summarizeMolecularProfiles(GDSC, mDataType="rna", fill.missing=FALSE, verbose=FALSE)))
aac <- summarizeSensitivityProfiles(GDSC, sensitivity.measure="auc_recomputed", drug="lapatinib", fill.missing=FALSE, verbose=FALSE)
cells <- intersect(rownames(train_expr), names(aac))
df <- as.matrix(cbind(train_expr[cells,], "lapatinib"=aac[cells]))
##Preparing validation dataset
validation_expr <- summarizeMolecularProfiles(CCLE, mDataType="rna", fill.missing=FALSE, verbose=FALSE)
actual_labels <- summarizeSensitivityProfiles(CCLE, sensitivity.measure="auc_recomputed", drug="lapatinib", fill.missing=FALSE, verbose=FALSE)
for(method in c("ridge", "lasso", "random_forest", "svm")){
par(mfrow=c(1, 2))
res <- optimization(train=df[, -ncol(df), drop=F],
labels=t(df[, ncol(df), drop=F]),
method=method,
folds.no=5,
sampling.no=1,
features.no=10,
feature.selection="mRMR",
assessment=c("corr", "mCI"))
validation_labels <- validation(model=res$model$lapatinib,
validation.set=t(exprs(validation_expr)),
validation.labels=actual_labels,
method=method,
assessment="mCI")
}
```
### Bonus: Using the Connectivity Map for drug repurposing
We show here how to use *PharmacoGx* for linking drug perturbation signatures inferred from CMAP to independent signatures of HDAC inhibitors published in Glaser et al. (2003). We therefore sought to reproduce the HDAC analysis in Lamb et al. (2006) using the latest version of CMAP that can be downloaded using downloadPSet. The connectivityScore function enables the computation of the connectivity scores between the 14-gene HDAC signature from (Glaser et al., 2003) and over 1000 CMAP drugs. This analysis results in the four HDAC inhibitors in CMAP being ranked at the top of the drug list (Fig. 2), therefore concurring with the original CMAP analysis (Lamb et al., 2006).
```{r runCMAP, message=FALSE, warning=FALSE}
## download and process the HDAC signature
mydir <- "1132939s"
downloader::download(paste("http://www.sciencemag.org/content/suppl/2006/09/29/313.5795.1929.DC1/", mydir, ".zip", sep=""), destfile=paste(mydir,".zip",sep=""))
unzip(paste(mydir,".zip", sep=""))
library(hgu133a.db)
library(PharmacoGx)
HDAC_up <- gdata::read.xls(paste(mydir, paste(mydir, "sigS1.xls", sep="_"),sep="/"), sheet=1, header=FALSE, as.is=TRUE)
HDAC_down <- gdata::read.xls(paste(mydir, paste(mydir, "sigS1.xls", sep="_"),sep="/"), sheet=2, header=FALSE, as.is=TRUE)
HDAC <- as.data.frame(matrix(NA, nrow=nrow(HDAC_down)+nrow(HDAC_up), ncol=2))
annot <- AnnotationDbi::select(hgu133a.db, keys = c(HDAC_up[[1]], HDAC_down[[1]]), columns=c("ENSEMBL"), keytype="PROBEID")
gene_up <- unique(annot[match(HDAC_up[[1]], annot[,1]),2])
gene_down <- na.omit(unique(annot[match(HDAC_down[[1]], annot[,1]),2]))
HDAC_genes <- as.data.frame(matrix(NA, nrow=length(gene_down)+length(gene_up), ncol=2))
HDAC_genes[ , 2] <- c(rep(1, times=length(gene_up)), rep(-1, times=length(gene_down)))
HDAC_genes[ , 1] <- c(gene_up, gene_down)
rownames(HDAC_genes) <- HDAC_genes[ , 1]
HDAC <- HDAC_genes[ , 2]
names(HDAC) <- rownames(HDAC_genes)
drug.perturbation <- PharmacoGx::downloadPertSig("CMAP")
dimnames(drug.perturbation)[[1]] <- gsub("_at", "", dimnames(drug.perturbation)[[1]])
message("Be aware that computing sensitivity will take some time...")
cl <- parallel::makeCluster(2)
res <- parApply(drug.perturbation[ , , c("tstat", "fdr")], 2, function(x, HDAC){
return(PharmacoGx::connectivityScore(x=x, y=HDAC, method="gsea", nperm=100))
}, cl=cl, HDAC=HDAC)
stopCluster(cl)
rownames(res) <- c("Connectivity", "P Value")
res <- t(res)
res <- apply(drug.perturbation[ , , c("tstat", "fdr")], 2, function(x, HDAC){
return(PharmacoGx::connectivityScore(x=x, y=HDAC, method="gsea", nperm=100))
}, HDAC=HDAC)
rownames(res) <- c("Connectivity", "P Value")
res <- t(res)
HDAC_inhibitors <- c("vorinostat", "trichostatin A", "HC toxin", "valproic acid")
res <- res[order(res[,1], decreasing=T), ]
HDAC_ranks <- which(rownames(res) %in% HDAC_inhibitors)
```
## Session Info
This document was generated with the following R version and packages loaded:
```{r sessionInfo}
sessionInfo()
```