Skip to content

Commit

Permalink
Merge pull request #14 from pasmopy/update-r
Browse files Browse the repository at this point in the history
Update transcriptomic data integration
  • Loading branch information
Hiroaki Imoto committed Jul 16, 2021
2 parents bf636ca + f47f9f5 commit 1d06839
Show file tree
Hide file tree
Showing 6 changed files with 395 additions and 169 deletions.
98 changes: 92 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ This repository contains analysis code for the following paper:
| ------------- | -------------------------------------------------------------- |
| Python >= 3.7 | See [`requirements.txt`](requirements.txt) |
| Julia >= 1.5 | [`BioMASS.jl==0.5.0`](https://github.com/biomass-dev/BioMASS.jl) |
| R >= 4.0 | TCGAbiolinks, sva, biomaRt, edgeR, ComplexHeatmap, viridisLite |
| R >= 4.0 | TCGAbiolinks, sva, biomaRt, ComplexHeatmap, viridisLite, dplyr, edgeR, sva, tibble, data.table, stringr, biomaRt |

## Table of contents

Expand All @@ -26,14 +26,100 @@ This repository contains analysis code for the following paper:

## Integration of TCGA and CCLE data

- Run [`transcriptomic_data_integration.R`](transcriptomic_data/transcriptomic_data_integration.R)

```bash
$ cd transcriptomic_data
$ Rscript transcriptomic_data_integration.R
### Download package for run following R script
- Run code.R

```R
source("code.R")
```

### Download TCGA clinical/subtype information

- Run `outputclinical()` or `outputsubtype()`
**outputclinical()** : You can select all cancer types in [TCGA Study Abbreviations](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations).
**outputsubtype()** : You can select "LGG", "LUAD", "STAD", "BRCA", COAD", "READ"
```R
outputclinical("BRCA")
outputsubtype("BRCA")
```
Output: `TPM_RLE_postComBat.csv`
Output: `"TCGA Study Abbreviation"_clinic.csv` or `"TCGA Study Abbreviation"_subtype.csv`
### Select samples in reference to clinical or subtype data
- You can select the patient's state based on the clinical or subtype data obrained above.
```R
sampleselection(type = subtype,
ID = "patient",
pathologic_stage %in% c("Stage_I", "Stage_II"),
age_at_initial_pathologic_diagnosis < 60)
```
**type** :
You can choose `clinic` or `subtype`. If you specify `clinic`, refer to `"TCGA Study Abbreviation"_clinic.csv`, and if you specify `subtype`, refer to `"TCGA Study Abbreviation"_subtype.csv` to select the patient. In order to select each one, you need to run outputclinical() or outputsubtype() before running this code.
**ID** :
Column name that contains the patient's ID (ex. TCGA-E2-A14U, TCGA-E9-A1RC, ...) in the .csv file referenced by "type".
**After line 3** :
You can set multiple conditions for selecting samples.
| Terms | How to write |
| ------------- | -------------------------------------------------------------- |
| all patients meet "A" in column x | x == "A" |
| all patients meet "A" or "B" or "C" in column x | x %in% c("A", "B", "C") |
| all patients have a value greater than A in column x | x > A |
| all patients have a value less than A in column x | x < A |
### Download TCGA gene expression data (HTSeq-Counts)
- Download the gene expression data of the specified sample types [(Sample Type Codes)](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes) in the cancer type specified by `outputclinical()` or `outputsubtype()`. By running this code, you can get data of only the patients selected by `sampleselection()`.
```R
downloadTCGA(cancertype = "BRCA",
sampletype = c("01", "06"))
```
Output: Number of selected samples
### Download CCLE transcriptomic data
- Download CCLE transcriptomic data. You can select cell lines derived from [`one specific cancer type`](CCLE_cancertype.txt).
```R
downloadCCLE(cancertype = "BREAST")
```
Output: Number of selected samples
### Merge TCGA and CCLE data
1. Merge TCGA data download with `downloadTCGA()` and CCLE data download with `downloadCCLE()`
2. Run ComBat-seq program to remove batch effects between TCGA and CCLE datasets
3. Output total read counts of all samples in order to decide the cutoff value of total read counts for `Normalization()`
```R
MergeTCGAandCCLE()
```
Output : totalreadcounts.csv
### Normalization of RNA-seq counts data
- Conduct noramlization of RNA-seq .
- You can specify min and max value for truncation of total read counts.
- If you do not want to specify values for truncation, please set "min = F" or "max = F"
```R
Normalization(min = 40000000, max = 140000000)
```
Output : TPM_RLE_postComBat.csv
## Construction of a comprehensive model of the ErbB signaling network
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
biomass>=0.5
biomass==0.5.0
matplotlib
numpy
pandas>=0.23
Expand Down
23 changes: 23 additions & 0 deletions transcriptomic_data/CCLE_cancertype.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
PROSTATE
STOMACH
URINARY_TRACT
CENTRAL_NERVOUS_SYSTEM
OVARY
HAEMATOPOIETIC_AND_LYMPHOID_TISSUE
KIDNEY
THYROID
LUNG
BONE
PLEURA
ENDOMETRIUM
PANCREAS
BREAST
UPPER_AERODIGESTIVE_TRACT
LARGE_INTESTINE
AUTONOMIC_GANGLIA
OESOPHAGUS
FIBROBLAST
CERVIX
LIVER
BILIARY_TRACT
SMALL_INTESTINE
119 changes: 119 additions & 0 deletions transcriptomic_data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Create transcriptomic data

Workflow for creating trascriptomic data

## Requirements

| Language | Dependent packages |
| ------------- | -------------------------------------------------------------- |
| R | dplyr, edgeR, sva, tibble, data.table, stringr, TCGAbiolinks , biomaRt |



## Move to /transcriptomic_data and run R
- Move to /transcriptomic_data and run R

```bash
$cd transcriptomic_data
$R
```

- Run transcriptomic_data.R

```R
source("transcriptomic_data.R")
```

## Download TCGA clinical/subtype information

- Run `outputclinical()` or `outputsubtype()`
**outputClinical()** : You can select all cancer types in [TCGA Study Abbreviations](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations).
**outputSubtype()** : You can select "ACC", "BRCA", "BLCA", "CESC", "CHOL", "COAD", "ESCA", "GBM", "HNSC", "KICH", "KIRC", "KIRP", "LGG", "LIHC", "LUAD", "LUSC", "PAAD", "PCPG", "PRAD", "READ", "SKCM", "SARC", "STAD", THCA", "UCEC", "UCS", "UVM".
```R
outputClinical("BRCA")
outputSubtype("BRCA")
```
Output: `"TCGA Study Abbreviation"_clinic.csv` or `"TCGA Study Abbreviation"_subtype.csv`
## Select samples in reference to clinical or subtype data
- You can select the patient's state based on the clinical or subtype data obtained above.
```R
patientSelection(type = subtype,
ID = "patient",
pathologic_stage %in% c("Stage_I", "Stage_II"),
age_at_initial_pathologic_diagnosis < 60)
```
**type** :
You can choose `clinical` or `subtype`. If you specify `clinical`, refer to `"TCGA Study Abbreviation"_clinical.csv`, and if you specify `subtype`, refer to `"TCGA Study Abbreviation"_subtype.csv` to select the patient. In order to select each one, you need to run outputClinical() or outputSubtype() before running this code.
**ID** :
Column name that contains the patient's ID (ex. TCGA-E2-A14U, TCGA-E9-A1RC, ...) in the .csv file referenced by "type".
**After line 3** :
You can set multiple conditions for selecting samples.
| Terms | How to write |
| ------------- | -------------------------------------------------------------- |
| all patients meet "A" in column x | x == "A" |
| all patients meet "A" or "B" or "C" in column x | x %in% c("A", "B", "C") |
| all patients have a value greater than A in column x | x > A |
| all patients have a value less than A in column x | x < A |
## Download TCGA gene expression data (HTSeq-Counts)
- Download the gene expression data of the specified sample types [(Sample Type Codes)](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes) in the cancer type specified by `outputclinical()` or `outputsubtype()`. By running this code, you can get data of only the patients selected by `sampleselection()`.
```R
downloadTCGA(cancertype = "BRCA",
sampletype = c("01", "06"),
outputresult = FALSE)
```
Output: Number of selected samples
## Download CCLE transcriptomic data
- Download CCLE transcriptomic data. You can select cell lines derived from [`one specific cancer type`](CCLE_cancertype.txt).
```R
downloadCCLE(cancertype = "BREAST",
outputresult = FALSE)
```
Output: Number of selected samples
## Merge TCGA and CCLE data
1. Merge TCGA data download with `downloadTCGA()` and CCLE data download with `downloadCCLE()`
2. Run ComBat-seq program to remove batch effects between TCGA and CCLE datasets
3. Output total read counts of all samples in order to decide the cutoff value of total read counts for `normalization()`
```R
mergeTCGAandCCLE(outputesult = FALSE)
```
Output : totalreadcounts.csv
## Normalization of RNA-seq counts data
- Conduct noramlization of RNA-seq .
- You can specify min and max value for truncation of total read counts.
- If you do not want to specify values for truncation, please set "min = F" or "max = F"
```R
normalization(min = 40000000, max = 140000000)
```
Output : TPM_RLE_postComBat.csv
Loading

0 comments on commit 1d06839

Please sign in to comment.