Skip to content

Commit

Permalink
Added post about retrieving access-controlled data from dbGAP
Browse files Browse the repository at this point in the history
  • Loading branch information
tomsing1 committed Sep 17, 2023
1 parent 77e3d1f commit 97880f1
Show file tree
Hide file tree
Showing 7 changed files with 1,784 additions and 3,042 deletions.
82 changes: 48 additions & 34 deletions docs/index.html

Large diffs are not rendered by default.

3,381 changes: 375 additions & 3,006 deletions docs/index.xml

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/listings.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
{
"listing": "/index.html",
"items": [
"/posts/dbgap/index.html",
"/posts/ParquetMatrix/index.html",
"/posts/parquetArray/index.html",
"/posts/parquet/index.html",
Expand Down
1,071 changes: 1,071 additions & 0 deletions docs/posts/dbgap/index.html

Large diffs are not rendered by default.

30 changes: 29 additions & 1 deletion docs/search.json

Large diffs are not rendered by default.

6 changes: 5 additions & 1 deletion docs/sitemap.xml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@
</url>
<url>
<loc>https://tomsing1.github.io/blog/index.html</loc>
<lastmod>2023-09-14T03:44:36.887Z</lastmod>
<lastmod>2023-09-17T23:09:50.717Z</lastmod>
</url>
<url>
<loc>https://tomsing1.github.io/blog/posts/fujita_2022/index.html</loc>
Expand Down Expand Up @@ -140,4 +140,8 @@
<loc>https://tomsing1.github.io/blog/posts/ParquetMatrix/index.html</loc>
<lastmod>2023-09-14T03:44:36.189Z</lastmod>
</url>
<url>
<loc>https://tomsing1.github.io/blog/posts/dbgap/index.html</loc>
<lastmod>2023-09-17T23:09:50.020Z</lastmod>
</url>
</urlset>
255 changes: 255 additions & 0 deletions posts/dbgap/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,255 @@
---
title: "Retrieving access-controlled data from NCBI's dbGAP repository"
author: "Thomas Sandmann"
date: "2023-09-17"
freeze: true
categories: [NCBI, dbGAP, TIL]
editor:
markdown:
wrap: 72
format:
html:
toc: true
toc-depth: 4
code-tools:
source: true
toggle: false
caption: none
editor_options:
chunk_output_type: console
---

## tl;dr:

Today I learned about different ways to retrieve access-controlled short read
data from
[NCBI's dbGAP repository](https://www.ncbi.nlm.nih.gov/gap/).
dbGAP hosts both publicly available and _Access Controlled_ data. The latter is
usually used to disseminate data from individual human participants and requires
a data access application.

After the data access request has been granted, it is time to retrieve the
actual data from dbGAP and - in case of short read data - its sister repository
[the Short Read Archive](https://www.ncbi.nlm.nih.gov/sra).

## Authenticating with JWT or NGC files

The path to authenticating and downloading dbGAP data differs depending on
whether you are using the AWS or GCP cloud proviers, or a local compute
infrastructure (or another, unsupported cloud provider) instead.

### Authentication within AWS or GCP cloud environments

On these two platforms, you have two paths to access the data:

1. With a `JWT` file: A JWT[^1] file,
[introduced with `sra-tools` version 2.10](https://github.com/ncbi/sra-tools/wiki/First-help-on-decryption-dbGaP-data#using-jwt-tokens),
allows users to transfer data from dbGAP's
cloud buckets into your own cloud instance. (Because both your and dbGAP's
system's share the same cloud environment, this is faster than a regular
transfer e.g. via https or ftp) [^2]

[^1]: [JSON Web Token](https://jwt.io/)
[^2]: [dbGAP's official JWT instructions are here.](https://www.ncbi.nlm.nih.gov/sra/docs/sra-dbGAP-cloud-download/)

2. Via `fusera`: Alternatively, you can mount dbGAP's buckets as
read-only volumes on your cloud instances via
[fusera](https://github.com/mitre/fusera)[^3]

[^3]: [dbGAP's official fusera instructions are here.](https://www.ncbi.nlm.nih.gov/sra/docs/dbgap-cloud-access/#access-by-fusera)

::: {.callout-note collapse=true}

### The nf-core/fetchngs workflow

The
[nf-core/fetchngs](https://nf-co.re/fetchngs)
workflow supports the retrieval of dbGAP data via `JWT` file authentication,
e.g. when it is executed on AWS or GCP compute instances (see above). As all
nf-core workflows, it is easily parallelized, e.g. across an HPC or via an AWS
Batch queue. (Highly recommended when you need to retrieve large amounts of
data.)

:::
### Authenticating outside AWS / GCP

On all other compute platforms, including your laptop or your local high-
performance cluster (HPC), you need to authenticate with an `NGC` file
(containing your _repository key_) instead[^4][^5].

[^4]: [dbGAP's official NGC instructions are here.](https://www.ncbi.nlm.nih.gov/sra/docs/sra-dbgap-download/)

[^5]: `NGC` file authentication also works on cloud instances, e.g. an AWS EC2
instance, but it is slower as it doesn't take advantage of the fact that your
instance and dbGAP's data bucket are co-located.

In this blog post, I will outline how to use `NGC` authentication, but make
sure to read
[dbGAP's official documentation](https://www.ncbi.nlm.nih.gov/sra/docs/sra-dbgap-download/)
as well.

### Retrieving dbGAP data with NGC authentication

If you are _not_ working on AWS or GCP, and need to rely on `NGC`
authentication, the following steps might be useful.

#### 1. Log into dbGAP

- Navigate to
[the dbGAP login page for controlled access](https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login)
and log in with your eRA credentials.

#### 2. Install sra-tools from github

I usually download the latest binary of the `sra-tools` suite for my operating
system from
[github]https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit).
Alternatively, you can also install it using
[Bioconda](https://anaconda.org/bioconda/sra-tools)

::: {.callout-note}

Please note that the `sra-tools` package is frequently updated, so make sure
you have the latest version).

:::

For example, this code snipped retrieves and decompresses the latest version
for Ubuntu Linux into the `~/bin` directory:

```bash
mkdir -p ~/bin
pushd ~/bin
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.7/sratoolkit.3.0.7-ubuntu64.tar.gz
tar xfz sratoolkit.3.0.7-ubuntu64.tar.gz
rm sratoolkit.3.0.7-ubuntu64.tar.gz
popd
```

Afterward, I add the sra directory to my `PATH` and verify that it's the version
I expected:

```bash
export PATH=~/bin/sratoolkit.3.0.7-ubuntu64/bin:$PATH
prefetch --version # verify that it's the version you downloaded
```

::: {.callout-note}

The best source of information about using the various tools included in the `sra-tools`
is the
[sra-tools wiki](https://github.com/ncbi/sra-tools/wiki).

:::

#### 3. Configure the sra toolkit

Next, I configure the toolkit, especially the location of the `cache` directory.
The `prefetch` command stores all `SRA` files it downloads in this location, so
I make sure it is on a volume that is large enough to hold the expected amount
of data.

```bash
./vdb-config -i
```

- In the `Cache` section, O specify an existing directory as the
`public user-repository`. This is where `prefetch ` will be download files
to (and they will be kept until the cache is cleared!)

My settings are stored in the `${HOME}/.ncbi/user-settings.mkfg` file. For more
information and other ways to configure the toolkit, please see
[its wiki page](https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump).

#### 4. Log into dbGAP to retrieve the repository key

- Back in on the dbGAP website, navigate to the “My Projects” tab
- Choose “get dbGaP repository key” in the “Actions” column.
- Download the _repository key_ file with the `.ngc` extension to your system.

#### 5. Choose the files to download from SRA

- In your dbGAP account, next navigate to the "My requests" tab.
- Click on "Request files" on the right side of the table.
- Navigate to the `SRA data (reads and reference alignments)` tab.
- Click on SRA Run Selector
- Select all of the files you would like to download in the table at the bottom
of the page.
- Toggle the `Selected` radio button.

#### 6. Download the `.krt` file that specifies which files to retrieve

- Download the `.krt` file by clicking on the green `Cart file` button.

#### 7. Initiate the download of the files in `SRA` format

- Now, with both the `.ngc` and `.krt` files in hand, we can trigger the
download with the sra-tool's `prefetch` command. We need to provide
_both_ paths to
- the repository key (via `--ngc`) and
- the cart file (via `--cart`)

For example, this code snipped assumes the two files are are in my home
directory. (The exact names of your `.ngc` and `.krt` files will be different.)

```bash
mkdir -p ~/dbgap
pushd ~/dbgap
prefetch \
--max-size u \
--ngc ~/prj_123456.ngc \
--cart ~/cart_DAR12345_2023081212345.krt
popd
```

Note: The files are downloaded (and cached) in SRA format into the directory I
specified when configuring the sra-toolkit (e.g. the `public user-repository`).
Extracting reads and generating FASTQ files is a separate step.

#### 8. Decrypt SRA files and extract reads in FASTQ format

🚨 The final fastq-files will be approximately 7 times the size of the accession.
The fasterq-dump-tool needs temporary space (scratch space) of about 1.5 times the
amount of the final fastq-files during the conversion. Overall, the space you need
during the conversion is approximately 17 times the size of the accession.

🚨 The extraction and recompression steps are very CPU intensive, and it is
recommended to use multiple cores. (The code below uses _all_ available cores,
as determined via the `nproc --all` command.)

The `fasterq-dump` tool extracts the reads into FASTQ files. It only accepts a
single accession at a time, and expects to find the corresponding SRA file in
the cache directory. Like the `prefetch` command above, it requires the `.ngc`
file to verify that I am permitted to decrypt the data.

To save disk space I only extract a single SRA file at a time and then compress
the FASTQ files with `pigz`. Afterward I copy the compressed FASTQ files to
an AWS S3 bucket and delete the local files before processing the next
accession.

```bash
#!/usr/bin/env bash
set -e
set -o nounset

declare -r CACHEDIR="~/cache/sra/" # the cache directory with .sra files
declare -r BUCKET="s3://my-s3-bucket-for-storing-dbGAP-data"

for SRA_FILE in ${CACHEDIR}/*.sra
do
fasterq-dump -p \
--threads $(nproc --all) \
--ngc ~/prj_123456.ngc \
$(basename $SRA_FILE .sra)
pigz \
--processes $(nproc --all) \
*.fastq
aws s3 sync \
--exclude "*" \
--include "*.fastq.gz" \
. \
${BUCKET}
rm *.fastq.gz
done
```

0 comments on commit 97880f1

Please sign in to comment.