From 97880f1fe8f0c6f496ce40e4a21c82d310d539bf Mon Sep 17 00:00:00 2001 From: Thomas Sandmann Date: Sun, 17 Sep 2023 16:10:06 -0700 Subject: [PATCH] Added post about retrieving access-controlled data from dbGAP --- docs/index.html | 82 +- docs/index.xml | 3381 ++++------------------------------- docs/listings.json | 1 + docs/posts/dbgap/index.html | 1071 +++++++++++ docs/search.json | 30 +- docs/sitemap.xml | 6 +- posts/dbgap/index.qmd | 255 +++ 7 files changed, 1784 insertions(+), 3042 deletions(-) create mode 100644 docs/posts/dbgap/index.html create mode 100644 posts/dbgap/index.qmd diff --git a/docs/index.html b/docs/index.html index a6cf384..9d85cfe 100644 --- a/docs/index.html +++ b/docs/index.html @@ -171,7 +171,7 @@

Welcome

+
Categories
All (34)
AWS (3)
Bioconductor (2)
CSS (1)
Literature (1)
NCBI (1)
NGS (4)
R (17)
SQL (2)
TIL (27)
conda (1)
conference (1)
databases (1)
dbGAP (1)
news (1)
nextflow (5)
parquet (3)
postgres (2)
python (2)
quarto (2)
scRNAseq (1)
streamlit (1)
tidyverse (1)
visualization (2)
@@ -203,7 +203,21 @@
Categories
- + + +Sep 17, 2023 + + +Retrieving access-controlled data from NCBI’s dbGAP repository + + +Thomas Sandmann + + +6 min + + + Sep 13, 2023 @@ -217,7 +231,7 @@
Categories
11 min - + Sep 5, 2023 @@ -231,7 +245,7 @@
Categories
11 min - + Aug 31, 2023 @@ -245,7 +259,7 @@
Categories
7 min - + Aug 30, 2023 @@ -259,7 +273,7 @@
Categories
10 min - + Aug 28, 2023 @@ -273,7 +287,7 @@
Categories
6 min - + Jul 31, 2023 @@ -287,7 +301,7 @@
Categories
1 min - + Jul 24, 2023 @@ -301,7 +315,7 @@
Categories
3 min - + Jul 21, 2023 @@ -315,7 +329,7 @@
Categories
1 min - + Jul 21, 2023 @@ -329,7 +343,7 @@
Categories
3 min - + Jun 26, 2023 @@ -343,7 +357,7 @@
Categories
3 min - + May 6, 2023 @@ -357,7 +371,7 @@
Categories
6 min - + Mar 12, 2023 @@ -372,7 +386,7 @@
Categories
+ Feb 25, 2023 @@ -386,7 +400,7 @@
Categories
4 min - + Jan 21, 2023 @@ -400,7 +414,7 @@
Categories
12 min - + Jan 16, 2023 @@ -414,7 +428,7 @@
Categories
13 min - + Jan 16, 2023 @@ -428,7 +442,7 @@
Categories
4 min - + Jan 16, 2023 @@ -442,7 +456,7 @@
Categories
14 min - + Jan 16, 2023 @@ -456,7 +470,7 @@
Categories
13 min - + Jan 2, 2023 @@ -471,7 +485,7 @@
Categories
+ Dec 27, 2022 @@ -485,7 +499,7 @@
Categories
27 min - + Dec 24, 2022 @@ -499,7 +513,7 @@
Categories
8 min - + Dec 22, 2022 @@ -513,7 +527,7 @@
Categories
2 min - + Dec 12, 2022 @@ -527,7 +541,7 @@
Categories
11 min - + Dec 11, 2022 @@ -541,7 +555,7 @@
Categories
2 min - + Dec 10, 2022 @@ -555,7 +569,7 @@
Categories
1 min - + Dec 8, 2022 @@ -569,7 +583,7 @@
Categories
1 min - + Nov 17, 2022 @@ -583,7 +597,7 @@
Categories
1 min - + Nov 15, 2022 @@ -597,7 +611,7 @@
Categories
8 min - + Nov 14, 2022 @@ -611,7 +625,7 @@
Categories
1 min - + Nov 13, 2022 @@ -625,7 +639,7 @@
Categories
3 min - + Nov 13, 2022 @@ -639,7 +653,7 @@
Categories
3 min - + Nov 12, 2022 @@ -653,7 +667,7 @@
Categories
1 min - + Nov 12, 2022 diff --git a/docs/index.xml b/docs/index.xml index 493c062..0183713 100644 --- a/docs/index.xml +++ b/docs/index.xml @@ -11,7 +11,381 @@ Thomas' personal thoughts, opinions, learnings and code examples quarto-1.3.450 -Wed, 13 Sep 2023 07:00:00 GMT +Sun, 17 Sep 2023 07:00:00 GMT + + Retrieving access-controlled data from NCBI’s dbGAP repository + Thomas Sandmann + https://tomsing1.github.io/blog/posts/dbgap/index.html + +

tl;dr:

+

Today I learned about different ways to retrieve access-controlled short read data from NCBI’s dbGAP repository. dbGAP hosts both publicly available and Access Controlled data. The latter is usually used to disseminate data from individual human participants and requires a data access application.

+

After the data access request has been granted, it is time to retrieve the actual data from dbGAP and - in case of short read data - its sister repository the Short Read Archive.

+ +
+

Authenticating with JWT or NGC files

+

The path to authenticating and downloading dbGAP data differs depending on whether you are using the AWS or GCP cloud proviers, or a local compute infrastructure (or another, unsupported cloud provider) instead.

+
+

Authentication within AWS or GCP cloud environments

+

On these two platforms, you have two paths to access the data:

+
    +
  1. With a JWT file: A JWT1 file, introduced with sra-tools version 2.10, allows users to transfer data from dbGAP’s cloud buckets into your own cloud instance. (Because both your and dbGAP’s system’s share the same cloud environment, this is faster than a regular transfer e.g. via https or ftp) 2
  2. +
+
    +
  1. Via fusera: Alternatively, you can mount dbGAP’s buckets as read-only volumes on your cloud instances via fusera3
  2. +
+
+ +
+
+

The nf-core/fetchngs workflow supports the retrieval of dbGAP data via JWT file authentication, e.g. when it is executed on AWS or GCP compute instances (see above). As all nf-core workflows, it is easily parallelized, e.g. across an HPC or via an AWS Batch queue. (Highly recommended when you need to retrieve large amounts of data.)

+
+
+
+
+
+

Authenticating outside AWS / GCP

+

On all other compute platforms, including your laptop or your local high- performance cluster (HPC), you need to authenticate with an NGC file (containing your repository key) instead45.

+

In this blog post, I will outline how to use NGC authentication, but make sure to read dbGAP’s official documentation as well.

+
+
+

Retrieving dbGAP data with NGC authentication

+

If you are not working on AWS or GCP, and need to rely on NGC authentication, the following steps might be useful.

+
+

1. Log into dbGAP

+ +
+
+

2. Install sra-tools from github

+

I usually download the latest binary of the sra-tools suite for my operating system from [github]https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). Alternatively, you can also install it using Bioconda

+
+
+
+ +
+
+Note +
+
+
+

Please note that the sra-tools package is frequently updated, so make sure you have the latest version).

+
+
+

For example, this code snipped retrieves and decompresses the latest version for Ubuntu Linux into the ~/bin directory:

+
mkdir -p ~/bin
+pushd ~/bin
+wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.7/sratoolkit.3.0.7-ubuntu64.tar.gz
+tar xfz sratoolkit.3.0.7-ubuntu64.tar.gz
+rm sratoolkit.3.0.7-ubuntu64.tar.gz
+popd
+

Afterward, I add the sra directory to my PATH and verify that it’s the version I expected:

+
export PATH=~/bin/sratoolkit.3.0.7-ubuntu64/bin:$PATH
+prefetch --version  # verify that it's the version you downloaded
+
+
+
+ +
+
+Note +
+
+
+

The best source of information about using the various tools included in the sra-tools is the sra-tools wiki.

+
+
+
+
+

3. Configure the sra toolkit

+

Next, I configure the toolkit, especially the location of the cache directory. The prefetch command stores all SRA files it downloads in this location, so I make sure it is on a volume that is large enough to hold the expected amount of data.

+
./vdb-config -i
+
    +
  • In the Cache section, O specify an existing directory as the public user-repository. This is where prefetch will be download files to (and they will be kept until the cache is cleared!)
  • +
+

My settings are stored in the ${HOME}/.ncbi/user-settings.mkfg file. For more information and other ways to configure the toolkit, please see its wiki page.

+
+
+

4. Log into dbGAP to retrieve the repository key

+
    +
  • Back in on the dbGAP website, navigate to the “My Projects” tab
  • +
  • Choose “get dbGaP repository key” in the “Actions” column.
  • +
  • Download the repository key file with the .ngc extension to your system.
  • +
+
+
+

5. Choose the files to download from SRA

+
    +
  • In your dbGAP account, next navigate to the “My requests” tab.
  • +
  • Click on “Request files” on the right side of the table.
  • +
  • Navigate to the SRA data (reads and reference alignments) tab.
  • +
  • Click on SRA Run Selector
  • +
  • Select all of the files you would like to download in the table at the bottom of the page.
  • +
  • Toggle the Selected radio button.
  • +
+
+
+

6. Download the .krt file that specifies which files to retrieve

+
    +
  • Download the .krt file by clicking on the green Cart file button.
  • +
+
+
+

7. Initiate the download of the files in SRA format

+
    +
  • Now, with both the .ngc and .krt files in hand, we can trigger the download with the sra-tool’s prefetch command. We need to provide both paths to +
      +
    • the repository key (via --ngc) and
    • +
    • the cart file (via --cart)
    • +
  • +
+

For example, this code snipped assumes the two files are are in my home directory. (The exact names of your .ngc and .krt files will be different.)

+
mkdir -p ~/dbgap
+pushd ~/dbgap
+prefetch \
+  --max-size u \
+  --ngc ~/prj_123456.ngc \
+  --cart ~/cart_DAR12345_2023081212345.krt
+popd
+

Note: The files are downloaded (and cached) in SRA format into the directory I specified when configuring the sra-toolkit (e.g. the public user-repository). Extracting reads and generating FASTQ files is a separate step.

+
+
+

8. Decrypt SRA files and extract reads in FASTQ format

+

🚨 The final fastq-files will be approximately 7 times the size of the accession. The fasterq-dump-tool needs temporary space (scratch space) of about 1.5 times the amount of the final fastq-files during the conversion. Overall, the space you need during the conversion is approximately 17 times the size of the accession.

+

🚨 The extraction and recompression steps are very CPU intensive, and it is recommended to use multiple cores. (The code below uses all available cores, as determined via the nproc --all command.)

+

The fasterq-dump tool extracts the reads into FASTQ files. It only accepts a single accession at a time, and expects to find the corresponding SRA file in the cache directory. Like the prefetch command above, it requires the .ngc file to verify that I am permitted to decrypt the data.

+

To save disk space I only extract a single SRA file at a time and then compress the FASTQ files with pigz. Afterward I copy the compressed FASTQ files to an AWS S3 bucket and delete the local files before processing the next accession.

+
#!/usr/bin/env bash
+set -e
+set -o nounset
+
+declare -r CACHEDIR="~/cache/sra/"  # the cache directory with .sra files
+declare -r BUCKET="s3://my-s3-bucket-for-storing-dbGAP-data"
+
+for SRA_FILE in ${CACHEDIR}/*.sra
+do
+  fasterq-dump -p \
+    --threads $(nproc --all) \
+    --ngc ~/prj_123456.ngc \
+    $(basename $SRA_FILE .sra)
+  pigz \
+    --processes $(nproc --all) \
+    *.fastq
+  aws s3 sync \
+    --exclude "*" \
+    --include "*.fastq.gz" \
+    . \
+    ${BUCKET}
+  rm *.fastq.gz
+done
+ + + + +
+
+
+ + +

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Footnotes

+ +
    +
  1. JSON Web Token↩︎

  2. +
  3. dbGAP’s official JWT instructions are here.↩︎

  4. +
  5. dbGAP’s official fusera instructions are here.↩︎

  6. +
  7. dbGAP’s official NGC instructions are here.↩︎

  8. +
  9. NGC file authentication also works on cloud instances, e.g. an AWS EC2 instance, but it is slower as it doesn’t take advantage of the fact that your instance and dbGAP’s data bucket are co-located.↩︎

  10. +
+
]]>
+ NCBI + dbGAP + TIL + https://tomsing1.github.io/blog/posts/dbgap/index.html + Sun, 17 Sep 2023 07:00:00 GMT +
Adventures with parquet III: single-cell RNA-seq data and comparison with HDF5-backed arrays Thomas Sandmann @@ -17631,3010 +18005,5 @@ SessionInfo https://tomsing1.github.io/blog/posts/geneset-sqlite-db/index.html Mon, 02 Jan 2023 08:00:00 GMT - - Interactive GSEA results: visualizations with reactable & plotly - Thomas Sandmann - https://tomsing1.github.io/blog/posts/interactive-gene-set-results/index.html - As a Computational Biologist, I frequently analyze data from high throughput experiments, including transcriptomics, proteomics or metabolomics results. As a first step, I usually examine the behavior of individual analysis - genes, proteins or metabolites - and obtain a long list of effect sizes, p- or q-values.

-

Frequently, another layer of analysis focuses on the behavior of predefined gene sets, e.g. groups of genes whose up- or down-regulation reflects the activity of a biological process, a metabolic pathway or is indicative of a cellular state.

-

There are numerous methods to perform gene set enrichment (GSEA), over-representation (ORA) or pathway analysis, with more than 140 R packages on Bioconductor alone.

-
-
-
- -
-
-Note -
-
-
-

How the gene sets are actually defined, e.g. whether they reflect the underlying biology of the system under study, is likely more critical than the exact choice of algorithm - but that’s a discussion for another day.)

-
-
-
-

Sharing analysis results

-

Regardless of the chosen statistical approach, GSEA or ORA analyses typically produce set-level statistics, e.g. a summary of the effect size across all members of a gene set alongside a statistic, p-value, etc.

-

To share results with my collaborators, I would like to enable them to

-
    -
  1. Browse set-level results to hone in on specific pathways / processes of interest.
  2. -
  3. Visualize the behavior of the analytes in the set.
  4. -
  5. Drill down to a subset of analytes and export e.g. gene-level results
  6. -
-

The pioneering ReportingTools Bioconductor package creates static web pages for gene-set enrichment results, including gene- and set-level plots and statistics. But all of the plots are generated in advance, and interactivity is limited.

-

In this blog post, I take advantage of the reactable, plotly, crosstalk and htmlwidgets R packages to create a stand-alone interactive HTML report, allowing my collaborators to explore the results without the need for a server.

-

I learned a lot about these incredibly useful packages!

-
-
-
- -
-
-Note -
-
-
-

At the time of writing, the current release of the reactable R package (v0.4.1) is not compatible with the latest release of the htmlwidgets (v1.6.0). This issue has already been fixed in reactable’s development version, which is available from github Alternatively, you can use the previous release of htmlwidgets (v1.5.4), e.g.  by installing it with remotes::install_version("htmlwidgets", version = "1.5.4").

-
-
-
-

Features

-

Here, I am combining several interactive elements, linked through SharedData objects via crosstalk:

-
    -
  • At the top, an interactive volcano plot showing the effects sizes (mean trimmed log2 fold changes) and nominal p-values for each tested gene set.
  • -
  • Below, a nested reactable table displays the results for each set. When a row is expanded -
      -
    • It shows a volcano plot with gene-level results, as well as a linked table with the corresponding statistics.
    • -
    • The reader can hone in on specific genes by selecting points in the volcano plot, or by searching the table.
    • -
  • -
-

First, we define a set of helper functions are, which are composed into the main gene_set_report() function.

-
-
-Code -
library(Biobase)
-library(crosstalk)
-library(dplyr)
-library(htmltools)
-library(plotly)
-library(reactable)
-library(sparrow)
-library(stringr)
-library(htmlwidgets)  
-library(V8)  # to create static HTML
-
-#' Retrieve gene-level statistics for a single gene set
-#' 
-#' @param stats named list of data.frames with gene-level statistics, one for
-#' each gene set 
-#' @gene_set_name Scalar character, the name of an element of `stats`.
-#' in `data`.
-#' @return A data.frame with results for a single gene set
-.get_gene_data <- function(mg, gene_set_name, keep.cols = c(
-  "symbol", "entrez_id", "logFC", "pval", "CI.L", "CI.R", "pval", "padj")) {
-  sparrow::geneSet(mg, name = gene_set_name) %>%
-    dplyr::select(tidyselect::any_of(keep.cols)) %>%
-    dplyr::arrange(pval)
-}
-
-#' @importFrom htmltools tags
-.entrez_url <- function(value) {
-  if(!is.na(value) & nzchar(value)) {
-    url <- sprintf("http://www.ncbi.nlm.nih.gov/gene/%s",
-                   value)
-    return(htmltools::tags$a(href = url, target = "_blank", 
-                             as.character(value)))
-  } else {
-    return(value)
-  }
-}
-
-#' @importFrom htmltools tags
-.symbol_url <- function(value) {
-  if(!is.na(value) & nzchar(value)) {
-    url <- sprintf(
-      "https://www.genenames.org/tools/search/#!/?query=%s",
-      value)
-    return(
-      htmltools::tags$a(href = url, target = "_blank", as.character(value))
-    )
-  } else {
-    return(value)
-  }
-}
-
-#' @importFrom htmltools tags
-.msigdb_url <- function(value) {
-  if(!is.na(value) & nzchar(value)) {
-    url <- sprintf(
-      "https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/%s.html",
-      value)
-    return(
-      htmltools::tags$a(href = url, target = "_blank", as.character(value))
-    )
-  } else {
-    return(value)
-  }
-}
-
-#' Create a reactable table with gene-level results
-#' 
-#' @param data A data.frame or a `SharedData` object.
-#' @param defaultColDef A list that defines the default configuration for a 
-#' column, typically the output of the [reactable::colDef] function.
-#' @param columns A list of column definitions, each generated with the 
-#' [reactable::colDef] function.
-#' @param theme A `reactableTheme` object, typically generated with a call to
-#' the [reactable::reactableTheme] function.
-#' @param striped Scalar flag, display stripes?
-#' @param bordered Scalar flag, display borders?
-#' @param highlight Scalar flag, highlight selected rows?
-#' @param searchable Scalar flag, add search box?
-#' @param defaultPageSize Scalar integer, the default number of rows to display.
-#' @param elementId Scalar character, an (optional) element identifier
-#' @param ... Additional arguments for the [reactable::reactable] function.
-#' @return A `reactable` object.
-#' @export
-#' @importFrom reactable colDef reactable colFormat
-#' @examples
-#' \dontrun{
-#' df <- data.frame(
-#'    symbol = c("TP53", "KRAS", "PIK3CA"),
-#'    pval = runif(3, 0, 1),
-#'    logFC = rnorm(3)
-#' )
-#' stats_table(df)
-#' }
-stats_table <- function(
-    data, 
-    defaultColDef = reactable::colDef(
-      align = "center",
-      minWidth = 100,
-      sortNALast = TRUE
-    ),
-    columns = list(
-      symbol = reactable::colDef(
-        name = "Symbol",
-        cell = .symbol_url
-      ),
-      entrezid = reactable::colDef(
-        name = "EntrezId",
-        cell = .entrez_url
-      ),
-      entrez_id = reactable::colDef(
-        name = "EntrezId",
-        cell = .entrez_url
-      ),
-      entrez = reactable::colDef(
-        name = "EntrezId",
-        cell = .entrez_url
-      ),
-      pval = reactable::colDef(
-        name = "P-value",
-        format = reactable::colFormat(digits = 4)),
-      padj = reactable::colDef(
-        name = "P-value",
-        format = reactable::colFormat(digits = 4)),
-      t = reactable::colDef(
-        name = "t-statistic",
-        format = reactable::colFormat(digits = 2)),
-      B = reactable::colDef(
-        name = "log-odds",
-        format = reactable::colFormat(digits = 2)),
-      AveExpr = reactable::colDef(
-        name = "Mean expr",
-        format = reactable::colFormat(digits = 2)),
-      CI.L = reactable::colDef(
-        name = "Lower 95% CI",
-        format = reactable::colFormat(digits = 2)),
-      CI.R = reactable::colDef(
-        name = "Upper 95% CI",
-        format = reactable::colFormat(digits = 2)),
-      logFC = reactable::colDef(
-        name = "logFC", 
-        format = reactable::colFormat(digits = 2),
-        style = function(value) {
-          if (value > 0) {
-            color <- "firebrick"
-          } else if (value < 0) {
-            color <- "navy"
-          } else {
-            color <- "lightgrey"
-          }
-          list(color = color, fontWeight = "bold")
-        }
-      )
-    ),
-    theme = reactable::reactableTheme(
-      stripedColor = "#f6f8fa",
-      highlightColor = "#f0f5f9",
-      cellPadding = "8px 12px",
-      style = list(
-        fontFamily = "-apple-system, BlinkMacSystemFont, Segoe UI, Helvetica, 
-        Arial, sans-serif")
-    ),
-    striped = FALSE,
-    bordered = FALSE,
-    highlight = TRUE,
-    searchable = TRUE,
-    defaultPageSize = 10L,
-    elementId = NULL,
-    ...
-) {
-  reactable::reactable(
-    data = data,
-    searchable = searchable,
-    striped = striped,
-    bordered = bordered,
-    highlight = highlight,
-    selection = "multiple",
-    onClick = "select",
-    rowStyle = list(cursor = "pointer"),
-    theme = theme,
-    defaultPageSize = defaultPageSize,
-    defaultColDef = defaultColDef,
-    columns = columns,
-    elementId = elementId,
-    ...
-  )
-}
-
-#' Wrap stats_table() output in a div html tag
-#' 
-#' @param style Scalar character, the style tag for the tag
-#' @param elementId Scalar character, the element identifier
-#' @param ... Arguments passed on to the `stats_table` function.
-#' @return A `shiny.tag` object.
-#' @importFrom htmltools div tags
-.stats_table_div <- function(
-    style = paste0(
-      "width: 50%;",
-      "float: right;",
-      "padding-top: 1rem;"
-    ),
-    elementId = NULL,
-    ...) {
-  if (is.null(elementId)) {
-    elementId <- basename(tempfile(pattern = "id"))
-  }    
-  htmltools::div(
-    style = style,
-    htmltools::tagList(
-      stats_table(..., elementId = elementId),
-      # download button
-      htmltools::tags$button(
-        "\u21E9 Download as CSV",
-        onclick = sprintf("Reactable.downloadDataCSV('%s', 'gene-results.csv')",
-                          elementId)
-      )
-    )    
-  )
-}
-
-#' Create an interactive volcano plot
-#' 
-#' @param data A data.frame or a `SharedData` object.
-#' @param x A `formula` defining the column of `data` mapped to the x-axis.
-#' @param y A `formula` defining the column of `data` mapped to the y-axis.
-#' @param text A `formula` defining the column of `data` mapped to the tooltip.
-#' @param title Scalar character, the title of the plot.
-#' @param xlab Scalar character, the title of the x-axis
-#' @param ylab Scalar character, the title pf the y-axis
-#' @param title.width Scalar integer, the target line width (passed on to the
-#' [stringr::str_wrap] function.
-#' @param opacity Scalar numeric between 0 and 1, the opacity of the points.
-#' @param marker A list defining the size, line and color limits of the points.
-#' @param colors Character vector of colors used to shade the points.
-#' @param highlight.color Scalar character, the color used to highlight selected
-#' points.
-#' @param webGL Scalar flag, use webGL to render the plot?
-#' @param width Scalar numeric or scalar character, width of the plot
-#' @param height Scalar numeric or scalar character, height of the plot
-#' @param ... Additional arguments passed to the [plotly::plot_ly] function.
-#' @return A `plotly` object.
-#' @importFrom plotly plot_ly add_trace config layout highlight toWebGL
-#' @importFrom grDevices colorRampPalette
-#' @export
-#' @examples
-#' \dontrun{
-#' df <- data.frame(
-#'    symbol = letters,
-#'    pval = runif(length(letters), 0, 1),
-#'    logFC = rnorm(length(letters))
-#' )
-#' volcano_plot(df)
-#' }
-volcano_plot <- function(
-    data, 
-    x = ~logFC,
-    y = ~-log10(pval),
-    text = ~symbol,
-    title = "",
-    xlab = "Fold change (log2)",
-    ylab = "-log10(pval)",
-    title.width = 35L,
-    opacity = 0.5,
-    marker = list(
-      color = ~logFC,
-      size = 10, 
-      cmax = 3,
-      cmid = 0,
-      cmin = -3,
-      line = list(color = "grey", width = 1)),
-    colors = grDevices::colorRampPalette(
-      c('navy', 'lightgrey', 'firebrick'))(15),
-    highlight.color = "red",
-    webGL = FALSE,
-    width = NULL,
-    height = NULL,
-    ...) {
-  p <- plotly::plot_ly(
-    width = width,
-    height = height
-  ) %>% 
-    plotly::add_trace(
-      data = data, 
-      name = "",
-      type = 'scatter',
-      mode = 'markers',
-      x = x,
-      y = y,
-      text = text,
-      hoverinfo ="text",
-      opacity = opacity,
-      colors = colors,
-      marker = marker,
-      ...
-    ) %>%
-    plotly::config(displaylogo = FALSE) %>%
-    plotly::layout(
-        xaxis = list(title = xlab),
-        yaxis = list(title = ylab),
-        title = stringr::str_wrap(
-          stringr::str_replace_all(title, "_", " "),
-          width = title.width)
-    ) %>%
-    plotly::highlight(
-      color = highlight.color,
-      on = "plotly_selected",
-      off = "plotly_deselect"
-    )
-  if (isTRUE(webGL)) p <- plotly::toWebGL(p)
-  return(p)
-}
-
-#' Create an interactive volcano plot for gene-set results
-#' 
-#' @param data A data.frame or a `SharedData` object.
-#' @param x A `formula` defining the column of `data` mapped to the x-axis.
-#' @param y A `formula` defining the column of `data` mapped to the y-axis.
-#' @param text A `formula` defining the column of `data` mapped to the tooltip.
-#' @param xlab Scalar character, the title of the x-axis
-#' @param text.width Scalar integer, the target line width (passed on to the
-#' [stringr::str_wrap] function.
-#' @param hovertemplate Scalar character defining the tooltip template.
-#' @param marker A list defining the size, line and color limits of the points.
-#' @param width Scalar numeric or scalar character, width of the plot
-#' @param height Scalar numeric or scalar character, height of the plot
-#' @param ... Additional arguments passed to the [volcano_plot] function.
-#' @return A `plotly` object.
-#' @importFrom grDevices colorRampPalette
-#' @importFrom stringr str_wrap str_replace_all
-#' @export
-#' @examples
-#' \dontrun{
-#' df <- data.frame(
-#'    name = paste("Set", letters),
-#'    pval = runif(length(letters), 0, 1),
-#'    mean.logFC.trim = rnorm(length(letters)),
-#'    n = sample(1:100, size = length(letters))
-#' )
-#' volcano_gene_set_plot(df)
-#' }
-volcano_gene_set_plot <- function(
-    data,
-    text = ~stringr::str_wrap(
-          stringr::str_replace_all(name, "_", " "),
-          width = text.width),
-    text.width = 25,
-    x = ~mean.logFC.trim,
-    y = ~-log10(pval),
-    marker = list(
-      color = ~mean.logFC.trim,
-      size = ~n, 
-      sizemode = 'area', 
-      cmax = 2,
-      cmid = 0,
-      cmin = -2,
-      line = list(color = "grey", width = 1)
-    ), 
-    hovertemplate = paste(
-            '<b>%{text}</b>',
-            '<br><i>logFC</i>: %{x:.2f}',
-            '<br><i>-log10(pval)</i>: %{y:.2f}',
-            '<br><i>n</i>: %{marker.size}',
-            '<br>'),
-    xlab = "Fold change (log2)",
-    width = NULL,
-    height = NULL,
-    ...)
-{
-  volcano_plot(
-    data = data, 
-    text = text, 
-    x = x, 
-    y = y,
-    marker = marker, 
-    xlab = xlab,
-    hovertemplate = hovertemplate,
-    width = width,
-    height = height,
-    ...)
-}
-
-#' Wrap volcano_plot() output in a div html tag
-#' 
-#' @param helptext Scalar character, text to display below the plot.
-#' @param style Scalar character, the style tag for the tag
-#' @param ... Arguments passed on to the `volcano_plot` function.
-#' @return A `shiny.tag` object.
-#' @importFrom htmltools div tagList p
-.volcano_plot_div <- function(
-    helptext = paste("Draw a rectangle / use the lasso tool to select points,",
-                     "double-click to deselect all."), 
-    style = paste0(
-      "width: 50%;",
-      "float: left;",
-      "padding-right: 1rem;",
-      "padding-top: 4rem;"
-    ), 
-    ...) {
-  htmltools::div(
-    style = style, {
-      htmltools::tagList(
-        volcano_plot(...),
-        htmltools::p(helptext)
-      )
-    }
-  )
-}
-
-#' Helper function to combine gene-level outputs into a single div
-#' 
-#' @param data A data.frame with gene-set results.
-#' @param stats A named list of data.frames whose names much match the `name`
-#' column of `data`.
-#' @param index Scalar count, the row of `data` to plot.
-#' @return A `shiny.tag` object containing the output of the 
-#' `.volcano_plot_div()` and `.stats_table_div()` functions.
-#' @importFrom crosstalk SharedData
-#' @importfrom htmltools tagList div
-.row_details <- function(data, mg, index) {
-  gene_data <- .get_gene_data(mg = mg, gene_set_name = data$name[index])
-  gd <- crosstalk::SharedData$new(gene_data)
-  htmltools::div(
-    htmltools::tagList(
-      # volcano plot
-      .volcano_plot_div(data = gd, title = data$name[index]),
-      # interactive gene-stat table
-      .stats_table_div(data = gd)
-    )
-  )
-}
-
-#' Create a nested gene set result table
-#' 
-#' @param mg A `SparrowResult` object
-#' @param max.pval Scalar numeric, the largest (uncorrected) p-value for which
-#' to return results.
-#' @param max.results Scalar integer, the top number of rows to return
-#' (ordered by p-value).
-#' @param color.up Scalar character, the color for positive log2 fold changes.
-#' @param color.down Scalar character, the color for negative log2 fold changes.
-#' @param color.ns Scalar character, the color for zero log2 fold change.
-#' @param theme A `reactableTheme` object, typically generated with a call to
-#' the [reactable::reactableTheme] function.
-#' @param defaultColDef A list that defines the default configuration for a 
-#' column, typically the output of the [reactable::colDef] function.
-#' @param columns A list of column definitions, each generated with the 
-#' [reactable::colDef] function.
-#' @param bordered Scalar flag, display borders?
-#' @param highlight Scalar flag, highlight selected rows?
-#' @param searchable Scalar flag, add search box?
-#' @param striped Scalar flag, alternate row shading?
-#' @param defaultPageSize Scalar integer, the default number of rows to display.
-#' @param pageSizeOptions Integer vector that will populate the pagination menu.
-#' @param paginationType Scalar character, the pagination control to use. Either
-#' `numbers` for page number buttons (the default), `jump` for a page jump, or 
-#' `simple` to show 'Previous' and 'Next' buttons only.
-#' @param elementId Scalar character, an (optional) element identifier
-#' @param defaultSorted Character vector of column names to sort by default. Or
-#' to customize sort order, a named list with values of `asc` or `desc`.
-#' @param name_url A function that returns a `shiny.tag` (usually an 
-#' `<a href></a>` tag) for each element of the `name` column of `data` to link
-#' to more information about the gene set (e.g. on the MSigDb website, etc).
-#' @param ... Additional arguments for the [reactable::reactable] function.
-#' @importFrom reactable reactable reactableTheme colDef colFormat
-#' @return A `reactable` object with one row for each row in `data`, each of
-#' which can be expanded into the output of the `.row_details()` function
-#' for that specific gene set.
-#' @export
-#' @examples
-#' \dontrun{
-#' library(sparrow)
-#' vm <- sparrow::exampleExpressionSet()
-#' gdb <- sparrow::exampleGeneSetDb()
-#' mg <- sparrow::seas(vm, gdb, c("camera"), design = vm$design, 
-#'                     contrast = 'tumor')
-#' gene_set_table(mg, max.results = 10)
-#' }
-gene_set_table <- function(
-    mg,
-    max.pval = 0.05,
-    max.results = Inf,
-    keep.cols = c("collection", "name", "n", "pval", "padj", 
-                      "mean.logFC.trim"),
-    method = resultNames(mg)[1],
-    color.up = "firebrick", 
-    color.down = "navy",
-    color.ns = "grey50",
-    theme = reactable::reactableTheme(
-      stripedColor = "grey95",
-      highlightColor = "grey80",
-      cellPadding = "8px 12px",
-      style = list(
-        fontFamily = "-apple-system, BlinkMacSystemFont, Segoe UI, Helvetica, 
-        Arial, sans-serif")
-    ),
-    defaultColDef = reactable::colDef(
-      header = function(value) value,
-      align = "center",
-      minWidth = 100,
-      headerStyle = list(background = "#f7f7f8"),
-      sortNALast = TRUE
-    ),
-    name_url = function(value) {value},
-    columns = list(
-      collection = reactable::colDef(
-        name = "Collection"),
-      name = reactable::colDef(
-        name = "Gene set",
-        cell = name_url,
-        minWidth = 150),
-      pval = reactable::colDef(
-        name = "P-value", aggregate = "min",
-        format = reactable::colFormat(digits = 4)),
-      padj = reactable::colDef(
-        name = "FDR", aggregate = "min",
-        format = reactable::colFormat(digits = 4)),
-      Direction =  reactable::colDef(
-        name = "dir", minWidth = 45, 
-        cell = function(value) {
-          if (value == "Up")  "\u2B06" else "\u2B07"
-        }),
-      logFC = reactable::colDef(
-        name = "logFC", format = reactable::colFormat(digits = 2),
-        style = function(value) {
-          if (value > 0) {
-            color <- color.up
-          } else if (value < 0) {
-            color <- color.down
-          } else {
-            color <- color.ns
-          }
-          list(color = color, fontWeight = "bold")
-        }
-      ),
-      mean.logFC.trim = reactable::colDef(
-        name = "logFC", format = reactable::colFormat(digits = 2),
-        style = function(value) {
-          if (value > 0) {
-            color <- color.up
-          } else if (value < 0) {
-            color <- color.down
-          } else {
-            color <- color.ns
-          }
-          list(color = color, fontWeight = "bold")
-        }
-      )
-    ),
-    elementId = "expansion-table",
-    static = TRUE,
-    filterable = TRUE,
-    searchable = TRUE,
-    bordered = TRUE,
-    striped = FALSE,
-    highlight = TRUE,
-    defaultPageSize = 25L,
-    showPageSizeOptions = TRUE,
-    pageSizeOptions = sort(unique(c(25, 50, 100, nrow(data)))),
-    paginationType = "simple",
-    defaultSorted = list(pval = "asc")
-) {
-  data = sparrow::result(mg, method) %>%
-    dplyr::slice_min(n = max.results, order_by = pval) %>%
-    dplyr::filter(pval <= max.pval) %>%
-    dplyr::select(tidyselect::any_of(keep.cols))
-  if (nrow(data) == 0) {
-    warning("None of the gene sets pass the `max.pval` threshold.")
-    return(NULL)
-  }
-  reactable::reactable(
-    data,
-    elementId = elementId,
-    defaultColDef = defaultColDef,
-    static = static,
-    filterable = filterable,
-    searchable = searchable,
-    bordered = bordered,
-    highlight = highlight,
-    theme = theme,
-    defaultPageSize = defaultPageSize,
-    showPageSizeOptions = showPageSizeOptions,
-    pageSizeOptions = pageSizeOptions,
-    paginationType = paginationType,
-    defaultSorted = defaultSorted,
-    columns = columns,
-    details = function(index) {
-      .row_details(data = data, mg = mg, index)
-    }
-  )
-}
-
-#' Wrapper to create a div HTML tag
-#' @param mg A `SparrowResult` object
-#' @param method Scalar character, which results to return from `mg`.
-#' @param max.pal Scalar numeric, return only results wiht an (uncorrected)
-#' <= `max.pal`.
-#' @param verbose Scalar flag, show messages?
-#' @param title Scalar character, the `h1` title for the element
-#' @param elementId Scalar character, the element identifier for the interactive
-#' table.
-#' @param style Scalar character, the style tag for the tag
-#' @param ... Additional arguments passed on to the `gene_set_table` function.
-#' @return A `shiny.tag` object containing the output of the 
-#' `gene_set_table()` function.
-#' @importFrom htmltools div h1 tagList tags
-#' @export
-gene_set_report <- function(
-    mg,
-    method = resultNames(mg)[1], 
-    max.pval = 0.05,
-    max.results = Inf,
-    verbose = TRUE,
-    title = "Gene set enrichment analysis",
-    elementId = "expansion-table",
-    style = "",
-    ...
-) {
-  if (!is.finite(max.results)) {
-    message.log <- sprintf(
-      paste("Reporting all '%s' results with (uncorrected)",
-            "p-value <= %s"), 
-      method, max.pval)
-  } else {
-    message.log <- sprintf(
-      paste("Reporting up to %s '%s' results with (uncorrected)",
-            "p-value <= %s"), 
-      max.results, method, max.pval)
-  }
-  if (isTRUE(verbose)) {
-    message(message.log)
-  }
-  htmltools::div(
-    style = style, 
-    {
-      htmltools::tagList(
-        htmltools::h1(title),
-        htmltools::p(message.log),
-        # volcano plot
-        sparrow::result(mg, method) %>%
-          dplyr::slice_min(n = max.results, order_by = pval) %>%
-          volcano_gene_set_plot(width = "50%"),
-        htmltools::br(),
-        # expansion button
-        htmltools::tags$button(
-          "Expand/collapse all rows",
-          onclick = sprintf("Reactable.toggleAllRowsExpanded('%s')", elementId)
-        ),
-        gene_set_table(mg = mg, max.pval = max.pval, max.results = max.results,
-                       ...)
-      )
-    })
-}
-
-
-

Next, we perform a gene set enrichment analysis with sparrow’s seas function. It returns a convenient SparrowResult S4 object with both gene-set statistics and gene-level differential expression results.

-
-
# gene set enrichment analysis with the sparrow Bioconductor package
-vm <- exampleExpressionSet()
-gdb <- exampleGeneSetDb()
-mg <- seas(vm, gdb, methods = "camera", design = vm$design, contrast = 'tumor')
-
-

Finally, we pass the mg object to our gene_set_report() function, together with arguments requesting that all gene-set results passing a p-value threshold of < 0.05 are included. We also pass the .msigdb_url helper function to the name_url argument, to link the name of each gene set to its description on the msigdb website.

-
-
# create an interactive report
-htmltools::browsable(
-  gene_set_report(mg, method = "camera", max.pval = 0.05, max.results = Inf, 
-                  name_url = .msigdb_url)
-)
-
-
Reporting all 'camera' results with (uncorrected) p-value <= 0.05
-
-
- -
-

Gene set enrichment analysis

-

Reporting all 'camera' results with (uncorrected) p-value <= 0.05

-
- -
- -
Collection
Gene set
n
P-value
FDR
logFC
c2
14
0.0000
0.0000
1.92
c7
175
0.0003
0.0031
0.33
c2
170
0.0013
0.0106
0.17
c2
50
0.0015
0.0106
-0.74
c2
52
0.0022
0.0126
-0.76
c6
33
0.0133
0.0632
-0.67
c2
47
0.0152
0.0662
-0.44
c6
89
0.0157
0.0662
-0.41
c6
98
0.0227
0.0823
-0.43
c2
11
0.0227
0.0823
-0.50
c6
96
0.0227
0.0823
-0.34
c6
90
0.0249
0.0848
-0.31
c6
189
0.0256
0.0848
-0.26
c6
50
0.0296
0.0888
0.17
c6
183
0.0306
0.0888
-0.23
1–25 of 31 rows
Show
1 of 2
- -
-
-
-



-
-
-

Details

-
-

Gene set enrichment analysis

-

In this example, I am using Steve Lianoglou’s sparrow Bioconductor package to perform gene set enrichment analysis. But any other method could be used, as long as both set-level and gene-level differential expression statistics can be obtained.

-

sparrow supports numerous GSEA and ORA methods. Here, I am using the camera algorithm from the limma Bioconductor package for illustration.

-
-
-

Reproducibility

-
- -SessionInfo - -
-
-
─ Session info ───────────────────────────────────────────────────────────────
- setting  value
- version  R version 4.3.1 (2023-06-16)
- os       macOS Ventura 13.5.1
- system   aarch64, darwin20
- ui       X11
- language (EN)
- collate  en_US.UTF-8
- ctype    en_US.UTF-8
- tz       America/Los_Angeles
- date     2023-08-30
- pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
-
-─ Packages ───────────────────────────────────────────────────────────────────
- ! package          * version    date (UTC) lib source
- P annotate           1.78.0     2023-05-08 [?] Bioconductor
- P AnnotationDbi      1.62.2     2023-07-02 [?] Bioconductor
- P askpass            1.1        2019-01-13 [?] CRAN (R 4.3.0)
- P babelgene          22.9       2022-09-29 [?] CRAN (R 4.3.0)
- P backports          1.4.1      2021-12-13 [?] CRAN (R 4.3.0)
- P Biobase          * 2.60.0     2023-05-08 [?] Bioconductor
- P BiocGenerics     * 0.46.0     2023-06-04 [?] Bioconductor
- P BiocIO             1.10.0     2023-05-08 [?] Bioconductor
- P BiocManager        1.30.22    2023-08-08 [?] CRAN (R 4.3.0)
- P BiocParallel       1.34.2     2023-05-28 [?] Bioconductor
- P BiocSet            1.14.0     2023-05-08 [?] Bioconductor
- P Biostrings         2.68.1     2023-05-21 [?] Bioconductor
- P bit                4.0.5      2022-11-15 [?] CRAN (R 4.3.0)
- P bit64              4.0.5      2020-08-30 [?] CRAN (R 4.3.0)
- P bitops             1.0-7      2021-04-24 [?] CRAN (R 4.3.0)
- P blob               1.2.4      2023-03-17 [?] CRAN (R 4.3.0)
- P cachem             1.0.8      2023-05-01 [?] CRAN (R 4.3.0)
- P checkmate          2.2.0      2023-04-27 [?] CRAN (R 4.3.0)
- P circlize           0.4.15     2022-05-10 [?] CRAN (R 4.3.0)
- P cli                3.6.1      2023-03-23 [?] CRAN (R 4.3.0)
- P clue               0.3-64     2023-01-31 [?] CRAN (R 4.3.0)
- P cluster            2.1.4      2022-08-22 [?] CRAN (R 4.3.0)
- P codetools          0.2-19     2023-02-01 [?] CRAN (R 4.3.1)
- P colorspace         2.1-0      2023-01-23 [?] CRAN (R 4.3.0)
- P ComplexHeatmap     2.16.0     2023-05-08 [?] Bioconductor
- P crayon             1.5.2      2022-09-29 [?] CRAN (R 4.3.0)
- R credentials        1.3.2      <NA>       [?] <NA>
- P crosstalk        * 1.2.0      2021-11-04 [?] CRAN (R 4.3.0)
- P curl               5.0.2      2023-08-14 [?] CRAN (R 4.3.0)
- P data.table         1.14.8     2023-02-17 [?] CRAN (R 4.3.0)
- P DBI                1.1.3      2022-06-18 [?] CRAN (R 4.3.0)
- P digest             0.6.33     2023-07-07 [?] CRAN (R 4.3.0)
- P doParallel         1.0.17     2022-02-07 [?] CRAN (R 4.3.0)
- P dplyr            * 1.1.2      2023-04-20 [?] CRAN (R 4.3.0)
- P edgeR              3.42.4     2023-06-04 [?] Bioconductor
- P ellipsis           0.3.2      2021-04-29 [?] CRAN (R 4.3.0)
- P evaluate           0.21       2023-05-05 [?] CRAN (R 4.3.0)
- P fansi              1.0.4      2023-01-22 [?] CRAN (R 4.3.0)
- P fastmap            1.1.1      2023-02-24 [?] CRAN (R 4.3.0)
- P foreach            1.5.2      2022-02-02 [?] CRAN (R 4.3.0)
- P generics           0.1.3      2022-07-05 [?] CRAN (R 4.3.0)
- P GenomeInfoDb       1.36.1     2023-07-02 [?] Bioconductor
- P GenomeInfoDbData   1.2.10     2023-08-23 [?] Bioconductor
- P GetoptLong         1.0.5      2020-12-15 [?] CRAN (R 4.3.0)
- P ggplot2          * 3.4.3      2023-08-14 [?] CRAN (R 4.3.0)
- P GlobalOptions      0.1.2      2020-06-10 [?] CRAN (R 4.3.0)
- P glue               1.6.2      2022-02-24 [?] CRAN (R 4.3.0)
- P graph              1.78.0     2023-05-08 [?] Bioconductor
- P gridExtra          2.3        2017-09-09 [?] CRAN (R 4.3.0)
- P GSEABase           1.62.0     2023-05-08 [?] Bioconductor
- P gtable             0.3.4      2023-08-21 [?] CRAN (R 4.3.1)
- P htmltools        * 0.5.6      2023-08-10 [?] CRAN (R 4.3.0)
- P htmlwidgets      * 1.6.2      2023-03-17 [?] CRAN (R 4.3.0)
- P httpuv             1.6.11     2023-05-11 [?] CRAN (R 4.3.0)
- P httr               1.4.7      2023-08-15 [?] CRAN (R 4.3.0)
- P IRanges            2.34.1     2023-07-02 [?] Bioconductor
- P irlba              2.3.5.1    2022-10-03 [?] CRAN (R 4.3.0)
- P iterators          1.0.14     2022-02-05 [?] CRAN (R 4.3.0)
- P jsonlite           1.8.7      2023-06-29 [?] CRAN (R 4.3.0)
- P KEGGREST           1.40.0     2023-05-08 [?] Bioconductor
- P knitr              1.43       2023-05-25 [?] CRAN (R 4.3.0)
- P later              1.3.1      2023-05-02 [?] CRAN (R 4.3.0)
- P lattice            0.21-8     2023-04-05 [?] CRAN (R 4.3.1)
- P lazyeval           0.2.2      2019-03-15 [?] CRAN (R 4.3.0)
- P lifecycle          1.0.3      2022-10-07 [?] CRAN (R 4.3.0)
- P limma              3.56.2     2023-06-04 [?] Bioconductor
- P locfit             1.5-9.8    2023-06-11 [?] CRAN (R 4.3.0)
- P magrittr           2.0.3      2022-03-30 [?] CRAN (R 4.3.0)
- P Matrix             1.5-4.1    2023-05-18 [?] CRAN (R 4.3.1)
- P matrixStats        1.0.0      2023-06-02 [?] CRAN (R 4.3.0)
- P memoise            2.0.1      2021-11-26 [?] CRAN (R 4.3.0)
- P mime               0.12       2021-09-28 [?] CRAN (R 4.3.0)
- P munsell            0.5.0      2018-06-12 [?] CRAN (R 4.3.0)
- P ontologyIndex      2.11       2023-05-30 [?] CRAN (R 4.3.0)
- P openssl            2.1.0      2023-07-15 [?] CRAN (R 4.3.0)
- P pillar             1.9.0      2023-03-22 [?] CRAN (R 4.3.0)
- P pkgconfig          2.0.3      2019-09-22 [?] CRAN (R 4.3.0)
- P plotly           * 4.10.2     2023-06-03 [?] CRAN (R 4.3.0)
- P plyr               1.8.8      2022-11-11 [?] CRAN (R 4.3.0)
- P png                0.1-8      2022-11-29 [?] CRAN (R 4.3.0)
- P promises           1.2.1      2023-08-10 [?] CRAN (R 4.3.0)
- P purrr              1.0.2      2023-08-10 [?] CRAN (R 4.3.0)
- P R6                 2.5.1      2021-08-19 [?] CRAN (R 4.3.0)
- P RColorBrewer       1.1-3      2022-04-03 [?] CRAN (R 4.3.0)
- P Rcpp               1.0.11     2023-07-06 [?] CRAN (R 4.3.0)
- P RCurl              1.98-1.12  2023-03-27 [?] CRAN (R 4.3.0)
- P reactable        * 0.4.4.9000 2023-08-25 [?] Github (glin/reactable@86bd276)
- P reactR             0.4.4      2021-02-22 [?] CRAN (R 4.3.0)
-   renv               1.0.2      2023-08-15 [1] CRAN (R 4.3.0)
- P rjson              0.2.21     2022-01-09 [?] CRAN (R 4.3.0)
- P rlang              1.1.1      2023-04-28 [?] CRAN (R 4.3.0)
- P rmarkdown          2.24       2023-08-14 [?] CRAN (R 4.3.0)
- P RSQLite            2.3.1      2023-04-03 [?] CRAN (R 4.3.0)
- P rstudioapi         0.15.0     2023-07-07 [?] CRAN (R 4.3.0)
- P S4Vectors          0.38.1     2023-05-08 [?] Bioconductor
- P scales             1.2.1      2022-08-20 [?] CRAN (R 4.3.0)
- P sessioninfo        1.2.2      2021-12-06 [?] CRAN (R 4.3.0)
- P shape              1.4.6      2021-05-19 [?] CRAN (R 4.3.0)
- P shiny              1.7.5      2023-08-12 [?] CRAN (R 4.3.0)
- P sparrow          * 1.6.0      2023-05-08 [?] Bioconductor
- P stringi            1.7.12     2023-01-11 [?] CRAN (R 4.3.0)
- P stringr          * 1.5.0      2022-12-02 [?] CRAN (R 4.3.0)
- P sys                3.4.2      2023-05-23 [?] CRAN (R 4.3.0)
- P tibble             3.2.1      2023-03-20 [?] CRAN (R 4.3.0)
- P tidyr              1.3.0      2023-01-24 [?] CRAN (R 4.3.0)
- P tidyselect         1.2.0      2022-10-10 [?] CRAN (R 4.3.0)
- P utf8               1.2.3      2023-01-31 [?] CRAN (R 4.3.0)
- P V8               * 4.3.3      2023-07-18 [?] CRAN (R 4.3.0)
- P vctrs              0.6.3      2023-06-14 [?] CRAN (R 4.3.0)
- P viridis            0.6.4      2023-07-22 [?] CRAN (R 4.3.0)
- P viridisLite        0.4.2      2023-05-02 [?] CRAN (R 4.3.0)
- P withr              2.5.0      2022-03-03 [?] CRAN (R 4.3.0)
- P xfun               0.40       2023-08-09 [?] CRAN (R 4.3.0)
- P XML                3.99-0.14  2023-03-19 [?] CRAN (R 4.3.0)
- P xtable             1.8-4      2019-04-21 [?] CRAN (R 4.3.0)
- P XVector            0.40.0     2023-05-08 [?] Bioconductor
- P yaml               2.3.7      2023-01-23 [?] CRAN (R 4.3.0)
- P zlibbioc           1.46.0     2023-05-08 [?] Bioconductor
-
- [1] /Users/sandmann/repositories/blog/renv/library/R-4.3/aarch64-apple-darwin20
- [2] /Users/sandmann/Library/Caches/org.R-project.R/R/renv/sandbox/R-4.3/aarch64-apple-darwin20/ac5c2659
-
- P ── Loaded and on-disk path mismatch.
- R ── Package was removed from disk.
-
-──────────────────────────────────────────────────────────────────────────────
-
-
-
- - - - -
-
-
- -

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

]]>
- TIL - R - visualization - https://tomsing1.github.io/blog/posts/interactive-gene-set-results/index.html - Tue, 27 Dec 2022 08:00:00 GMT -
diff --git a/docs/listings.json b/docs/listings.json index 8a00fbe..5538483 100644 --- a/docs/listings.json +++ b/docs/listings.json @@ -2,6 +2,7 @@ { "listing": "/index.html", "items": [ + "/posts/dbgap/index.html", "/posts/ParquetMatrix/index.html", "/posts/parquetArray/index.html", "/posts/parquet/index.html", diff --git a/docs/posts/dbgap/index.html b/docs/posts/dbgap/index.html new file mode 100644 index 0000000..b10e3cb --- /dev/null +++ b/docs/posts/dbgap/index.html @@ -0,0 +1,1071 @@ + + + + + + + + + + + +Thomas Sandmann’s blog - Retrieving access-controlled data from NCBI’s dbGAP repository + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+
+
+

Retrieving access-controlled data from NCBI’s dbGAP repository

+
+
NCBI
+
dbGAP
+
TIL
+
+
+
+ + +
+ +
+
Author
+
+

Thomas Sandmann

+
+
+ +
+
Published
+
+

September 17, 2023

+
+
+ + +
+ + +
+ + + + +
+ + + + +
+

tl;dr:

+

Today I learned about different ways to retrieve access-controlled short read data from NCBI’s dbGAP repository. dbGAP hosts both publicly available and Access Controlled data. The latter is usually used to disseminate data from individual human participants and requires a data access application.

+

After the data access request has been granted, it is time to retrieve the actual data from dbGAP and - in case of short read data - its sister repository the Short Read Archive.

+
+
+

Authenticating with JWT or NGC files

+

The path to authenticating and downloading dbGAP data differs depending on whether you are using the AWS or GCP cloud proviers, or a local compute infrastructure (or another, unsupported cloud provider) instead.

+
+

Authentication within AWS or GCP cloud environments

+

On these two platforms, you have two paths to access the data:

+
    +
  1. With a JWT file: A JWT1 file, introduced with sra-tools version 2.10, allows users to transfer data from dbGAP’s cloud buckets into your own cloud instance. (Because both your and dbGAP’s system’s share the same cloud environment, this is faster than a regular transfer e.g. via https or ftp) 2
  2. +
+
    +
  1. Via fusera: Alternatively, you can mount dbGAP’s buckets as read-only volumes on your cloud instances via fusera3
  2. +
+
+ +
+
+

The nf-core/fetchngs workflow supports the retrieval of dbGAP data via JWT file authentication, e.g. when it is executed on AWS or GCP compute instances (see above). As all nf-core workflows, it is easily parallelized, e.g. across an HPC or via an AWS Batch queue. (Highly recommended when you need to retrieve large amounts of data.)

+
+
+
+
+
+

Authenticating outside AWS / GCP

+

On all other compute platforms, including your laptop or your local high- performance cluster (HPC), you need to authenticate with an NGC file (containing your repository key) instead45.

+

In this blog post, I will outline how to use NGC authentication, but make sure to read dbGAP’s official documentation as well.

+
+
+

Retrieving dbGAP data with NGC authentication

+

If you are not working on AWS or GCP, and need to rely on NGC authentication, the following steps might be useful.

+
+

1. Log into dbGAP

+ +
+
+

2. Install sra-tools from github

+

I usually download the latest binary of the sra-tools suite for my operating system from [github]https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). Alternatively, you can also install it using Bioconda

+
+
+
+ +
+
+Note +
+
+
+

Please note that the sra-tools package is frequently updated, so make sure you have the latest version).

+
+
+

For example, this code snipped retrieves and decompresses the latest version for Ubuntu Linux into the ~/bin directory:

+
mkdir -p ~/bin
+pushd ~/bin
+wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.7/sratoolkit.3.0.7-ubuntu64.tar.gz
+tar xfz sratoolkit.3.0.7-ubuntu64.tar.gz
+rm sratoolkit.3.0.7-ubuntu64.tar.gz
+popd
+

Afterward, I add the sra directory to my PATH and verify that it’s the version I expected:

+
export PATH=~/bin/sratoolkit.3.0.7-ubuntu64/bin:$PATH
+prefetch --version  # verify that it's the version you downloaded
+
+
+
+ +
+
+Note +
+
+
+

The best source of information about using the various tools included in the sra-tools is the sra-tools wiki.

+
+
+
+
+

3. Configure the sra toolkit

+

Next, I configure the toolkit, especially the location of the cache directory. The prefetch command stores all SRA files it downloads in this location, so I make sure it is on a volume that is large enough to hold the expected amount of data.

+
./vdb-config -i
+
    +
  • In the Cache section, O specify an existing directory as the public user-repository. This is where prefetch will be download files to (and they will be kept until the cache is cleared!)
  • +
+

My settings are stored in the ${HOME}/.ncbi/user-settings.mkfg file. For more information and other ways to configure the toolkit, please see its wiki page.

+
+
+

4. Log into dbGAP to retrieve the repository key

+
    +
  • Back in on the dbGAP website, navigate to the “My Projects” tab
  • +
  • Choose “get dbGaP repository key” in the “Actions” column.
  • +
  • Download the repository key file with the .ngc extension to your system.
  • +
+
+
+

5. Choose the files to download from SRA

+
    +
  • In your dbGAP account, next navigate to the “My requests” tab.
  • +
  • Click on “Request files” on the right side of the table.
  • +
  • Navigate to the SRA data (reads and reference alignments) tab.
  • +
  • Click on SRA Run Selector
  • +
  • Select all of the files you would like to download in the table at the bottom of the page.
  • +
  • Toggle the Selected radio button.
  • +
+
+
+

6. Download the .krt file that specifies which files to retrieve

+
    +
  • Download the .krt file by clicking on the green Cart file button.
  • +
+
+
+

7. Initiate the download of the files in SRA format

+
    +
  • Now, with both the .ngc and .krt files in hand, we can trigger the download with the sra-tool’s prefetch command. We need to provide both paths to +
      +
    • the repository key (via --ngc) and
    • +
    • the cart file (via --cart)
    • +
  • +
+

For example, this code snipped assumes the two files are are in my home directory. (The exact names of your .ngc and .krt files will be different.)

+
mkdir -p ~/dbgap
+pushd ~/dbgap
+prefetch \
+  --max-size u \
+  --ngc ~/prj_123456.ngc \
+  --cart ~/cart_DAR12345_2023081212345.krt
+popd
+

Note: The files are downloaded (and cached) in SRA format into the directory I specified when configuring the sra-toolkit (e.g. the public user-repository). Extracting reads and generating FASTQ files is a separate step.

+
+
+

8. Decrypt SRA files and extract reads in FASTQ format

+

🚨 The final fastq-files will be approximately 7 times the size of the accession. The fasterq-dump-tool needs temporary space (scratch space) of about 1.5 times the amount of the final fastq-files during the conversion. Overall, the space you need during the conversion is approximately 17 times the size of the accession.

+

🚨 The extraction and recompression steps are very CPU intensive, and it is recommended to use multiple cores. (The code below uses all available cores, as determined via the nproc --all command.)

+

The fasterq-dump tool extracts the reads into FASTQ files. It only accepts a single accession at a time, and expects to find the corresponding SRA file in the cache directory. Like the prefetch command above, it requires the .ngc file to verify that I am permitted to decrypt the data.

+

To save disk space I only extract a single SRA file at a time and then compress the FASTQ files with pigz. Afterward I copy the compressed FASTQ files to an AWS S3 bucket and delete the local files before processing the next accession.

+
#!/usr/bin/env bash
+set -e
+set -o nounset
+
+declare -r CACHEDIR="~/cache/sra/"  # the cache directory with .sra files
+declare -r BUCKET="s3://my-s3-bucket-for-storing-dbGAP-data"
+
+for SRA_FILE in ${CACHEDIR}/*.sra
+do
+  fasterq-dump -p \
+    --threads $(nproc --all) \
+    --ngc ~/prj_123456.ngc \
+    $(basename $SRA_FILE .sra)
+  pigz \
+    --processes $(nproc --all) \
+    *.fastq
+  aws s3 sync \
+    --exclude "*" \
+    --include "*.fastq.gz" \
+    . \
+    ${BUCKET}
+  rm *.fastq.gz
+done
+ + + + +
+
+
+ + +

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Footnotes

+ +
    +
  1. JSON Web Token↩︎

  2. +
  3. dbGAP’s official JWT instructions are here.↩︎

  4. +
  5. dbGAP’s official fusera instructions are here.↩︎

  6. +
  7. dbGAP’s official NGC instructions are here.↩︎

  8. +
  9. NGC file authentication also works on cloud instances, e.g. an AWS EC2 instance, but it is slower as it doesn’t take advantage of the fact that your instance and dbGAP’s data bucket are co-located.↩︎

  10. +
+
+ + +
+ + + + \ No newline at end of file diff --git a/docs/search.json b/docs/search.json index 26fdef2..8a20d05 100644 --- a/docs/search.json +++ b/docs/search.json @@ -424,7 +424,7 @@ "href": "index.html", "title": "Welcome", "section": "", - "text": "Date\n\n\nTitle\n\n\nAuthor\n\n\nReading Time\n\n\n\n\n\n\nSep 13, 2023\n\n\nAdventures with parquet III: single-cell RNA-seq data and comparison with HDF5-backed arrays\n\n\nThomas Sandmann\n\n\n11 min\n\n\n\n\nSep 5, 2023\n\n\nAdventures with parquet II: Implementing the parquetArraySeed S4 class\n\n\nThomas Sandmann\n\n\n11 min\n\n\n\n\nAug 31, 2023\n\n\nAdventures with parquet: Storing & querying gene expression data\n\n\nThomas Sandmann\n\n\n7 min\n\n\n\n\nAug 30, 2023\n\n\nOrganizing sequencing metadata: experimenting with S7\n\n\nThomas Sandmann\n\n\n10 min\n\n\n\n\nAug 28, 2023\n\n\ntourrr: Exploring multi-dimensional data\n\n\nThomas Sandmann\n\n\n6 min\n\n\n\n\nJul 31, 2023\n\n\nCustomizing my Quarto website\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\nJul 24, 2023\n\n\nGuess the correlation - a first streamlit app\n\n\nThomas Sandmann\n\n\n3 min\n\n\n\n\nJul 21, 2023\n\n\nGrav - a lightweight content management system\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\nJul 21, 2023\n\n\nGreg Wilson: Late Night Thoughts on Listening to Ike Quebec (2018)\n\n\nThomas Sandmann\n\n\n3 min\n\n\n\n\nJun 26, 2023\n\n\nDocumenting data wrangling with the dtrackr R package\n\n\nThomas Sandmann\n\n\n3 min\n\n\n\n\nMay 6, 2023\n\n\nQuerying parquet files with duckdb\n\n\nThomas Sandmann\n\n\n6 min\n\n\n\n\nMar 12, 2023\n\n\nLemur: analyzing multi-condition single-cell data\n\n\nThomas Sandmann\n\n\n14 min\n\n\n\n\n\nFeb 25, 2023\n\n\nSimultaneously inserting records into two tables with Postgres CTEs\n\n\nThomas Sandmann\n\n\n4 min\n\n\n\n\nJan 21, 2023\n\n\nDistributing R packages with a drat repository hosted on AWS S3\n\n\nThomas Sandmann\n\n\n12 min\n\n\n\n\nJan 16, 2023\n\n\nQuantSeq RNAseq analysis (1): configuring the nf-core/rnaseq workflow\n\n\nThomas Sandmann\n\n\n13 min\n\n\n\n\nJan 16, 2023\n\n\nQuantSeq RNAseq analysis (2): Exploring nf-core/rnaseq output\n\n\nThomas Sandmann\n\n\n4 min\n\n\n\n\nJan 16, 2023\n\n\nQuantSeq RNAseq analysis (3): Validating published results (no UMIs)\n\n\nThomas Sandmann\n\n\n14 min\n\n\n\n\nJan 16, 2023\n\n\nQuantSeq RNAseq analysis (4): Validating published results (with UMIs)\n\n\nThomas Sandmann\n\n\n13 min\n\n\n\n\nJan 2, 2023\n\n\nSQL and noSQL approaches to creating & querying databases (using R)\n\n\nThomas Sandmann\n\n\n13 min\n\n\n\n\n\nDec 27, 2022\n\n\nInteractive GSEA results: visualizations with reactable & plotly\n\n\nThomas Sandmann\n\n\n27 min\n\n\n\n\nDec 24, 2022\n\n\nUpSet plots: comparing differential expression across contrasts\n\n\nThomas Sandmann\n\n\n8 min\n\n\n\n\nDec 22, 2022\n\n\nFigure size, layout & tabsets with Quarto\n\n\nThomas Sandmann\n\n\n2 min\n\n\n\n\nDec 12, 2022\n\n\nFull text search in Postgres - the R way\n\n\nThomas Sandmann\n\n\n11 min\n\n\n\n\nDec 11, 2022\n\n\nUpdating R the easy way: using rig command line tool\n\n\nThomas Sandmann\n\n\n2 min\n\n\n\n\nDec 10, 2022\n\n\n2022 normconf: lightning talks\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\nDec 8, 2022\n\n\nThe rlist R package\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\nNov 17, 2022\n\n\nCreating custom badges for your README\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\nNov 15, 2022\n\n\nLearning nextflow: blasting multiple sequences\n\n\nThomas Sandmann\n\n\n8 min\n\n\n\n\nNov 14, 2022\n\n\nPython type hints\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\nNov 13, 2022\n\n\nFujita et al: Cell-subtype specific effects of genetic variation in the aging and Alzheimer cortex\n\n\nThomas Sandmann\n\n\n3 min\n\n\n\n\nNov 13, 2022\n\n\nRefreshing & exporting temporary AWS credentials\n\n\nThomas Sandmann\n\n\n3 min\n\n\n\n\nNov 12, 2022\n\n\nInstalling pyroe with conda\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\nNov 12, 2022\n\n\nWelcome To My Blog\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\n\n\nNo matching items\n\n\n \n\nThis work is licensed under a Creative Commons Attribution 4.0 International License." + "text": "Date\n\n\nTitle\n\n\nAuthor\n\n\nReading Time\n\n\n\n\n\n\nSep 17, 2023\n\n\nRetrieving access-controlled data from NCBI’s dbGAP repository\n\n\nThomas Sandmann\n\n\n6 min\n\n\n\n\nSep 13, 2023\n\n\nAdventures with parquet III: single-cell RNA-seq data and comparison with HDF5-backed arrays\n\n\nThomas Sandmann\n\n\n11 min\n\n\n\n\nSep 5, 2023\n\n\nAdventures with parquet II: Implementing the parquetArraySeed S4 class\n\n\nThomas Sandmann\n\n\n11 min\n\n\n\n\nAug 31, 2023\n\n\nAdventures with parquet: Storing & querying gene expression data\n\n\nThomas Sandmann\n\n\n7 min\n\n\n\n\nAug 30, 2023\n\n\nOrganizing sequencing metadata: experimenting with S7\n\n\nThomas Sandmann\n\n\n10 min\n\n\n\n\nAug 28, 2023\n\n\ntourrr: Exploring multi-dimensional data\n\n\nThomas Sandmann\n\n\n6 min\n\n\n\n\nJul 31, 2023\n\n\nCustomizing my Quarto website\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\nJul 24, 2023\n\n\nGuess the correlation - a first streamlit app\n\n\nThomas Sandmann\n\n\n3 min\n\n\n\n\nJul 21, 2023\n\n\nGrav - a lightweight content management system\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\nJul 21, 2023\n\n\nGreg Wilson: Late Night Thoughts on Listening to Ike Quebec (2018)\n\n\nThomas Sandmann\n\n\n3 min\n\n\n\n\nJun 26, 2023\n\n\nDocumenting data wrangling with the dtrackr R package\n\n\nThomas Sandmann\n\n\n3 min\n\n\n\n\nMay 6, 2023\n\n\nQuerying parquet files with duckdb\n\n\nThomas Sandmann\n\n\n6 min\n\n\n\n\nMar 12, 2023\n\n\nLemur: analyzing multi-condition single-cell data\n\n\nThomas Sandmann\n\n\n14 min\n\n\n\n\n\nFeb 25, 2023\n\n\nSimultaneously inserting records into two tables with Postgres CTEs\n\n\nThomas Sandmann\n\n\n4 min\n\n\n\n\nJan 21, 2023\n\n\nDistributing R packages with a drat repository hosted on AWS S3\n\n\nThomas Sandmann\n\n\n12 min\n\n\n\n\nJan 16, 2023\n\n\nQuantSeq RNAseq analysis (1): configuring the nf-core/rnaseq workflow\n\n\nThomas Sandmann\n\n\n13 min\n\n\n\n\nJan 16, 2023\n\n\nQuantSeq RNAseq analysis (2): Exploring nf-core/rnaseq output\n\n\nThomas Sandmann\n\n\n4 min\n\n\n\n\nJan 16, 2023\n\n\nQuantSeq RNAseq analysis (3): Validating published results (no UMIs)\n\n\nThomas Sandmann\n\n\n14 min\n\n\n\n\nJan 16, 2023\n\n\nQuantSeq RNAseq analysis (4): Validating published results (with UMIs)\n\n\nThomas Sandmann\n\n\n13 min\n\n\n\n\nJan 2, 2023\n\n\nSQL and noSQL approaches to creating & querying databases (using R)\n\n\nThomas Sandmann\n\n\n13 min\n\n\n\n\n\nDec 27, 2022\n\n\nInteractive GSEA results: visualizations with reactable & plotly\n\n\nThomas Sandmann\n\n\n27 min\n\n\n\n\nDec 24, 2022\n\n\nUpSet plots: comparing differential expression across contrasts\n\n\nThomas Sandmann\n\n\n8 min\n\n\n\n\nDec 22, 2022\n\n\nFigure size, layout & tabsets with Quarto\n\n\nThomas Sandmann\n\n\n2 min\n\n\n\n\nDec 12, 2022\n\n\nFull text search in Postgres - the R way\n\n\nThomas Sandmann\n\n\n11 min\n\n\n\n\nDec 11, 2022\n\n\nUpdating R the easy way: using rig command line tool\n\n\nThomas Sandmann\n\n\n2 min\n\n\n\n\nDec 10, 2022\n\n\n2022 normconf: lightning talks\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\nDec 8, 2022\n\n\nThe rlist R package\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\nNov 17, 2022\n\n\nCreating custom badges for your README\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\nNov 15, 2022\n\n\nLearning nextflow: blasting multiple sequences\n\n\nThomas Sandmann\n\n\n8 min\n\n\n\n\nNov 14, 2022\n\n\nPython type hints\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\nNov 13, 2022\n\n\nFujita et al: Cell-subtype specific effects of genetic variation in the aging and Alzheimer cortex\n\n\nThomas Sandmann\n\n\n3 min\n\n\n\n\nNov 13, 2022\n\n\nRefreshing & exporting temporary AWS credentials\n\n\nThomas Sandmann\n\n\n3 min\n\n\n\n\nNov 12, 2022\n\n\nInstalling pyroe with conda\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\nNov 12, 2022\n\n\nWelcome To My Blog\n\n\nThomas Sandmann\n\n\n1 min\n\n\n\n\n\n\nNo matching items\n\n\n \n\nThis work is licensed under a Creative Commons Attribution 4.0 International License." }, { "objectID": "posts/fujita_2022/index.html", @@ -957,5 +957,33 @@ "title": "Adventures with parquet III: single-cell RNA-seq data and comparison with HDF5-backed arrays", "section": "Footnotes", "text": "Footnotes\n\n\nBy default, the location of the cache is set via ExperimentHub::getExperimentHubOption(\"CACHE\"). For example, on my system the data is located at /Users/sandmann/Library/Caches/org.R-project.R/R/ExperimentHub.↩︎\nThe performance will depend on the network connection. With my home internet connection, I was able to read the full dataset from the parquet file into memory or extract counts for 50 random genes x cells in 1.8 seconds on average.↩︎" + }, + { + "objectID": "posts/dbgap/index.html", + "href": "posts/dbgap/index.html", + "title": "Retrieving access-controlled data from NCBI’s dbGAP repository", + "section": "", + "text": "Today I learned about different ways to retrieve access-controlled short read data from NCBI’s dbGAP repository. dbGAP hosts both publicly available and Access Controlled data. The latter is usually used to disseminate data from individual human participants and requires a data access application.\nAfter the data access request has been granted, it is time to retrieve the actual data from dbGAP and - in case of short read data - its sister repository the Short Read Archive.\nThis work is licensed under a Creative Commons Attribution 4.0 International License." + }, + { + "objectID": "posts/dbgap/index.html#tldr", + "href": "posts/dbgap/index.html#tldr", + "title": "Retrieving access-controlled data from NCBI’s dbGAP repository", + "section": "", + "text": "Today I learned about different ways to retrieve access-controlled short read data from NCBI’s dbGAP repository. dbGAP hosts both publicly available and Access Controlled data. The latter is usually used to disseminate data from individual human participants and requires a data access application.\nAfter the data access request has been granted, it is time to retrieve the actual data from dbGAP and - in case of short read data - its sister repository the Short Read Archive." + }, + { + "objectID": "posts/dbgap/index.html#authenticating-with-jwt-or-ngc-files", + "href": "posts/dbgap/index.html#authenticating-with-jwt-or-ngc-files", + "title": "Retrieving access-controlled data from NCBI’s dbGAP repository", + "section": "Authenticating with JWT or NGC files", + "text": "Authenticating with JWT or NGC files\nThe path to authenticating and downloading dbGAP data differs depending on whether you are using the AWS or GCP cloud proviers, or a local compute infrastructure (or another, unsupported cloud provider) instead.\n\nAuthentication within AWS or GCP cloud environments\nOn these two platforms, you have two paths to access the data:\n\nWith a JWT file: A JWT1 file, introduced with sra-tools version 2.10, allows users to transfer data from dbGAP’s cloud buckets into your own cloud instance. (Because both your and dbGAP’s system’s share the same cloud environment, this is faster than a regular transfer e.g. via https or ftp) 2\n\n\nVia fusera: Alternatively, you can mount dbGAP’s buckets as read-only volumes on your cloud instances via fusera3\n\n\n\n\n\n\n\nThe nf-core/fetchngs workflow\n\n\n\n\n\nThe nf-core/fetchngs workflow supports the retrieval of dbGAP data via JWT file authentication, e.g. when it is executed on AWS or GCP compute instances (see above). As all nf-core workflows, it is easily parallelized, e.g. across an HPC or via an AWS Batch queue. (Highly recommended when you need to retrieve large amounts of data.)\n\n\n\n\n\nAuthenticating outside AWS / GCP\nOn all other compute platforms, including your laptop or your local high- performance cluster (HPC), you need to authenticate with an NGC file (containing your repository key) instead45.\nIn this blog post, I will outline how to use NGC authentication, but make sure to read dbGAP’s official documentation as well.\n\n\nRetrieving dbGAP data with NGC authentication\nIf you are not working on AWS or GCP, and need to rely on NGC authentication, the following steps might be useful.\n\n1. Log into dbGAP\n\nNavigate to the dbGAP login page for controlled access and log in with your eRA credentials.\n\n\n\n2. Install sra-tools from github\nI usually download the latest binary of the sra-tools suite for my operating system from [github]https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). Alternatively, you can also install it using Bioconda\n\n\n\n\n\n\nNote\n\n\n\nPlease note that the sra-tools package is frequently updated, so make sure you have the latest version).\n\n\nFor example, this code snipped retrieves and decompresses the latest version for Ubuntu Linux into the ~/bin directory:\nmkdir -p ~/bin\npushd ~/bin\nwget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.7/sratoolkit.3.0.7-ubuntu64.tar.gz\ntar xfz sratoolkit.3.0.7-ubuntu64.tar.gz\nrm sratoolkit.3.0.7-ubuntu64.tar.gz\npopd\nAfterward, I add the sra directory to my PATH and verify that it’s the version I expected:\nexport PATH=~/bin/sratoolkit.3.0.7-ubuntu64/bin:$PATH\nprefetch --version # verify that it's the version you downloaded\n\n\n\n\n\n\nNote\n\n\n\nThe best source of information about using the various tools included in the sra-tools is the sra-tools wiki.\n\n\n\n\n3. Configure the sra toolkit\nNext, I configure the toolkit, especially the location of the cache directory. The prefetch command stores all SRA files it downloads in this location, so I make sure it is on a volume that is large enough to hold the expected amount of data.\n./vdb-config -i\n\nIn the Cache section, O specify an existing directory as the public user-repository. This is where prefetch will be download files to (and they will be kept until the cache is cleared!)\n\nMy settings are stored in the ${HOME}/.ncbi/user-settings.mkfg file. For more information and other ways to configure the toolkit, please see its wiki page.\n\n\n4. Log into dbGAP to retrieve the repository key\n\nBack in on the dbGAP website, navigate to the “My Projects” tab\nChoose “get dbGaP repository key” in the “Actions” column.\nDownload the repository key file with the .ngc extension to your system.\n\n\n\n5. Choose the files to download from SRA\n\nIn your dbGAP account, next navigate to the “My requests” tab.\nClick on “Request files” on the right side of the table.\nNavigate to the SRA data (reads and reference alignments) tab.\nClick on SRA Run Selector\nSelect all of the files you would like to download in the table at the bottom of the page.\nToggle the Selected radio button.\n\n\n\n6. Download the .krt file that specifies which files to retrieve\n\nDownload the .krt file by clicking on the green Cart file button.\n\n\n\n7. Initiate the download of the files in SRA format\n\nNow, with both the .ngc and .krt files in hand, we can trigger the download with the sra-tool’s prefetch command. We need to provide both paths to\n\nthe repository key (via --ngc) and\nthe cart file (via --cart)\n\n\nFor example, this code snipped assumes the two files are are in my home directory. (The exact names of your .ngc and .krt files will be different.)\nmkdir -p ~/dbgap\npushd ~/dbgap\nprefetch \\\n --max-size u \\\n --ngc ~/prj_123456.ngc \\\n --cart ~/cart_DAR12345_2023081212345.krt\npopd\nNote: The files are downloaded (and cached) in SRA format into the directory I specified when configuring the sra-toolkit (e.g. the public user-repository). Extracting reads and generating FASTQ files is a separate step.\n\n\n8. Decrypt SRA files and extract reads in FASTQ format\n🚨 The final fastq-files will be approximately 7 times the size of the accession. The fasterq-dump-tool needs temporary space (scratch space) of about 1.5 times the amount of the final fastq-files during the conversion. Overall, the space you need during the conversion is approximately 17 times the size of the accession.\n🚨 The extraction and recompression steps are very CPU intensive, and it is recommended to use multiple cores. (The code below uses all available cores, as determined via the nproc --all command.)\nThe fasterq-dump tool extracts the reads into FASTQ files. It only accepts a single accession at a time, and expects to find the corresponding SRA file in the cache directory. Like the prefetch command above, it requires the .ngc file to verify that I am permitted to decrypt the data.\nTo save disk space I only extract a single SRA file at a time and then compress the FASTQ files with pigz. Afterward I copy the compressed FASTQ files to an AWS S3 bucket and delete the local files before processing the next accession.\n#!/usr/bin/env bash\nset -e\nset -o nounset\n\ndeclare -r CACHEDIR=\"~/cache/sra/\" # the cache directory with .sra files\ndeclare -r BUCKET=\"s3://my-s3-bucket-for-storing-dbGAP-data\"\n\nfor SRA_FILE in ${CACHEDIR}/*.sra\ndo\n fasterq-dump -p \\\n --threads $(nproc --all) \\\n --ngc ~/prj_123456.ngc \\\n $(basename $SRA_FILE .sra)\n pigz \\\n --processes $(nproc --all) \\\n *.fastq\n aws s3 sync \\\n --exclude \"*\" \\\n --include \"*.fastq.gz\" \\\n . \\\n ${BUCKET}\n rm *.fastq.gz\ndone" + }, + { + "objectID": "posts/dbgap/index.html#footnotes", + "href": "posts/dbgap/index.html#footnotes", + "title": "Retrieving access-controlled data from NCBI’s dbGAP repository", + "section": "Footnotes", + "text": "Footnotes\n\n\nJSON Web Token↩︎\ndbGAP’s official JWT instructions are here.↩︎\ndbGAP’s official fusera instructions are here.↩︎\ndbGAP’s official NGC instructions are here.↩︎\nNGC file authentication also works on cloud instances, e.g. an AWS EC2 instance, but it is slower as it doesn’t take advantage of the fact that your instance and dbGAP’s data bucket are co-located.↩︎" } ] \ No newline at end of file diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 69273c5..6f0edda 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -70,7 +70,7 @@ https://tomsing1.github.io/blog/index.html - 2023-09-14T03:44:36.887Z + 2023-09-17T23:09:50.717Z https://tomsing1.github.io/blog/posts/fujita_2022/index.html @@ -140,4 +140,8 @@ https://tomsing1.github.io/blog/posts/ParquetMatrix/index.html 2023-09-14T03:44:36.189Z + + https://tomsing1.github.io/blog/posts/dbgap/index.html + 2023-09-17T23:09:50.020Z + diff --git a/posts/dbgap/index.qmd b/posts/dbgap/index.qmd new file mode 100644 index 0000000..ae652b9 --- /dev/null +++ b/posts/dbgap/index.qmd @@ -0,0 +1,255 @@ +--- +title: "Retrieving access-controlled data from NCBI's dbGAP repository" +author: "Thomas Sandmann" +date: "2023-09-17" +freeze: true +categories: [NCBI, dbGAP, TIL] +editor: + markdown: + wrap: 72 +format: + html: + toc: true + toc-depth: 4 + code-tools: + source: true + toggle: false + caption: none +editor_options: + chunk_output_type: console +--- + +## tl;dr: + +Today I learned about different ways to retrieve access-controlled short read +data from +[NCBI's dbGAP repository](https://www.ncbi.nlm.nih.gov/gap/). +dbGAP hosts both publicly available and _Access Controlled_ data. The latter is +usually used to disseminate data from individual human participants and requires +a data access application. + +After the data access request has been granted, it is time to retrieve the +actual data from dbGAP and - in case of short read data - its sister repository +[the Short Read Archive](https://www.ncbi.nlm.nih.gov/sra). + +## Authenticating with JWT or NGC files + +The path to authenticating and downloading dbGAP data differs depending on +whether you are using the AWS or GCP cloud proviers, or a local compute +infrastructure (or another, unsupported cloud provider) instead. + +### Authentication within AWS or GCP cloud environments + +On these two platforms, you have two paths to access the data: + +1. With a `JWT` file: A JWT[^1] file, +[introduced with `sra-tools` version 2.10](https://github.com/ncbi/sra-tools/wiki/First-help-on-decryption-dbGaP-data#using-jwt-tokens), +allows users to transfer data from dbGAP's +cloud buckets into your own cloud instance. (Because both your and dbGAP's +system's share the same cloud environment, this is faster than a regular +transfer e.g. via https or ftp) [^2] + +[^1]: [JSON Web Token](https://jwt.io/) +[^2]: [dbGAP's official JWT instructions are here.](https://www.ncbi.nlm.nih.gov/sra/docs/sra-dbGAP-cloud-download/) + +2. Via `fusera`: Alternatively, you can mount dbGAP's buckets as +read-only volumes on your cloud instances via +[fusera](https://github.com/mitre/fusera)[^3] + +[^3]: [dbGAP's official fusera instructions are here.](https://www.ncbi.nlm.nih.gov/sra/docs/dbgap-cloud-access/#access-by-fusera) + +::: {.callout-note collapse=true} + +### The nf-core/fetchngs workflow + +The +[nf-core/fetchngs](https://nf-co.re/fetchngs) +workflow supports the retrieval of dbGAP data via `JWT` file authentication, +e.g. when it is executed on AWS or GCP compute instances (see above). As all +nf-core workflows, it is easily parallelized, e.g. across an HPC or via an AWS +Batch queue. (Highly recommended when you need to retrieve large amounts of +data.) + +::: +### Authenticating outside AWS / GCP + +On all other compute platforms, including your laptop or your local high- +performance cluster (HPC), you need to authenticate with an `NGC` file +(containing your _repository key_) instead[^4][^5]. + +[^4]: [dbGAP's official NGC instructions are here.](https://www.ncbi.nlm.nih.gov/sra/docs/sra-dbgap-download/) + +[^5]: `NGC` file authentication also works on cloud instances, e.g. an AWS EC2 +instance, but it is slower as it doesn't take advantage of the fact that your +instance and dbGAP's data bucket are co-located. + +In this blog post, I will outline how to use `NGC` authentication, but make +sure to read +[dbGAP's official documentation](https://www.ncbi.nlm.nih.gov/sra/docs/sra-dbgap-download/) +as well. + +### Retrieving dbGAP data with NGC authentication + +If you are _not_ working on AWS or GCP, and need to rely on `NGC` +authentication, the following steps might be useful. + +#### 1. Log into dbGAP + +- Navigate to + [the dbGAP login page for controlled access](https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login) + and log in with your eRA credentials. + +#### 2. Install sra-tools from github + +I usually download the latest binary of the `sra-tools` suite for my operating +system from +[github]https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). +Alternatively, you can also install it using +[Bioconda](https://anaconda.org/bioconda/sra-tools) + +::: {.callout-note} + +Please note that the `sra-tools` package is frequently updated, so make sure +you have the latest version). + +::: + +For example, this code snipped retrieves and decompresses the latest version +for Ubuntu Linux into the `~/bin` directory: + +```bash +mkdir -p ~/bin +pushd ~/bin +wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.7/sratoolkit.3.0.7-ubuntu64.tar.gz +tar xfz sratoolkit.3.0.7-ubuntu64.tar.gz +rm sratoolkit.3.0.7-ubuntu64.tar.gz +popd +``` + +Afterward, I add the sra directory to my `PATH` and verify that it's the version +I expected: + +```bash +export PATH=~/bin/sratoolkit.3.0.7-ubuntu64/bin:$PATH +prefetch --version # verify that it's the version you downloaded +``` + +::: {.callout-note} + +The best source of information about using the various tools included in the `sra-tools` +is the +[sra-tools wiki](https://github.com/ncbi/sra-tools/wiki). + +::: + +#### 3. Configure the sra toolkit + +Next, I configure the toolkit, especially the location of the `cache` directory. +The `prefetch` command stores all `SRA` files it downloads in this location, so +I make sure it is on a volume that is large enough to hold the expected amount +of data. + +```bash +./vdb-config -i +``` + +- In the `Cache` section, O specify an existing directory as the + `public user-repository`. This is where `prefetch ` will be download files + to (and they will be kept until the cache is cleared!) + +My settings are stored in the `${HOME}/.ncbi/user-settings.mkfg` file. For more +information and other ways to configure the toolkit, please see +[its wiki page](https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump). + +#### 4. Log into dbGAP to retrieve the repository key + +- Back in on the dbGAP website, navigate to the “My Projects” tab +- Choose “get dbGaP repository key” in the “Actions” column. +- Download the _repository key_ file with the `.ngc` extension to your system. + +#### 5. Choose the files to download from SRA + +- In your dbGAP account, next navigate to the "My requests" tab. +- Click on "Request files" on the right side of the table. +- Navigate to the `SRA data (reads and reference alignments)` tab. +- Click on SRA Run Selector +- Select all of the files you would like to download in the table at the bottom + of the page. +- Toggle the `Selected` radio button. + +#### 6. Download the `.krt` file that specifies which files to retrieve + +- Download the `.krt` file by clicking on the green `Cart file` button. + +#### 7. Initiate the download of the files in `SRA` format + +- Now, with both the `.ngc` and `.krt` files in hand, we can trigger the + download with the sra-tool's `prefetch` command. We need to provide + _both_ paths to + - the repository key (via `--ngc`) and + - the cart file (via `--cart`) + +For example, this code snipped assumes the two files are are in my home +directory. (The exact names of your `.ngc` and `.krt` files will be different.) + +```bash +mkdir -p ~/dbgap +pushd ~/dbgap +prefetch \ + --max-size u \ + --ngc ~/prj_123456.ngc \ + --cart ~/cart_DAR12345_2023081212345.krt +popd +``` + +Note: The files are downloaded (and cached) in SRA format into the directory I +specified when configuring the sra-toolkit (e.g. the `public user-repository`). +Extracting reads and generating FASTQ files is a separate step. + +#### 8. Decrypt SRA files and extract reads in FASTQ format + +🚨 The final fastq-files will be approximately 7 times the size of the accession. +The fasterq-dump-tool needs temporary space (scratch space) of about 1.5 times the +amount of the final fastq-files during the conversion. Overall, the space you need +during the conversion is approximately 17 times the size of the accession. + +🚨 The extraction and recompression steps are very CPU intensive, and it is +recommended to use multiple cores. (The code below uses _all_ available cores, +as determined via the `nproc --all` command.) + +The `fasterq-dump` tool extracts the reads into FASTQ files. It only accepts a +single accession at a time, and expects to find the corresponding SRA file in +the cache directory. Like the `prefetch` command above, it requires the `.ngc` +file to verify that I am permitted to decrypt the data. + +To save disk space I only extract a single SRA file at a time and then compress +the FASTQ files with `pigz`. Afterward I copy the compressed FASTQ files to +an AWS S3 bucket and delete the local files before processing the next +accession. + +```bash +#!/usr/bin/env bash +set -e +set -o nounset + +declare -r CACHEDIR="~/cache/sra/" # the cache directory with .sra files +declare -r BUCKET="s3://my-s3-bucket-for-storing-dbGAP-data" + +for SRA_FILE in ${CACHEDIR}/*.sra +do + fasterq-dump -p \ + --threads $(nproc --all) \ + --ngc ~/prj_123456.ngc \ + $(basename $SRA_FILE .sra) + pigz \ + --processes $(nproc --all) \ + *.fastq + aws s3 sync \ + --exclude "*" \ + --include "*.fastq.gz" \ + . \ + ${BUCKET} + rm *.fastq.gz +done +``` +