Skip to content

Commit

Permalink
Added post about rasterizing volcano plots
Browse files Browse the repository at this point in the history
  • Loading branch information
tomsing1 committed Apr 4, 2024
1 parent ae4b593 commit e47b4db
Show file tree
Hide file tree
Showing 17 changed files with 2,930 additions and 352 deletions.
16 changes: 16 additions & 0 deletions _freeze/posts/volcano/index/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"hash": "8d5832a73f06d67f65eb82c8f140ef55",
"result": {
"markdown": "---\ntitle: \"Collaborator-friendly volcano plots with ggplot2\"\nauthor: \"Thomas Sandmann\"\ndate: \"2024-04-03\"\nfreeze: true\ncategories: [R, ggplot2, TIL]\neditor:\n markdown:\n wrap: 72\nformat:\n html:\n toc: true\n toc-depth: 4\n code-tools:\n source: true\n toggle: false\n caption: none\neditor_options: \n chunk_output_type: console\n---\n\n\n## tl;dr\n\n[volcano plots](https://en.wikipedia.org/wiki/Volcano_plot_(statistics))\nare a staple of genomics papers. The `ggrastr::geom_point_rast()` function\nenables collaborators to post-process plots in \n[inkscape](https://inkscape.org/)\nor\nAdobe illustrator - without overwhelming the application with tens of thousands\nof individual points.\n\n## Introduction\n\nAs a scientific collaborator, I often contribute visualizations of high\nthroughput experiments to publications. Try as I might, the final touches of a\ncomplex figure usually need to be added manually, e.g. using a vector graphics\napplication like inkscape or Adobe Illustrator.\n\nUnfortunately, a plot with tens of thousands of points exported from R as a PDF\nis hard for my collaborators to edit, as these applications struggle to render\nall of the individual points.\n\nLuckily, I can modify my plots to make their life easier.\n\n### Example data from Mattila et al, 2015\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(fs)\nlibrary(ggplot2)\nlibrary(ggrastr)\nlibrary(poorman)\nlibrary(readxl)\nggplot2::theme_set(theme_linedraw(base_size = 14))\n```\n:::\n\n\nLet's retrieve a table with the results from a differential gene expression\nanalysis by downloading an excel file published as supplementary table S2\nby\n[Mattila et al, 2015](https://doi.org/10.1016/j.celrep.2015.08.081)\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- local({\n kUrl <- paste0(\n \"https://drive.google.com/uc?export=download&\",\n \"id=1xWVyoSSrs4hoqf5zVRgGhZPRjNY_fx_7\"\n ) \n temp_file <- tempfile(fileext = \".xlsx\")\n download.file(kUrl, temp_file)\n readxl::read_excel(temp_file, sheet = \"mlx1 mutant LSD vs. HSD\", skip = 3)\n})\n```\n:::\n\n\nThe authors compared the transcriptomes of _Drosophila melanogaster_ (fruifly)\nlarvae carrying mutant form of the _Mlx1_ gene that were\nraised either on a high- or low-suger diet. The results of their analysis\n(with the\n[limma Bioconductor package](https://bioconductor.org/packages/release/bioc/html/limma.html)\n) revealed numerous significantly differentially expressed genes (adjusted \np-value < 0.05).\n\nLet's visualize their results with a volcano plot! [^1]\n\n[^1]: The limma package includes the `limma::volcanoplot()` function, which\ncreates a volcano plot using base R. Here, I am generating custom plots with\n`ggplot2` instead.\n\nTo highlight statistically significant differences in expression, let's add\nthe `direction` column, labeling up- and down-regulated genes that pass an\nadjusted p-value threshold of 0.05\n\n\n::: {.cell}\n\n```{.r .cell-code}\nkFdr <- 0.05\ndf$direction <- with(df, poorman::case_when(\n logFC > 0 & adj.P.Val < kFdr ~ \"up\",\n logFC < 0 & adj.P.Val < kFdr ~ \"down\",\n TRUE ~ \"n.s.\"\n))\ndf$direction = factor(df$direction, levels = c(\"down\", \"up\", \"n.s.\"))\ntable(df$direction)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\ndown up n.s. \n1869 1181 7966 \n```\n:::\n:::\n\n\n### A first volcano plot\n\nLet's plot the full dataset, and highlight significantly up- and down-regulated\ngenes in red and blue, respectively.\n\n\n::: {.cell}\n\n```{.r .cell-code}\np1 <- ggplot() +\n geom_point(data = df,\n aes(x = logFC, y = -log10(P.Value), color = direction),\n alpha = 0.4) +\n scale_color_manual(values = c(\"up\" = \"#E41A1C\",\n \"down\" = \"#377EB8\", \n \"n.s.\" = \"lightgrey\"),\n guide = \"none\") +\n labs(\n x = \"Fold change (log2)\",\n y = \"-log10(p-value)\"\n ) +\n theme(panel.grid = element_blank())\nprint(p1)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-4-1.png){width=432}\n:::\n:::\n\n\nNot too bad! We might want to label a few genes (e.g. with the \n[ggrepel R package](https://ggrepel.slowkow.com/)\n), but this is a great start.\n\nUnfortunately, when I save this plot as a PDF file and share it with my \ncollaborators, they let me know that they have trouble editing the axis labels\nin Adobe Illustrator - there are simply too many individual points to render!\n\nLet's see what we can do to help.\n\n### Summarizing points as densities\n\nIn most datasets, including this one, only a small fraction of features is\nof interest. E.g. 7966 genes are _not_ differentially\nexpressed and are plotted in grey. (Because there are so many genes, they are\nplotted on top of each other.)\n\nInstead of plotting all of these points individually, we can summarize them\nas a 2D density instead:\n\n\n::: {.cell}\n\n```{.r .cell-code}\np2 <- ggplot() +\n stat_density2d(\n data = poorman::filter(df, adj.P.Val >= kFdr),\n aes(x = logFC, y = -log10(P.Value), fill = after_stat(density ^0.5)), \n geom = \"tile\", contour = FALSE, n = 200) + \n scale_fill_gradient2(low = \"white\", high = \"black\", guide = \"none\") +\n geom_point(data = poorman::filter(df, adj.P.Val < kFdr),\n aes(x = logFC, y = -log10(P.Value), \n color = direction),\n alpha = 0.4) +\n scale_color_manual(values = c(\"up\" = \"#E41A1C\",\n \"down\" = \"#377EB8\", \n \"n.s.\" = \"lightgrey\"),\n guide = \"none\") +\n labs(\n x = \"Fold change (log2)\",\n y = \"-log10(p-value)\"\n ) +\n theme(panel.grid = element_blank())\nprint(p2)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-5-1.png){width=432}\n:::\n:::\n\n\nThis reduces the number of points dramatically, as only the \n3050 significantly changed (e.g. red and blue)\npoints need to be plotted individually. \n\nCan we do even better?\n\n### Rastering points\n\nThe \n[ggrastr R package](https://cran.r-project.org/package=ggrastr)\nallows us to \n[raster](https://en.wikipedia.org/wiki/Rasterisation)\nour ggplot - either specific layers or the entire plot. Instead of plotting\nevery point (often on top of other points), a rasterized image records the\nintensity of its pixels.\n\nThe `ggrastr::geom_point_rast()` function is a drop-in replacement for\n`ggplot2::geom_point()`.\n\nThe results are indistinguishable from our first plot. But the underlying\nlayer is now rasterized: the points cannot be edited individually in a vector\ngraphics application, as they are now encoded as a single image.\n\n\n::: {.cell}\n\n```{.r .cell-code}\np3 <- ggplot() +\n ggrastr::geom_point_rast(data = df,\n aes(x = logFC, y = -log10(P.Value), color = direction),\n alpha = 0.4) +\n scale_color_manual(values = c(\"up\" = \"#E41A1C\",\n \"down\" = \"#377EB8\", \n \"n.s.\" = \"lightgrey\"),\n guide = \"none\") +\n labs(\n x = \"Fold change (log2)\",\n y = \"-log10(p-value)\"\n ) +\n theme(panel.grid = element_blank())\nprint(p3)\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-6-1.png){width=432}\n:::\n:::\n\n\nIn addition to making our collaborators' lifes easier, the rasterized files\nare also slightly smaller than the original vectorized versions:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvectorized <- file.path(tempdir(), \"vectorized.pdf\")\nggsave(plot = p1, filename = vectorized, width = 4.5, height = 4)\nrasterized <- file.path(tempdir(), \"rasterized.pdf\")\nggsave(plot = p3, filename = rasterized, width = 4.5, height = 4)\n\nfs::dir_info(tempdir()) |>\n poorman::filter(grepl(\"vectorized|rasterized\", path)) |>\n poorman::select(path, size) |>\n poorman::mutate(path = basename(path))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 × 2\n path size\n <chr> <fs::bytes>\n1 rasterized.pdf 339K\n2 vectorized.pdf 550K\n```\n:::\n:::\n\n\n## Reproducibility\n\n<details>\n<summary>\nSession Information\n</summary>\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsessioninfo::session_info(\"attached\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n─ Session info ───────────────────────────────────────────────────────────────\n setting value\n version R version 4.3.2 (2023-10-31)\n os Debian GNU/Linux 12 (bookworm)\n system x86_64, linux-gnu\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz America/Los_Angeles\n date 2024-04-03\n pandoc 3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)\n\n─ Packages ───────────────────────────────────────────────────────────────────\n ! package * version date (UTC) lib source\n P fs * 1.6.3 2023-07-20 [?] CRAN (R 4.3.1)\n P ggplot2 * 3.4.3 2023-08-14 [?] CRAN (R 4.3.1)\n P ggrastr * 1.0.2 2023-06-01 [?] CRAN (R 4.3.1)\n P poorman * 0.2.7 2023-10-30 [?] RSPM\n P readxl * 1.4.3 2023-07-06 [?] CRAN (R 4.3.1)\n\n [1] /home/sandmann/repositories/blog/renv/library/R-4.3/x86_64-pc-linux-gnu\n [2] /home/sandmann/.cache/R/renv/sandbox/R-4.3/x86_64-pc-linux-gnu/9a444a72\n\n P ── Loaded and on-disk path mismatch.\n\n──────────────────────────────────────────────────────────────────────────────\n```\n:::\n:::\n\n\n</details>\n ",
"supporting": [
"index_files"
],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit e47b4db

Please sign in to comment.