Skip to content

Commit

Permalink
Merge pull request #150 from IARCbioinfo/v1.0b
Browse files Browse the repository at this point in the history
V1.0 merge
  • Loading branch information
mfoll committed Feb 17, 2017
2 parents d90c275 + 00a1340 commit 9d116db
Show file tree
Hide file tree
Showing 13 changed files with 1,495 additions and 737 deletions.
28 changes: 24 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,42 @@
# Change Log

## [v1.0](https://github.com/IARCbioinfo/needlestack/tree/v1.0) (2016-08-03)
[Full Changelog](https://github.com/IARCbioinfo/needlestack/compare/v0.3...v1.0)

**Implemented enhancements:**

- Manage the three possible genotypes in vcf [\#130](https://github.com/IARCbioinfo/needlestack/issues/130)
- The graph showing AF vs log10\(qval\) should show phred-scaled qvalues [\#121](https://github.com/IARCbioinfo/needlestack/issues/121)

**Fixed bugs:**

- Contours seem to be incorrect [\#128](https://github.com/IARCbioinfo/needlestack/issues/128)
- correct file name extraction for sample name [\#126](https://github.com/IARCbioinfo/needlestack/issues/126)
- Let min\_qval be equal to 0 [\#119](https://github.com/IARCbioinfo/needlestack/issues/119)
- plot improved error rate confidence interval [\#117](https://github.com/IARCbioinfo/needlestack/issues/117)

**Closed issues:**

- QUAL should not be reported as Inf in VCF when q-value=0 [\#125](https://github.com/IARCbioinfo/needlestack/issues/125)
- Add pipeline execution DAG in README [\#123](https://github.com/IARCbioinfo/needlestack/issues/123)

## [v0.3](https://github.com/IARCbioinfo/needlestack/tree/v0.3) (2016-05-03)
[Full Changelog](https://github.com/IARCbioinfo/needlestack/compare/v0.2...v0.3)

**Implemented enhancements:**

- color points by qvalues in regression plot [\#85](https://github.com/IARCbioinfo/needlestack/issues/85)
- Add an option to directly input a region for calling in the command line [\#71](https://github.com/IARCbioinfo/needlestack/issues/71)
- Improve the bed split method [\#47](https://github.com/IARCbioinfo/needlestack/issues/47)
- Change the number of entry in the INFO and FORMAT VCF fields [\#108](https://github.com/IARCbioinfo/needlestack/issues/108)
- Add contour lines for a set of qvalues in the plot [\#100](https://github.com/IARCbioinfo/needlestack/issues/100)
- color points by qvalues in regression plot [\#85](https://github.com/IARCbioinfo/needlestack/issues/85)
- Add an option to choose output VCF file name \(--out\_vcf?\) [\#81](https://github.com/IARCbioinfo/needlestack/issues/81)
- Add an option to directly input a region for calling in the command line [\#71](https://github.com/IARCbioinfo/needlestack/issues/71)
- Change the way we publish new version [\#69](https://github.com/IARCbioinfo/needlestack/issues/69)
- Make the stable docker file more stable [\#68](https://github.com/IARCbioinfo/needlestack/issues/68)
- Add more tests in CircleCI [\#55](https://github.com/IARCbioinfo/needlestack/issues/55)
- Remove unnecessary intermediate outputs [\#51](https://github.com/IARCbioinfo/needlestack/issues/51)
- Improve the bed split method [\#47](https://github.com/IARCbioinfo/needlestack/issues/47)
- In the absence of a bed file the pipeline should run on the full reference genome [\#39](https://github.com/IARCbioinfo/needlestack/issues/39)
- Improve R script command line parsing [\#38](https://github.com/IARCbioinfo/needlestack/issues/38)
- Add version numbers in VCF output [\#20](https://github.com/IARCbioinfo/needlestack/issues/20)

**Fixed bugs:**
Expand All @@ -34,7 +55,6 @@
**Implemented enhancements:**

- Add logo image [\#62](https://github.com/IARCbioinfo/needlestack/issues/62)
- Improve R script command line parsing [\#38](https://github.com/IARCbioinfo/needlestack/issues/38)
- add --no\_indel option [\#56](https://github.com/IARCbioinfo/needlestack/issues/56)
- Correct english typos in readme, help and log [\#53](https://github.com/IARCbioinfo/needlestack/issues/53)
- The pipeline randomly crashes with java.nio.file.NoSuchFileException: XXX\_empty.pdf [\#49](https://github.com/IARCbioinfo/needlestack/issues/49)
Expand Down
53 changes: 41 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,9 @@ Contact: follm@iarc.fr
Needlestack is an ultra-sensitive multi-sample variant caller for Next Generation Sequencing (NGS) data. It is based on the idea that analysing several samples together can help estimate the distribution of sequencing errors to accurately identify variants. It has been initially developed for somatic variant calling using very deep NGS data from circulating free DNA, but is also applicable to lower coverage data like Whole Exome Sequencing (WES) or even Whole Genome Sequencing (WGS). It is a highly scalable and reproducible pipeline thanks to the use of [nextflow](http://www.nextflow.io/) and [docker](https://www.docker.com) technologies.

Here is a summary of the method:

- At each position and for each candidate variant, we model sequencing errors using a negative binomial regression with a linear link and a zero intercept. The data is extracted from the BAM files using [samtools](http://www.htslib.org).
- Genetic variants are detected as being outliers from the error model. To avoid these outliers biasing the regression we use robust estimator for the negative binomial regression (published [here](http://www.ncbi.nlm.nih.gov/pubmed/25156188) with code available [here](https://github.com/williamaeberhard/glmrob.nb)).
- Genetic variants are detected as being outliers from the error model. To avoid these outliers biasing the regression we use a robust estimator for the negative binomial regression (published [here](http://www.ncbi.nlm.nih.gov/pubmed/25156188) with code available [here](https://github.com/williamaeberhard/glmrob.nb)).
- We calculate for each sample a p-value for being a variant (outlier from the regression) that we further transform into q-values to account for multiple testing.

## Input
Expand Down Expand Up @@ -93,29 +94,30 @@ Needlestack works under most Linux distributions and Apple OS X.

## Detailed instructions

### Nextflow and Docker

If you can't install [docker](https://www.docker.com) or don't want to use it, the pipeline will also work if you install [perl](https://www.perl.org), [bedtools](http://bedtools.readthedocs.org/en/latest/), [samtools](http://www.htslib.org) and Rscript from [R](https://www.r-project.org) and put them in your path (executables are assumed to be respectively called `perl`, `bedtools`, `samtools` and `Rscript`). In this case, remove the `-with-docker` option from step 5 above.

The exact same pipeline can be run on your computer or on a HPC cluster, by adding a [nextflow configuration file](http://www.nextflow.io/docs/latest/config.html) to choose an appropriate [executor](http://www.nextflow.io/docs/latest/executor.html). For example to work on a cluster using [SGE scheduler](https://en.wikipedia.org/wiki/Oracle_Grid_Engine), simply add a file named `nextflow.config` in the current directory (or `~/.nextflow/config` to make global changes) containing:
```java
process.executor = 'sge'
```

Other popular schedulers such as LSF, SLURM, PBS, TORQUE etc. are also compatible. See the nextflow documentation [here](http://www.nextflow.io/docs/latest/executor.html) for more details. Also have a look at the [other parameters for the executors](http://www.nextflow.io/docs/latest/config.html#scope-executor), in particular `queueSize` that defines the number of tasks the executor will handle in a parallel manner. Parallelism in needlestack is managed by splitting the genomic regions in pieces of equal sizes (`--nsplit`). Note that dealing with very large regions can take a large amount of memory, therefore splitting more is more memory-efficient. In nextflow the default number of tasks the executor will handle in a parallel is 100, which is certainly too high if you are executing it on your local machine (as if you use `--nsplit 100` the 100 pieces will run in parallel). In this case a good idea is to set it to the number of computing cores your local machine has. You can add this as an option at run time, by adding for example `-queue-size 4` in the `nextflow run` command if you have a machine with four cores. You can also permanently set it in the config file, here is a way to automatically obtain and add this information (works on Linux and Mac OS X):
```bash
echo "executor.\$local.queueSize = "`getconf _NPROCESSORS_ONLN` >> ~/.nextflow/config
```
Other popular schedulers such as LSF, SLURM, PBS, TORQUE etc. are also compatible. See the nextflow documentation [here](http://www.nextflow.io/docs/latest/executor.html) for more details. Also have a look at the [other parameters for the executors](http://www.nextflow.io/docs/latest/config.html#scope-executor), in particular `queueSize` that defines the number of tasks the executor will handle in a parallel manner. Parallelism in needlestack is managed by splitting the genomic regions in pieces of equal sizes (`--nsplit`). Note that dealing with very large regions can take a large amount of memory, therefore splitting more is more memory-efficient.

### Parameters

`--bam_folder` and `--fasta_ref` are compulsary. The optional parameters with default values are:
Type `--help` to get the full list of options. `--bam_folder` and `--fasta_ref` are compulsary. The optional parameters with default values are:

| Parameter | Default value | Description |
|-----------|--------------:|-------------|
| min_dp | 50 | Minimum coverage in at least one sample to consider a site |
| min_ao | 5 | Minimum number of non-ref reads in at least one sample to consider a site |
| min_dp | 30 | Minimum median coverage to consider a site. In addition, at least 10 samples have to be covered by min_dp. |
| min_ao | 3 | Minimum number of non-ref reads in at least one sample to consider a site |
| nsplit | 1 | Split the bed file in nsplit pieces and run in parallel |
| min_qval | 50 | qvalue threshold in [Phred scale](https://en.wikipedia.org/wiki/Phred_quality_score) to consider a variant |
| sb_type | SOR | Strand bias measure, either SOR or RVSB |
| sb_snv | 100 | Strand bias threshold for SNVs (100 =no filter) |
| sb_indel | 100 | Strand bias threshold for indels (100 = no filter)|
| sb_type | SOR | Strand bias measure, either SOR, RVSB or FS |
| sb_snv | 100 or 1000 | Strand bias threshold for SNVs (100 (1000 if FS) = no filter) |
| sb_indel | 100 or 1000 | Strand bias threshold for indels (100 (1000 if FS) = no filter)|
| map_qual | 20 | Min mapping quality (passed to samtools) |
| base_qual | 20 | Min base quality (passed to samtools) |
| max_DP | 30000 | Downsample coverage per sample (passed to samtools) |
Expand All @@ -129,12 +131,39 @@ echo "executor.\$local.queueSize = "`getconf _NPROCESSORS_ONLN` >> ~/.nextflow/c
| out_vcf | all_variants.vcf | File name of final VCF |
| bed | | BED file containing a list of regions (or positions) where needlestack should be run |
| region | | A region in format CHR:START-END where calling should be done |
| pairs_file | | A tab-delimited file containing two columns (normal and tumor sample names) for each sample in line. This enables matched tumor/normal pair calling features (see below) |
| power_min_af |  | Allelic fraction used to classify genotypes to 0/0 or ./. depending of the power to detect a variant at this fraction (see below) |
| extra_robust_gl | | Add this argument to perform extra-robust regression (useful for common germline SNPs, see below) |
| sigma_normal | 0.1 | Sigma parameter for negative binomial modeling of expected germline allelic fraction. We strongly recommend not to change this parameter unless you really know what it means |
| input_vcf | | A VCF file (from GATK) where calling should be done. Needlestack will extract DP and AO from this VCF (DP and AD fields) and annotate it with phred q-value score (`FORMAT/QVAL` field), error rate (`INFO/ERR`) and over-dispersion sigma parameter (`INFO/SIG`). WARNING: by default, only work with split (coreutils) version > 8.13 |

By default, if neither `--bed` nor `--region` are provided, needlestack would run on whole reference, building a bed file from fasta index inputed.
If `--bed` and `--region` are both provided, it should run on the region only.

Simply add the parameters you want in the command line like `--min_dp 1000` for example to change the min coverage or `--all_SNVs` to output all sites.

[Recommended values](http://gatkforums.broadinstitute.org/discussion/5533/strandoddsratio-computation) for SOR strand bias are SOR < 4 for SNVs and < 10 for indels. For RVSB, a good starting point is to filter out variant with RVSB>0.85. There is no hard filter by default as this is easy to do afterward using [bcftools filter](http://samtools.github.io/bcftools/bcftools.html#filter) command.
### Germline, somatic, matched Tumor-Normal pairs calling and contamination

When using matched tumor/normal, Needlestack can classify variants (VCF `FORMAT/STATUS` field) according to the following table:
![Status Table](STATUS_TABLE.png)

For this one need to provide a tab-delimited file containing two columns with normal and tumor sample names using the `--pairs_file` option. The first line of this file is a header with `TUMOR` and `NORMAL` keywords. When one normal or one tumor is missing, one can write `NA`. In this mode, the parameter `power_min_af` defines the allelic fraction in the tumor one is trying to detect to classify genotypes as `./.` or `0/0` depending on the power to detect this allelic fraction. Variants found as somatic in a tumor, but germline in another sample of the series will be flagged as `POSSIBLE_CONTAMINATION`. We found this particularly important, as needlestack is very sensitive to low allelic fractions, to filter out contamination among samples for pooled exome capture.

In other cases (when there is no `--pairs_file` parameter defined), genotypes are defined as `./.` or `0/0` assuming one is looking for allelic fractions expected for germline variants (negative binomial distribution centered at 0.5 with over-dispersion parameter sigma=`sigma_normal`, with `sigma_normal=0.1` by default). If you are looking for somatic variants without matched-normal and assuming you are interesting to correctly distinguish `./.` and `0/0`genotypes, you can set the `power_min_af` parameter to the lowest allelic fraction of somatic variants you are interested with (and your coverage allows you to find). Note that this is by far not the most common situation, and that in most cases you don't have to worry about the `power_min_af` parameter.

## Notes

### Common variants

Needlestack is made to identify rare variants (i.e. only a few samples in your set of samples have a particular variant), because of the robust regression used. Therefore, common SNPs (>10%) or strong somatic hotspots will be missed. The optional `extra_robust_gl` can overcome partially this issue for common germline mutations: it first discard high allelic fraction (>20%, assuming these are likely true variants) before fitting the regression model when between 10% and 50% of sample have such a high allelic fraction. A flag is written in the VCF `INFO/WARN` field when this happened (`EXTRA_ROBUST_GL`). Additionally, when an other allele than the reference allele is the most common, it is taken as the reference allele and a flag is also written in the VCF (`INFO/WARN=INV_REF`).

### Strand bias

For conventional variant callers, GATK WES/WGS [recommended values](http://gatkforums.broadinstitute.org/discussion/5533/strandoddsratio-computation) for SOR strand bias are SOR < 4 for SNVs and < 10 for indels. We haven't found this particularly useful, but a more detailed evaluation is necessary. For amplicon based targeted sequencing, RVSB>0.85 seems to reduce some erros. There is no hard filter by default as this is easy to do afterward using [bcftools filter](http://samtools.github.io/bcftools/bcftools.html#filter) command.

### Reproducibility

A good practice is to keep (and publish) the `.nextflow.log` file create during the pipeline process, as it contains useful information for reproducibility (and for debugging in case of problem). You should keep the `trace.txt` file containing even more information to keep for records. Nextflow also creates a nice processes execution timeline file (a web page) in `timeline.html`.

## Pipeline execution DAG
<img align="center" src="https://cloud.githubusercontent.com/assets/3366818/15250619/eb6afcea-1925-11e6-9f91-e1d6ceecbe19.jpg" width="600">
Binary file added STATUS_TABLE.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
130 changes: 130 additions & 0 deletions STATUS_TABLE.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
%& -shell-escape
\documentclass[11pt]{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{lmodern}

%\usepackage[latin1]{inputenc}
\usepackage[english]{babel}
\usepackage[paperwidth=18cm, paperheight=10cm,margin=0.5cm]{geometry}
\usepackage{helvet}
\usepackage{times}
%\usepackage{subfigure}
\usepackage{enumitem}

%\usepackage{unnumberedcaption}
\usepackage{multirow}
\usepackage{graphics,graphicx,epsfig,rotating}
\usepackage{amsfonts,amsmath,supertabular,tabularx,amstext}
\usepackage[table]{xcolor}
\usepackage{color}
\usepackage{verbatim}
\usepackage{stmaryrd}
\usepackage{array}
\usepackage{subfig}
%\usepackage[noae]{Sweave}

\usepackage{longtable}
\usepackage{threeparttable}

\usepackage[]{natbib}
\bibpunct{(}{)}{;}{author-year}{}{,}

\RequirePackage{fancyvrb}
\RequirePackage{listings}

\usepackage{hyperref}
\usepackage[right]{lineno}

\rmfamily

\definecolor{lightgray}{rgb}{0.75,0.75,0.75}

%\oddsidemargin 0cm
%\evensidemargin 0cm
%%\textwidth 17.2cm
%\topmargin 0cm
%\textheight 10.2cm
%\textwidth 18cm
%\marginparsep 0pt
\renewcommand{\baselinestretch}{1}
%color

\usepackage{titlesec}
\titleformat{\section}{\Large\bfseries}{}{0pt}{}
\titleformat{\subsection}{\Large\itshape}{}{0pt}{}




\begin{document}
\definecolor{red2}{rgb}{1,0.5,0.5}
\definecolor{lightred}{rgb}{1,0.8,0.8}

\definecolor{lightblue}{rgb}{0,0.60,0.86}

\definecolor{redblue}{rgb}{0.3,0.7,0.43}

\definecolor{darkgreen}{rgb}{0.1,0.6,0.1}
\definecolor{Lightgray}{rgb}{0.95,0.95,0.95}
\definecolor{lightgray}{rgb}{0.75,0.75,0.75}
\definecolor{darkgray}{rgb}{0.35,0.35,0.35}
\definecolor{darkgray2}{rgb}{0.90,0.90,0.90}
\definecolor{darkred}{rgb}{0.6,0.1,0.1}

%\begin{titlepage}




\setcounter{page}{1}

%\tableofcontents
%\newpage


%\newpage
%%%%%%%%%%%%%%%%% citations %%%%%%%%%%%%%%%%%%%
\setcounter{equation}{0}
\setcounter{figure}{0}
\setcounter{section}{0}
\renewcommand{\thefigure}{S\arabic{figure}}
\renewcommand{\thetable}{S\arabic{table}}
\renewcommand{\theequation}{S\arabic{equation}}

%\newpage


\begin{footnotesize}
\begin{threeparttable}
\begin{tabular}[htp!]{m{1.4cm} m{1.7cm} m{1.4cm} m{1.7cm} m{4.5cm} m{1.4cm} m{1.4cm}}
\caption{\textbf{Table of variant STATUS and GT (genotype), as a function of variant detection and the power to detect variants in matched normal and tumor samples.}}\\
\hline
\textbf{Variant in Normal} & \textbf{Power to detect variant in Normal} & \textbf{Variant in Tumor} & \textbf{Power to detect variant in Tumor} & \textbf{STATUS} & \textbf{Normal GT} & \textbf{Tumor GT}\\
\hline
\rowcolor{lightgray}no & no & no& no& . & ./. & ./. \tnote{*}\\
no & no & no& \textbf{YES}& . & ./. & 0/0 \tnote{*}\\
\rowcolor{lightgray}no & no & \textbf{YES}& \textbf{YES} or no & UNKNOWN & ./. & 0/1 or 1/1 \\
no & \textbf{YES} & no& no& . & 0/0 & ./. \tnote{*}\\
\rowcolor{lightgray}no & \textbf{YES} & no& \textbf{YES}& . & 0/0 & 0/0 \tnote{*}\\
no & \textbf{YES} & \textbf{YES}& \textbf{YES} or no & SOMATIC \tnote{\textdagger} & 0/0 & 0/1 or 1/1 \\
\rowcolor{lightgray}\textbf{YES} & \textbf{YES} or no & no& no& GERMLINE\_UNCONFIRMABLE & 0/1 or 1/1 & ./. \\
\textbf{YES} & \textbf{YES} or no & no& \textbf{YES}& GERMLINE\_UNCONFIRMED & 0/1 or 1/1 & 0/0 \\
\rowcolor{lightgray}\textbf{YES} & \textbf{YES} or no & \textbf{YES}& \textbf{YES} or no & GERMLINE\_CONFIRMED & 0/1 or 1/1 & 0/1 or 1/1\\
\hline
\label{tab:sumstats}
\end{tabular}
\begin{tablenotes}
\item[*] power is computed using a binomial distribution with mean $power\_min\_af$ (default value is 0.01)
\item[\textdagger] only in Tumor; Normal STATUS is ``."
\end{tablenotes}

\end{threeparttable}

\end{footnotesize}



\end{document}

Loading

0 comments on commit 9d116db

Please sign in to comment.