Skip to content

Commit

Permalink
Update the README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
chtsai0105 committed Feb 27, 2024
1 parent d3f49f8 commit 0f26bd8
Showing 1 changed file with 116 additions and 29 deletions.
145 changes: 116 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ The marker sets developed for this approach in fungi are available as part of th
- Implement all stuff in python. The entire program will be more readable and maintainable.
- Simplify some steps and reduce the intermediate files as much as possible.
- [Muscle] is now available for alternative alignment method.
- Use [PhyKIT] to remove uninformative orthologs.
- [FastTree] is now available for tree construction.
- [ASTER], a C++ version of [ASTRAL] is now integrated to resolve consensus among trees built upon individual genes.

Expand All @@ -43,7 +44,7 @@ positional arguments:
align Run multiple sequence alignments against orthologs found among samples
tree Build a phylogenetic tree based on multiple sequence alignment results
optional arguments:
options:
-h, --help show this help message and exit
-V, --version show program's version number and exit
```
Expand Down Expand Up @@ -76,8 +77,6 @@ positional arguments:
options:
-h, --help show this help message and exit
-v, --verbose Verbose mode for debug
-o OUTPUT, --output OUTPUT
Output directory to save HMM markerset (default="./HMM")
```

Firstly, use `download list` to show the available BUSCO markersets.
Expand All @@ -86,7 +85,11 @@ Firstly, use `download list` to show the available BUSCO markersets.
phyling download list
```

Copy and paste the markerset to download it. Here we use `fungi_odb10` as example.
By default the downloaded markersets will be saved to the `~/.phyling/HMM`.
The `Downloaded datasets` section listed the markersets that have already been downloaded.

To download the markerset, copy the name from the list and paste it to the download module directly.
Here we use `fungi_odb10` as example.

```
phyling download fungi_odb10
Expand All @@ -112,34 +115,36 @@ See all the options with `phyling align --help`.
options:
-h, --help show this help message and exit
-v, --verbose Verbose mode for debug
-i INPUTS [INPUTS ...], --inputs INPUTS [INPUTS ...]
-i file [files ...], --inputs file [files ...]
Query pepetide/cds fasta or gzipped fasta
-I INPUT_DIR, --input_dir INPUT_DIR
-I directory, --input_dir directory
Directory containing query pepetide/cds fasta or gzipped fasta
-o OUTPUT, --output OUTPUT
Output directory of the alignment results (default="./align")
-m MARKERSET, --markerset MARKERSET
-o directory, --output directory
Output directory of the alignment results (default: phyling-align-20240227-103913-0800
[current timestamp])
-m directory, --markerset directory
Directory of the HMM markerset
-E EVALUE, --evalue EVALUE
Hmmsearch reporting threshold (default=1e-10)
-E float, --evalue float
Hmmsearch reporting threshold (default: 1e-10)
-M {hmmalign,muscle}, --method {hmmalign,muscle}
Program used for multiple sequence alignment (default="hmmalign")
Program used for multiple sequence alignment (default: hmmalign)
--non_trim Report non-trimmed alignment results
--from_checkpoint Load previous hmmsearch results from .checkpoint.pkl in the output directory
-t THREADS, --threads THREADS
Threads for hmmsearch and the number of parallelized jobs in MSA step (default=1)
Threads for hmmsearch and the number of parallelized jobs in MSA step. Better be multiple of 4
if using more than 8 threads (default: 1)
```

Run the align module with all the fasta files under folder `pep`.

```
phyling align -I pep -m HMM/fungi_odb10/hmms
phyling align -I pep -o align -m HMM/fungi_odb10/hmms
```

An equivalent way to send inputs.

```
phyling align -i pep/*.fasta.gz -m HMM/fungi_odb10/hmms
phyling align -i pep/*.fasta.gz -o align -m HMM/fungi_odb10/hmms
```

Or if you're just interested in part of the fasta, you can specify the inputs one-by-one.
Expand All @@ -148,6 +153,7 @@ Or if you're just interested in part of the fasta, you can specify the inputs on
phyling align -i pep/Pilobolus_umbonatus_NRRL_6349.aa.fasta.gz \
pep/Rhizopus_homothallicus_CBS_336.62.aa.fasta.gz \
pep/Rhizopus_rhizopodiformis_NRRL_2570.aa.fasta.gz \
-o align \
-m HMM/fungi_odb10/hmms
```

Expand All @@ -156,13 +162,15 @@ phyling align -i pep/Pilobolus_umbonatus_NRRL_6349.aa.fasta.gz \
Accelerate by using 16 cpus.

```
phyling align -I pep -m HMM/fungi_odb10/hmms -t 16
phyling align -I pep -o align -m HMM/fungi_odb10/hmms -t 16
```

According to [pyhmmer benchmark](https://pyhmmer.readthedocs.io/en/stable/benchmarks.html) the multithread acceleration drop dramatically when more cpu is used.
#### Multithreading strategy

According to [pyhmmer benchmark](https://pyhmmer.readthedocs.io/en/stable/benchmarks.html), the acceleration benefits from multithreading drop significantly as more CPUs are utilized.
When less then 8 cpus are given, the hmmsearch step will run in single-thread manner and all cpus will be used for each round of hmmsearch.
When 8 or more cpus are given, the hmmsearch step will use 4 cpus for each parallel job.
In this example, 4 hmmsearch jobs will run parallelly and each job utilize 4 cpus.
In the example above, 4 hmmsearch jobs will run parallelly and each job utilize 4 cpus.
For the alignment step, 16 parallel jobs will be launched and each parallel job is running in single-thread manner.

Highly recommended if **muscle** is chosen for alignment. (**muscle** is much slower than **hmmalign**!!)
Expand All @@ -174,46 +182,55 @@ The `--from_checkpoint` option allow you to retrieve already completed hmmsearch
Only the newly added/changed inputs will do the hmmsearch when `--from_checkpoint` is enabled.
Meanwhile, the samples not specified in the command with `--from_checkpoint` will be removed from the checkpoint file.

#### Use coding sequence instead of peptide sequence
#### Use coding sequences instead of peptide sequences

In some circumstances, the highly shared peptide sequences make it difficult to resolve the relationship among closely related species.
To address the issue, one can use DNA coding sequences (CDS), which contain more evolutionary traces, instead of peptide sequences for phylogeny analysis.

Run the align module with cds fasta files under folder `cds`.

```
phyling align -I cds -m HMM/fungi_odb10/hmms -t 16
phyling align -I cds -o align_cds -m HMM/fungi_odb10/hmms -t 16
```

The CDS inputs will be translated into peptide sequences in the first steps.
The translated peptide sequences will be used for hmmsearch and the alignment steps.
The peptide alignment results will then being back-translated according to the original CDS inputs.
And the back-translated DNA version alignments will be output.

### Build tree on multiple sequence alignment results
### Build tree from multiple sequence alignment results

Finally, we can run the tree module to use the multiple sequence alignment results to build a phylogenetic tree.
By default, it uses the _consensus tree_ strategy (conclude the majority of trees which was built upon each single gene)
But you can choose to use _concatenated alignment_ strategy by specifying `-c/--concat`.
Currently, 3 methods (upgma, Neighbor Joining and FastTree) are available for tree building.
We further calculate the [treeness/RCV] by [PhyKit] and select the top 50 most informative markers for final tree building.
See all the options with `phyling tree --help`.

```
options:
-h, --help show this help message and exit
-v, --verbose Verbose mode for debug
-i INPUTS [INPUTS ...], --inputs INPUTS [INPUTS ...]
Multiple sequence alignment fasta
-I INPUT_DIR, --input_dir INPUT_DIR
Directory containing multiple sequence alignment fasta
-o OUTPUT, --output OUTPUT
Output directory of the newick treefile (default=".")
-i file [files ...], --inputs file [files ...]
Multiple sequence alignment fasta of the markers
-I directory, --input_dir directory
Directory containing multiple sequence alignment fasta of the markers
-o directory, --output directory
Output directory of the newick treefile (default: phyling-tree-20240227-103934-0800 [current
timestamp])
-M {upgma,nj,ft}, --method {upgma,nj,ft}
Algorithm used for tree building (default="upgma")
Algorithm used for tree building. (default: upgma)
Available options:
UPGMA: upgma
Neighbor Joining: nj
FastTree: ft
-n TOP_N_TOVERR, --top_n_toverr TOP_N_TOVERR
Select the top n markers based on their treeness/RCV for final tree building (default: 50.
Specify 0 to use all markers)
-c, --concat Concatenated alignment results
-f, --figure Generate a matplotlib tree figure
-t THREADS, --threads THREADS
Threads for tree construction (default=1)
Threads for tree construction (default: 1)
```

Run the tree module with all the alignment results under folder `align`.
Expand All @@ -228,6 +245,8 @@ You can also use only part of the alignment results to build tree.
phyling tree -i align/100957at4751.aa.mfa align/174653at4751.aa.mfa align/255412at4751.aa.mfa
```

**Note: the checkpoint file .checkpoint.pkl is required in the same directory to provide the sample names and sequence type for tree building.**

Use FastTree instead of the default UPGMA method for tree building and running with 16 threads.

```
Expand All @@ -240,13 +259,57 @@ Use matplotlib to generate a tree figure.
phyling tree -I align -f
```

#### Use concatenate strategy

By default the consensus strategy is used but users can choose to concatenate the markers and generate a single tree on it.

```
phyling tree -I align -c
```

When enable the concatenate option, the concatenated fasta will be output to the designated output folder.
Meanwhile, a partition file compatible with RAxML and IQTree will also being generated to support users who are interested in using RAxML and IQTree.

**Note: with default setting PHYling will only concatenate top 50 markers based on their treeness/RCV score.
See more [details](#use-top_n_toverr-to-filter-the-markers-by-treenessrcv) in the section below.**

#### Use top_n_toverr to filter the markers by treeness/RCV

The align module usually report a lot of markers.
We integrate [PhyKIT] to calculate the [treeness/RCV] value to determine how informative a marker is.
The higher the [treeness/RCV] values, the marker are more informative and least susceptible to composition bias.
By doing so, we're able to select the most informative and rubost markers to build the final tree.

By default the top 50 markers are selected for the final tree building.
You can specify your own desired number.
Say we want only the top 20:

```
phyling tree -I align -n 20 -t 16
```

You can also enable the -c/--concat at the same time to concatenate only the top ranked markers.
The following example will concatenate the top 20 markers.
```
phyling tree -I align -c -n 20 -t 16
```

If you want to use all the markers without filtering, you can specify `-n/--top_n_toverr 0`.

```
phyling tree -I align -n 0 -t 16
```

**Note: when -n/--top_n_toverr is enable (>0), a file `top_toverr_trees.tsv` that record the treeness/RCV of selected markers will be generate as well.**

## Requirements

- Python >= 3.9
- [Biopython](https://biopython.org/)
- [pyhmmer], a HMMER3 implementation on python3.
- [muscle] for alternative method for multiple sequence alignment. (Optional)
- [ClipKIT] for removing sites that are poor of phylogenetic signal.
- [PhyKIT] for calculating treeness/RCV to filter uninformative orthologs.
- [FastTree], use approximately maximum-likelihood to build trees. (Optional)
- [ASTER], a C++ re-implementation of [ASTRAL] to resolve consensus among trees.

Expand All @@ -269,6 +332,27 @@ Install the package through pip in the PHYling folder.
pip install .
```

### Install additional package for developing (developer only)

Developer should clone the GitHub project directly instead of downloading from the releases.
Some of the files for developing purpose only are not included in the releases.

In addition to the requirements listed above, the following packages are required for developing environment.

- [pre-commit] >= 3.4.0

Developer can install it with conda by:

```
conda install pre-commit>=3.4.0
```

For convenience we also provide the additional conda env file. Please use the dev_additional_packages.yml to install the additional packages.

```
conda env update -f dev_additional_packages.yml
```

## Notes

- Training your own marker set is also possible but most busco sets are good starting place.
Expand All @@ -278,6 +362,9 @@ pip install .
[pyhmmer]: https://pyhmmer.readthedocs.io/en/stable/index.html
[muscle]: https://drive5.com/muscle5/
[ClipKIT]: https://jlsteenwyk.com/ClipKIT/
[PhyKIT]: https://jlsteenwyk.com/PhyKIT/
[Treeness/RCV]: https://jlsteenwyk.com/PhyKIT/usage/index.html#treeness-over-rcv
[FastTree]: http://www.microbesonline.org/fasttree/
[ASTER]: https://github.com/chaoszhang/ASTER
[ASTRAL]: https://github.com/smirarab/ASTRAL
[pre-commit]: https://pre-commit.com/

0 comments on commit 0f26bd8

Please sign in to comment.