Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmseqs taxonomy based on GTDB + NR viruses + NR eukaryotes #849

Open
pbelmann opened this issue Jun 2, 2024 · 1 comment
Open

mmseqs taxonomy based on GTDB + NR viruses + NR eukaryotes #849

pbelmann opened this issue Jun 2, 2024 · 1 comment

Comments

@pbelmann
Copy link

pbelmann commented Jun 2, 2024

Hi

I would like to taxonomically classify my protein sequences based on the GTDB taxonomy combined with the ncbi taxonomy of NR viruses and NR eukaryotes.

Do you have any suggestions on how I could build a mmseqs database consisting of these three databases and two taxonomies?

My current approach would be to create *dmp files according to your description for the gtdb and merge them with the *dmp files of the NR containing only viruses and eukaryotes.

@milot-mirdita
Copy link
Member

Essentially you need:

  • three FASTA files with headers with unique accessions and the amino acid sequences.
  • one TSV files that goes from accession to numeric taxonomy id
  • combined *.dmp files.

With all of that you can call:

mmseqs createdb gtdb.fasta virus.fasta euks.fasta seqdb
mmseqs createtaxdb seqdb tmp --tax-mapping-file accession_to_taxid.tsv --ncbi-tax-dump path-to-dmp-files-dir/

seqdb will then be a normal taxonomy database.

for the tsv files you have to check that the second column (containing the accessions) in the seqdb.lookup file that is created after calling createdb matches the accessions in the first column in your tsv file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants