Skip to content

KNNCNV: A k-nearest neighbor based method for detection of copy number variations using NGS data

Notifications You must be signed in to change notification settings

mk-runner/KNNCNV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KNNCNV

KNNCNV: A k-nearest neighbor based method for detection of copy number variations using NGS data


Usage

1. Input

  • bam_path (str, for the six real blood samples in Section 3.2.1, the *.bam file can be obtained from the 1000 Genomes Project. The three cancer samples in Section 3.2.2 can be downloaded from the European Genome-Phenome Archive. Additionally, these *.bam files can be download from the Baidu Netdisk, and its extraction code is 33dd).
  • fa_path (str, the reference genome that can be in the fa format or the fasta format).
  • gt_path (str, optional, the confirmed CNVs file of the *.bam file. if this file is provided, some performance metrics including precision and sensitivity can be calculated. Note that the file can be obtained from the database of genomic variants).

2. Output of the entire process

The experimental result on the NA12878 is NA12878.txt that is a txt text delimited by commas and represents the detailed descriptions of CNVs predicted by KNNCNV. More specifically, the first line of the text is the column descriptions including chr, start, end, variant type, and RD, where chr denotes the chromosome ID, and start and end represent the start and end positions of declared CNVs, respectively. The variant type contains deletion and duplication, and the RD is the read depth of declared CNVs.

3. API

1) API for the entire process

  • open the file knncnv.py and modify the variables bam_path, fa_path inside;
if __name__ == '__main__':
    # Local path of the *.bam file
    bam_path = r"/real_data/NA12878.chrom21.SLX.maq.SRP000032.2009_07.bam"
    
    # Local path of the *.fasta file or the *.fa file
    fa_path = r"./data/chr21.fa"
    
    # Local path of the ground truth (i.e., confirmed CNVs) for the *.bam file.
    gt_path = r"./data/NA12878.gt"
    
    # parameter setting of the preprocessing
    bin_size = 1000  # the bin size ('1000' by default)

    knncnv(bam_path, fa_path, gt_path=gt_path, iter_num=20, bin_size=bin_size)
  • run the knncnv.py;

  • output of the entire process is NA12878.txt.

2) API only for the VBGMM

from knncnv import vbgmm

labels = vbgmm(scores)
# 0 stands for inliers and 1 for outliers(CNVs).

Required Dependencies

Python 3.8

  • biopython 1.78
  • numpy 1.18.5
  • pandas 1.0.5
  • pysam 0.16.0.1
  • pyod 0.8.4
  • rpy2 3.4.2
  • scikit-learn 0.23.1
  • scipy 1.5.0

R 3.4.4

  • DNAcopy

Citing KNNCNV

KNNCNV is published in Frontiers in Cell and Developmental Biology. If you use this code in your work, we would like to cite the following paper.

@article {PMID:35004691,
	Title = {KNNCNV: A K-Nearest Neighbor Based Method for Detection of Copy Number Variations Using NGS Data},
	Author = {Xie, Kun and Liu, Kang and Alvi, Haque A K and Chen, Yuehui and Wang, Shuzhen and Yuan, Xiguo},
	DOI = {10.3389/fcell.2021.796249},
	Volume = {9},
	Year = {2021},
	Journal = {Frontiers in cell and developmental biology},
	ISSN = {2296-634X},
	Pages = {796249},
	URL = {https://europepmc.org/articles/PMC8728060},
}


About

KNNCNV: A k-nearest neighbor based method for detection of copy number variations using NGS data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published