KNNCNV: A k-nearest neighbor based method for detection of copy number variations using NGS data
- bam_path (str, for the six real blood samples in Section 3.2.1, the *.bam file can be obtained from the 1000 Genomes Project. The three cancer samples in Section 3.2.2 can be downloaded from the European Genome-Phenome Archive. Additionally, these *.bam files can be download from the Baidu Netdisk, and its extraction code is
33dd
). - fa_path (str, the reference genome that can be in the fa format or the fasta format).
- gt_path (str, optional, the confirmed CNVs file of the *.bam file. if this file is provided, some performance metrics including precision and sensitivity can be calculated. Note that the file can be obtained from the database of genomic variants).
The experimental result on the NA12878 is NA12878.txt that is a txt text delimited by commas and represents the detailed descriptions of CNVs predicted by KNNCNV. More specifically, the first line of the text is the column descriptions including chr, start, end, variant type, and RD, where chr denotes the chromosome ID, and start and end represent the start and end positions of declared CNVs, respectively. The variant type contains deletion and duplication, and the RD is the read depth of declared CNVs.
- open the file
knncnv.py
and modify the variablesbam_path
,fa_path
inside;
if __name__ == '__main__':
# Local path of the *.bam file
bam_path = r"/real_data/NA12878.chrom21.SLX.maq.SRP000032.2009_07.bam"
# Local path of the *.fasta file or the *.fa file
fa_path = r"./data/chr21.fa"
# Local path of the ground truth (i.e., confirmed CNVs) for the *.bam file.
gt_path = r"./data/NA12878.gt"
# parameter setting of the preprocessing
bin_size = 1000 # the bin size ('1000' by default)
knncnv(bam_path, fa_path, gt_path=gt_path, iter_num=20, bin_size=bin_size)
-
run the
knncnv.py
; -
output of the entire process is NA12878.txt.
from knncnv import vbgmm
labels = vbgmm(scores)
# 0 stands for inliers and 1 for outliers(CNVs).
Python 3.8
- biopython 1.78
- numpy 1.18.5
- pandas 1.0.5
- pysam 0.16.0.1
- pyod 0.8.4
- rpy2 3.4.2
- scikit-learn 0.23.1
- scipy 1.5.0
R 3.4.4
- DNAcopy
KNNCNV is published in Frontiers in Cell and Developmental Biology. If you use this code in your work, we would like to cite the following paper.
@article {PMID:35004691,
Title = {KNNCNV: A K-Nearest Neighbor Based Method for Detection of Copy Number Variations Using NGS Data},
Author = {Xie, Kun and Liu, Kang and Alvi, Haque A K and Chen, Yuehui and Wang, Shuzhen and Yuan, Xiguo},
DOI = {10.3389/fcell.2021.796249},
Volume = {9},
Year = {2021},
Journal = {Frontiers in cell and developmental biology},
ISSN = {2296-634X},
Pages = {796249},
URL = {https://europepmc.org/articles/PMC8728060},
}