Skip to content

Latest commit

 

History

History
50 lines (33 loc) · 3.12 KB

README_1_partition_into_separate_files.md

File metadata and controls

50 lines (33 loc) · 3.12 KB

Partition data into a separate VCF file per chromosome

The first preprocessing step is to partition the data into manageable independent VCF files (one file per chromosome), in an effort to fasten data retrieval, smooth the management of raw data, and exploit parallel execution. Otherwise, accessing and manipulating large-scale multidimensional biological data might be excessively costly, both in terms of execution time and memory consumption.

Before writing the partitioned VCF files, the chromosome notation in the variants/CHROM field can optionally be modified. The chromosome notation of the reference and the query datasets should be the same prior to merging. Standardizing the chromosome notation at this step is preferable to doing it in an added step for efficiency purposes. To rename the chromosome notation, you can use the --rename-chr flag. By default, if the --rename-chr flag is called, the variants/CHROM format will change from <chrom_number> to chr<chrom_number> or vice-versa. If the notation of a chromosome is in neither of these formats, its notation will remain the same. To apply a different change in the notation, you can use the --rename-map flag to define a dictionary with the actual chromosome notation as keys and the new chromosome notation as values.

Usage

$ python3 MergeGenome.py partition -q <query_file> -o <output_folder>

Input flags include:

  • -q, --query PATH, Path to input .vcf file with data for multiple chromosomes (required).
  • -o, --output-folder PATH, Path to output folder (required). Note: make sure a '/' appears at the end of the output folder.
  • -r, --rename-chr, To rename chromosome notation (optional).
  • -m, --rename-map DICT, Mapping from actual to new chromosome notation (optional).
  • -d, --debug PATH, Path to .log/.txt file to store info/debug messages (optional).

Output

  • One .vcf file for each chromosome in <query_file>. Each new .vcf file will receive the same base name as the input file, but ending with the chromosome name of the new file in particular.
  • If --debug, a .log or .txt file with information regarding the dimensions of the data (number of samples and number of SNPs), the amount of chromosomes available and their corresponding dimensions and, when applicable, the changes in the chromosome notation.

Examples

  1. Partition .vcf data in a separate .vcf file per chromosome:
$ python3 MergeGenome.py partition -q query.vcf -o ./output/
  1. Partition .vcf data in a separate .vcf file per chromosome and change chromosome notation from <chrom_number> to chr<chrom_number>:
$ python3 MergeGenome.py partition -q query.vcf -o ./output/ -r
  1. Partition .vcf data in a separate .vcf file per chromosome and change chromosome notation from chr<chrom_number> to <chrom_number>:
$ python3 MergeGenome.py partition -q query.vcf -o ./output/ -r
  1. Partition .vcf data in a separate .vcf file per chromosome and change chromosome notation from "1" to "chr_1", and from "2" to "chr_2". Also save debug info in .log file:
$ python3 MergeGenome.py partition -q query.vcf -o ./output/ -r -m '{"1":"chr_1", "2":"chr_2"}' -d partition.log