项目作者: hall-lab

项目描述 :
Bayesian genotyper for structural variants
高级语言: Python
项目地址: git://github.com/hall-lab/svtyper.git
创建时间: 2014-08-14T18:06:49Z
项目社区:https://github.com/hall-lab/svtyper

开源协议:MIT License

下载


SVTyper

GitHub license
Build Status

Bayesian genotyper for structural variants

Overview

SVTyper performs breakpoint genotyping of structural variants (SVs) using whole genome sequencing data. Users must supply a VCF file of sites to genotype (which may be generated by LUMPY) as well as a BAM/CRAM file of Illumina paired-end reads aligned with BWA-MEM. SVTyper assesses discordant and concordant reads from paired-end and split-read alignments to infer genotypes at each site. Algorithm details and benchmarking are described in Chiang et al., 2015.

NA12878 heterozygous deletion

Installation

Requirements:

  • Python 2.7.x

Install via pip

  1. pip install git+https://github.com/hall-lab/svtyper.git

svtyper depends on pysam (version 0.15.0 or newer), numpy, and scipy; svtyper-sso additionally depends on cytoolz. If the dependencies aren’t already available on your system, pip will attempt to download and install them.

svtyper vs svtyper-sso

svtyper is the original implementation of the genotyping algorithm, and works with multiple samples. svtyper-sso is an alternative implementation of svtyper that is optimized for genotyping a single sample. svtyper-sso is a parallelized implementation of svtyper that takes advantage of multiple CPU cores via the multiprocessing module. svtyper-sso can offer a 2x or more speedup (depending on how many CPU cores used) in genotyping a single sample. NOTE: svtyper-sso is not yet stable. There are minor logging differences between the two and svtyper-sso may exit with an error prematurely when processing CRAM files.

Example Usage

svtyper

As a Command Line Python Script

  1. svtyper \
  2. -i sv.vcf \
  3. -B sample.bam \
  4. -l sample.bam.json \
  5. > sv.gt.vcf

As a Python Library

  1. import svtyper.classic as svt
  2. input_vcf = "/path/to/input.vcf"
  3. input_bam = "/path/to/input.bam"
  4. library_info = "/path/to/library_info.json"
  5. output_vcf = "/path/to/output.vcf"
  6. with open(input_vcf, "r") as inf, open(output_vcf, "w") as outf:
  7. svt.sv_genotype(bam_string=input_bam,
  8. vcf_in=inf,
  9. vcf_out=outf,
  10. min_aligned=20,
  11. split_weight=1,
  12. disc_weight=1,
  13. num_samp=1000000,
  14. lib_info_path=library_info,
  15. debug=False,
  16. alignment_outpath=None,
  17. ref_fasta=None,
  18. sum_quals=False,
  19. max_reads=None)
  20. # Results will be inside the /path/to/output.vcf file

svtyper-sso

As a Command Line Python Script

  1. svtyper-sso \
  2. --core 2 # number of cpu cores to use \
  3. --batch_size 1000 # number of SVs to process in a single batch (default: 1000) \
  4. --max_reads 1000 # skip genotyping if SV contains valid reads greater than this threshold (default: 1000) \
  5. -i sv.vcf \
  6. -B sample.bam \
  7. -l sample.bam.json \
  8. > sv.gt.vcf

As a Python Library

  1. import svtyper.singlesample as sso
  2. input_vcf = "/path/to/input.vcf"
  3. input_bam = "/path/to/input.bam"
  4. library_info = "/path/to/library_info.json"
  5. output_vcf = "/path/to/output.vcf"
  6. with open(input_vcf, "r") as inf, open(output_vcf, "w") as outf:
  7. sso.sso_genotype(bam_string=input_bam,
  8. vcf_in=inf,
  9. vcf_out=outf,
  10. min_aligned=20,
  11. split_weight=1,
  12. disc_weight=1,
  13. num_samp=1000000,
  14. lib_info_path=library_info,
  15. debug=False,
  16. alignment_outpath=None,
  17. ref_fasta=None,
  18. sum_quals=False,
  19. max_reads=1000,
  20. cores=2,
  21. batch_size=1000)
  22. # Results will be inside the /path/to/output.vcf file

Development

Requirements:

Setting Up a Development Environment

Using virtualenv

  1. git clone https://github.com/hall-lab/svtyper.git
  2. cd svtyper
  3. virtualenv myvenv
  4. source myvenv/bin/activate
  5. pip install -e .
  6. <add, edit, or delete code>
  7. make test
  8. # when you're finished with development
  9. git push <remote-name> <branch>
  10. deactivate
  11. cd .. && rm -rf svtyper

Using conda

  1. git clone https://github.com/hall-lab/svtyper.git
  2. cd svtyper
  3. conda create --channel bioconda --name mycenv pysam numpy scipy cytoolz # type 'y' when prompted with "proceed ([y]/n)?"
  4. source activate mycenv
  5. pip install -e .
  6. <add, edit, or delete code>
  7. make test
  8. # when you're finished with development
  9. git push <remote-name> <branch>
  10. source deactivate
  11. cd .. && rm -rf svtyper
  12. conda remove --name mycenv --all

Troubleshooting

Many common issues are related to abnormal insert size distributions in the BAM file. SVTyper provides methods to assess and visualize the characteristics of sequencing libraries.

Running SVTyper with the -l flag creates a JSON file with essential metrics on a BAM file. SVTyper will sample the first N reads for the file (1 million by default) to parse the libraries, read groups, and insert size histograms. This can be done in the absence of a VCF file.

  1. svtyper \
  2. -B my.bam \
  3. -l my.bam.json

The lib_stats.R script produces insert size histograms from the JSON file

  1. scripts/lib_stats.R my.bam.json my.bam.json.pdf

Insert size histogram

Citation

C Chiang, R M Layer, G G Faust, M R Lindberg, D B Rose, E P Garrison, G T Marth, A R Quinlan, and I M Hall. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Meth 12, 966–968 (2015). doi:10.1038/nmeth.3505.

http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3505.html