项目作者: merlyescalona

项目描述 :
GACMSA: Get Allele counts from Multiple Sequence Alignment files
高级语言: Python
项目地址: git://github.com/merlyescalona/gacmsa.git
创建时间: 2017-11-16T11:31:34Z
项目社区:https://github.com/merlyescalona/gacmsa

开源协议:GNU General Public License v3.0

下载


GACMSA: Get Allelic Count from Multiple Sequence Alignments

Build Status

Coverage Status

© 2017 Merly Escalona (merlyescalona@uvigo.es)

University of Vigo, Spain, http://darwin.uvigo.es

Assumptions

  • Assumes the input file is a Multiple Sequence Alignment file, meaning, all
    sequences have the same length.

Install

Clone this repository and install it with user privileges

  1. git clone https://github.com/merlyescalona/gacmsa.git
  2. cd gacmsa
  3. python setup.py instal --user

Usage:

  1. usage: gacmsa.py -i <filepath> -o <filepath> [-l <log_level>] [-v] [-h]

Required arguments:

  • -i <filepath>, --input <filepath>: Multiple-alignment sequence file (MSA).
  • -o <filepath>, --output <filepath>: Output file

Optional arguments:

  • -l <log_level>, --log <log_level>: Specified level of log that will be shown through the standard output. Entire log will be stored in a separate file.
    • Values:[‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’]. Default: INFO.

Information arguments:

  • -v, --version : Show program’s version number and exit
  • -h, --help: Show this help message and exit

Input

  • a Multiple Sequence alignment file path (MSA)

Output

  • Output file (plain text), opened with cat output.txt | less -S will give you:
  1. POSITION 1 2 3 4 5 6 7 8 9 10
  2. A 1 0 0 0 0 0 0 0 0 0
  3. C 0 0 0 0 0 31 0 0 31 0
  4. T 30 0 30 0 0 0 0 6 0 30
  5. G 0 31 1 31 31 0 31 25 0 1
  6. N 0 0 0 0 0 0 0 0 0 0
  7. GAP 0 0 0 0 0 0 0 0 0 0
  8. TOTAL 2 1 2 1 1 1 1 2 1 2
  9. 31 (10)

Where:

  • Line 1: POSITION corresponds to the base position, and will have as many columns as the length of the given sequences.
  • Line 2: A, corresponds to the count of A alleles in POSITION, for all the sequences.
  • Line 3: C, corresponds to the count of C alleles in POSITION, for all the sequences.
  • Line 4: G, corresponds to the count of G alleles in POSITION, for all the sequences.
  • Line 5: T, corresponds to the count of T alleles in POSITION, for all the sequences.
  • Line 6: N, corresponds to the count of Ns in POSITION, for all the sequences.
  • Line 7: GAP, corresponds to the count of GAPs (-) in POSITION, for all the sequences.
  • Line 8: TOTAL, corresponds to the number of alleles for the specific POSITION.
  • Line 9: gives the number of sequences and the size of the sequences in the given file. Format: NumberOfSequences (LengthOfSequence)

Quick start guide

There’s a test file under test/data.fasta with 63 individuals, and sequences of 10.000 bp.

  1. $gacmsa -i data.fasta -o gacmsa.output
  2. 22/11/2017 01:08:46 PM - INFO: Start...
  3. 22/11/2017 01:08:46 PM - INFO: Parsing MSA file: /home/user/test/data.fasta
  4. 22/11/2017 01:08:46 PM - INFO: Calculating Allelic counts
  5. 22/11/2017 01:08:47 PM - WARNING: Sequence size: 10000 and matrix cols: 10000
  6. 22/11/2017 01:08:48 PM - INFO: Writing output into: /home/user/test/gacmsa.output
  7. 22/11/2017 01:08:48 PM - INFO: Process finished
  8. 22/11/2017 01:08:48 PM - INFO: GACMSA finished properly.
  9. 22/11/2017 01:08:48 PM - INFO: Elapsed time (ETA): 0:00:01.440117
  10. 22/11/2017 01:08:48 PM - INFO: Ending at: Wed, Nov 22 2017. 01:08:48 PM

Then we can user R, with the following code to obtain the frequencies of the monoallelic, di-allelic, tri-allelic and tetra-allelic sites (if they exist in the MSA).

  1. library(ggplot2)
  2. mydata=read.table("gacmsa.notail.output", header=T, nrow=7)
  3. rowNames=mydata[,1]
  4. rownames(mydata)=rowNames
  5. mydata=mydata[,2:ncol(mydata)]
  6. mydata=data.frame(t(mydata))
  7. rownames(mydata)=paste0("POS",1:nrow(mydata))
  8. ggplot(mydata, aes(x=mydata$TOTAL))+geom_histogram(aes(y=..density..), bins=4) + theme_classic() +
  9. stat_bin(aes(label=..count.., y=..density..), bins=4, geom="text", col="black", vjust=-1.5) +
  10. xlab("Allele count") + ylab("Density - Sites")

GACMSA output test