项目作者: luislorenzom

项目描述 :
K-mer spectrum corrector based on Hadoop
高级语言: Java
项目地址: git://github.com/luislorenzom/hmusket.git
创建时间: 2018-02-08T18:35:32Z
项目社区:https://github.com/luislorenzom/hmusket

开源协议:GNU General Public License v3.0

下载


hmusket

K-mer spectrum corrector based on Hadoop

Requisites

  • Java Develpment Environment (JDK) version 1.6 or above
  • Make sure you have a working Apache Maven distribution version 3 or above
  • A version of hsp (Hadoop Sequence Parser) in your Maven repository
  • Hadoop 2.8.0 or above
  • g++ (development version: 6.3.0)
  • GNU Make (development version: 4.1)
  • Upgrade Commons-cli version (1.2 —> 1.4) in hadoop/share/hadoop/common (TODO: improve this)

Authors

Do you want to compile hmusket from scrath?

  • Clone this project
  • Generate the header file for java side call (resource/make.sh)
  • Configure in src/main/native/MakeFile.common where is musket source files.
    • Don’t you have a musket source code copy? Download it
  • Also in src/main/native execute make to compile musket and create the shared library
  • Once you have the shared library created copy it from lib folder (root folder) and paste it in $HADOOP_HOME/lib/native
  • Additionally you have to change your common-cli library in your hadoop cluster
    • Copy the commons-cli-1.4 (from .m2/) to $HADOOP_HOME/share/hadoop/common/lib
  • Finally compile with Maven hmusket mvn clean package

How to run hmusket?

Flags cheat sheet

usage: hmusket -fileIn -fileOut -fileType
[-inorder] [-k ] [-lowercase] [-maxbuff ] [-maxerr
] [-maxiter ] [-maxtrim ] [-minmulti ] [-multik
] [-o ] [-omulti ] [-p ] [-zlib ]


-fileIn \ File where there are the sequences

-fileOut \ File where there want to save the output

-fileType \ File type for FASTA files and for FASTQ files

-k Specify two paramters: k-mer size and estimated total number of k-mers for this k-mer size)

-lowercase Write corrected bases in lowercase, default=0

-maxbuff \ Capacity of message buffer for each worker, default=1024

-maxerr \ Maximal number of mutations in any region of length#k, default=4

-maxiter \ Maximal number of correcting iterations per k-mer size, default=2

-maxtrim \ Maximal number of bases that can be trimmed, default=0

-minmulti \ Minimum multiplicty for correct k-mers [only applicable when not using multiple k-mer sizes],default=0

-multik \ Enable the use of multiple k-mer sizes, default=0

-o \ The single output file name

-omulti \ Prefix of output file names, one input corresponding one output

-p \ Number of threads [>=2], default=2

-zlib \ Zlib-compressed output, default=0

Some runs examples

  1. # Single-end dataset
  2. user@host:~$ hadoop jar Hmusket-1.0.jar es.udc.gac.hmusket.HMusket -fileIn ~/datasets/single-end.fastq -fileOut output1 -fileType q
  3. # Pair-end dataset
  4. user@host:~$ hadoop jar Hmusket-1.0.jar es.udc.gac.hmusket.HMusket -fileIn ~/datasets/pair-end_1.fasta ~/datasets/pair-end_2.fasta -fileOut output2 -fileType a -p 4

License

This software is distributed as free software and is publicity available under the GPLv3 license (see the LICENSE file for more details)