项目作者: luislorenzom
项目描述 :
K-mer spectrum corrector based on Hadoop
高级语言: Java
项目地址: git://github.com/luislorenzom/hmusket.git
hmusket
K-mer spectrum corrector based on Hadoop
Requisites
- Java Develpment Environment (JDK) version 1.6 or above
- Make sure you have a working Apache Maven distribution version 3 or above
- A version of hsp (Hadoop Sequence Parser) in your Maven repository
- Hadoop 2.8.0 or above
- g++ (development version: 6.3.0)
- GNU Make (development version: 4.1)
- Upgrade Commons-cli version (1.2 —> 1.4) in hadoop/share/hadoop/common (TODO: improve this)
Authors
Do you want to compile hmusket from scrath?
- Clone this project
- Generate the header file for java side call (resource/make.sh)
- Configure in src/main/native/MakeFile.common where is musket source files.
- Also in src/main/native execute make to compile musket and create the shared library
- Once you have the shared library created copy it from lib folder (root folder) and paste it in $HADOOP_HOME/lib/native
- Additionally you have to change your common-cli library in your hadoop cluster
- Copy the commons-cli-1.4 (from .m2/) to $HADOOP_HOME/share/hadoop/common/lib
- Finally compile with Maven hmusket mvn clean package
How to run hmusket?
Flags cheat sheet
usage: hmusket -fileIn -fileOut -fileType
[-inorder] [-k ] [-lowercase] [-maxbuff ] [-maxerr
] [-maxiter ] [-maxtrim ] [-minmulti ] [-multik
] [-o ] [-omulti ] [-p ] [-zlib ]
-fileIn \ File where there are the sequences
-fileOut \ File where there want to save the output
-fileType \ File type for FASTA files and for FASTQ files
-k Specify two paramters: k-mer size and estimated total number of k-mers for this k-mer size)
-lowercase Write corrected bases in lowercase, default=0
-maxbuff \ Capacity of message buffer for each worker, default=1024
-maxerr \ Maximal number of mutations in any region of length#k, default=4
-maxiter \ Maximal number of correcting iterations per k-mer size, default=2
-maxtrim \ Maximal number of bases that can be trimmed, default=0
-minmulti \ Minimum multiplicty for correct k-mers [only applicable when not using multiple k-mer sizes],default=0
-multik \ Enable the use of multiple k-mer sizes, default=0
-o \ The single output file name
-omulti \ Prefix of output file names, one input corresponding one output
-p \ Number of threads [>=2], default=2
-zlib \ Zlib-compressed output, default=0
Some runs examples
# Single-end dataset
user@host:~$ hadoop jar Hmusket-1.0.jar es.udc.gac.hmusket.HMusket -fileIn ~/datasets/single-end.fastq -fileOut output1 -fileType q
# Pair-end dataset
user@host:~$ hadoop jar Hmusket-1.0.jar es.udc.gac.hmusket.HMusket -fileIn ~/datasets/pair-end_1.fasta ~/datasets/pair-end_2.fasta -fileOut output2 -fileType a -p 4
License
This software is distributed as free software and is publicity available under the GPLv3 license (see the LICENSE file for more details)