Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
Note: minimap2 has replaced BWA-MEM for PacBio and Nanopore read
alignment. It retains all major BWA-MEM features, but is ~50 times as fast,
more versatile, more accurate and produces better base-level alignment.
BWA-MEM2 is 50-100% faster than BWA-MEM and outputs identical alignments.
git clone https://github.com/lh3/bwa.git
cd bwa; make
./bwa index ref.fa
./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz
BWA is a software package for mapping DNA sequences against a large reference
genome, such as the human genome. It consists of three algorithms:
BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina
sequence reads up to 100bp, while the rest two for longer sequences ranged from
70bp to a few megabases. BWA-MEM and BWA-SW share similar features such as the
support of long reads and chimeric alignment, but BWA-MEM, which is the latest,
is generally recommended as it is faster and more accurate. BWA-MEM also has
better performance than BWA-backtrack for 70-100bp Illumina reads.
For all the algorithms, BWA first needs to construct the FM-index for the
reference genome (the index command). Alignment algorithms are invoked with
different sub-commands: aln/samse/sampe for BWA-backtrack,
bwasw for BWA-SW and mem for the BWA-MEM algorithm.
BWA is released under GPLv3. The latest source code is freely
available at github. Released packages can be downloaded at
SourceForge. After you acquire the source code, simply use make
to compile
and copy the single executable bwa
to the destination you want. The only
dependency required to build BWA is zlib.
Since 0.7.11, precompiled binary for x86_64-linux is available in bwakit.
In addition to BWA, this self-consistent package also comes with bwa-associated
and 3rd-party tools for proper BAM-to-FASTQ conversion, mapping to ALT contigs,
adapter triming, duplicate marking, HLA typing and associated data files.
The detailed usage is described in the man page available together with the
source code. You can use man ./bwa.1
to view the man page in a terminal. The
HTML version of the man page can be found at the BWA website. If you
have questions about BWA, you may sign up the mailing list and then send
the questions to help@sourceforge.net"">bio-bwa-help@sourceforge.net. You may also ask questions
in forums such as BioStar and SEQanswers.
Li H. and Durbin R. (2009) Fast and accurate short read alignment with
Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID:
19451168]. (if you use the BWA-backtrack algorithm)
Li H. and Durbin R. (2010) Fast and accurate long-read alignment with
Burrows-Wheeler transform. Bioinformatics, 26, 589-595. [PMID:
20080505]. (if you use the BWA-SW algorithm)
Li H. (2013) Aligning sequence reads, clone sequences and assembly contigs
with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN]. (if you use the BWA-MEM
algorithm or the fastmap command, or want to cite the whole BWA package)
Please note that the last reference is a preprint hosted at arXiv.org. I
do not have plan to submit it to a peer-reviewed journal in the near future.
BWA works with a variety types of DNA sequence data, though the optimal
algorithm and setting may vary. The following list gives the recommended
settings:
Illumina/454/IonTorrent single-end reads longer than ~70bp or assembly
contigs up to a few megabases mapped to a closely related reference genome:
bwa mem ref.fa reads.fq > aln.sam
Illumina single-end reads shorter than ~70bp:
bwa aln ref.fa reads.fq > reads.sai; bwa samse ref.fa reads.sai reads.fq > aln-se.sam
Illumina/454/IonTorrent paired-end reads longer than ~70bp:
bwa mem ref.fa read1.fq read2.fq > aln-pe.sam
Illumina paired-end reads shorter than ~70bp:
bwa aln ref.fa read1.fq > read1.sai; bwa aln ref.fa read2.fq > read2.sai
bwa sampe ref.fa read1.sai read2.sai read1.fq read2.fq > aln-pe.sam
PacBio subreads or Oxford Nanopore reads to a reference genome:
bwa mem -x pacbio ref.fa reads.fq > aln.sam
bwa mem -x ont2d ref.fa reads.fq > aln.sam
BWA-MEM is recommended for query sequences longer than ~70bp for a variety of
error rates (or sequence divergence). Generally, BWA-MEM is more tolerant with
errors given longer query sequences as the chance of missing all seeds is small.
As is shown above, with non-default settings, BWA-MEM works with Oxford Nanopore
reads with a sequencing error rate over 20%.
BWA-SW and BWA-MEM perform local alignments. If there is a translocation, a gene
fusion or a long deletion, a read bridging the break point may have two hits,
occupying two lines in the SAM output. With the default setting of BWA-MEM, one
and only one line is primary and is soft clipped; other lines are tagged with
0x800 SAM flag (supplementary alignment) and are hard clipped.
Yes. Since 0.6.x, all BWA algorithms work with a genome with total length over
4GB. However, individual chromosome should not be longer than 2GB.
This is correct. Mapping quality is assigned for individual read, not for a read
pair. It is possible that one read can be mapped unambiguously, but its mate
falls in a tandem repeat and thus its accurate position cannot be determined.
Internally BWA concatenates all reference sequences into one long sequence. A
read may be mapped to the junction of two adjacent reference sequences. In this
case, BWA-backtrack will flag the read as unmapped (0x4), but you will see
position, CIGAR and all the tags. A similar issue may occur to BWA-SW alignment
as well. BWA-MEM does not have this problem.
Yes, since 0.7.11, BWA-MEM officially supports mapping to GRCh38+ALT.
BWA-backtrack and BWA-SW don’t properly support ALT mapping as of now. Please
see README-alt.md for details. Briefly, it is recommended to use
bwakit, the binary release of BWA, for generating the reference genome
and for mapping.
If you are not interested in hits to ALT contigs, it is okay to run BWA-MEM
without post-processing. The alignments produced this way are very close to
alignments against GRCh38 without ALT contigs. Nonetheless, applying
post-processing helps to reduce false mappings caused by reads from the
diverged part of ALT contigs and also enables HLA typing. It is recommended to
run the post-processing script.
This is typically caused by FASTQ generated from a coordinate-sorted BAM.
BWA uses a lot more memory for centromeric reads than for unique reads.
In a FASTQ file generated from a sequencing run, centromeric reads are rare in each batch and rarely cause troubles.
However, in a coordinate-sorted FASTQ file, a whole batch could consist of centromeric reads.
Such a batch will take a lot more memory and time to map; the insert size estimate will be distorted as well.
General rule: NEVER use Picard SamToFastq on coordiate-sorted BAM;
use samtools collate+fastq instead.