RNA-Sequencing Read-Alignment
In this project, we have implemented some basic BWT functions that we covered in class, and then implemented a
simplified version of an aligner for RNA sequencing reads.
project.py
. All the python files can be found inproject.zip
; We have created and imported other files for helper functions. The util files are in the same directoryproject.py
; Lastly, we included a brief write-up describing the approach we took for the project inwrite-up.pdf
.Algorithm.ipynb
.['$', 'A', 'C', 'T', 'G']
, where “$” will always only be the terminator.s
parameter in all the function is already terminated by “$”.We have provided a simple Python doctest as sanity check for the BWT functions. You can run this by running
python -m doctest project.py
on the command line.
get_suffix_array(s)
Naive implementation of suffix array generation (0-indexed). This code is fast enough so we have enough time inAligner.__init__
(see bottom).
Input:
s
: a string of the alphabet ['A', 'C', 'G', 'T']
already terminated by a unique delimiter ‘$’
Output:
>>> get_suffix_array('GATAGACA$')
[8, 7, 5, 3, 1, 6, 4, 0, 2]
get_bwt(s, sa)
Input:
s
: a string terminated by a unique delimiter ‘$’sa
: the suffix array of s
Output:
L
: BWT of s
as a stringget_F(L)
Input:
L = get_bwt(s)
Output:
F
, first column in Pi_sorted
get_M(F)
Returns the helper data structure M
(using the notation from class). M
is a dictionary that maps character
strings to start indices. i.e. M[c]
is the first occurrence of "c"
in F
.
If a character "c"
does not exist in F
, we set M[c] = -1
get_occ(L)
Returns the helper data structure OCC
(using the notation from class). OCC
should be a dictionary that maps
string character to a list of integers. If c
is a string character and i
is an integer, then OCC[c][i]
gives
the number of occurrences of character "c"
in the bwt string up to and including index i
.
exact_suffix_matches(p, M, occ)
Find the positions within the suffix array sa of the longest possible suffix of p
that is a substring of s
(the original string).
Note that such positions must be consecutive, so we want the range of positions.
Input:
p
: the pattern stringM
, occ
: buckets and repeats information used by sp
, ep
Output:
a tuple (range, length)
range
: a tuple (start inclusive, end exclusive) of the indices in sa
that containsp
as a prefix. range=None
if no indices matches any suffix of p
length
: length of the longest suffix of p
found in s
. length=0
if no indices matches any suffix of p
An example return value would be ((2, 5), 7)
. This means that p[len(p) - 7 : len(p)]
is
found in s
and matches positions 2
, 3
, and 4
in the suffix array.
>>> s = 'ACGT' * 10 + '$'
>>> sa = get_suffix_array(s)
>>> sa
[40, 36, 32, 28, 24, 20, 16, 12, 8, 4, 0, 37, 33, 29, 25, 21, 17, 13, 9, 5, 1, 38, 34, 30, 26, 22, 18, 14, 10, 6, 2, 39, 35, 31, 27, 23, 19, 15, 11, 7, 3]
>>> L = get_bwt(s, sa)
>>> L
'TTTTTTTTTT$AAAAAAAAAACCCCCCCCCCGGGGGGGGGG'
>>> F = get_F(L)
>>> F
'$AAAAAAAAAACCCCCCCCCCGGGGGGGGGGTTTTTTTTTT'
>>> M = get_M(F)
>>> sorted(M.items())
[('$', 0), ('A', 1), ('C', 11), ('G', 21), ('T', 31)]
>>> occ = get_occ(L)
>>> type(occ) == dict, type(occ['$']) == list, type(occ['$'][0]) == int
(True, True, True)
>>> occ['$']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> exact_suffix_matches('ACTGA', M, occ)
((1, 11), 1)
>>> exact_suffix_matches('$', M, occ)
((0, 1), 1)
>>> exact_suffix_matches('AA', M, occ)
((1, 11), 1)
In this part of the project, we implemented some method to align RNA reads to genome. We are given a reduced size genome
of roughly 11 million bases with the location of genes, isoforms, and exons. We used an alignment technique discussed in
lecture 13 (TODO).
Our reads are generated from the transcriptome of the given genome. The given genome sequence corresponds to the
forward strand; We assumed the genes/isoforms/exons are all on the forward strand, and all reads match to the forward
strand. There are no insertions or deletions (but there could be mismatches) in the reads (i.e. each position in the
read corresponds to some position in the genome). In addition, the genes/isoforms/exons that some reads are generated
from have been hidden, i.e. we assumed these “unknown” genes will not be passed to our Aligner class inAligner.__init__
. In addition, some reads could be randomly generated. We have not tested against being able to align
reads to the transcriptome if we are also not able to align these to the transcriptome. See
Evaluation/Scoring for more details on what we have been evaluated on. In any read generated from
the genome (visible or unknown), there are at most 6 mismatches, so any alignment with more than 6 mismatches is
invalid.
Aligner
ClassWe implemented the Aligner
class in project.py
. It will be initialized with genome information, callingAligner.align
on a number of sequences to get our outputs, we can, then, evaluate the outputs.
In Aligner.__init__
, we are given the genome sequence as well as genes
, which is a list of Gene
container objects
that we have defined in shared.py
. Each Gene
contains a list of Isoform
objects that correspond to the gene,
and each Isoform
contains a list of Exon
objects. You could look at how these classes are defined in the file,
but you cannot modify these classes.
Since it has been specified that there is no insertion or deletion, an alignment of a read to the genome can be thought
as k
(usually k
will be 1 or 2) separate ungapped alignments between the read and the genome (with some start index
in the read and start index in the genome). Thus, we specified the alignment as a python list of k
tuples of(<read_start_index>, <genome_start_index>, <length>)
. For example, an alignment of [(0, 2, 3), (3, 6, 10)]
specifies
that the 0th position of the read aligns to the 2nd position of the genome for 3 bases, and then the 3rd position of the
read aligns to the 6th positon of the genome for 10 bases. If we can’t find an alignment for a read, we return an empty
list, []
.
Warning: if the ranges of two consecutive alignment pieces overlap in the read, i.e. if<read_start_1> + <length_1> > <read_start_2>
, the second piece of the alignment will be discarded and the new
alignment will be scored accordingly. This is checked for with the provided functions in evaluation.py
(see
Evaluation/Scoring).
Our approach to the RNA sequencing project involved first alignment to the annotated transcriptome and then for the un-aligned reads, alignment to the genome. We created all the isoforms of all the genes and tried to align to the reads without any gaps, greedily choosing the mismatches, and bounding them at MAX_NUM_MISMATCHES
. This meant that we checked the equality of the read sequence and each isoform element-by-element starting at index 0 and counted every inequality as a mismatch and terminated if the number of mismatches exceeded MAXNUM_MISMATCHES
. This gave us a fast runtime that was on average less than 0.5s
.
For the reads that did not align to the annotated transcriptome, we tried to align to the genome. After experimentation and evaluating our alignments with the alignment to hidden gene transcriptome, we found it best to limit the number of parts of our alignment to the genome to 1 as that would cover more than half of the hidden alignments. However, to find a single ungapped alignment for the reads to genome, we had to consider all the permutations of the read sequence. This meant that we went through the read sequence element by element and considered all the possible permutations of the bases at every index that would result in an exact match in the genome sequence. We achieved this by using a recursive function (find_max_so_far
) that is documented in project.py
along with all the helper functions used in the alignment.
We are given files containing examples of what kind of genome and read sequences our alignment might be tested on.
We parsed the files and tried our algorithm on these examples since they were representative of the evaluation set.
genome.fa
is a FASTA file with the genome sequence.reads.fa
is a FASTA files with read sequences. Note that the file does not include base quality (PHRED) scores, asgenes.tab
is a tab-separated file containing three types of rows:
gene_id
then a semicolon-separated list of isoform_id
isoform_id
then a sorted semicolon-separated list ofexon_id
an exon row begins with “exon” and specifies the exon_id
, start
, and end
of the exon.
Moreover, the genomic elements that are “unknown” (hidden when the Aligner
is tested) are prefixed with “unknown”.
If a gene is unknown, its corresponding isoforms and exons will also be marked as unknown. We parse this file to
construct our python set of genes that we feed into our `Aligner.init`. We made sure to construct each Gene
with all of its corresponding Isoform
objects, and etc.
We always prioritize aligning reads to known isoforms (in genes.tab
) with 6 or less mismatches. Only if we can’t
align with 6 or less mismatches, we try to align reads to other parts of the genome with 6 or less mismatches (as these
regions may represent unknown genes; there are no genes with unknown exons). We also minimize the number of mismatches,
but do not choose an alignment to an unknown isoform with less mismatches over an alignment to a known isoform with
more mismatches (but still 6 or less).
If we don’t align some indices of a read, those indices will be counted as mismatches (so it’s always best to align all
indices of the read). Since there is no insertion or deletion, our alignment will be scored based on one-to-one and
consecutive correspondence to the transcriptome. i.e. if we match a read to an isoform, we make sure that the
read aligns consecutively to the transcript generated by concatenating the exons within the isoform, there are no
gaps when aligning to the transcriptome.
evaluation.py
To make sure we are able to evaluate the aligner’s alignments, the function used to evaluate the alignments is provided
in evaluation.py
. We, first, need to construct a python set of known Isoform
and a python set of unknown Isoform
objects. We can, then, build an index by calling the index_isoform_locations
function, and use this index to runevaluate_alignment
on the alignment that we generate with Aligner.align
.
For a 11 million base genome, our Aligner.__init__
method takes less than 500 seconds to run. Our Aligner.align
method takes on average less than 0.5 seconds per read. In addition, we were penalized if our Aligner.align
was
more than 2 times slower than the mean runtime of everyone in the class.
This assignment was due on Wednesday 11/28 11:59pm PST. We created an Instructional account for CS 176 and submitted
our project there. Only one person per our pair submitted, but we both submitted as well just to be safe. We, each,
indicated the name of our partner when prompted by Glookup. As stated above, our submission keeps all required
(skeleton) functions/classes/methods in project.py
. All the python files can be found in project.zip
; We have
created and imported other files for helper functions. The util files are in the same directory structure asproject.py
. We may, depending on the reason, have another chance to submit with a penalty if our first submission
crashes. Lastly, we included a brief write-up describing the approach we took for the project in write-up.pdf
.
Once the assignment was graded, a scores table for all students was released, with the follow information:
[1 - 0.5 * (the aligner’s rank - our rank)]
. Notice we get extraour rank > the aligner's rank
.0.75 * gene_alignment + 0.2 * hidden_gene_alignment + 0.05 * fake_alignment
Aligner.__init__
. We got full credit if our_time < 500 seconds
0.15 * SA + 0.15 * match + 0.5 * weighted_alignment + 0.1 * (setup_time_under_threshold) + 0.1 * (alignment_time_under_threshold)
correctness_score
if we’re not late, else 0.85 * correctness_score