Freiburg RNA Tools
Teaching RNA algorithms

# Teaching RNA algorithms

In order to enable an example-driven learning and teaching of RNA structure related algorithms, we provide here Javascript-based implementations for various algorithms for RNA structure and RNA-RNA interaction prediction. To reduce the level of complexity, all algorithms use a simple Nussinov-like energy scoring scheme, i.e. the energy of an RNA structure is directly related to its number of base pairs without further distinction.
Furthermore, we provide interactive implementations for general sequence alignment algorithms that are taught in our bioinformatics courses.
The source code of all implementations is available @github.com/BackofenLab/RNA-Playground.

# RNA structure

Using an adapted umambigous version of the Nussinov algorithm it is possible to count the number of all possible nested secondary structures an RNA molecule can form.

The Nussinov algorithm enables the efficient computation of the structure with the maximal number of base pairs for a given RNA sequence.

The McCaskill algorithm enables the efficient computation of RNA structure probabilities as well as probabilities that a certain base pair is formed. Furthermore, unpaired probabilities for subsequences can be computed that reflect the accessibility of RNA parts for other interactions.

The maximum expected accuracy (MEA) algorithm uses base pair probabilities as well as unpaired probabilites (e.g. computed via the McCaskill algorithm) to find the structure that is 'maximally accurate' in its structural elements.

# RNA-RNA interaction

RNA-RNA interactions can be predicted by linking the two interacting RNAs into one pseudo-RNA and use a variant of the Nussinov algorithm to predict the fused structure. Such approaches are called 'co-folding'.

Given two RNA molecules, the RNA-RNA interaction with the maximal number of intermolecular base pairs can be predicted. Such algorithms ignore the competition between intra- and intermolecular base pair formation, e.g. considered in 'co-folding' or 'accessibilty-based' approaches.

Since a nucleotide can form at most one base pair, there is a competition between intra- and intermolecular base pair formation. This can be incorporated into RNA-RNA interaction prediction by also scoring the probability that an interacting subsequence is not involved in intramolecular base pairs, i.e. it is accessible.

# Sequence alignment

S. B. Needleman and C. D. Wunsch introduced 1970 the first dynamic programming based algorithm for the efficient global alignment of two sequences. The approach uses a linear gap scoring scheme.

The global pairwise alignment scheme by Needleman and Wunsch requires quadratic memory to store the dynamic programming data structures. D. S. Hirschberg introduced 1975 an adaption of the computation strategy in order to reduce the memory requirement to a linear value while keeping the original runtime.

M. S. Waterman, T. F. Smith and W. A. Beyer introduced 1976 a variant of the global Needleman-Wunsch alignment approach that allows for arbitrary gap scoring. While more general, the approach has a higher computational complexity compared to the Needleman-Wunsch method.

O. Gotoh introduced 1982 an efficient global alignment approach that enables a more realistic affine gap cost model without changing the computational complexity compared to the Needleman-Wunsch approach.

T. F. Smith and M. S. Waterman adapted 1981 the Needleman-Wunsch approach to local alignment with linear gap scoring. This enables the identification of most similar subsequences.

O. Gotoh introduced 1982 an efficient global alignment approach that enables a more realistic affine gap cost model without changing the computational complexity compared to the Needleman-Wunsch approach. Here, we extend the original approach to local alignments by applying the ideas of the Smith-Waterman algorithm.

A. N. Arslan, Ö. Eğecioğlu and P. A. Pevzner introduced 2001 a fractional-programming-based approach to compute normalized local alignments. That is, it identifies the local alignment that is best preserved with respect to the length of the aligned subsequences. This counters the problem of the Smith-Waterman approach where often longer alignments hide shorter alignments with less gaps since they enable higher scores.

The progressive multi-sequence-alignment (MSA) scheme introduced 1987 by D.-F. Feng and R. F. Doolittle is a heuristic approach to compute close-to-optimal MSAs in reasonable time. This is achieved by progressively joining subsets of sequences to MSAs based on pairwise alignments, which is following by a distance-based clustering tree. To this end, similarity scores of pairwise alignments are converted into distance scores needed for the clustering.

Progressive multi-sequence-alignment (MSA) schemes are fast but only heuristics and thus are not always providing the optimal MSA. Their result quality can be improved when iterative refinement techniques are applied.

Progressive alignment schemes like the Feng-Doolittle approach can show suboptimal alignment steps since they can be misguided by the pairwise alignments that define how the progressive alignment fusions are performed. This is tackled by the t-coffee approach introduced 2000 by C. Notredame, D. G. Higgins and J. Heringa. Instead of using single pairwise alignments, the t-coffee approach derives an input-specific scoring using an arbitrarily large set of input alignments. This scoring scheme is sequence-position-specific and thus encodes what sequence positions are (most often) aligned in the input and thus indirectly also encodes unaligned positions, i.e. gaps. Furthermore, 'indirect' position alignment identified by combinations of alignment triples are incorporated and provide a consistency-focusing scoring extension. The final scoring is used in a adapted progressive alignment scheme to compute a final multiple sequence alignment.

# Evol. Clustering

Different methods can be applied to derive an evolutionary / clustering tree from given pairwise distance data.