Freiburg RNA Tools
CopraRNA - Help
BIF
IFF

Introduction

CopraRNA is a tool for sRNA target prediction. It computes whole genome predictions by combination of distinct whole genome IntaRNA predictions. As input, CopraRNA requires at least 3 homologous sRNA sequences from 3 distinct organisms in FASTA format. Furthermore each organisms' genome has to be part of the NCBI Reference Sequence (RefSeq) database (i.e. it should have exactly this NZ_* or this NC_XXXXXX format where * stands for any character and X stands for a digit between 0 and 9). Depending on sequence length (target and sRNA), amount of input organisms and genome sizes, CopraRNA can take up to 24h or longer to compute (in most cases it is significantly faster). It is suggested you supply your email and return when the job has finished. As output, CopraRNA produces a CopraRNA p-value sorted list of putative targets. Results can be viewed in the browser, but closer examination of the downloadable data is suggested.

Precomputed results for Enterobacteria: ArcZ, ChiX, CyaR, DsrA, FnrS, GcvB, MicA, MicC, MicF, OxyS, RprA, RybB, RyhB, SgrS, Spot42, and for Non-enteric bacteria: FsrA, LhrA2, PrrF1, SR1, IhtA.

Note, in contrast to this server, the stand-alone CopraRNA software does not limit the problem size, provides enhanced functionality, and offers a batch processing-friendly command line interface. For this reasons, you might consider to install CopraRNA locally

When using CopraRNA please cite :

Results are computed with CopraRNA version 2.1.4 using IntaRNA 2.4.1

Overview

The following parameters are used to control the execution of CopraRNA

Furthermore, additional information is available

Sequence input

?  sRNA sequences

The central CopraRNA parameter is the selection of the species, which definitely has an impact on the prediction results. A small evolutionary distance between the species favors sensitivity and a high distance favors specificity. Hence, we suggest selecting as many sRNA homologs as possible from species with varying evolutionary distance, if there is no availability constraint by the species in which respective sRNA is conserved. For the benchmark we used a blend of close, medium and more remotely conserved species (based on the 16S rDNA sequence, see Fig. S4 in the accompanied publication). In general the maximal evolutionary distance is given by the conservation of the sRNA that is often restricted to a phylum or a class.

CopraRNA accepts input in form of a multiple FASTA file. A simple example looks like this:
>NC_000913
cccagagguauugauuggugaagucucucaugcgcagguuuuuuuu
>NC_011740
cccagagguauugauucggcacccgcggaugcgcagguuuuuuuu
>NC_003197
cccagagguauugauuggugagauuaggaugcgcagguuuuuuuu

Note the FASTA headers have to represent a RefSeq ID of the according organism!

In order to be CopraRNA compatible, an entered organism must be part of the NCBI Reference Sequence (RefSeq) database. This given, an organism has one, or several (depending on the existence of further replicons such as plasmids) RefSeq ID(s) in the following format:

NC_XXXXXX where X stands for a digit between 0 and 9 (NC_000913 for E. coli)

NZ_* where * stands for any character beside whitespaces (NZ_CP007542 for Synechocystis sp. PCC 6714)

Only one RefSeq ID has to be supplied for each organism. If you supply the ID for a plasmid, the prediction will also be executed on all other replicons of the organism. Vice versa, if you supply the ID of the major replicon, the prediction will also be taken out on all additionally available replicons. IDs such as NS_000191 are not valid.

To check if the organisms you selected are CopraRNA compatible, check this list of RefSeq IDs. The list is regularly updated.

Please contact us if you know your organism is part of the RefSeq database and has an ID in the NZ_* or NC_XXXXXX format but is not present in this list, or is missing IDs. Then we can run an update.

Input can be given either as direct text input or by uploading a FASTA file. The sequences you upload should be homologous to each other. If you have an sRNA sequence and are trying to find homologs, then you can start by using BLAST. If you don't find anything with BLAST there are more sophisticated methods for this task, such as GotohScan. Furthermore it is also possible that there are no homologs for your sequence. In this case we suggest you resort to the IntaRNA whole genome target prediction webserver.
The parameter constraints are: The input has to be in valid FASTA format. The number of sequences has to be at least 3 and at most 50. Sequence lengths have to be in the range 7-750. The allowed sequence alphabet is 'ACGUTacgut'. Each FASTA sequence header/name has to match against the regular expression '^>\s*N[CZ]_\S+\s*'. Supported IDs can be found in CopraRNA_available_organisms.txt Access to the NCBI server is needed.
Defaults to ()

?  Extract sequences around

This option allows you to select from which region of the mRNAs you would like to retrieve your putative target sequences. Selecting "start codon" selects regions upstream and downstram (see nt up, nt down) relative to the start codon. The same logic holds if you select "stop codon".

?  nt up (1-300)

This parameter specifies the number of nucleotides (nt) upstream of your start or stop codon (depending which one you selected). If you selected start codon, and have prior knowledge about average 5'UTR lengths in your input organisms then it is sensible to set nt up to this number in order to increase prediction quality. The sum of nt up and nt down have to be above a minimal threshold; see constraint list.
The parameter constraints are: Input value has to be parsable as Integer. The value must be smaller than or equal to 300 and must be greater than or equal to 1. The sum of nt up (1-300) plus nt down (1-300) has to be at least 150.
Defaults to (200)

?  nt down (1-300)

This parameter specifies the number of nucleotides (nt) downstream of your start or stop codon (depending which one you selected). If you selected stop codon, and have prior knowledge about average 3'UTR lengths in your input organisms then it is sensible to set nt down to this number in order to increase prediction quality. The sum of nt up and nt down have to be above a minimal threshold; see constraint list.
The parameter constraints are: Input value has to be parsable as Integer. The value must be smaller than or equal to 300 and must be greater than or equal to 1.
Defaults to (100)

?  Organism of interest

Usually a user has a specific organism he is esspecially interested in. The organism of interest which is finally selected takes a prime position in the output display and post processing. However it does not change the internal computations of the core CopraRNA algorithm. The online output can only be viewed for the organism of interest. In the downloadable data, all organisms are incorporated, but the functional enrichment of the top candidates is only computed for the organism of interest.

Putative target sequences

These are the putative target sequences, extracted from the organism of interest's RefSeq file(s). In most cases their length is nt up + nt down.

CopraRNA parameters

?  Consensus prediction

Perform a CopraRNA 2 consensus prediction searching interaction regions overlapping with the organism of interest's predicted interaction site.

?  p-value combination

Whether to do a dynamic p-value combination as introduced by CopraRNA-v2 or to permform a non-dynamic CopraRNA-v1 p-value combination.
The parameter constraints are: Input value has to be parsable as Boolean.
Defaults to ( dynamic)

?  p-value filtering (0=off)

Post-processing filter for organism of interest p-value (0=off).
The parameter constraints are: Input value has to be parsable as Double. The value must be greater than or equal to 0 and must be smaller than or equal to 1.
Defaults to (0)

IntaRNA parameters

?  Target folding window size

Size of the averaging window in the local target RNA folding (RNAplfold -W) for the computation of accessibilities.
Local folding is key to reasonable folding results when facing larger RNA molecules, since it minimizes effects of incorrect long-range predictions (see local folding article).
Note, the folding window size should be about 50nt higher than the max. basepair distance.
If set to 0, no sliding window is used and the full sequence length is considered.
The parameter constraints are: Input value has to be parsable as Integer. The value must be greater than or equal to 0.
Defaults to (150)

?  Target max. basepair distance

Maximal distance of two paired bases in the local target RNA folding (RNAplfold -L) for computation of accessibilities.
Local folding is key to reasonable folding results when facing larger RNA molecules, since it minimizes effects of incorrect long-range predictions (see local folding article).
Note, max. basepair distance should be about 50nt less than the the folding window size.
If set to 0, the sliding window size value is also used for base pair span restrictions.
The parameter constraints are: Input value has to be parsable as Integer. The value must be greater than or equal to 0.
Defaults to (100)

Output Description

Main result:

The main CopraRNA result is a CopraRNA p-value sorted table, of target candidates for the entered homologous sRNAs. The data displayed on the output page of the webserver is comparatively limited, when compared to the downloadable data. For this reason we suggest you download the results for closer inspection.

Positions of interactions:

The positions of the interactions are not relative to start or stop codon, but rather absolute positions with respect to the lengths of your sRNA/mRNA sequence. For example, if you were to extract sequences 200 upstream and 100 downstream of the start codon, the location of your start codon is 201,202,203.

Annotation:

The annotation is retrieved from the RefSeq genome files.

Additional homologs:

In some cases, genes from the same organism can be part of the same cluster of targets. In these cases only the sequence with the best IntaRNA energy score participates in the caculation of the CopraRNA p-value. To secure that no potential targets are lost because of this, the additional homologs are added for the organism of interest.

Regions plots:

These plots are meant to give you an overview of the regions in the target and sRNA sequences that play predominant roles in the statistically significant interactions. The density plot in the top of the image, is calculated from all predicted interactions with a CopraRNA-pvalue <= 0.01, while the interactions displayed in the bottom of the image are shown for the top 20 predictied targets. The different coloring contains no information and is purely intended to increase contrast between different genes.

Functional Annotation Chart:

The top 100 targets of the comparative CopraRNA prediction, which have homologs in organism of interest, have been subjected to functional enrichment. The heatmap shows all members of clusters with a DAVID enrichment score >= 1 in a specific color. Each row represents a gene and each column a specific functional term. If the gene can be assigned to a term, the corresponding square is filled/colored. Closely related terms are assigned to a cluster and have the same color. The opacity of the color depends on the p-value of the CopraRNA prediction. A more intense color represents a more significant p-value. The "Fold enrichment" is given in front of the term descriptions. It gives the enrichment of a term in the prediction group in relation to the whole genome background (e.g. a term with an enrichment of 10 contains 10 times more genes belonging to the respective term than the background). The enrichment scores give a measure of the biological significance of the cluster. A higher score represents a more statistically significant enrichment. The publication on the DAVID webserver suggests to investigate clusters with an enrichment score of >= 1.3.

Interactions:

The interaction you see on the webserver, is the interaction calculated by IntaRNA for the specific candidate you are viewing (the highlighted line in the table). Single interactions can be downloaded for further use. For additional information on how the RNA interactions are computed, please resort to the IntaRNA publication.

Downloadable files:

Main CopraRNA result:

This is a CopraRNA p-value sorted, comma separated table (*.csv), containing all the results for all organisms entered in the analysis. Each column, named by a RefSeq ID, represents the prediction for one organism. The column 'sampled p-values' reports how many IntaRNA p-values were sampled for a specific gene cluster. A high number in this cell can be an indicator of a false positive. The other colums should be self explanatory. See explanation of additional homologs further up in this help. Each line represents one cluster of homologous genes within the organisms entered in the analysis. The content of the cells follows this scheme:
locus_tag(gene name|IntaRNA energy score|IntaRNA p-value|pos. start mRNA|pos. stop mRNA |pos. start sRNA|pos. stop sRNA|Entrez GeneID)

Functional Enrichment:

This file contains the DAVID functional enrichment result for the top 100 target candidates of CopraRNA. A certain term appears as enriched, if it is significantly overrepresented in the top list when compared to the background. The background in this case are all genes for which there is a prediction (not the entire set of genes of an organism). Enrichment scores of 1.3 and higher, suggest statistical significance. However, enrichments also strongly depend on the quality of the annotation of the entered organism of interest. The file is tab delimited. This result is only calculated for the organism of interest.

Auxiliary enrichment file:

Due to strong organism specificity, some targets may not be detectable using the comparative approach. In order to alleviate this problem, functional enrichemnts are computed for the top 100 CopraRNA predictions and the top 100 IntaRNA predictions for the organism of interest. In order to detect as many targets as possible, the CopraRNA enrichment is compared with the IntaRNA enrichment. Identical functional terms are compared and putative targets that are not reported in the CopraRNA enrichment but are reported in the IntaRNA enrichment for the same term, are stored in the auxiliary enrichment file.

Phylogenetic conservation:

Conservation of the sRNA/target interactions for the top 25 predicted targets. The columns of the heatmap represent the investigated organisms, the rows the respective targets. The columns are ordered regarding to an UPGMA tree based on the sRNA sequences. The cells are colored based on the respective IntaRNA p-value. Shades of green are used for p-values ≤ 0.3 and salmon for p-values > 0.3. If an organism has, regarding to the domclust clustering of CopraRNA, no homolog of a given target the cell is colored white. The cell labeling gives information about the position of the respective interaction site in mRNA and sRNA. If both sites are in agreement with the consensus interactions the cells is not labeled. If either the mRNA site, the sRNA site, or both do not match with the consensus site, the cell is labelled in red with '-/+', '+/-' or '-/-', respectively. If the optimal prediction does not match the mRNA consensus site, but the first suboptimal prediction does match, the p-value of the suboptimal prediction is used for the cell coloring. The usage of the suboptimal prediction is indicated by a '2' in the cell labeling. '2' means both the interaction in mRNA and sRNA of the suboptimal prediction match the consensus, '2+/-' means that only the interaction site in the mRNA matches the consensus.

Regions plots:

These are the same as the ones displayed on the webserver. They can be downloaded in postscript, pdf and png format.

16S rDNA tree:

The tree is constructed with the neighbor-joining method and based on the 16S rDNA sequences of the respective organisms. Based on the annotation, the 16S sequences are retrieved from each organisms RefSeq genome file. The tree is provided both in the NEWICK text and the SVG image format. The guide tree view is generated using the Newick Utilities.

Input Examples

?  5 ChiX sequences (~2h)

5 ChiX sequences
The example's result can be directly accessed here

List of Changes