Freiburg RNA Tools
CopraRNA - Help
BIF
IFF

Introduction

CopraRNA is a tool for sRNA target prediction. It computes whole genome predictions by combination of distinct whole genome IntaRNA predictions. As input, CopraRNA requires at least 3 homologous sRNA sequences from 3 distinct organisms in FASTA format. Furthermore each organisms' genome has to be part of the NCBI Reference Sequence (RefSeq) database (i.e. it should have exactly this NZ_* or this NC_XXXXXX format where * stands for any character and X stands for a digit between 0 and 9). Depending on sequence length (target and sRNA), amount of input organisms and genome sizes, CopraRNA can take up to 24h or longer to compute (in most cases it is significantly faster). It is suggested you supply your email and return when the job has finished. As output, CopraRNA produces a CopraRNA p-value sorted list of putative targets. Results can be viewed in the browser, but closer examination of the downloadable data is suggested.

Precomputed results for Enterobacteria: ArcZ, ChiX, CyaR, DsrA, FnrS, GcvB, GlmZ, MicA, MicC, MicF, OmrA, OmrB, OxyS, RprA, RybB, RyhB, SgrS, Spot42, and for Non-enteric bacteria: FsrA, LhrA2, PrrF1, SR1, IhtA.
Furthermore, IsaR target predictions on 19 cyanobacterial genomes.

When using CopraRNA please cite :

Results are computed with CopraRNA version 2.0.3.2

Overview

The following parameters are used to control the execution of CopraRNA

Furthermore, additional information is available

Input Parameters

?  sRNA sequences

The central CopraRNA parameter is the selection of the species, which definitely has an impact on the prediction results. A small evolutionary distance between the species favors sensitivity and a high distance favors specificity. Hence, we suggest selecting as many sRNA homologs as possible from species with varying evolutionary distance, if there is no availability constraint by the species in which respective sRNA is conserved. For the benchmark we used a blend of close, medium and more remotely conserved species (based on the 16S rDNA sequence, see Fig. S4 in the accompanied publication). In general the maximal evolutionary distance is given by the conservation of the sRNA that is often restricted to a phylum or a class.

CopraRNA accepts input in form of a multiple FASTA file. A simple example looks like this:
>NC_000913
cccagagguauugauuggugaagucucucaugcgcagguuuuuuuu
>NC_011740
cccagagguauugauucggcacccgcggaugcgcagguuuuuuuu
>NC_003197
cccagagguauugauuggugagauuaggaugcgcagguuuuuuuu

Note the FASTA headers have to represent a RefSeq ID of the according organism!

In order to be CopraRNA compatible, an entered organism must be part of the NCBI Reference Sequence (RefSeq) database. This given, an organism has one, or several (depending on the existence of further replicons such as plasmids) RefSeq ID(s) in the following format:

NC_XXXXXX where X stands for a digit between 0 and 9 (NC_000913 for E. coli)

NZ_* where * stands for any character beside whitespaces (NZ_CP007542 for Synechocystis sp. PCC 6714)

Only one RefSeq ID has to be supplied for each organism. If you supply the ID for a plasmid, the prediction will also be executed on all other replicons of the organism. Vice versa, if you supply the ID of the major replicon, the prediction will also be taken out on all additionally available replicons. IDs such as NS_000191 are not valid.

To check if the organisms you selected are CopraRNA compatible, check this list of RefSeq IDs. Currently, more than 4000 organisms are CopraRNA compatible. The list is regularly updated.

Please contact us if you know your organism is part of the RefSeq database and has an ID in the NZ_* or NC_XXXXXX format but is not present in this list, or is missing IDs. Then we can run an update.

Input can be given either as direct text input or by uploading a FASTA file. The sequences you upload should be homologous to each other. If you have an sRNA sequence and are trying to find homologs, then you can start by using BLAST. If you don't find anything with BLAST there are more sophisticated methods for this task, such as GotohScan. Furthermore it is also possible that there are no homologs for your sequence. In this case we suggest you resort to the IntaRNA whole genome target prediction webserver.

Note, if you want to use iterative enrichment maximization, a minimum of 4 sequences is required.
The parameter constraints are: The input has to be in valid FASTA format. The number of sequences has to be at least 3 and at most 20. Sequence lengths have to be in the range 7-750. The allowed sequence alphabet is 'ACGUTacgut'. Each FASTA sequence header/name has to match against the regular expression '^>\s*N[CZ]_\S+\s*'. Access to the NCBI server is needed. The provided RefSeqID has to be part of the supported organism list (see according file).

?  Extract sequences around

This option allows you to select from which region of the mRNAs you would like to retrieve your putative target sequences. Selecting "start codon" selects regions upstream and downstram (see nt up, nt down) relative to the start codon. The same logic holds if you select "stop codon".

?  nt up (1-300)

This parameter specifies the number of nucleotides (nt) upstream of your start or stop codon (depending which one you selected). If you selected start codon, and have prior knowledge about average 5'UTR lengths in your input organisms then it is sensible to set nt up to this number in order to increase prediction quality. The sum of nt up and nt down have to be above a minimal threshold; see constraint list.
The parameter constraints are: Input value has to be parsable as a Integer. The value must be smaller than or equal to 300 and must be greater than or equal to 1. The sum of nt up (1-300) plus nt down (1-300) has to be at least 150.

?  nt down (1-300)

This parameter specifies the number of nucleotides (nt) downstream of your start or stop codon (depending which one you selected). If you selected stop codon, and have prior knowledge about average 3'UTR lengths in your input organisms then it is sensible to set nt down to this number in order to increase prediction quality. The sum of nt up and nt down have to be above a minimal threshold; see constraint list.
The parameter constraints are: Input value has to be parsable as a Integer. The value must be smaller than or equal to 300 and must be greater than or equal to 1.

?  Minimal rel. cluster size

The minimal relative cluster size parameter adjusts how many genes must at least be present in every putative target cluster with respect to the total number of participating organisms. For a cutoff of 0.5 this means that at least half of the organisms in a single prediction set need to have a homolog belonging to a specific gene cluster. If this is not the case, the cluster is not considered in the prediction. Setting higher stringency (i.e. bigger values) may reduce noise, but can also cause loss of real targets.
The parameter constraints are: Input value has to be parsable as a Double. The value must be greater than or equal to 0.5 and must be smaller than or equal to 1.

?  Iterative organism subset analysis (beta)

If this option is selected, CopraRNA will be iteratively run several times for subsets of the initial input (at least 4 sequences). With each iteration one sequence/organism is discarded. The organism to be discarded, is always the one which is most distant to the organism of interest in the 16s rDNA tree. In total n-2 different predictions are computed (n = amount of sequences in the input). The first prediction contains n sequences, the last prediction contains 3 sequences. All iterations are subsequently employed to create a consensus prediction, that represents a restricted union of the individual predictions. This method increases the robustness of the prediction.

Note, this feature is not rigorously tested yet (beta stage).
The parameter constraints are: Input value has to be parsable as a Boolean.

?  Organism of interest

Usually a user has a specific organism he is esspecially interested in. The organism of interest which is finally selected takes a prime position in the output display and post processing. However it does not change the internal computations of the core CopraRNA algorithm. The online output can only be viewed for the organism of interest. In the downloadable data, all organisms are incorporated, but the functional enrichment of the top candidates is only computed for the organism of interest.

Putative target sequences

These are the putative target sequences, extracted from the organism of interest's RefSeq file(s). In most cases their length is nt up + nt down.

Output Description

Main result:

The main CopraRNA result is a CopraRNA p-value sorted table, of target candidates for the entered homologous sRNAs. The data displayed on the output page of the webserver is comparatively limited, when compared to the downloadable data. For this reason we suggest you download the results for closer inspection.

Positions of interactions:

The positions of the interactions are not relative to start or stop codon, but rather absolute positions with respect to the lengths of your sRNA/mRNA sequence. For example, if you were to extract sequences 200 upstream and 100 downstream of the start codon, the location of your start codon is 201,202,203.

Annotation:

The annotation is retrieved from the RefSeq genome files.

Additional homologs:

In some cases, genes from the same organism can be part of the same cluster of targets. In these cases only the sequence with the best IntaRNA energy score participates in the caculation of the CopraRNA p-value. To secure that no potential targets are lost because of this, the additional homologs are added for the organism of interest.

Regions plots:

These plots are meant to give you an overview of the regions in the target and sRNA sequences that play predominant roles in the statistically significant interactions. The density plot in the top of the image, is calculated from all predicted interactions with a CopraRNA-pvalue <= 0.01, while the interactions displayed in the bottom of the image are shown for the top 20 predictied targets. The different coloring contains no information and is purely intended to increase contrast between different genes.

Functional Annotation Chart:

The top 100 targets of the comparative CopraRNA prediction, which have homologs in organism of interest, have been subjected to functional enrichment. The heatmap shows all members of clusters with a DAVID enrichment score >= 1 in a specific color. Each row represents a gene and each column a specific functional term. If the gene can be assigned to a term, the corresponding square is filled/colored. Closely related terms are assigned to a cluster and have the same color. The opacity of the color depends on the p-value of the CopraRNA prediction. A more intense color represents a more significant p-value. The "Fold enrichment" is given in front of the term descriptions. It gives the enrichment of a term in the prediction group in relation to the whole genome background (e.g. a term with an enrichment of 10 contains 10 times more genes belonging to the respective term than the background). The enrichment scores give a measure of the biological significance of the cluster. A higher score represents a more statistically significant enrichment. The publication on the DAVID webserver suggests to investigate clusters with an enrichment score of >= 1.3.

Interactions:

The interaction you see on the webserver, is the interaction calculated by IntaRNA for the specific candidate you are viewing (the highlighted line in the table). Single interactions can be downloaded for further use. For additional information on how the RNA interactions are computed, please resort to the IntaRNA publication.

Downloadable files:

Main CopraRNA result:

This is a CopraRNA p-value sorted, comma separated table (*.csv), containing all the results for all organisms entered in the analysis. Each column, named by a RefSeq ID, represents the prediction for one organism. The column 'sampled p-values' reports how many IntaRNA p-values were sampled for a specific gene cluster. A high number in this cell can be an indicator of a false positive. The other colums should be self explanatory. See explanation of additional homologs further up in this help. Each line represents one cluster of homologous genes within the organisms entered in the analysis. The content of the cells follows this scheme:
locus_tag(gene name|IntaRNA energy score|IntaRNA p-value|pos. start mRNA|pos. stop mRNA |pos. start sRNA|pos. stop sRNA|Entrez GeneID)

Functional Enrichment:

This file contains the DAVID functional enrichment result for the top 100 target candidates of CopraRNA. A certain term appears as enriched, if it is significantly overrepresented in the top list when compared to the background. The background in this case are all genes for which there is a prediction (not the entire set of genes of an organism). Enrichment scores of 1.3 and higher, suggest statistical significance. However, enrichments also strongly depend on the quality of the annotation of the entered organism of interest. The file is tab delimited. This result is only calculated for the organism of interest.

Auxiliary enrichment file:

Due to strong organism specificity, some targets may not be detectable using the comparative approach. In order to alleviate this problem, functional enrichemnts are computed for the top 100 CopraRNA predictions and the top 100 IntaRNA predictions for the organism of interest. In order to detect as many targets as possible, the CopraRNA enrichment is compared with the IntaRNA enrichment. Identical functional terms are compared and putative targets that are not reported in the CopraRNA enrichment but are reported in the IntaRNA enrichment for the same term, are stored in the auxiliary enrichment file.

Regions plots:

These are the same as the ones displayed on the webserver. They can be downloaded in postscript, pdf and png format.

16S rDNA tree:

The tree is constructed with the neighbor-joining method and based on the 16S rDNA sequences of the respective organisms. Based on the annotation, the 16S sequences are retrieved from each organisms RefSeq genome file. The tree is provided both in the NEWICK text and the SVG image format. The guide tree view is generated using the Newick Utilities.

Input Examples

?  5 ChiX sequences (~2h)

This is an example with five homologous ChiX sequences that takes about two hours of computation.
The example's result can be directly accessed here

Frequently Asked Questions

If your question is not listed, please send it to us!

? Other tools for whole genome sRNA target prediction are much faster and do not require previous assembly of homologs. Why should I use CopraRNA?

Truthfully, the runtime of CopraRNA is not excellent and sequence assembly can be tedious. However, the quality of the results outcompetes all other state of the art sRNA target prediction algorithms. Our results show that CopraRNA is even very competetive when compared with the insights gained from micro array analyses. The cost of additional runtime and previous data assembly, is justified by the results being several orders of magnitude better than those computed by other algorithms. Furthermore, CopraRNA is free and fast when compared with microarrays. In some cases (i.e. GcvB) it allows a complete in silico characterization of a certain sRNA's function within the organism.

? Why are only organisms supported that are part of the RefSeq database?

In order to guarantee easy usability, CopraRNA requires a certain degree of consitency within the files that it accesses. RefSeq is in most cases a very reliable and cosistent database, that meets sensible consitency terms. Find all CopraRNA compatible organisms in this list. Already more than 2000 organisms are CopraRNA compatible.

? Why does the target on rank 1 have a p-value = 0 ?

In some cases one of the putative target sequences is encoded on the complementary strand at the same genomic location as the sRNA. In these cases, the complementarity is perfect, which leads to extremely low IntaRNA energy scores and consequently to a p-value of 0. Usually this can be discarded as an artifact. However in some cases it has been shown that sRNAs not only act on trans but also have cis regulatory effects, in which case a putative target with a p-value of 0 should not be disregarded.

? What are the fdr values and how to interpret them?

The fdr (false discovery rate) values are most easily explained with an example. Assume a fdr cutoff of 0.5. Statistically speaking, 50% of all predicted targets in the list up to this cutoff are assumed to be false positives. The fdr gives you an impression of how many incorrect predictions to expect up to a certain threshold. The fdr values are computed using the R-function p.adjust and the method by (Benjamini&Hochberg, 1995).

? When are sRNAs homologous? or Are the sequences I am inputing feasible for CopraRNA?

This is not a trivial question and subject to reasearch in itsself. Usually if you find similar sequences of similar lengths with a BLAST search, it is highly likely that the sequences you found are homologous. Yet, if you don't find anything with BLAST this doesn't mean there is nothing to find. In these cases we suggest that you resort to more sophisticated methods to find sRNA homologs, such as GotohScan. Nevertheless, there are cases in which no sRNA homologs exist. In these cases we suggest you resort to an IntaRNA whole genome target prediction.

? What are additional homologs?

Sometimes the clustering of homologous genes, assigns several genes from one organism to the same cluster. In this case the analysis is only executed on the candidate with the best IntaRNA energy score. In order to prevent losing the other putative targets, they are added at the end as additional homologs.

? Are the predictions always good?

Eventhough we could show that CopraRNA predictions are mostly reliable for Enterobacteria, it is still an in silico method and not flawless. You should look at, and think about the output and try to make sense of it, instead of blindly trusting the top list (p-value <=0.01).

? Which putative targets should I take a closer look at?

Basically all putative targets with a CopraRNA p-value <=0.01 are statistically speaking interesting. Furthermore putative targets that belong to a certain enriched term are interesting.

? Do single outlier organisms affect my results?

Even though we have included a root function, in order to prevent overly strong effects of outlier organisms, in the prediction it is advisable not to use sets of organisms in which single organisms are very distant from all other organisms participating in the prediciton.

? My prediction list contains a putative target gene more than once. Why?

During the process of clustering homologous genes, it sometimes happens that extremely similar clusters are generated which may only differ in one gene which is not part of your organism of interest. This may look like a duplication when only considering the organism of interest, while it really isn't a duplication. Having a closer look at the "cluster.tab" file from the results archive should clarify when and how this happens.

? Does CopraRNA work for all bacterial and archaeal phyla?

Extensive testing of CopraRNA predictions has so far only been done for enteric bacteria. However, the basic idea is not limited to this branch of microorganisms. It is highly likely that CopraRNA can produce predictions of the same quality for other phyla but it has not yet been experimentally proven.

? Is CopraRNA deterministic? It appears your precalculated results are not identical to the results presented in the publication. Why?

Due to the p-value sampling for clusters that do not contain genes from each participating organism, CopraRNA is not a deterministic algorithm. However, usually only slight differences between distinct analyses are to be expected.

? The putative targets are sorted in the reverse order in the regions plot when compared to the main result table. Which sorting should I trust?

The reverse sorting in the regions plots is due to our plotting script. This means that you should trust the initial sorting of the main result table.

? Can I download CopraRNA to run batch jobs on my local machine?

The source code for CopraRNA is available from our Software page.

List of Changes