CopraRNA is a tool for sRNA target prediction. It computes
whole genome predictions by combination of distinct whole genome IntaRNA
predictions. As input, CopraRNA requires at least 3 homologous sRNA
sequences from 3 distinct organisms in FASTA format. Furthermore each
organisms' genome has to be part of the NCBI Reference Sequence (RefSeq)
database (i.e. it should have exactly this NZ_* or this NC_XXXXXX
format where * stands for any character and X
stands for a digit between 0 and 9). Depending on sequence length (target
and sRNA), amount of input organisms and genome sizes, CopraRNA can take
up to 24h or longer to compute (in most cases it is significantly faster). It is
suggested you supply your email and return when the job has finished.
As output, CopraRNA produces a CopraRNA p-value sorted list of putative
targets. Results can be viewed in the browser, but closer examination
of the downloadable data is suggested.
Precomputed results for Enterobacteria:
ArcZ,
ChiX,
CyaR,
DsrA,
FnrS,
GcvB,
MicA,
MicC,
MicF,
OxyS,
RprA,
RybB,
RyhB,
SgrS,
Spot42,
and for Non-enteric bacteria:
FsrA,
LhrA2,
PrrF1,
SR1,
IhtA.
Note, in contrast to this server, the stand-alone CopraRNA software
does not limit the problem size, provides enhanced functionality, and
offers a batch processing-friendly command line interface. For this reasons,
you might consider to install CopraRNA locally
Introduction
When using CopraRNA please cite :
- Patrick R. Wright, Jens Georg, Martin Mann, Dragos A. Sorescu, Andreas S. Richter, Steffen Lott, Robert Kleinkauf, Wolfgang R. Hess, and Rolf Backofen
CopraRNA and IntaRNA: predicting small RNA targets, networks and interaction domains
Nucleic Acids Research, 2014, 42 (W1), W119-W123. - Patrick R. Wright, Andreas S. Richter, Kai Papenfort, Martin Mann, Joerg Vogel, Wolfgang R. Hess, Rolf Backofen and Jens Georg
Comparative genomics boosts target prediction for bacterial small RNAs
Proc Natl Acad Sci USA, 2013, 110 (37), E3487-E3496. - Martin Raden, Syed M Ali, Omer S Alkhnbashi, Anke Busch, Fabrizio Costa, Jason A Davis, Florian Eggenhofer, Rick Gelhausen, Jens Georg, Steffen Heyne, Michael Hiller, Kousik Kundu, Robert Kleinkauf, Steffen C Lott, Mostafa M Mohamed, Alexander Mattheis, Milad Miladi, Andreas S Richter, Sebastian Will, Joachim Wolff, Patrick R Wright, and Rolf Backofen
Freiburg RNA tools: a central online resource for RNA-focused research and teaching
Nucleic Acids Research, 46(W1), W25-W29, 2018.
Results are computed with CopraRNA version 2.1.4 using IntaRNA 2.4.1
Overview
The following parameters are used to control the execution of CopraRNA
Furthermore, additional information is available
Sequence input
sRNA sequences
The central CopraRNA parameter is the selection of the species, which
definitely has an impact on the prediction results. A small evolutionary
distance between the species favors sensitivity and a high distance favors
specificity. Hence, we suggest selecting as many sRNA homologs as possible from species with varying
evolutionary distance, if there is no availability constraint by the
species in which respective sRNA is conserved. For the benchmark we used a
blend of close, medium and more remotely conserved species (based on the
16S rDNA sequence, see Fig. S4 in the accompanied publication). In general
the maximal evolutionary distance is given by the conservation of the sRNA
that is often restricted to a phylum or a class.
CopraRNA accepts input in form of a multiple FASTA file. A simple example looks like this:
Note the FASTA headers have to represent a RefSeq ID of the according organism!
In order to be CopraRNA compatible, an entered organism must be part of the NCBI Reference Sequence (RefSeq) database. This given, an organism has one, or several (depending on the existence of further replicons such as plasmids) RefSeq ID(s) in the following format:
NC_XXXXXX where X stands for a digit between 0 and 9 (NC_000913 for E. coli)
NZ_* where * stands for any character beside whitespaces (NZ_CP007542 for Synechocystis sp. PCC 6714)
Only one RefSeq ID has to be supplied for each organism. If you supply the ID for a plasmid, the prediction will also be executed on all other replicons of the organism. Vice versa, if you supply the ID of the major replicon, the prediction will also be taken out on all additionally available replicons. IDs such as NS_000191 are not valid.
To check if the organisms you selected are CopraRNA compatible, check this list of RefSeq IDs. The list is regularly updated.
Please contact us if you know your organism is part of the RefSeq database and has an ID in the NZ_* or NC_XXXXXX format but is not present in this list, or is missing IDs. Then we can run an update.
Input can be given either as direct text input or by uploading a FASTA file. The sequences you upload should be homologous to each other. If you have an sRNA sequence and are trying to find homologs, then you can start by using BLAST. If you don't find anything with BLAST there are more sophisticated methods for this task, such as GotohScan. Furthermore it is also possible that there are no homologs for your sequence. In this case we suggest you resort to the IntaRNA whole genome target prediction webserver.
CopraRNA accepts input in form of a multiple FASTA file. A simple example looks like this:
>NC_000913 cccagagguauugauuggugaagucucucaugcgcagguuuuuuuu >NC_011740 cccagagguauugauucggcacccgcggaugcgcagguuuuuuuu >NC_003197 cccagagguauugauuggugagauuaggaugcgcagguuuuuuuu
Note the FASTA headers have to represent a RefSeq ID of the according organism!
In order to be CopraRNA compatible, an entered organism must be part of the NCBI Reference Sequence (RefSeq) database. This given, an organism has one, or several (depending on the existence of further replicons such as plasmids) RefSeq ID(s) in the following format:
NC_XXXXXX where X stands for a digit between 0 and 9 (NC_000913 for E. coli)
NZ_* where * stands for any character beside whitespaces (NZ_CP007542 for Synechocystis sp. PCC 6714)
Only one RefSeq ID has to be supplied for each organism. If you supply the ID for a plasmid, the prediction will also be executed on all other replicons of the organism. Vice versa, if you supply the ID of the major replicon, the prediction will also be taken out on all additionally available replicons. IDs such as NS_000191 are not valid.
To check if the organisms you selected are CopraRNA compatible, check this list of RefSeq IDs. The list is regularly updated.
Please contact us if you know your organism is part of the RefSeq database and has an ID in the NZ_* or NC_XXXXXX format but is not present in this list, or is missing IDs. Then we can run an update.
Input can be given either as direct text input or by uploading a FASTA file. The sequences you upload should be homologous to each other. If you have an sRNA sequence and are trying to find homologs, then you can start by using BLAST. If you don't find anything with BLAST there are more sophisticated methods for this task, such as GotohScan. Furthermore it is also possible that there are no homologs for your sequence. In this case we suggest you resort to the IntaRNA whole genome target prediction webserver.
The parameter constraints are: The input has to be in valid FASTA format. The number of sequences has to be at least 3 and at most 50. Sequence lengths have to be in the range 7-750. The allowed sequence alphabet is 'ACGUTacgut'. Each FASTA sequence header/name has to match against the regular expression '^>\s*N[CZ]_\S+\s*'. Supported IDs can be found in CopraRNA_available_organisms.txt Access to the NCBI server is needed.
Defaults to ()
Defaults to ()
Extract sequences around
This option allows you to select from which region of the mRNAs
you would like to retrieve your putative target sequences. Selecting
"start codon" selects regions upstream and downstram (see nt up, nt down)
relative to the start codon. The same logic holds if you select "stop codon".
nt up (1-300)
This parameter specifies the number of nucleotides (nt) upstream of your
start or stop codon (depending which one you selected). If you selected
start codon, and have prior knowledge about average 5'UTR lengths in your
input organisms then it is sensible to set nt up to this number in order
to increase prediction quality. The sum of nt up and nt down have to be above a
minimal threshold; see constraint list.
The parameter constraints are: Input value has to be parsable as Integer. The value must be smaller than or equal to 300 and must be greater than or equal to 1. The sum of nt up (1-300) plus nt down (1-300) has to be at least 150.
Defaults to (200)
Defaults to (200)
nt down (1-300)
This parameter specifies the number of nucleotides (nt) downstream of your
start or stop codon (depending which one you selected). If you selected
stop codon, and have prior knowledge about average 3'UTR lengths in your
input organisms then it is sensible to set nt down to this number in order
to increase prediction quality. The sum of nt up and nt down have to be above a
minimal threshold; see constraint list.
The parameter constraints are: Input value has to be parsable as Integer. The value must be smaller than or equal to 300 and must be greater than or equal to 1.
Defaults to (100)
Defaults to (100)
Organism of interest
Usually a user has a specific organism he is esspecially interested in. The
organism of interest which is finally selected takes a prime position in
the output display and post processing. However it does not change the internal
computations of the core CopraRNA algorithm. The online output can only be
viewed for the organism of interest. In the downloadable data, all organisms
are incorporated, but the functional enrichment of the top candidates is only
computed for the organism of interest.
Putative target sequences
These are the putative target sequences, extracted from the organism of interest's RefSeq file(s). In most cases their length is nt up + nt down.CopraRNA parameters
Consensus prediction
Perform a CopraRNA 2 consensus prediction searching interaction regions
overlapping with the organism of interest's predicted interaction site.
p-value combination
Whether to do a dynamic p-value combination as introduced by CopraRNA-v2
or to permform a non-dynamic CopraRNA-v1 p-value combination.
The parameter constraints are: Input value has to be parsable as Boolean.
Defaults to ( dynamic)
Defaults to ( dynamic)
p-value filtering (0=off)
Post-processing filter for organism of interest p-value (0=off).
The parameter constraints are: Input value has to be parsable as Double. The value must be greater than or equal to 0 and must be smaller than or equal to 1.
Defaults to (0)
Defaults to (0)
IntaRNA parameters
Target folding window size
Size of the averaging window in the local target
RNA folding (RNAplfold -W) for the computation of accessibilities.
Local folding is key to reasonable folding results when facing larger RNA molecules, since it minimizes effects of incorrect long-range predictions (see local folding article).
Note, the folding window size should be about 50nt higher than the max. basepair distance.
If set to 0, no sliding window is used and the full sequence length is considered.
Local folding is key to reasonable folding results when facing larger RNA molecules, since it minimizes effects of incorrect long-range predictions (see local folding article).
Note, the folding window size should be about 50nt higher than the max. basepair distance.
If set to 0, no sliding window is used and the full sequence length is considered.
The parameter constraints are: Input value has to be parsable as Integer. The value must be greater than or equal to 0.
Defaults to (150)
Defaults to (150)
Target max. basepair distance
Maximal distance of two paired bases in the local
target RNA folding (RNAplfold -L) for computation of accessibilities.
Local folding is key to reasonable folding results when facing larger RNA molecules, since it minimizes effects of incorrect long-range predictions (see local folding article).
Note, max. basepair distance should be about 50nt less than the the folding window size.
If set to 0, the sliding window size value is also used for base pair span restrictions.
Local folding is key to reasonable folding results when facing larger RNA molecules, since it minimizes effects of incorrect long-range predictions (see local folding article).
Note, max. basepair distance should be about 50nt less than the the folding window size.
If set to 0, the sliding window size value is also used for base pair span restrictions.
The parameter constraints are: Input value has to be parsable as Integer. The value must be greater than or equal to 0.
Defaults to (100)
Defaults to (100)
Output Description
Main result:
The main CopraRNA result is a CopraRNA p-value sorted table,
of target candidates for the entered homologous sRNAs. The data displayed
on the output page of the webserver is comparatively limited,
when compared to the downloadable data. For this reason we
suggest you download the results for closer inspection.
Positions of interactions:
The positions of the interactions are not relative to start or stop
codon, but rather absolute positions with respect to the lengths of
your sRNA/mRNA sequence. For example, if you were to extract sequences
200 upstream and 100 downstream of the start codon, the location of
your start codon is 201,202,203.
Annotation:
The annotation is retrieved from the RefSeq genome files.
Additional homologs:
In some cases, genes from the same organism can be part of the same
cluster of targets. In these cases only the sequence with the best IntaRNA
energy score participates in the caculation of the CopraRNA p-value.
To secure that no potential targets are lost because of this,
the additional homologs are added for the organism of interest.
Regions plots:
These plots are meant to give you an overview of the regions in the
target and sRNA sequences that play predominant roles in the statistically
significant interactions. The density plot in the top of the image, is
calculated from all predicted interactions with a CopraRNA-pvalue <= 0.01,
while the interactions displayed in the bottom of the image are shown for
the top 20 predictied targets. The different coloring contains no
information and is purely intended to increase contrast between different
genes.
Functional Annotation Chart:
The top 100 targets of the comparative CopraRNA prediction, which
have homologs in organism of interest, have been subjected to functional
enrichment. The heatmap shows all members of clusters with a
DAVID
enrichment score >= 1 in a specific color. Each row represents a gene and
each column a specific functional term. If the gene can be assigned to a
term, the corresponding square is filled/colored. Closely related terms are
assigned to a cluster and have the same color. The opacity of the
color depends on the p-value of the CopraRNA prediction. A more intense
color represents a more significant p-value. The "Fold enrichment" is
given in front of the term descriptions. It gives the enrichment of a
term in the prediction group in relation to the whole genome background
(e.g. a term with an enrichment of 10 contains 10 times more genes
belonging to the respective term than the background). The enrichment
scores give a measure of the biological significance of the cluster. A
higher score represents a more statistically significant enrichment. The
publication on the
DAVID
webserver suggests to investigate clusters with an
enrichment score of >= 1.3.
Interactions:
The interaction you see on the webserver, is the interaction calculated
by IntaRNA for the specific candidate you are viewing (the highlighted
line in the table). Single interactions can be downloaded for further use.
For additional information on how the RNA interactions are computed,
please resort to the IntaRNA publication.
Downloadable files:
Main CopraRNA result:
This is a CopraRNA p-value sorted, comma separated table (*.csv),
containing all the results for all organisms entered in
the analysis. Each column, named by a RefSeq ID, represents
the prediction for one organism.
The column 'sampled p-values' reports how many IntaRNA p-values were sampled for a specific
gene cluster. A high number in this cell can be an indicator of a false positive.
The other colums should be self explanatory.
See explanation of additional homologs
further up in this help. Each line represents one cluster of
homologous genes within the organisms entered in the analysis.
The content of the cells follows this scheme:
locus_tag(gene name|IntaRNA energy score|IntaRNA p-value|pos. start mRNA|pos. stop mRNA |pos. start sRNA|pos. stop sRNA|Entrez GeneID)
locus_tag(gene name|IntaRNA energy score|IntaRNA p-value|pos. start mRNA|pos. stop mRNA |pos. start sRNA|pos. stop sRNA|Entrez GeneID)
Functional Enrichment:
This file contains the
DAVID
functional enrichment result
for the top 100 target candidates of CopraRNA.
A certain term appears as enriched, if it is significantly
overrepresented in the top list when compared to the background.
The background in this case are all genes for which there is
a prediction (not the entire set of genes of an organism).
Enrichment scores of 1.3 and higher, suggest
statistical significance. However, enrichments also strongly
depend on the quality of the annotation of the entered organism of interest.
The file is tab delimited. This result is only calculated for
the organism of interest.
Auxiliary enrichment file:
Due to strong organism specificity, some targets may not be detectable
using the comparative approach. In order to alleviate this problem,
functional enrichemnts are computed for the top 100 CopraRNA predictions
and the top 100 IntaRNA predictions for the organism of interest. In order
to detect as many targets as possible, the CopraRNA enrichment is compared
with the IntaRNA enrichment. Identical functional terms are compared and
putative targets that are not reported in the CopraRNA enrichment but are
reported in the IntaRNA enrichment for the same term, are stored in the
auxiliary enrichment file.
Phylogenetic conservation:
Conservation of the sRNA/target interactions
for the top 25 predicted targets. The columns of the heatmap represent the
investigated organisms, the rows the respective targets. The
columns are ordered regarding to an UPGMA tree based on the sRNA
sequences. The cells are colored based on the respective
IntaRNA p-value. Shades of green are used for p-values ≤ 0.3 and
salmon for p-values > 0.3. If an organism has, regarding to the
domclust clustering of CopraRNA, no homolog of a given target the
cell is colored white. The cell labeling gives information about
the position of the respective interaction site in mRNA and sRNA.
If both sites are in agreement with the consensus interactions the
cells is not labeled. If either the mRNA site, the sRNA site, or
both do not match with the consensus site, the cell is labelled in
red with '-/+', '+/-' or '-/-', respectively. If the optimal
prediction does not match the mRNA consensus site, but the first
suboptimal prediction does match, the p-value of the suboptimal
prediction is used for the cell coloring. The usage of the
suboptimal prediction is indicated by a '2' in the cell labeling.
'2' means both the interaction in mRNA and sRNA of the suboptimal
prediction match the consensus, '2+/-' means that only the
interaction site in the mRNA matches the consensus.
Regions plots:
These are the same as the ones displayed on the webserver.
They can be downloaded in postscript, pdf and png format.
16S rDNA tree:
The tree is constructed with the
neighbor-joining method and based on the 16S rDNA sequences of the
respective organisms. Based on the annotation, the 16S sequences
are retrieved from each organisms RefSeq genome file.
The tree is provided both in the NEWICK text and the SVG image format.
The guide tree view is generated using the
Newick Utilities.
Input Examples
5 ChiX sequences (~2h)
5 ChiX sequences
The example's result can be directly accessed here
List of Changes
- 4.5.6 : supported organisms updated (2018-07-20)
- 4.5.2 : CopraRNA v2.1.2 online : automated clustering
- 4.1.1 : CopraRNA v2.1.0 online : new CopraRNA2 features enabled
- 4.0.9 : CopraRNA v2.0.3.2 online : fixed issue where RefSeq IDs that were longer than 11 characters caused jobs to fail
- 4.0.7 : CopraRNA v2.0.3.1 online : changed DAVID-WS from perl to python client
- 4.0.0 : CopraRNA v2.0.3 online : using local mirrors of old NCBI ID system for compatibility if available
- 3.4.3 : CopraRNA v2.0.2 online : support of new NCBI ID system
- 3.4.1 : CopraRNA v2.0.1 online : Iterative organism subset analysis enabled. Auxiliary enrichment output added. Minimal relative cluster size parameter added. IntaRNA parameters changed to -w 150 -L 100.
- 3.3.8 : CopraRNA v1.3.0 online : Potential outlier detection; evolutionary tree visualization; minor bugfix in weight calculation.
- 3.3.5 : CopraRNA v1.2.9 online : Now using (Benjamini&Hochberg, 1995) for false discovery rate (fdr) estimation. Fixed issue where trees with branch lengths of zero would cause job failures.
- 3.3.2 : CopraRNA v1.2.8 online : Fixed the issue where jobs with input organisms with exactly the same 16S sequences would fail
- 3.2.3 : CopraRNA v1.2.7 online : Reimplementation of p-value joining (runtime reduction); Minor bugfix for heatmap drawing and regions plots
- 3.2.2 : CopraRNA v1.2.6 online : Added heatmap pdf output
- 3.2.0 : CopraRNA v1.2.5 online : Added functional enrichment heatmaps
- 3.1.3 : CopraRNA v1.2.4 online : Changed DomClust parameters to standard MBGD parameters
- 3.1.2 : CopraRNA v1.2.3 online : BLAST speedup
- 3.0.2 : CopraRNA v1.2.2 online : Fixed issue with organism: 'sfd'
- 3.0.1 : CopraRNA v1.2.1 online : RefSeq files now being downloaded from NCBI FTP
- 2.7.5 : non-enteric examples added