SINE Base

SINEBase is a database of short interspersed elements and tools for their identification and analysis. It is intended to:

provide data about SINE families known to date;
attribute individual SINE sequences to known SINE families;
analyse sequences of known and unknown SINE families.

SINEBase is not intended to:

identify LINEs and other classes of repetitive genomic elements; at the same time, it can be used to discriminate between SINE and non-SINE sequences;
identify new (unknown) SINE families; instead it can confirm that the sequence analyzed does not belong to known SINE families.
identify (most) SINE subfamilies.

It is not an automated tool and requires the understanding of the basic concepts of SINEs. We also recommend to follow the protocol for the analysis of putative SINE sequences.

Definitions

Retro(trans)posons are genetic elements that can amplify themselves in eukaryotic genomes via an RNA intermediate, which requires their transcription and reverse transcription. Retroposons are divided into three classes: LTR elements, LINEs, and SINEs. The elements that encode reverse transcriptase (RT), an enzyme providing for the reverse transcription and integration of the DNA copy into the genome, are called autonomous transposons. Nonautonomous retroposons rely on the RTs of autonomous transposons. LTR transposons and LINEs can be autonomous or nonautonomous; and their genomic copies are transcribed by the cellular RNA polymerase II.

Length distribution for 175 eukaryotic SINE families (without tail)

Short interspersed elements (SINEs) are defined as relatively short (< 700 bp) nonautonomous retroposons transcribed by the cellular RNA polymerase III (pol III) from an internal promoter, while their reverse transcription depends on the RT of partner LINEs. Eukaryotic genomes can harbor hundreds thousands (sometimes more) of SINE copies; copies originating from a common ancestral SINE can differ from each other by single-nucleotide alterations as well as by longer internal deletions or duplications (SINEs with such duplication are called quasidimeric). Some of them can become founders of new SINE subfamilies.

SINEs consist of two or more modules; typically, head, body, and tail. The 5'-terminal head originates from the cellular RNAs synthesized by pol III: tRNA, 7SL RNA, or 5S rRNA. The origin of the body is either unknown or it descends from a partner LINE. SINEs with a LINE-derived region mimic LINE RNA in the reverse transcription (such SINEs belong to the stringent group). It can also contain a domain shared by distant SINE families (CORE and similar domains). The 3'-terminal tail is a sequence of variable length consisting of simple (often degenerate) repeats. In addition, two SINEs can combine into a dimeric SINE, thus, giving rise to a new SINE family. SINEs consisting of the head and tail only are called simple, while dimeric, trimeric, etc. are complex SINEs.

We consider SINEs as:

short (<1 kb) interspersed (nontandem) genomic repeats
present in at least 100 copies per genome (except certain genomes where repetitive elements are not abundant, e.g., Arabidopsis thaliana)
with at least 60% identity with a tRNA species, 5S rRNA, or 7SL RNA in at least 60-nt overlap (unless the element transcription by pol III was confirmed experimentally). The identification of pol III promoters (e.g., boxes A and B) can serve only as an indication (but not a proof) that the sequence belongs to SINEs.

SINEs should be distinguished from RNA pseudogenes: the pseudogenes are generated by the reverse transcription of the cellular RNAs (e.g., 5S rRNA) rather than of SINE RNAs transcribed from their genomic copies. In practical terms, most SINEs have extra (body) sequences, while simple SINEs have characteristic substitutions/indels shared with their source gene but not with the cellular RNA gene. In addition, SINEs significantly outnumber RNA pseudogenes.

The notion of ‘SINE family’ is widely used but not clearly defined. We consider SINE family as a set of SINEs

of a common origin and
consisting of the same modules in the same order (except the tail, which can vary even in the same species).

Thus, similar SINEs with different LINE-derived regions belong to different families. Long insertions are considered as modules. At the same time, internal deletions or duplications within modules do not give birth to a new family; although a combination of complete or almost complete SINEs (complex SINEs) is considered as a new family (thus, pB1 and quasidimeric B1 are subfamilies of the same family, while dimeric Alu represents a distinct family). Finally, there are а few SINEs with quite similar structure but of independent origin (certain simple SINEs), which are considered as different families.

Database of SINE families is presented as a Table integrating the following data:

SINE is the SINE family name linking to the consensus sequence; synonyms, previous names, and RepBase IDs (if different) are given in parentheses.
Length is the length of SINE family consensus without tail.
Taxon is the high-rank taxon (~class) where SINE occurs.
Distribution is the taxon limiting SINE distribution.
Copy number is the number of SINE copies per haploid genome.
Structure is schematic structure of SINEs (tRNA, 5S, and 7SL are SINE heads derived from tRNA, 5S rRNA, and 7SL RNA, respectively; CORE is CORE, Deu-, V-, Ceph-, α-, or β-domains; ??? corresponds to body parts of unknown origin; LINE is long interspersed element-derived body region; and ‘~~~’ denotes the tail.
tRNA shows human tRNA genes with >70% (black), >75% (green), or >80% (red) identity with SINE family consensus in at least 60-nt overlap; in complex SINEs, similar tRNAs for each tRNA-derived monomer are separated by dotted lines.
LINE indicates the putative partner LINE; and the LINE clade is specified in square brackets (the partners were identified by the similarity with the 3’-terminal regions of SINEs and LINEs except mammalian L1, which was identified by the A-rich tail).
Tail is the repeat unit of the tail (sometimes degenerate; in particular, ‘A’ and ‘AT’ correspond to A- and AT-rich sequences).
Features describe specific features of SINE structure (CORE, Deu-, V-, Ceph-, α-, or β- domains; complex or simple structure; T⁺ class SINEs; TC-stretches; absence of target site duplications (TSD^–); etc.).
Refs include links to the original/key publications.

The Table contents can be filtered and sorted in many ways. E.g., you can limit it to mammalian SINE families with a LINE-derived region and sort them by length (see Tips & Tricks)

Recommended protocol

Make sure that the genomic element analyzed is repetitive and nontandem. Try to evaluate the number of copies per genome if long genomic sequences are available. This can be not as important when the sequence analyzed belongs to a species where presumably all SINEs have been described (e.g., mouse).
Define the boundaries of the element. Usually, these boundaries are clearly seen on SINE multiple alignments where similarity ends. Another way to define the limits of an individual SINE sequence is to find (degenerate) short direct repeats (commonly 8-16 nt) generated in the course of SINE reverse transcription/integration. The SINE sequence should lie between these repeats called target site duplications (TSDs). TSDs can be identified using our TSDSearch tool. Exclude the flanking sequence from further analysis. Truncate very long tails; 10-20 nt is enough. Whenever possible, use consensus rather than individual sequences.
If the element is longer than 1 kb, it is not a SINE. You can try to to identify it by searching for similarities with other transposons.
Run SINESearch against the SINEBase using ~90% of the element length as the Min overlap lenght. If the search was successful, (i) there were no long gaps in the alignment, and (ii) the lengths of the query element and the hit consensus sequence in the SINEBase are similar, the genomic element analyzed can be assigned to the found SINE family. If the search was not successful, try to slightly decrease the Min overlap lenght. If it doesn't help, proceed to module analysis.

If the studied element consists of an RNA-derived region and a tail only, it can be the RNA pseudogene. Simple SINEs can be identified by characteristic substitutions/indels shared with their source SINE copy but not with the cellular RNA gene. In addition, SINEs significantly outnumber RNA pseudogenes.

Module analysis targets to the identification of individual modules of a putative SINE.
- Run SINESearch against the RNABase with 60% identity and 60 nt overlap. No results strongly indicate that the element analyzed is not a SINE.
- Exclude the whole RNA-derived region and run SINESearch with the remainder sequence against RNABase (complex SINEs contain two or more RNA-derived regions), COREBase, and LINEBase in an attempt to identify known SINE modules. A search against SINEBase can also give a clue to the module nature. Adjust the search parameters to correspond to the query sequence and bank; try to decrease the values if the search was negative.
- Exclude identified module(s) and repeat the previous step.

Note that SINEs of the same family have the same modules in the same order; at the same time, they can have relatively small deletions or internal duplications. The tail length and even sequence is not a marker of SINE families.

SINESearch

SINESearch is a FASTA-based search tool that utilizes simple parameters to select sequences of interest instead of the internal FASTA's statistical significance test. This obviates two limitations of FASTA (as well as BLAST etc.) in the case of relatively short and degenerate similarities between nucleotide sequences of SINEs:

bias to short (almost) perfect matches, while the goal is to find full-length and significant similarities, and
missing significant hits when the bank includes many sequences similar to query.

Finally, the parameters used, overlap length and sequence identity, are biologically sensible and allow easy adjustment of hit selection.

SINESearch is simple to use and fast. Specify the search parameters, bank to search, and query sequence, and press the ' Submit Query ' button to start search. If error message appears, press the ' Back to Previous Page ' button, enter correct data (' Reset All ' button can be used to reset all fields), and press the ' Submit Query ' button again. The results are sorted by the best fit coefficient (reflecting correspondence between the total lengths of the sequences and the overlap length; note that it does not directly depend on the sequence identity). However, the results can be sorted by other parameters (sequence name, identity, or overlap) by clicking on column headers marked with . In the case of the SINEBase bank, the output contains links to the SINE Table, where you can find details about the SINE families found. If you are not satisfied with the results, try to adjust parameters or redefine the query sequence limits.

SINESearch input fields:

Sequence identity. Allowed range: 40-100%. Generally, it is not recommended to decrease this value below 65%. The default value is 65% but it automatically changes to 60% for the RNA banks.

Min overlap lenght. Allowed range: 30-1000 nt. Use 90% of the query sequence length as the starting point; 60 nt is recommended for the RNABase bank; use common sense: some modules can be as short as ~30 nt, while others are longer. The default value is 70 nt but it automatically decreases to 60 and 40 nt for the RNA and CORE banks, respectively.

Sequence Banks.

SINEBank is our bank of consensus sequences of SINE families. The consensus sequences specify the source and some other significant information (such as the previous name) in square brackets, which is followed by the distribution range (as in the SINETable).
RNABank is the database of human tRNA species (tRNAdb 2009) plus 7SL RNA and 5S rRNA.
Plant tRNABank is the database of Arabidopsis thaliana tRNA species (tRNAdb 2009).
LINEBank is our bank of SINE consensus sequences derived from partner LINEs.
COREBank is our bank of consensus sequences of central domains (CORE, Deu-, V-, Ceph-, α-, and β-domains).

Sequence Entry. Query sequence can be entered manually (typed or pasted) or uploaded from a local file. The sequence must be in FASTA format. Only IUPAC nucleotide symbols are allowed. The maximum sequence length is 2 kb. Multiple query sequences are not supported. Either enter or upload query sequence; when a sequence file is uploaded the manual sequence entry field is disabled and vice versa (use the ' Clear ' buttons to reset the file upload box or the manual sequence entry fields; while ' Reset All ' resets all fields). The query sequence can be shortened from both ends using the Offset parameters (notice that the numbering of the full-length sequence is preserved).

TSDSearch

TSDSearch is a tool to search for relatively short target site duplications of genomic DNA that commonly frame retrotransposons including SINEs. Found repeats are shown as arrows below the sequence ruler and as sequences with coordinates. TSDs are sorted by a compromise between total length and length of matches, so that 'best' TSDs are list first.

Technically, TSDSearch is implemented in JavaScript, which means that the calculation is performed by your web browser/computer. Clearly, JavaScript should be enabled in your browser. Finally, execution time significantly varies bewteen different browsers. Overall, more recent browsers execute TSDSearch faster. Chrome showed the best performance among popular browsers tested.

Search Area. A typical task in SINE analysis is to identify TSDs framing SINE sequence(s). In this context, the region where TSDs are searched should not include the proper SINE sequence as well as areas too distant from it. Blind analysis of the whole region not only substantially increases the computation time but also complicates data intepretation. Setting the 5' and 3' offset values as well as the lengths of regions to analyze (ranges) makes it possible to focus on the desired areas. Bear in mind that the tails can substantially vary in length, so it is a good practice to increase the 3' range relative to the 5' range. On the other hand, the 5' range has the greatest impact on calculation time so do not increase this value unless nessesary.

TSD parameters. The search algorithm considers TSD as three blocks of nucleotides (subrepeats) identical between the 3' and 5' TSDs (i.e., subrepeat 1 in 5' TSD is identical to subrepeat 1 in 3' TSD etc.). Subrepeats can be separated by variable spacers. For instance, 5' TSD: (ACCT)a(GGG)(TAC) and 3' TSD: (ACCT)(GGG)ac(TAC); subrepeats are shown in parentheses and spacers are in lowercase. Subrepeats cannot be shorter that the subrepeat min lengths specified, while spacers cannot be longer than the max length of spacers. Min length of total match (without gaps) and max total mismatch length allow fine tunung of TSD length and similarity, respectively. Finally, the number of displayed TSDs for a query sequence is limited by the max number of TSDs (specify '1' to show the best TSD only).

Sequence Entry. Query sequence can be entered manually (typed or pasted) or uploaded from a local file (if you use a recent browser that supports HTML5 file operations). The sequence must be in FASTA format. Only A, C, G, and T nucleotide symbols are allowed in the search area. Notice that U, N, and X are not allowed. Gaps ('~', '–', & ' ') are allowed and ignored. Multiple sequence query can be analyzed, but all sequences must have unique names. For good reason, all sequences should be longer than the sum of the left and right offsets and ranges. Either enter or upload query sequence; when a sequence file is uploaded the manual sequence entry field is cleared and vice versa (use the ' Clear ' buttons to reset the file upload box or the manual sequence entry fields; while ' Reset All ' resets all fields).

All parameters and sequences are checked prior to TSD search. Analysis will not start if any parameter or sequence does not conform to the requirements. In this case, balloon message(s) appear near the field to be corrected.

Data Submission

We encourage the submission of new data on SINE families. Please, make sure that your SINE comply with the requirements and provide all nessesary information, which includes submitter's data (name, affiliation etc.), SINE data, and publication (if any). SINE data includes the SINE family name, consensus sequence, taxonomic distribution, copy number, tail repeat unit, and comments.

Please, avoid 'SINE' in the SINE family name; we recommend the first letters of the taxon limiting its distribution (e.g., Gli-1 for a SINE found in dorimice (Gliridae) rather than GliSINE1, SINE2_DOR, etc.).
Only IUPAC nucleotide symbols are allowed in the consensus and tail sequences. The consensus sequence should contain from 60 to 1000 symbols.
Please, evaluate the taxonomic range and copy number of the family. Even rough estimates are better than nothing. Reports based on a single sequence are unacceptable and will not be considered.
Specify any details you wish in the comments field.
You may send any supplemental data (e.g., a multiple alignment or a PDF of the publication) as an attachment file. Please, provide a description of the attachment in the comments field. Do not send files larger than 5 Mb as well as executables (.exe, .com etc.; potentially dangerous files will drive the submission to spam).

As you proceed to the next field (as well as when you press the ' Validate & Send Data ' button), error prompts may appear (e.g., 'This field is required' or 'Invalid email address'). You won't be able to send data without fulfilling all the requirements. Please, contact us if you find some requirements irrelevant (e.g., you have discovered a 55-nt SINE). If no (more) errors, pressing the ' Validate & Send Data ' button sends your data to the SINEBase. You will promptly recieve an automatic confirmation e-mail. Please, allow some time for us to review your submission.

Tips & Tricks

Place mouse cursor over elements of interest (e.g., references in SINETable or abbreviations or nucletides in consensus sequences) for additional information.
SINETable contents can be filtered using boxes below certain fields. E.g., enter 7SL to the Structure box to view only SINEs with a 7SL RNA-derived region. The filters are case-sensitive and support Perl-like regular expressions. For instance, ^7SL in the Structure box will show all SINEs with a 7SL RNA-derived region at the 5' end; tRNA.+LINE will show SINEs with a tRNA-derived region and a downstream LINE-derived one (more examples can be found here). Empty the box to remove filtering.
Similarly, SINETable contents can also be filtered by a high-rank taxon (or taxa) in the Taxon box; both Latin and common names can be used (e.g., Aves or birds ).
Click on column headers (marked with ) in SINETable to sort by the column; second click reverses the order.
SINEs can be selected manually by ticking the checkbox left of the family name in the SINETable and clicking the SINE header to show selected items first.
SINETable filtering and sorting can be combined; e.g., you can view all bony fish SINEs with an L2-derived region sorted by taxon. All filtering/sorting can be reset by reloading the page (F5 in many web browsers). The number (and proportion) of selected elements is immediately shown below the SINE header.
Clicking a SINE family name or a reference in the Table will redirect you to the consensus sequence or the reference, respectively. Clicking items in the Features column will redirect you to their descriptions.
Rainbow animation in the top frame can be stopped by clicking the animated text.