- Section I: ECM Overview
- Top panel shows the predefined species tree.
- Bottom panel shows the partition of input genes into Evolutionary Conserved Modules (ECMs), ordered by ECM strength (shown at right), and separated by horizontal lines.
- Each row show one gene, where the phylogenetic profile indicates presence (blue) or absence (gray) of homologs in each species (column).
- Gene symbols are shown at left. Gray color indicates that the gene is a paralog to a higher scoring gene within the same ECM (based on BLASTP E < 1e-3).
- The next PDF section shows separate details for each ECM, where ECMs that contain two or more non-paralogous genes are indicated by an asterisk appended to the name (e.g. ECM2*).
- Section II: Details of each ECM and its expansion ECM+
- Top panel shows the ECM’s inferred evolutionary history on the predefined species tree. Branch color indicates gene gain (blue), loss (red color, with brighter color indicating higher confidence in loss), or inheritance (black); otherwise branches are gray.
- Bottom panel shows the input genes that are within the ECM (blue/white rows) as well as all genes in the expanded ECM+ (green/gray rows). The ECM+ includes genes likely to have arisen under the inferred model of evolution relative to a background model, and scored using a log likelihood ratio (LLR). Green color hue indicates LLR score.
- PG indicates "paralog group" which are labeled alphabetically (i.e., A, B). The first gene within each paralog group is shown in black text. All other genes sharing sequence similarity (BLAST E < 1e-3) are assigned to the same PG label and displayed in gray text.
The interactive visualizer provides an interactive HTML representation of the CLIME PDF output with the following added features:
- Your input gene set in partitioned into ECMs. You select the ECM of interest via a drop down menu at the top left. Unlike the PDF output, there is no summary of the entire partition in the Interactive Visualizer.
- Zoom & Pan
- Click and drag: Clicking and dragging inside the visualizer will pan around your results.
- + (plus): Zoom in by 10% (maximum: 400%).
- - (minus): Zoom out by 10%.
- R: Reset zoom level. You can also reset the zoom level by clicking the 'Reset Zoom' button at the top of the page or selecting a different ECM from the dropdown if you have multiple ECMs.
- Other Features
- Gene symbols (left column of result matrix) are hyperlinked to the relevant gene page in NCBI Entrez Gene or other database..
- Gray color for a gene name indicates it’s a paralog of another higher-scoring gene in the ECM (see “Paralog Group” above).
- Organism names (column headers of result matrix) are hyperlinked to a Google search for that organism.
- Mouse Hover: Hovering on a cell in the results table will provide a tooltip with the gene symbol and organism name for your convenience.
Downloaded results include a text file in addition to the PDF file. The text file contains the same information as the PDF file (except that it lacks the tree gain/loss events shown in PDF). This text file can be useful to view in Excel and to link gene symbols to other annotations.
If you wish to visualize the results in Excel rather than PDF, the following instructions will let you make the phylogenetic profiles look similar to the blue/white matrices shown in PDF: 1) open text file in Excel; 2) select columns I:EU; 3) choose “Conditional Formatting | New Rule” and select gray as the minimum color and blue as the maximum color; 4) with columns I:EU still selected, choose “Column Width” and set to ~4 pixels.
- What types of gene name/identifiers does CLIME recognize?
The online version of CLIME takes the users input genes and compares them to a lookup table of gene symbols and aliases, downloaded from NCBI Gene (ftp://ftp.ncbi.nih.gov/gene/) and PlasmoDB. The alias lookup table is static (initially downloaded December 2013) but will be updated at regular intervals. The input gene symbols and aliases are case-insensitive, and we discard a gene’s alias if it is conflict with another gene’s official symbol. If your gene ID or symbol is not recognized, then please try to use the Entrez Gene ID or current official gene symbol according to NCBI Gene (http://www.ncbi.nlm.nih.gov/gene). Below are examples of gene names and symbols for each of the 10 model organisms:
Organism Gene Symbol Entrez ID Aliases H. sapiens MICU1 10367 CALC; EFHA3; CBARA1 M. musculus Micu1 216001 Calc; Cbara1; C730016L05Rik A. thaliana CSP3 816297 ATCSP3; T13L16.11; T13L16_11; AT2G17870 C. elegans homt-1 171590 Y74C9A.3 C. merolae CMA001C 16992110 D. melanogaster plexB 43766 CG17245; Dmel_CG17245 N. crassa NCU09980 3874110 NCU09980.1 P. falciparum VAR 813139 PFA_0005w S. cerevisiae NTG1 851218 ogg2; FUN33; YAL015C T. brucei Tb10.61.2410 3662557
- What does “PG” stand for?
PG stands for Paralog Group and is used to label genes within the ECM that share sequence similarity with each other (BLASTP E<1e-3). Paralog groups are labeled sequentially (A, B, C…) and the first member in the group is shown with black text, whereas the remaining genes are shown with gray text.
Why do we do this? Genes sharing sequence similarity will usually (but not always) have the same phylogenetic profile – thus CLIME will usually group them within the same ECM. Some users will only be interested in non-paralogs and thus can ignore genes with gray text. Other users will be interested in ALL genes within the ECM and will thus want to consider both paralogous and non-paralogous genes.
- What does the * indicate in names like ECM2* (PDF output only)?
The asterisk indicates that the ECM contains two or more non-paralogous genes.
- What does “Strength” in the Interactive Viewer mean?
ECM Strength measures the coherence of the evolutionary histories of all genes in the ECM. It is formally defined as the logarithm of the Bayes Factor for two models normalized by the number of genes in that ECM, where the first model assumes that all genes were derived from the ECM model of evolution (θk) and the second model assumes that each gene was derived independently from the null model (θ0).
- How are the output ECMs ordered?
ECMs are ordered by ECM Strength (see above). Singleton ECMs are ordered by total number of species.
- How long should it take for my gene set to run?
CLIME takes several minutes for small gene sets, under ten minutes for medium sized gene sets (<25 genes), about one hour for medium size gene sets (50-100 genes), and several hours for larger gene sets (100-200 genes). However, the online tool launches jobs on a compute cluster, and if the cluster is busy there may be a delay of several minutes to hours before the CLIME job is launched. If you want to run CLIME with more than 200 genes or get your results more immediately, please download and use the command-line CLIME software package from the home page.
- Is there limit of input gene set size?
The online version of CLIME has a limit of gene set size 200. Larger sets must be run using the command line version of the program downloaded onto your computer.
- How can I run CLIME on more than 200 genes?
You need to download and install CLIME software on your computer, and then run it from the command line on your input gene set. The software package includes the 138 species tree and pre-computed matrices for 10 model organisms for your convenience.
- Why does my browser download a PDF file when I click “View Results”?
The results of CLIME analysis are reported in a PDF file. For the browsers with PDF viewer plugin, the PDF will be displayed automatically within the browser. But for the browsers without that plugin, the PDF file could be downloaded to user’s disk and open with local PDF viewers.
- How can I view CLIME results with more informative gene names?
For some of the model organisms, the official gene symbols are not very informative. If you prefer different names/descriptions then you have several options:
- Run CLIME on the website, and download the results (both PDF and Text). Open the text output file in Excel. There is a column that shows “gene symbol”. In another excel tab you can enter a mapping of gene symbols from precomputed ones to names/descriptions that you prefer. Then you can add a column in the main results file, and using the Excel vlookup function, you can insert the description from your mapping.
- Run CLIME on the command line, but input a phylogenetic matrix using the gene descriptions that you prefer rather than the precomputed ones.
- How was the 138-species phylogenetic tree created?
The tree of 138 species was generated using PhyML based on multiple alignments of 16 highly conserved proteins, as described in PMID:22605770 and paraphrased below:
The following 16 highly conserved mouse proteins (present in E. coli and in ≥98% of the 138 eukaryote genomes) were selected: Hars, Hspd1, Mlh1, Nthl1, Sars, Sod2, Top3a, Vars, Rps3, Uba2, Rpl23, Farsa, Ola1, Ddx47, Elp3, and Qars. Separate multiple sequence alignments were generated using MUSCLE (100 iterations) and then concatenated. Regions of poor sequence alignment were removed using Gblocks (maximum non-conserved positions=8; minimum block length=2; allowed gap positions=Half). A phylogenomic tree was constructed using PhyML 3.0 (parameters: -m JTT -f e -v e -c 4 -a e) with 100 bootstrap replicates. Tree visualization was performed by iTOL and manually labeled with approximate taxonomic branch names (e.g. Metazoa) for clarity. Such tree construction methodologies have known limitations, including long-branch attraction. The reconstructed tree was consistent with published reconstructions or NCBI taxonomy with a few exceptions: mammals (Bos taurus, Canis lupus familiaris, Mus musculus, Sus scrofa), primitive metazoans (Nematostella vectensis, Hydra magnipapillata, Strongylocentrotus purpuratus, Trichoplax adhaerens), one insect (Pediculus humanus corporis), and three highly divergent fungal/protist species (Encephalitozoon cuniculi, Entamoeba histolytica, Entamoeba dispar). The alignment and tree are available in TreeBASE(purl.org/phylo/treebase/phylows/study/TB2:S12416).
- Can I use a different tree?
In the command line version of CLIME, users can input any phylogenetic tree in newick format. There is no way to input a different tree into the online version of the program.
- How was the phylogenetic matrix generated?
The pre-computed phylogenetic matrices are generated using a very simple homolog analysis: a gene is deemed to have a homolog if a species contains any protein with sequence similarity based on BLASTP (Expect < 1e-3). This approach is applied to all annotated genes in a references species (e.g. human, mouse), based on gene sequences downloaded from KEGG – which selects one representative transcript for each gene model.
- Can I use a different phylogenetic matrix?
In the command line version of CLIME, users can input their own phylogenetic matrix. There is no way to input a different phylogenetic matrix into the online version of the program.
- What should I consider when deciding which species to add to my user-defined species tree?
The most useful eukaryotic species trees will contain (i) diverse species, (ii) a large number of species, and (iii) only species with well annotated and complete genomes. Metrics such as CEGMA are useful to estimate whether a genome is complete. For our 138 selected species, we performed a CEGMA-like analysis to exclude species that had poor assembly/annotation.
- Can I use just a subset of the 138 species?
In the command line version of CLIME, users can input any phylogenetic tree in newick format and any matching phylogenetic matrix. There is no way to input just a subset of the 138 species into the online version of the program. The best way to proceed would be to create new phylogenetic matrices and a tree with the species of interest and input them into the command line version of the program.
- What does “H. sapiens 111” mean?
“H. sapiens 111” refers to a phylogenetic matrix centered on human genes and their homologs across 111 species. These 111 species are a subset of the 138 eukaryotic genomes, excluding 27 species for which mitochondrial genomes are not available or annotated. Analysis using this matrix also uses a subset of the species tree containing just these 111 species.
- Why is the phylogenetic profile for my gene incorrect?
The phylogenetic profiles displayed here are based on a very simple homolog analysis: a gene is deemed to have a homolog if a species contains any protein with sequence similarity based on BLASTP (Expect < 1e-3). Below are some problems with this simplistic analysis that may cause problems for your gene of interest:
- BLASTP often fails to detect true homologs for short or quickly evolving proteins. In these cases, the existing phylogenetic matrix will contain 0 (absence) even for species that truly contain a homolog.
- BLASTP can have spurious low-quality matches, even with expect < 1e-3. In these cases, the phylogenetic matrix will incorrectly contain a 1 (presence) in species that do not truly have a homolog. Examining the BLASTP scores of all hits, and the alignments or multiple alignments themselves, can determine if a presence call represents a true homolog or a spurious hit.
- The matrix will incorrectly contain absences for genes not present in a genome’s draft assembly, or for genes not properly annotated a genome.
- Most common, and most pernicious, is the problem of determining homology for proteins that are part of large multi-gene families. The current phylogenetic matrix will contain a 1 (presence) if a species contains any member of this multi-gene family (based on sequence similarity) – and will not resolve if it contains the exact homolog. To resolve the relationship within a multi-gene family, a different approach must be taken – e.g. best bi-directional hits, gene trees, SYNERGY.
If you manually fix a phylogenetic profile, you can add this edited “gene” to the phylogenetic matrix for the species, and then run command-line CLIME using this updated matrix.
- Why do I get different results when I add/remove one gene to an input gene set?
CLIME partitions the entire input gene set into ECMs using a Bayesian method to perform clustering. Addition or deletion of a single gene from a gene set can alter all of the clustering results. Due to the stochastic nature of the underlying Markov chain Monte Carlo (MCMC) algorithm, for large gene sets CLIME could actually return slightly different partitioning results just from running the program multiple times. However, the default behavior of CLIME is to use a fixed seed so that all results are reproducible (although the user can set a command line argument to use a random seed instead).
- Why doesn’t the ECM order from the online tool match the order shown in the CLIME publication?
For display purposes, in the CLIME publication ECM results were ordered by total number of species containing a homolog –rather than by ECM strength.
- Why are there no ECM+ results for my gene(s)?
By default, CLIME returns genes with LLR scores greater than 0. For inferred histories with no loss events (eg present in ALL species, or present just in one clade/species), the LLR will be very low, and thus no results will be returned as part of the ECM+.
- How can I tell if a particular ECM or phylogenetic profile is meaningful?
If an ECM contains multiple genes:
- The “ECM strength” score indicates the coherence of the different member genes. High scores indicate ECMs where member genes are highly similar to each other based on inferred evolutionary history.
- In addition, the most meaningful ECMs encompass independent loss events (shown visually by red branches in the tree). Loss events are most convincing if they are (a) supported by multiple species under the red branch and (b) occur in parts of the phylogenetic tree that are well supported (ie have short branch lengths; very long branch lengths in the tree indicate high levels of divergence from other and thus and more prone to tree-building artifacts such as long-branch attraction).
- • ECMs are strongest when they contain multiple non-paralogous genes, since the algorithm can better distinguish true loss events from phylogenetic profile errors.
If the ECM contains only a single gene than there is no ECM strength score and interpretation is more difficult. This difficulty is one of the main motivators for CLIME’s clustering approach where an evolutionary history is more robust if supported by multiple genes. However, the profile from a single gene is more convincing if:
- The phylogenetic profile is evolutionarily coherent (ie few loss events within tight clades, few presence calls in distant parts of the tree). If there are many BLASTP errors, then the resulting phylogenetic profile tends to be incoherent – with many presence/absence calls scattered across the tree.
- The single gene is not part of a large multi-gene family (eg kinases, trypsins, etc) – where the presence of the domain will complicate identifying true homologs across species.
By our experience, we found an ECM will be very informative if it satisfies the following two conditions: (1) ECM strength > 5. (2) Most of the ECM genes are not paralogs to each other. Furthermore, ECM+ genes will be confident prediction to share close evolutionary history with the ECM if LLR > 20.
- Why is there no outstanding ECM in my input gene set?
This can be either because of true underlying biology or a technical limitation of the input matrix/tree/algorithm. The most common reasons are:
- The input genes show similar profiles, but no loss events across evolution. For example, some inputs genes can be simply present in all 138 species. Even though they are shown to have identical evolutionary history, there are no informative loss events to predict genes with shared evolutionary history.
- The input genes profiles are dissimilar from each other. Many genes have quite different evolutionary history even though they are in same complex/pathway for some organisms.
- The input genes are part of multi-gene families. This complicates the procedure to determine homology (see above question) and thus the phylogenetic profiles tend to be uninformative. Examples of this are mitochondrial transporters, which share the same Mito_carr PFAM domain but each transport a different cargo, all of which have the exact same profile in our input matrix.
- What does "Informative" mean in ECM interactive visualizer?
Informative ECMs are those with strength > 2, with >= 50% non-homologous genes and at least 2 non-homologous genes total.