首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We introduce a new method for identifying optimal incomplete data sets from large sequence databases based on the graph theoretic concept of alpha-quasi-bicliques. The quasi-biclique method searches large sequence databases to identify useful phylogenetic data sets with a specified amount of missing data while maintaining the necessary amount of overlap among genes and taxa. The utility of the quasi-biclique method is demonstrated on large simulated sequence databases and on a data set of green plant sequences from GenBank. The quasi-biclique method greatly increases the taxon and gene sampling in the data sets while adding only a limited amount of missing data. Furthermore, under the conditions of the simulation, data sets with a limited amount of missing data often produce topologies nearly as accurate as those built from complete data sets. The quasi-biclique method will be an effective tool for exploiting sequence databases for phylogenetic information and also may help identify critical sequences needed to build large phylogenetic data sets.  相似文献   

2.
Many biological databases that provide comparative genomics information and tools are now available on the internet. While certainly quite useful, to our knowledge none of the existing databases combine results from multiple comparative genomics methods with manually curated information from the literature. Here we describe the Princeton Protein Orthology Database (P-POD, http://ortholog.princeton.edu), a user-friendly database system that allows users to find and visualize the phylogenetic relationships among predicted orthologs (based on the OrthoMCL method) to a query gene from any of eight eukaryotic organisms, and to see the orthologs in a wider evolutionary context (based on the Jaccard clustering method). In addition to the phylogenetic information, the database contains experimental results manually collected from the literature that can be compared to the computational analyses, as well as links to relevant human disease and gene information via the OMIM, model organism, and sequence databases. Our aim is for the P-POD resource to be extremely useful to typical experimental biologists wanting to learn more about the evolutionary context of their favorite genes. P-POD is based on the commonly used Generic Model Organism Database (GMOD) schema and can be downloaded in its entirety for installation on one's own system. Thus, bioinformaticians and software developers may also find P-POD useful because they can use the P-POD database infrastructure when developing their own comparative genomics resources and database tools.  相似文献   

3.
We present a web service allowing to automatically assign sequences to homologous gene families from a set of databases. After identification of the most similar gene family to the query sequence, this sequence is added to the whole alignment and the phylogenetic tree of the family is rebuilt. Thus, the phylogenetic position of the query sequence in its gene family can be easily identified. AVAILABILITY: http://pbil.univ-lyon1.fr/software/HoSeqI/.  相似文献   

4.
5.
Phylogenetic diversity and ecology of environmental Archaea   总被引:1,自引:0,他引:1  
On the basis of culture studies, Archaea were thought to be synonymous with extreme environments. However, the large numbers of environmental rRNA gene sequences currently flooding into databases such as GenBank show that these organisms are present in almost all environments examined to date. Large sequence databases and new fast phylogenetic software allow more precise determination of the archaeal phylogenetic tree, but also indicate that our knowledge of archaeal diversity is incomplete. Although it is apparent that Archaea can be found in all environments, the chemistry of their ecological context is mostly unknown.  相似文献   

6.
As the genome sequences of multiple strains of a given bacterial species are obtained, more generalized bacterial genome databases may be complemented by databases that are focused on providing more information geared for a distinct bacterial phylogenetic group and its associated research community. The Burkholderia Genome Database represents a model for such a database, providing a powerful, user-friendly search and comparative analysis interface that contains features not found in other genome databases. It contains continually updated, curated and tracked information about Burkholderia cepacia complex genome annotations, plus other Burkholderia species genomes for comparison, providing a high-quality resource for its targeted cystic fibrosis research community. AVAILABILITY: http://www.burkholderia.com. Source code: GNU GPL.  相似文献   

7.

Background  

Phylogenetic analysis of large, multiple-gene datasets, assembled from public sequence databases, is rapidly becoming a popular way to approach difficult phylogenetic problems. Supermatrices (concatenated multiple sequence alignments of multiple genes) can yield more phylogenetic signal than individual genes. However, manually assembling such datasets for a large taxonomic group is time-consuming and error-prone. Additionally, sequence curation, alignment and assessment of the results of phylogenetic analysis are made particularly difficult by the potential for a given gene in a given species to be unrepresented, or to be represented by multiple or partial sequences. We have developed a software package, TaxMan, that largely automates the processes of sequence acquisition, consensus building, alignment and taxon selection to facilitate this type of phylogenetic study.  相似文献   

8.
Increasingly large datasets of 16S rRNA gene sequences reveal new information about the extent of microbial diversity and the surprising extent of the rare biosphere. Currently, many of the largest datasets are represented by short and variable ribosomal sequence tags (RSTs) that are limited in their ability to accurately assign sequences to broad-scale phylogenetic trees. In this study, we selected 30 rare RSTs from existing sequence datasets and designed primers to amplify c. 1400 bases of the 16S rRNA gene to determine whether these sequences were represented by existing databases or if they might reveal new lineages within the Bacteria. Approximately one-third of the RST primers successfully amplified longer portions of these low-abundance 16S rRNA genes in a specific manner. Subsequent phylogenetic analysis demonstrated that most of these sequences were (1) distantly related to existing cultivated microorganisms and (2) closely related to uncultivated clone sequences that were recently deposited in GenBank. The presence of so many recently collected 16S rRNA gene reference sequences in existing databases suggests that progress is being made quickly towards a microbial census, one which has begun scratching the surface of the 'rare biosphere'.  相似文献   

9.
MOTIVATION: Biological sequence databases are highly redundant for two main reasons: 1. various databanks keep redundant sequences with many identical and nearly identical sequences 2. natural sequences often have high sequence identities due to gene duplication. We wanted to know how many sequences can be removed before the databases start losing homology information. Can a database of sequences with mutual sequence identity of 50% or less provide us with the same amount of biological information as the original full database? RESULTS: Comparisons of nine representative sequence databases (RSDB) derived from full protein databanks showed that the information content of sequence databases is not linearly proportional to its size. An RSDB reduced to mutual sequence identity of around 50% (RSDB50) was equivalent to the original full database in terms of the effectiveness of homology searching. It was a third of the full database size which resulted in a six times faster iterative profile searching. The RSDBs are produced at different granularity for efficient homology searching. AVAILABILITY: All the RSDB files generated and the full analysis results are available through internet: ftp://ftp.ebi.ac. uk/pub/contrib/jong/RSDB/http://cyrah.e bi.ac.uk:1111/Proj/Bio/RSDB  相似文献   

10.
11.
Recently, as genome-scale data have become available for more organisms, the development of phylogenetic markers from nuclear protein-coding loci (NPCL) has become more tractable. However, new methods are needed to efficiently sort the large number of genes from genomic databases into more limited sets appropriate for particular phylogenetic questions, while avoiding introns and paralogs. Here we describe a general methodology for identifying candidate single-copy NPCL from genomic databases. Our method uses information from reference genomes to identify genes with relatively large continuous protein-coding regions (i.e., 700bp). BLAST comparisons are used to help avoid genes with paralogous copies or close relatives (i.e., gene families) that might confound phylogenetic analyses. Exon boundary information is used to identify appropriately spaced potential priming sites. Using this method, we have developed over 25 novel NPCL, which span a variety of desirable evolutionary rates for phylogenetic analyses. Although targeted for higher-level phylogenetics of squamate reptiles, many of these loci appear to be useful across and within other vertebrate clades (e.g., amphibians), and some are relatively rapidly evolving and may be useful for closely-related species (e.g., within genera). This general method can be used whenever large-scale genomic data are available for an appropriate reference species (not necessarily within the focal clade). The method is also well suited for the development of intron regions for lower-level phylogenetic and phylogeographic studies. We provide an online database of alignments and suggested primers for approximately 85 NPCL that should be useful across vertebrates.  相似文献   

12.
The bioinformatics software, Geneious, provides a useful platform for researchers to retrieve and analyse genomic and functional genomics information. However, the main databases that the software is able to access are hosted by NCBI (National Center for Biotechnology Information). The databases of EuPathDB (Eukaryotic Pathogen Database Resources), such as PlasmoDB and PiroplasmaDB, collect more specific and detailed information about eukaryotic pathogens than those kept in NCBI databases. Two plugins for Geneious, one for PlasmaDB and one for PiroplasmaDB were developed. When installed, users can use search facilities to find and import gene and protein sequences from the EuPathDB databases. Users can then use the functions of Geneious to process the sequence information. When information unique to PlasmoDB and PiroplasmaDB is required, the user can access results linked with the gene/protein sequence via the default web browser. The plugins are freely available from the Victorian Bioinformatics Consortium website. The plugins can be modified to access any of the databases of EuPathDB.  相似文献   

13.
A common request of proteomics core facilities is protein identification. However, in some instances primary sequence information for the protein in question is not present in public databases. In other cases, the amino acid sequence of a protein may differ in some way from the sequence predicted from the gene sequence in a database as a result of gene mutation, gene splicing, and/or multiple posttranslational modifications. Thus, it may be necessary to determine the sequence of one or more peptides de novo in order to identify and/or adequately characterize the protein of interest. The primary goal of this study was to give participating laboratories an opportunity to evaluate their proficiency in sequencing unknown peptides that are not included in any published database. Samples containing 3–6 pmol each of five synthetic peptides with amino acid sequences that were not present in public databases were sent to 106 laboratories. One nonstandard amino acid was present in one of the peptides. From a comparison of the results obtained by different strategies, participating laboratories will be able to gauge their own capabilities and establish realistic expectations for the approaches that can be used for this determination.  相似文献   

14.
MOTIVATION: Comparative sequence analysis is widely used to study genome function and evolution. This approach first requires the identification of homologous genes and then the interpretation of their homology relationships (orthology or paralogy). To provide help in this complex task, we developed three databases of homologous genes containing sequences, multiple alignments and phylogenetic trees: HOBACGEN, HOVERGEN and HOGENOM. In this paper, we present two new tools for automating the search for orthologs or paralogs in these databases. RESULTS: First, we have developed and implemented an algorithm to infer speciation and duplication events by comparison of gene and species trees (tree reconciliation). Second, we have developed a general method to search in our databases the gene families for which the tree topology matches a peculiar tree pattern. This algorithm of unordered tree pattern matching has been implemented in the FamFetch graphical interface. With the help of a graphical editor, the user can specify the topology of the tree pattern, and set constraints on its nodes and leaves. Then, this pattern is compared with all the phylogenetic trees of the database, to retrieve the families in which one or several occurrences of this pattern are found. By specifying ad hoc patterns, it is therefore possible to identify orthologs in our databases.  相似文献   

15.
MOTIVATION: Genome projects have produced large amounts of data on the sequences of new genes whose functions are as yet unknown. The functions of new genes are usually inferred by comparing their sequences with those of known genes, but evaluation of the sequence homology of individual genes does not make the most of the available sequence information. Therefore, new methods and tools for extracting more biological information from homology searches would be advantageous. RESULTS: We have developed a computational tool, ORI-GENE, to analyze the results of sequence homology searches from the perspective of the evolution of selected sets of new genes. ORI-GENE has a graphical interface and accomplishes two important tasks: first, based on the output of homology searches, it identifies species with similar genes and displays their pattern of distribution on the phylogenetic tree. This function enables one to infer the way in which a given gene may have propagated among species over time. Second, from the distribution patterns, it predicts the point at which a given gene may have been first acquired (i.e. its 'origin'), then classifies the gene on that basis. Because it makes use of available evolutionary information to show the way in which genes cluster among species, ORI-GENE should be an effective tool for the screening and classification of new genes revealed by genome analysis. AVAILABILITY: ORI-GENE is retrievable via the Internet at: http://www.rtc.riken.go.jp/jouhou/ORI-GENE.  相似文献   

16.
Comprehensively sampled phylogenetic trees provide the most compelling foundations for strong inferences in comparative evolutionary biology. Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses. Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly of multi-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data in multi-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user’s query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees.  相似文献   

17.
Expressed sequence tags (ESTs) in public databases and cross-species transferable markers are considered to be a cost-effective means for developing sequence-based markers for less-studied species. In this study, EST-simple sequence repeat (SSR) markers developed from Lathyrus sativus L. EST sequences and cross-transferable EST-SSRs derived from Medicago truncatula L. were utilized to investigate the genetic diversity among grass pea populations from Ethiopia. A total of 45 alleles were detected using eleven EST-SSRs with an average of four alleles per locus. The average polymorphism information content for all primers was 0.416. The average gene diversity was 0.477, ranging from 0.205 for marker Ls942 to 0.804 for MtBA32F05. F(ST) values estimated by analysis of molecular variance were 0.01, 0.15, and 0.84 for among regions, among accessions and within accessions respectively, indicating that most of the variation (84%) resides within accessions. Model-based cluster analysis grouped the accessions into three clusters, grouping accessions irrespective of their collection regions. Among the regions, high levels of diversity were observed in Gojam, Gonder, Shewa and Welo regions, with Gonder region showing a higher number of different alleles. From breeding and conservation aspects, conducting a close study on a specific population would be advisable for genetic improvement in the crop, and it would be appropriate if future collection and conservation plans give due attention to under-represented regions. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s11032-011-9662-y) contains supplementary material, which is available to authorized users.  相似文献   

18.
Abstract— Amino acid encoding genes contain character state information that may be useful for phylogenetic analysis on at least two levels. The nucleotide sequence and the translated amino acid sequences have both been employed separately as character states for cladistic studies of various taxa, including studies of the genealogy of genes in multigene families. In essence, amino acid sequences and nucleic acid sequences are two different ways of character coding the information in a gene. Silent positions in the nucleotide sequence (first or third positions in codons that can accrue change without changing the identity of the amino acid that the triplet codes for) may accrue change relatively rapidly and become saturated, losing the pattern of historical divergence. On the other hand, non-silent nucleotide alterations and their accompanying amino acid changes may evolve too slowly to reveal relationships among closely related taxa. In general, the dynamics of sequence change in silent and non-silent positions in protein coding genes result in homoplasy and lack of resolution, respectively. We suggest that the combination of nucleic acid and the translated amino acid coded character states into the same data matrix for phylogenetic analysis addresses some of the problems caused by the rapid change of silent nucleotide positions and overall slow rate of change of non-silent nucleotide positions and slowly changing amino acid positions. One major theoretical problem with this approach is the apparent non-independence of the two sources of characters. However, there are at least three possible outcomes when comparing protein coding nucleic acid sequences with their translated amino acids in a phylogenetic context on a codon by codon basis. First, the two character sets for a codon may be entirely congruent with respect to the information they convey about the relationships of a certain set of taxa. Second, one character set may display no information concerning a phylogenetic hypothesis while the other character set may impart information to a hypothesis. These two possibilities are cases of non-independence, however, we argue that congruence in such cases can be thought of as increasing the weight of the particular phylogenetic hypothesis that is supported by those characters. In the third case, the two sources of character information for a particular codon may be entirely incongruent with respect to phylogenetic hypotheses concerning the taxa examined. In this last case the two character sets are independent in that information from neither can predict the character states of the other. Examples of these possibilities are discussed and the general applicability of combining these two sources of information for protein coding genes is presented using sequences from the homeobox region of 46 homeobox genes fromDrosophila melanogasterto develop a hypothesis of genealogical relationship of these genes in this large multigene family.  相似文献   

19.
Ren F  Tanaka H  Yang Z 《Gene》2009,441(1-2):119-125
Supermatrix and supertree methods are two strategies advocated for phylogenetic analysis of sequence data from multiple gene loci, especially when some species are missing at some loci. The supermatrix method concatenates sequences from multiple genes into a data supermatrix for phylogenetic analysis, and ignores differences in evolutionary dynamics among the genes. The supertree method analyzes each gene separately and assembles the subtrees estimated from individual genes into a supertree for all species. Most algorithms suggested for supertree construction lack statistical justifications and ignore uncertainties in the subtrees. Instead of supermatrix or supertree, we advocate the use of likelihood function to combine data from multiple genes while accommodating their differences in the evolutionary process. This combines the strengths of the supermatrix and supertree methods while avoiding their drawbacks. We conduct computer simulation to evaluate the performance of the supermatrix, supertree, and maximum likelihood methods applied to two phylogenetic problems: molecular-clock dating of species divergences and reconstruction of species phylogenies. The results confirm the theoretical superiority of the likelihood method. Supertree or separate analyses of data of multiple genes may be useful in revealing the characteristics of the evolutionary process of multiple gene loci, and the information may be used to formulate realistic models for combined analysis of all genes by likelihood.  相似文献   

20.
Babnigg G  Giometti CS 《Proteomics》2006,6(16):4514-4522
In proteome studies, identification of proteins requires searching protein sequence databases. The public protein sequence databases (e.g., NCBInr, UniProt) each contain millions of entries, and private databases add thousands more. Although much of the sequence information in these databases is redundant, each database uses distinct identifiers for the identical protein sequence and often contains unique annotation information. Users of one database obtain a database-specific sequence identifier that is often difficult to reconcile with the identifiers from a different database. When multiple databases are used for searches or the databases being searched are updated frequently, interpreting the protein identifications and associated annotations can be problematic. We have developed a database of unique protein sequence identifiers called Sequence Globally Unique Identifiers (SEGUID) derived from primary protein sequences. These identifiers serve as a common link between multiple sequence databases and are resilient to annotation changes in either public or private databases throughout the lifetime of a given protein sequence. The SEGUID Database can be downloaded (http://bioinformatics.anl.gov/SEGUID/) or easily generated at any site with access to primary protein sequence databases. Since SEGUIDs are stable, predictions based on the primary sequence information (e.g., pI, Mr) can be calculated just once; we have generated approximately 500 different calculations for more than 2.5 million sequences. SEGUIDs are used to integrate MS and 2-DE data with bioinformatics information and provide the opportunity to search multiple protein sequence databases, thereby providing a higher probability of finding the most valid protein identifications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号