首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.

Background

An important task in a metagenomic analysis is the assignment of taxonomic labels to sequences in a sample. Most widely used methods for taxonomy assignment compare a sequence in the sample to a database of known sequences. Many approaches use the best BLAST hit(s) to assign the taxonomic label. However, it is known that the best BLAST hit may not always correspond to the best taxonomic match. An alternative approach involves phylogenetic methods, which take into account alignments and a model of evolution in order to more accurately define the taxonomic origin of sequences. Similarity-search based methods typically run faster than phylogenetic methods and work well when the organisms in the sample are well represented in the database. In contrast, phylogenetic methods have the capability to identify new organisms in a sample but are computationally quite expensive.

Results

We propose a two-step approach for metagenomic taxon identification; i.e., use a rapid method that accurately classifies sequences using a reference database (this is a filtering step) and then use a more complex phylogenetic method for the sequences that were unclassified in the previous step. In this work, we explore whether and when using top BLAST hit(s) yields a correct taxonomic label. We develop a method to detect outliers among BLAST hits in order to separate the phylogenetically most closely related matches from matches to sequences from more distantly related organisms. We used modified BILD (Bayesian Integral Log-Odds) scores, a multiple-alignment scoring function, to define the outliers within a subset of top BLAST hits and assign taxonomic labels. We compared the accuracy of our method to the RDP classifier and show that our method yields fewer misclassifications while properly classifying organisms that are not present in the database. Finally, we evaluated the use of our method as a pre-processing step before more expensive phylogenetic analyses (in our case TIPP) in the context of real 16S rRNA datasets.

Conclusion

Our experiments make a good case for using a two-step approach for accurate taxonomic assignment. We show that our method can be used as a filtering step before using phylogenetic methods and provides a way to interpret BLAST results using more information than provided by E-values and bit-scores alone.
  相似文献   

2.
Correct orthology assignment is a critical prerequisite of numerous comparative genomics procedures, such as function prediction, construction of phylogenetic species trees and genome rearrangement analysis. We present an algorithm for the detection of non-orthologs that arise by mistake in current orthology classification methods based on genome-specific best hits, such as the COGs database. The algorithm works with pairwise distance estimates, rather than computationally expensive and error-prone tree-building methods. The accuracy of the algorithm is evaluated through verification of the distribution of predicted cases, case-by-case phylogenetic analysis and comparisons with predictions from other projects using independent methods. Our results show that a very significant fraction of the COG groups include non-orthologs: using conservative parameters, the algorithm detects non-orthology in a third of all COG groups. Consequently, sequence analysis sensitive to correct orthology assignments will greatly benefit from these findings.  相似文献   

3.

Background

Searching the orthologs of a given protein or DNA sequence is one of the most important and most commonly used Bioinformatics methods in Biology. Programs like BLAST or the orthology search engine Inparanoid can be used to find orthologs when the similarity between two sequences is sufficiently high. They however fail when the level of conservation is low. The detection of remotely conserved proteins oftentimes involves sophisticated manual intervention that is difficult to automate.

Results

Here, we introduce morFeus, a search program to find remotely conserved orthologs. Based on relaxed sequence similarity searches, morFeus selects sequences based on the similarity of their alignments to the query, tests for orthology by iterative reciprocal BLAST searches and calculates a network score for the resulting network of orthologs that is a measure of orthology independent of the E-value. Detecting remotely conserved orthologs of a protein using morFeus thus requires no manual intervention. We demonstrate the performance of morFeus by comparing it to state-of-the-art orthology resources and methods. We provide an example of remotely conserved orthologs, which were experimentally shown to be functionally equivalent in the respective organisms and therefore meet the criteria of the orthology-function conjecture.

Conclusions

Based on our results, we conclude that morFeus is a powerful and specific search method for detecting remotely conserved orthologs. morFeus is freely available at http://bio.biochem.mpg.de/morfeus/. Its source code is available from Sourceforge.net (https://sourceforge.net/p/morfeus/).

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-263) contains supplementary material, which is available to authorized users.  相似文献   

4.

Background  

Homology is a crucial concept in comparative genomics. The algorithm probably most widely used for homology detection in comparative genomics, is BLAST. Usually a stringent score cutoff is applied to distinguish putative homologs from possible false positive hits. As a consequence, some BLAST hits are discarded that are in fact homologous.  相似文献   

5.
6.

Background  

Seeded alignment is an important component of algorithms for fast, large-scale DNA similarity search. A good seed matching heuristic can reduce the execution time of genomic-scale sequence comparison without degrading sensitivity. Recently, many types of seed have been proposed to improve on the performance of traditional contiguous seeds as used in, e.g., NCBI BLASTN. Choosing among these seed types, particularly those that use information besides the presence or absence of matching residue pairs, requires practical guidance based on a rigorous comparison, including assessment of sensitivity, specificity, and computational efficiency. This work performs such a comparison, focusing on alignments in DNA outside widely studied coding regions.  相似文献   

7.
Tekaia F  Yeramian E 《Gene》2012,492(1):199-211
The proper detection of orthologs is crucial for evolutionary studies of genes and species. Despite large efforts to solve this problem the methodological situation appears unsettled to a large extent and the “quest for orthologs” is still an ongoing task in large-scale genome comparisons.Here, we introduce a simple operational framework for the detection of orthologs and their classification. The operational framework relies on well-established principles, optimizing their implementation for the considered purposes, and chaining components in coherent procedures: 1) We take advantage of the efficiency and simplicity of the Reciprocal Best Hit (RBH) detections, remedying (by design) the drawback concerning the limitations in terms of 1:1 detections. The procedure is based on the partitioning of Reciprocal Best Hits, with the further merging of partitions including members of the same paralogous classes (“SuperPartition of Orthologs” (SPOs)). 2) We then resort to the conservation profiles of the obtained clusters, allowing simple detection of SPOs containing duplicated members. Based on accepted evolutionary principles, such members can be further tagged as in-paralogs (co-orthologs) or out-paralogs.The method is illustrated and validated by extensive genomic analyses. The performances of the overall approach are characterized in global terms for three sets of species (Chlamydiae, Mycobacteria, Aspergilli), showing that at least 75% of the sets of orthologs contain at most one protein from a given species. The sets including more than one protein from a given species are shown to contain in-paralogs in proportions varying from 28% to 58%. The characterizations also show that the large majority of SPOs are associated with ancestral motifs, and accordingly not prone to chaining effects that might be triggered by multi-domain proteins. Further the SPO formulation is compared to other similarity based ortholog detection methods. Beyond core common results, significant differences are observed between various methods, which can be accounted for to a large extent on conceptual grounds, relative to the different merging schemes involved. Such comparisons highlight a major advantage of the SPO approach concerning the proper clustering of associated paralogs, which appear to be often dispatched spuriously into distinct orthologous classes.Finally the perspectives for future applications and elaborations of SPO-based compositional analyses are discussed.  相似文献   

8.
Identification of ortholog is one of the important tasks to understand a novel genome. It helps to assign functional annotations, from one organism to another organism. To identify the putative ortholog, Reciprocal Best BLAST hit (RBBH) method is known to be an efficient approach. OrFin makes use of the same approach to identify pair of orthologous proteins for a given set of sequences of two species. It is a user-friendly web tool which works with user defined parameters to search RBBHs. Results are produced in both html and text format.

Availability

This web tool is freely available at http://bifl.uohyd.ac.in/orfin  相似文献   

9.
Videla M  Valladares GR  Salvo A 《Oecologia》2012,169(3):743-751
Insect preferences for particular plant species might be subjected to trade-offs among several selective forces. Here, we evaluated, through laboratory and field experiments, the feeding and ovipositing preferences of the polyphagous leafminer Liriomyza?huidobrensis (Diptera: Agromyzidae) in relation to adult and offspring performance and enemy-free space. Female leafminers preferred laying their eggs on Vicia?faba (Fabaceae) over Beta?vulgaris var. cicla (Chenopodiaceae), in both laboratory and field choice experiments, although no oviposition preference was observed in no-choice tests. Females fed more often on B.?v.?var.?cicla (no-choice test) or showed no feeding preference (choice test), even when their realized fecundity was remarkably higher on V.?faba. Offspring developed faster, tended to survive better, and attained bigger adult size on the preferred host plant. Also, a field experiment showed higher overall parasitism rates for leafminers developing on B.?v.?var.?cicla, with a nonsignificant similar tendency in field surveys. According to these results, host plant selection by L.?huidobrensis appears to be driven mainly by variation in host quality. Moreover, the consistent oviposition choices for the best host and the labile feeding preferences observed here, suggest that host plant selection might be driven by maximization of offspring fitness even without a conflict of interest between parents and offspring. Overall, these results highlight the complexity of decisions performed by phytophagous insects regarding their host plants, and the importance of simultaneous evaluation of the various driving forces involved, in order to unravel the adaptive significance of female choices.  相似文献   

10.
11.
12.
Seed predation can cause significant losses of weed seeds in agricultural systems and can, thus, contribute to weed control. The removal of Lolium multiflorum and Vicia villosa seeds by harvester ants, Messor barbarus, and granivorous rodents, Mus spretus, in six cereal fields in NE Spain was separated into three sequential processes, namely (1) the probability of finding a seed cache (cache encounter rate), (2) the percentage of seeds utilized once a seed cache has been found (seed exploitation rate) and (3) seed selection if multiple species are present (preference). Identifying the most important behavioural component and factors that drive it may help to better understand and manage seed predation.Seed cache encounter rate correlated well with overall seed removal rate caused by harvester ants (r2 = 0.91), or rodents (r2 = 0.82). Once found, seed exploitation rates were high and fairly constant from spring to autumn for harvester ants, and low throughout the season for rodents. Harvester ants removed almost all L. multiflorum seeds from caches found, while the exploitation of V. villosa seeds varied across the season. In the case of rodents, cache encounter rate, but not exploitation rate, could be explained by canopy cover provided by the crop. L. multiflorum seemed to be preferred in early 2007, whereas V. villosa was in 2008.The adoption of no-till or minimum tillage systems together with the establishment of field edge vegetation are likely to encourage seed cache encounter and exploitation rates by both harvester ants and rodents, thus leading to increased weed control in semi-arid cereals.  相似文献   

13.
BLAST is the most popular bioinformatics tool and is used to run millions of queries each day. However, evaluating such queries is slow, taking typically minutes on modern workstations. Therefore, continuing evolution of BLAST--by improving its algorithms and optimizations--is essential to improve search times in the face of exponentially increasing collection sizes. We present an optimization to the first stage of the BLAST algorithm specifically designed for protein search. It produces the same results as NCBI-BLAST but in around 59% of the time on Intel-based platforms; we also present results for other popular architectures. Overall, this is a saving of around 15% of the total typical BLAST search time. Our approach uses a deterministic finite automaton (DFA), inspired by the original scheme used in the 1990 BLAST algorithm. The techniques are optimized for modern hardware, making careful use of cache-conscious approaches to improve speed. Our optimized DFA approach has been integrated into a new version of BLAST that is freely available for download at http://www.fsa-blast.org/.  相似文献   

14.
The bioassessment and monitoring of the ecological status of rivers using macrophytes has gained new momentum since macrophytes were recognised as biological quality elements for the implementation of the European Water Framework Directive (WFD; EU/2000/60).Our objectives were to test the suitability of two predictive modelling approaches to macrophyte communities as a tool for water quality assessment, and to compare their performance with other more common approaches—the use of macrophytes as indicators of the trophic status of rivers and multimetric indices. We used floristic and environmental data that were collected in the spring of 2004 and 2005 from around 400 sites on rivers across mainland Portugal, western Iberia.We build two predictive models: MACPACS (MACrophyte Prediction And Classification System) and MAC (Macrophyte Assessment and Classification) based on RIVPACS and the BEAST methods, respectively. Whereas MACPACS is derived from taxa occurrence data, MAC uses a quantitative measure of taxa abundance. Both models showed good performance in predicting reference sites to the correct group and low rate of misclassification errors. However, they performed differently. MAC depicts a reliable response to the overall human-mediated degradation of fluvial systems, as does the multimetric index (RVI, Riparian Vegetation Index), but MACPACS presented only a poor correlation with the Global Human Disturbance Index and with the nutrients input. The incorporation of abundance data in vegetation predictive models appears to be particularly important to the detection of high levels of degradation. The values for correlations with physical–chemical pressure variables were lower than expected for MTR (Mean Trophic Rank) due to an insufficient number of scoring species found in Portuguese fluvial systems. Our results suggest that the most effective methods for bioassessment in Mediterranean-type rivers are either the RVI or the MAC predictive model.  相似文献   

15.
Clustering of main orthologs for multiple genomes   总被引:1,自引:0,他引:1  
The identification of orthologous genes shared by multiple genomes is critical for both functional and evolutionary studies in comparative genomics. While it is usually done by sequence similarity search and reconciled tree construction in practice, recently a new combinatorial approach and high-throughput system MSOAR for ortholog identification between closely related genomes based on genome rearrangement and gene duplication has been proposed in Fu et al. MSOAR assumes that orthologous genes correspond to each other in the most parsimonious evolutionary scenario, minimizing the number of genome rearrangement and (postspeciation) gene duplication events. However, the parsimony approach used by MSOAR limits it to pairwise genome comparisons. In this paper, we extend MSOAR to multiple (closely related) genomes and propose an ortholog clustering method, called MultiMSOAR, to infer main orthologs in multiple genomes. As a preliminary experiment, we apply MultiMSOAR to rat, mouse, and human genomes, and validate our results using gene annotations and gene function classifications in the public databases. We further compare our results to the ortholog clusters predicted by MultiParanoid, which is an extension of the well-known program InParanoid for pairwise genome comparisons. The comparison reveals that MultiMSOAR gives more detailed and accurate orthology information, since it can effectively distinguish main orthologs from inparalogs.  相似文献   

16.
Pyrosequencing is a DNA sequencing technique based on sequencing-by-synthesis enabling rapid and real-time sequence determination. Although ample genomic research has been undertaken using pyrosequencing, the requirement of relatively high amount of DNA template and the difficulty in sequencing the homopolymeric regions limit its key advantages in the applications directing towards clinical research. In this study, we demonstrate that pyrosequencing on homopolymeric regions with 10 identical nucleotides can be successfully performed with optimal amount of DNA (0.3125-5 pmol) immobilized on conventional non-porous Sepharose beads. We also validate that by using porous silica beads, the sequencing signal increased 3.5-folds as compared to that produced from same amount of DNA immobilized on solid Sepharose beads. Our results strongly indicate that with optimized quantity of DNA and suitable solid support, the performance of pyrosequencing on homopolymeric regions and its detection limit has been significantly improved.  相似文献   

17.
Orthologs are genes from different genomes that originate from a common ancestor gene by speciation event. They are most similar by the structure of encoded proteins and therefore should have a similar function. Here I apply the principle used for detection of structural orthology for a genome-wide analysis of gene expression. For this purpose, I determine the mutual similarity rank in all-by-all comparison of among-tissues expression patterns. The expression of most part of human–mouse orthologs in homologous tissues is poorly correlated (average mutual coexpression rank is only 4835 out of 18,092). Genes from evolutionarily labile gene families, which experience rapid turnover of family composition, are among those with the strongest expression change. However, the revealed phenomenon is not limited to them. There is no or very weak relationship between protein sequence divergence and mutual coexpression rank. Also, generally there is no relationship between the ratio of nonsynonymous to synonymous nucleotide substitutions and coexpression rank. This relationship is tangible only within evolutionarily labile gene families. These results indicate that despite of a similar biochemical function of orthologs reflected in the conserved protein sequence, the physiological (systemic) context of this function can be changed. Also, these results suggest that gene biochemical function and its physiological role in the organism can evolve independently.  相似文献   

18.
Orthologs generally are under selective pressure against loss of function, while paralogs usually accumulate mutations and finally die or deviate in terms of function or regulation. Most ortholog detection methods contaminate the resulting datasets with a substantial amount of paralogs. Therefore we aimed to implement a straightforward method that allows the detection of ortholog clusters with a reduced amount of paralogs from completely sequenced genomes. The described cross-species expansion of the reciprocal best BLAST hit method is a time-effective method for ortholog detection, which results in 68% truly orthologous clusters and the procedure specifically enriches single-copy orthologs. The detection of true orthologs can provide a phylogenetic toolkit to better understand evolutionary processes. In a study across six photosynthetic eukaryotes, nuclear genes of putative mitochondrial origin were shown to be over-represented among single copy orthologs. These orthologs are involved in fundamental biological processes like amino acid metabolism or translation. Molecular clock analyses based on this dataset yielded divergence time estimates for the red/green algae (1,142 MYA), green algae/land plant (725 MYA), mosses/seed plant (496 MYA), gymno-/angiosperm (385 MYA) and monocotyledons/core eudicotyledons (301 MYA) divergence times. Electronic supplementary material The online version of this article (doi:) contains supplementary material, which is available to authorized users.  相似文献   

19.
The genus Mycobacterium comprises major human pathogens such as the causative agent of tuberculosis, Mycobacterium tuberculosis (Mtb), and many environmental species. Tuberculosis claims ~1.5 million lives every year, and drug resistant strains of Mtb are rapidly emerging. To aid the development of new tuberculosis drugs, major efforts are currently under way to determine crystal structures of Mtb drug targets and proteins involved in pathogenicity. However, a major obstacle to obtaining crystal structures is the generation of well-diffracting crystals. Proteins from thermophiles can have better crystallization and diffraction properties than proteins from mesophiles, but their sequences and structures are often divergent. Here, we establish a thermophilic mycobacterial model organism, Mycobacterium thermoresistibile (Mth), for the study of Mtb proteins. Mth tolerates higher temperatures than Mtb or other environmental mycobacteria such as M. smegmatis. Mth proteins are on average more soluble than Mtb proteins, and comparison of the crystal structures of two pairs of orthologous proteins reveals nearly identical folds, indicating that Mth structures provide good surrogates for Mtb structures. This study introduces a thermophile as a source of protein for the study of a closely related human pathogen and marks a new approach to solving challenging mycobacterial protein structures.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号