首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Numerous algorithms have been developed to analyze ChIP-Seq data. However, the complexity of analyzing diverse patterns of ChIP-Seq signals, especially for epigenetic marks, still calls for the development of new algorithms and objective comparisons of existing methods. We developed Qeseq, an algorithm to detect regions of increased ChIP read density relative to background. Qeseq employs critical novel elements, such as iterative recalibration and neighbor joining of reads to identify enriched regions of any length. To objectively assess its performance relative to other 14 ChIP-Seq peak finders, we designed a novel protocol based on Validation Discriminant Analysis (VDA) to optimally select validation sites and generated two validation datasets, which are the most comprehensive to date for algorithmic benchmarking of key epigenetic marks. In addition, we systematically explored a total of 315 diverse parameter configurations from these algorithms and found that typically optimal parameters in one dataset do not generalize to other datasets. Nevertheless, default parameters show the most stable performance, suggesting that they should be used. This study also provides a reproducible and generalizable methodology for unbiased comparative analysis of high-throughput sequencing tools that can facilitate future algorithmic development.  相似文献   

2.
3.
Docking of hydrophobic ligands with interaction-based matching algorithms   总被引:3,自引:0,他引:3  
MOTIVATION: Matching of chemical interacting groups is a common concept for docking and fragment placement algorithms in computer-aided drug design. These algorithms have been proven to be reliable and fast if at least a certain number of hydrogen bonds or salt bridges occur. However, the algorithms typically run into problems if hydrophobic fragments or ligands should be placed. In order to dock hydrophobic fragments without significant loss of computational efficiency, we have extended the interaction model and placement algorithms in our docking tool FlexX. The concept of multi-level interactions is introduced into the algorithms for automatic selection and placement of base fragments. RESULTS: With the multi-level interaction model and the corresponding algorithmic extensions, we were able to improve the overall performance of FlexX significantly. We tested the approach with a set of 200 protein-ligand complexes taken from the Brookhaven Protein Data Bank (PDB). The number of test cases which can be docked within 1.5 A RMSD from the crystal structure can be increased from 58 to 64%. The performance gain is paid for by an increase in computation time from 73 to 91 s on average per protein-ligand complex. AVAILABILITY: The FlexX molecular docking software is available for UNIX platforms IRIX, Solaris and Linux. See http://cartan.gmd.de/FlexX for additional information.  相似文献   

4.
The identification of MHC class II restricted peptide epitopes is an important goal in immunological research. A number of computational tools have been developed for this purpose, but there is a lack of large-scale systematic evaluation of their performance. Herein, we used a comprehensive dataset consisting of more than 10,000 previously unpublished MHC-peptide binding affinities, 29 peptide/MHC crystal structures, and 664 peptides experimentally tested for CD4+ T cell responses to systematically evaluate the performances of publicly available MHC class II binding prediction tools. While in selected instances the best tools were associated with AUC values up to 0.86, in general, class II predictions did not perform as well as historically noted for class I predictions. It appears that the ability of MHC class II molecules to bind variable length peptides, which requires the correct assignment of peptide binding cores, is a critical factor limiting the performance of existing prediction tools. To improve performance, we implemented a consensus prediction approach that combines methods with top performances. We show that this consensus approach achieved best overall performance. Finally, we make the large datasets used publicly available as a benchmark to facilitate further development of MHC class II binding peptide prediction methods.  相似文献   

5.
Genome comparison poses important computational challenges, especially in CPU-time, memory allocation and I/O operations. Although there already exist parallel approaches of multiple sequence comparisons algorithms, they face a significant limitation on the input sequence length. GECKO appeared as a computational and memory efficient method to overcome such limitation. However, its performance could be greatly increased by applying parallel strategies and I/O optimisations. We have applied two different strategies to accelerate GECKO while producing the same results. First, a two-level parallel approach parallelising each independent internal pairwise comparison in the first level, and the GECKO modules in the second level. A second approach consists on a complete rewrite of the original code to reduce I/O. Both strategies outperform the original code, which was already faster than equivalent software. Thus, much faster pairwise and multiple genome comparisons can be performed, what is really important with the ever-growing list of available genomes.  相似文献   

6.
OSCAR: one-class SVM for accurate recognition of cis-elements   总被引:1,自引:0,他引:1  
  相似文献   

7.
8.
The increasing availability of time series expression datasets, although promising, raises a number of new computational challenges. Accordingly, the development of suitable classification methods to make reliable and sound predictions is becoming a pressing issue. We propose, here, a new method to classify time series gene expression via integration of biological networks. We evaluated our approach on 2 different datasets and showed that the use of a hidden Markov model/Gaussian mixture models hybrid explores the time-dependence of the expression data, thereby leading to better prediction results. We demonstrated that the biclustering procedure identifies function-related genes as a whole, giving rise to high accordance in prognosis prediction across independent time series datasets. In addition, we showed that integration of biological networks into our method significantly improves prediction performance. Moreover, we compared our approach with several state-of–the-art algorithms and found that our method outperformed previous approaches with regard to various criteria. Finally, our approach achieved better prediction results on early-stage data, implying the potential of our method for practical prediction.  相似文献   

9.

Background

With advances in next generation sequencing technologies and genomic capture techniques, exome sequencing has become a cost-effective approach for mutation detection in genetic diseases. However, computational prediction of copy number variants (CNVs) from exome sequence data is a challenging task. Whilst numerous programs are available, they have different sensitivities, and have low sensitivity to detect smaller CNVs (1–4 exons). Additionally, exonic CNV discovery using standard aCGH has limitations due to the low probe density over exonic regions. The goal of our study was to develop a protocol to detect exonic CNVs (including shorter CNVs that cover 1–4 exons), combining computational prediction algorithms and a high-resolution custom CGH array.

Results

We used six published CNV prediction programs (ExomeCNV, CONTRA, ExomeCopy, ExomeDepth, CoNIFER, XHMM) and an in-house modification to ExomeCopy and ExomeDepth (ExCopyDepth) for computational CNV prediction on 30 exomes from the 1000 genomes project and 9 exomes from primary immunodeficiency patients. CNV predictions were tested using a custom CGH array designed to capture all exons (exaCGH). After this validation, we next evaluated the computational prediction of shorter CNVs. ExomeCopy and the in-house modified algorithm, ExCopyDepth, showed the highest capability in detecting shorter CNVs. Finally, the performance of each computational program was assessed by calculating the sensitivity and false positive rate.

Conclusions

In this paper, we assessed the ability of 6 computational programs to predict CNVs, focussing on short (1–4 exon) CNVs. We also tested these predictions using a custom array targeting exons. Based on these results, we propose a protocol to identify and confirm shorter exonic CNVs combining computational prediction algorithms and custom aCGH experiments.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-661) contains supplementary material, which is available to authorized users.  相似文献   

10.
11.
Inference of protein functions is one of the most important aims of modern biology. To fully exploit the large volumes of genomic data typically produced in modern-day genomic experiments, automated computational methods for protein function prediction are urgently needed. Established methods use sequence or structure similarity to infer functions but those types of data do not suffice to determine the biological context in which proteins act. Current high-throughput biological experiments produce large amounts of data on the interactions between proteins. Such data can be used to infer interaction networks and to predict the biological process that the protein is involved in. Here, we develop a probabilistic approach for protein function prediction using network data, such as protein-protein interaction measurements. We take a Bayesian approach to an existing Markov Random Field method by performing simultaneous estimation of the model parameters and prediction of protein functions. We use an adaptive Markov Chain Monte Carlo algorithm that leads to more accurate parameter estimates and consequently to improved prediction performance compared to the standard Markov Random Fields method. We tested our method using a high quality S.cereviciae validation network with 1622 proteins against 90 Gene Ontology terms of different levels of abstraction. Compared to three other protein function prediction methods, our approach shows very good prediction performance. Our method can be directly applied to protein-protein interaction or coexpression networks, but also can be extended to use multiple data sources. We apply our method to physical protein interaction data from S. cerevisiae and provide novel predictions, using 340 Gene Ontology terms, for 1170 unannotated proteins and we evaluate the predictions using the available literature.  相似文献   

12.
Key issues in protein science and computational biology are design and evaluation of algorithms aimed at detection of proteins that belong to a specific family, as defined by structural, evolutionary, or functional criteria. In this context, several validation techniques are often used to compare different parameter settings of the detector, and to subsequently select the setting that yields the smallest error rate estimate. A frequently overlooked problem associated with this approach is that this smallest error rate estimate may have a large optimistic bias. Based on computer simulations, we show that a detector's error rate estimate can be overly optimistic and propose a method to obtain unbiased performance estimates of a detector design procedure. The method is founded on an external 10-fold cross-validation (CV) loop that embeds an internal validation procedure used for parameter selection in detector design. The designed detector generated in each of the 10 iterations are evaluated on held-out examples exclusively available in the external CV iterations. Notably, the average of these 10 performance estimates is not associated with a final detector, but rather with the average performance of the design procedure used. We apply the external CV loop to the particular problem of detecting potentially allergenic proteins, using a previously reported design procedure. Unbiased performance estimates of the allergen detector design procedure are presented together with information about which algorithms and parameter settings that are most frequently selected.  相似文献   

13.

Background

Early and accurate identification of adverse drug reactions (ADRs) is critically important for drug development and clinical safety. Computer-aided prediction of ADRs has attracted increasing attention in recent years, and many computational models have been proposed. However, because of the lack of systematic analysis and comparison of the different computational models, there remain limitations in designing more effective algorithms and selecting more useful features. There is therefore an urgent need to review and analyze previous computation models to obtain general conclusions that can provide useful guidance to construct more effective computational models to predict ADRs.

Principal Findings

In the current study, the main work is to compare and analyze the performance of existing computational methods to predict ADRs, by implementing and evaluating additional algorithms that have been earlier used for predicting drug targets. Our results indicated that topological and intrinsic features were complementary to an extent and the Jaccard coefficient had an important and general effect on the prediction of drug-ADR associations. By comparing the structure of each algorithm, final formulas of these algorithms were all converted to linear model in form, based on this finding we propose a new algorithm called the general weighted profile method and it yielded the best overall performance among the algorithms investigated in this paper.

Conclusion

Several meaningful conclusions and useful findings regarding the prediction of ADRs are provided for selecting optimal features and algorithms.  相似文献   

14.
In search of lost introns   总被引:1,自引:0,他引:1  
Many fundamental questions concerning the emergence and subsequent evolution of eukaryotic exon-intron organization are still unsettled. Genome-scale comparative studies, which can shed light on crucial aspects of eukaryotic evolution, require adequate computational tools. We describe novel computational methods for studying spliceosomal intron evolution. Our goal is to give a reliable characterization of the dynamics of intron evolution. Our algorithmic innovations address the identification of orthologous introns, and the likelihood-based analysis of intron data. We discuss a compression method for the evaluation of the likelihood function, which is noteworthy for phylogenetic likelihood problems in general. We prove that after O(n l) preprocessing time, subsequent evaluations take O(n l/log l) time almost surely in the Yule-Harding random model of n-taxon phylogenies, where l is the input sequence length. We illustrate the practicality of our methods by compiling and analyzing a data set involving 18 eukaryotes, which is more than in any other study to date. The study yields the surprising result that ancestral eukaryotes were fairly intron-rich. For example, the bilaterian ancestor is estimated to have had more than 90% as many introns as vertebrates do now. AVAILABILITY: The Java implementations of the algorithms are publicly available from the corresponding author's site http://www.iro.umontreal.ca/~csuros/introns/. Data are available on request.  相似文献   

15.
16.
MOTIVATION: With the advent of microarray chip technology, large data sets are emerging containing the simultaneous expression levels of thousands of genes at various time points during a biological process. Biologists are attempting to group genes based on the temporal pattern of their expression levels. While the use of hierarchical clustering (UPGMA) with correlation 'distance' has been the most common in the microarray studies, there are many more choices of clustering algorithms in pattern recognition and statistics literature. At the moment there do not seem to be any clear-cut guidelines regarding the choice of a clustering algorithm to be used for grouping genes based on their expression profiles. RESULTS: In this paper, we consider six clustering algorithms (of various flavors!) and evaluate their performances on a well-known publicly available microarray data set on sporulation of budding yeast and on two simulated data sets. Among other things, we formulate three reasonable validation strategies that can be used with any clustering algorithm when temporal observations or replications are present. We evaluate each of these six clustering methods with these validation measures. While the 'best' method is dependent on the exact validation strategy and the number of clusters to be used, overall Diana appears to be a solid performer. Interestingly, the performance of correlation-based hierarchical clustering and model-based clustering (another method that has been advocated by a number of researchers) appear to be on opposite extremes, depending on what validation measure one employs. Next it is shown that the group means produced by Diana are the closest and those produced by UPGMA are the farthest from a model profile based on a set of hand-picked genes. Availability: S+ codes for the partial least squares based clustering are available from the authors upon request. All other clustering methods considered have S+ implementation in the library MASS. S+ codes for calculating the validation measures are available from the authors upon request. The sporulation data set is publicly available at http://cmgm.stanford.edu/pbrown/sporulation  相似文献   

17.
Reverse engineering of gene regulatory networks has been an intensively studied topic in bioinformatics since it constitutes an intermediate step from explorative to causative gene expression analysis. Many methods have been proposed through recent years leading to a wide range of mathematical approaches. In practice, different mathematical approaches will generate different resulting network structures, thus, it is very important for users to assess the performance of these algorithms. We have conducted a comparative study with six different reverse engineering methods, including relevance networks, neural networks, and Bayesian networks. Our approach consists of the generation of defined benchmark data, the analysis of these data with the different methods, and the assessment of algorithmic performances by statistical analyses. Performance was judged by network size and noise levels. The results of the comparative study highlight the neural network approach as best performing method among those under study.  相似文献   

18.
Vaccine development efforts will be guided by algorithms that predict immunogenic epitopes. Such prediction methods rely on classification-based algorithms that are trained against curated data sets of known B and T cell epitopes. It is unclear whether this empirical approach can be applied prospectively to predict epitopes associated with protective immunity for novel antigens. We present a comprehensive comparison of in silico B and T cell epitope predictions with in vivo validation using an previously uncharacterized malaria antigen, CelTOS. CelTOS has no known conserved structural elements with any known proteins, and thus is not represented in any epitope databases used to train prediction algorithms. This analysis represents a blind assessment of this approach in the context of a novel, immunologically relevant antigen. The limited accuracy of the tested algorithms to predict the in vivo immune responses emphasizes the need to improve their predictive capabilities for use as tools in vaccine design.  相似文献   

19.
A model-based rational strategy for the selection of chromatographic resins is presented. The main question being addressed is that of selecting the most optimal chromatographic resin from a few promising alternatives. The methodology starts with chromatographic modeling,parameters acquisition, and model validation, followed by model-based optimization of the chromatographic separation for the resins of interest. Finally, the resins are rationally evaluated based on their optimized operating conditions and performance metrics such as product purity, yield, concentration, throughput, productivity, and cost. Resin evaluation proceeds by two main approaches. In the first approach, Pareto frontiers from multi-objective optimization of conflicting objectives are overlaid for different resins, enabling direct visualization and comparison of resin performances based on the feasible solution space. The second approach involves the transformation of the resin performances into weighted resin scores, enabling the simultaneous consideration of multiple performance metrics and the setting of priorities. The proposed model-based resin selection strategy was illustrated by evaluating three mixed mode adsorbents (ADH, PPA, and HEA) for the separation of a ternary mixture of bovine serum albumin, ovalbumin, and amyloglucosidase. In order of decreasing weighted resin score or performance, the top three resins for this separation were ADH [PPA[HEA. The proposed model-based approach could be a suitable alternative to column scouting during process development, the main strengths being that minimal experimentation is required and resins are evaluated under their ideal working conditions, enabling a fair comparison. This work also demonstrates the application of column modeling and optimization to mixed mode chromatography.  相似文献   

20.
MOTIVATION: Gene expression profiling is a powerful approach to identify genes that may be involved in a specific biological process on a global scale. For example, gene expression profiling of mutant animals that lack or contain an excess of certain cell types is a common way to identify genes that are important for the development and maintenance of given cell types. However, it is difficult for traditional computational methods, including unsupervised and supervised learning methods, to detect relevant genes from a large collection of expression profiles with high sensitivity and specificity. Unsupervised methods group similar gene expressions together while ignoring important prior biological knowledge. Supervised methods utilize training data from prior biological knowledge to classify gene expression. However, for many biological problems, little prior knowledge is available, which limits the prediction performance of most supervised methods. RESULTS: We present a Bayesian semi-supervised learning method, called BGEN, that improves upon supervised and unsupervised methods by both capturing relevant expression profiles and using prior biological knowledge from literature and experimental validation. Unlike currently available semi-supervised learning methods, this new method trains a kernel classifier based on labeled and unlabeled gene expression examples. The semi-supervised trained classifier can then be used to efficiently classify the remaining genes in the dataset. Moreover, we model the confidence of microarray probes and probabilistically combine multiple probe predictions into gene predictions. We apply BGEN to identify genes involved in the development of a specific cell lineage in the C. elegans embryo, and to further identify the tissues in which these genes are enriched. Compared to K-means clustering and SVM classification, BGEN achieves higher sensitivity and specificity. We confirm certain predictions by biological experiments. AVAILABILITY: The results are available at http://www.csail.mit.edu/~alanqi/projects/BGEN.html.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号