首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

Recent advances in genome technologies and the subsequent collection of genomic information at various molecular resolutions hold promise to accelerate the discovery of new therapeutic targets. A critical step in achieving these goals is to develop efficient clinical prediction models that integrate these diverse sources of high-throughput data. This step is challenging due to the presence of high-dimensionality and complex interactions in the data. For predicting relevant clinical outcomes, we propose a flexible statistical machine learning approach that acknowledges and models the interaction between platform-specific measurements through nonlinear kernel machines and borrows information within and between platforms through a hierarchical Bayesian framework. Our model has parameters with direct interpretations in terms of the effects of platforms and data interactions within and across platforms. The parameter estimation algorithm in our model uses a computationally efficient variational Bayes approach that scales well to large high-throughput datasets.

Results

We apply our methods of integrating gene/mRNA expression and microRNA profiles for predicting patient survival times to The Cancer Genome Atlas (TCGA) based glioblastoma multiforme (GBM) dataset. In terms of prediction accuracy, we show that our non-linear and interaction-based integrative methods perform better than linear alternatives and non-integrative methods that do not account for interactions between the platforms. We also find several prognostic mRNAs and microRNAs that are related to tumor invasion and are known to drive tumor metastasis and severe inflammatory response in GBM. In addition, our analysis reveals several interesting mRNA and microRNA interactions that have known implications in the etiology of GBM.

Conclusions

Our approach gains its flexibility and power by modeling the non-linear interaction structures between and within the platforms. Our framework is a useful tool for biomedical researchers, since clinical prediction using multi-platform genomic information is an important step towards personalized treatment of many cancers. We have a freely available software at: http://odin.mdacc.tmc.edu/~vbaladan.
  相似文献   

2.
  1. Download : Download high-res image (82KB)
  2. Download : Download full-size image
  相似文献   

3.
We describe a method for extracting Boolean implications (if-then relationships) in very large amounts of gene expression microarray data. A meta-analysis of data from thousands of microarrays for humans, mice, and fruit flies finds millions of implication relationships between genes that would be missed by other methods. These relationships capture gender differences, tissue differences, development, and differentiation. New relationships are discovered that are preserved across all three species.  相似文献   

4.
5.
Nute  Michael  Warnow  Tandy 《BMC genomics》2016,17(10):764-144

Background

Multiple sequence alignment is an important task in bioinformatics, and alignments of large datasets containing hundreds or thousands of sequences are increasingly of interest. While many alignment methods exist, the most accurate alignments are likely to be based on stochastic models where sequences evolve down a tree with substitutions, insertions, and deletions. While some methods have been developed to estimate alignments under these stochastic models, only the Bayesian method BAli-Phy has been able to run on even moderately large datasets, containing 100 or so sequences. A technique to extend BAli-Phy to enable alignments of thousands of sequences could potentially improve alignment and phylogenetic tree accuracy on large-scale data beyond the best-known methods today.

Results

We use simulated data with up to 10,000 sequences representing a variety of model conditions, including some that are significantly divergent from the statistical models used in BAli-Phy and elsewhere. We give a method for incorporating BAli-Phy into PASTA and UPP, two strategies for enabling alignment methods to scale to large datasets, and give alignment and tree accuracy results measured against the ground truth from simulations. Comparable results are also given for other methods capable of aligning this many sequences.

Conclusions

Extensions of BAli-Phy using PASTA and UPP produce significantly more accurate alignments and phylogenetic trees than the current leading methods.
  相似文献   

6.

Background  

Predicting protein residue-residue contacts is an important 2D prediction task. It is useful for ab initio structure prediction and understanding protein folding. In spite of steady progress over the past decade, contact prediction remains still largely unsolved.  相似文献   

7.
8.
Data with a large p (number of covariates) and/or a large n (sample size) are now commonly encountered. For many problems, regularization especially penalization is adopted for estimation and variable selection. The straightforward application of penalization to large datasets demands a “big computer” with high computational power. To improve computational feasibility, we develop bootstrap penalization, which dissects a big penalized estimation into a set of small ones, which can be executed in a highly parallel manner and each only demands a “small computer”. The proposed approach takes different strategies for data with different characteristics. For data with a large p but a small to moderate n, covariates are first clustered into relatively homogeneous blocks. The proposed approach consists of two sequential steps. In each step and for each bootstrap sample, we select blocks of covariates and run penalization. The results from multiple bootstrap samples are pooled to generate the final estimate. For data with a large n but a small to moderate p, we bootstrap a small number of subjects, apply penalized estimation, and then conduct a weighted average over multiple bootstrap samples. For data with a large p and a large n, the natural marriage of the previous two methods is applied. Numerical studies, including simulations and data analysis, show that the proposed approach has computational and numerical advantages over the straightforward application of penalization. An R package has been developed to implement the proposed methods.  相似文献   

9.
In this paper, an autonomic performance management approach is introduced that can be applied to a general class of web services deployed in large scale distributed environment. The proposed approach utilizes traditional large scale control-based algorithms by using interaction balance approach in web service environment for managing the response time and the system level power consumption. This approach is developed in a generic fashion that makes it suitable for web service deployments, where web service performance can be adjusted by using a finite set of control inputs. This approach maintains the service level agreements, maximizes the revenue, and minimizes the infrastructure operating cost. Additionally, the proposed approach is fault-tolerant with respect to the failures of the computing nodes inside the distributed deployment. Moreover, the computational overhead of the proposed approach can also be managed by using appropriate value of configuration parameters during its deployment.  相似文献   

10.
11.
Due to advances in high-throughput biotechnologies biological information is being collected in databases at an amazing rate, requiring novel computational approaches that process collected data into new knowledge in a timely manner. In this study, we propose a computational framework for discovering modular structure, relationships and regularities in complex data. The framework utilizes a semantic-preserving vocabulary to convert records of biological annotations of an object, such as an organism, gene, chemical or sequence, into networks (Anets) of the associated annotations. An association between a pair of annotations in an Anet is determined by the similarity of their co-occurrence pattern with all other annotations in the data. This feature captures associations between annotations that do not necessarily co-occur with each other and facilitates discovery of the most significant relationships in the collected data through clustering and visualization of the Anet. To demonstrate this approach, we applied the framework to the analysis of metadata from the Genomes OnLine Database and produced a biological map of sequenced prokaryotic organisms with three major clusters of metadata that represent pathogens, environmental isolates and plant symbionts.  相似文献   

12.
Deregulated mTOR signaling drives the growth of various human cancers, making mTOR a major target for development of cancer chemotherapeutics. The role of mTOR in carcinogenesis is thought to be largely a consequence of its activity in the cytoplasm resulting in increased translation of pro-tumorigenic genes. However, emerging data locate mTOR in various subcellular compartments including Golgi, mitochondria, endoplasmic reticulum, and the nucleus, implying the presence of compartment-specific mTOR substrates and functions. Efforts to identify mTOR substrates in these compartments, and the mechanisms by which mTOR recruits these substrates and affects downstream cellular processes, will add to our understanding of the diversity of roles played by mTOR in carcinogenesis.  相似文献   

13.
Merging robust statistical methods with complex simulation models is a frontier for improving ecological inference and forecasting. However, bringing these tools together is not always straightforward. Matching data with model output, determining starting conditions, and addressing high dimensionality are some of the complexities that arise when attempting to incorporate ecological field data with mechanistic models directly using sophisticated statistical methods. To illustrate these complexities and pragmatic paths forward, we present an analysis using tree‐ring basal area reconstructions in Denali National Park (DNPP) to constrain successional trajectories of two spruce species (Picea mariana and Picea glauca) simulated by a forest gap model, University of Virginia Forest Model Enhanced—UVAFME. Through this process, we provide preliminary ecological inference about the long‐term competitive dynamics between slow‐growing P. mariana and relatively faster‐growing P. glauca. Incorporating tree‐ring data into UVAFME allowed us to estimate a bias correction for stand age with improved parameter estimates. We found that higher parameter values for P. mariana minimum growth under stress and P. glauca maximum growth rate were key to improving simulations of coexistence, agreeing with recent research that faster‐growing P. glauca may outcompete P. mariana under climate change scenarios. The implementation challenges we highlight are a crucial part of the conversation for how to bring models together with data to improve ecological inference and forecasting.  相似文献   

14.
15.
Bio-support vector machines for computational proteomics   总被引:2,自引:0,他引:2  
MOTIVATION: One of the most important issues in computational proteomics is to produce a prediction model for the classification or annotation of biological function of novel protein sequences. In order to improve the prediction accuracy, much attention has been paid to the improvement of the performance of the algorithms used, few is for solving the fundamental issue, namely, amino acid encoding as most existing pattern recognition algorithms are unable to recognize amino acids in protein sequences. Importantly, the most commonly used amino acid encoding method has the flaw that leads to large computational cost and recognition bias. RESULTS: By replacing kernel functions of support vector machines (SVMs) with amino acid similarity measurement matrices, we have modified SVMs, a new type of pattern recognition algorithm for analysing protein sequences, particularly for proteolytic cleavage site prediction. We refer to the modified SVMs as bio-support vector machine. When applied to the prediction of HIV protease cleavage sites, the new method has shown a remarkable advantage in reducing the model complexity and enhancing the model robustness.  相似文献   

16.
In recent years, the study of species' occurrence has benefited from the increased availability of large-scale citizen-science data. While abundance data from standardized monitoring schemes are biased toward well-studied taxa and locations, opportunistic data are available for many taxonomic groups, from a large number of locations and across long timescales. Hence, these data provide opportunities to measure species' changes in occurrence, particularly through the use of occupancy models, which account for imperfect detection. These opportunistic datasets can be substantially large, numbering hundreds of thousands of sites, and hence present a challenge from a computational perspective, especially within a Bayesian framework. In this paper, we develop a unifying framework for Bayesian inference in occupancy models that account for both spatial and temporal autocorrelation. We make use of the Pólya-Gamma scheme, which allows for fast inference, and incorporate spatio-temporal random effects using Gaussian processes (GPs), for which we consider two efficient approximations: subset of regressors and nearest neighbor GPs. We apply our model to data on two UK butterfly species, one common and widespread and one rare, using records from the Butterflies for the New Millennium database, producing occupancy indices spanning 45 years. Our framework can be applied to a wide range of taxa, providing measures of variation in species' occurrence, which are used to assess biodiversity change.  相似文献   

17.
Identifying biomarkers that are indicative of a phenotypic state is difficult because of the amount of natural variability which exists in any population. While there are many different algorithms to select biomarkers, previous investigation shows the sensitivity and flexibility of support vector machines (SVM) make them an attractive candidate. Here we evaluate the ability of support vector machine recursive feature elimination (SVM-RFE) to identify potential metabolic biomarkers in liquid chromatography mass spectrometry untargeted metabolite datasets. Two separate experiments are considered, a low variance (low biological noise) prokaryotic stress experiment, and a high variance (high biological noise) mammalian stress experiment. For each experiment, the phenotypic response to stress is metabolically characterized. SVM-based classification and metabolite ranking is undertaken using a systematically reduced number of biological replicates to evaluate the impact of sample size on biomarker reproducibility and robustness. Our results indicate the highest ranked 1 % of metabolites, the most predictive of the physiological state, were identified by SVM-RFE even when the number of training examples was small (≥3) and the coefficient of variation was high (>0.5). An accuracy analysis shows filtering with recursive feature elimination measurably improves SVM classification accuracy, an effect that is pronounced when the number of training examples is small. These results indicate that SVM-RFE can be successful at biomarker identification even in challenging scenarios where the training examples are noisy and the number of biological replicates is low.  相似文献   

18.
Microarray gene expression data usually have a large number of dimensions, e.g., over ten thousand genes, and a small number of samples, e.g., a few tens of patients. In this paper, we use the support vector machine (SVM) for cancer classification with microarray data. Dimensionality reduction methods, such as principal components analysis (PCA), class-separability measure, Fisher ratio, and t-test, are used for gene selection. A voting scheme is then employed to do multi-group classification by k(k - 1) binary SVMs. We are able to obtain the same classification accuracy but with much fewer features compared to other published results.  相似文献   

19.
20.

Background  

PDZ domains mediate protein-protein interactions involved in important biological processes through the recognition of short linear motifs in their target proteins. Two recent independent studies have used protein microarray or phage display technology to detect PDZ domain interactions with peptide ligands on a large scale. Several computational predictors of PDZ domain interactions have been developed, however they are trained using only protein microarray data and focus on limited subsets of PDZ domains. An accurate predictor of genomic PDZ domain interactions would allow the proteomes of organisms to be scanned for potential binders. Such an application would require an accurate and precise predictor to avoid generating too many false positive hits given the large amount of possible interactors in a given proteome. Once validated these predictions will help to increase the coverage of current PDZ domain interaction networks and further our understanding of the roles that PDZ domains play in a variety of biological processes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号