首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Summary In functional data classification, functional observations are often contaminated by various systematic effects, such as random batch effects caused by device artifacts, or fixed effects caused by sample‐related factors. These effects may lead to classification bias and thus should not be neglected. Another issue of concern is the selection of functions when predictors consist of multiple functions, some of which may be redundant. The above issues arise in a real data application where we use fluorescence spectroscopy to detect cervical precancer. In this article, we propose a Bayesian hierarchical model that takes into account random batch effects and selects effective functions among multiple functional predictors. Fixed effects or predictors in nonfunctional form are also included in the model. The dimension of the functional data is reduced through orthonormal basis expansion or functional principal components. For posterior sampling, we use a hybrid Metropolis–Hastings/Gibbs sampler, which suffers slow mixing. An evolutionary Monte Carlo algorithm is applied to improve the mixing. Simulation and real data application show that the proposed model provides accurate selection of functional predictors as well as good classification.  相似文献   

2.
Jiang X  Gold D  Kolaczyk ED 《Biometrics》2011,67(3):958-966
Predicting the functional roles of proteins based on various genome-wide data, such as protein-protein association networks, has become a canonical problem in computational biology. Approaching this task as a binary classification problem, we develop a network-based extension of the spatial auto-probit model. In particular, we develop a hierarchical Bayesian probit-based framework for modeling binary network-indexed processes, with a latent multivariate conditional autoregressive Gaussian process. The latter allows for the easy incorporation of protein-protein association network topologies-either binary or weighted-in modeling protein functional similarity. We use this framework to predict protein functions, for functions defined as terms in the Gene Ontology (GO) database, a popular rigorous vocabulary for biological functionality. Furthermore, we show how a natural extension of this framework can be used to model and correct for the high percentage of false negative labels in training data derived from GO, a serious shortcoming endemic to biological databases of this type. Our method performance is evaluated and compared with standard algorithms on weighted yeast protein-protein association networks, extracted from a recently developed integrative database called Search Tool for the Retrieval of INteracting Genes/proteins (STRING). Results show that our basic method is competitive with these other methods, and that the extended method-incorporating the uncertainty in negative labels among the training data-can yield nontrivial improvements in predictive accuracy.  相似文献   

3.
Predictive species distribution models (SDMs) are becoming increasingly important in ecology, in the light of rapid environmental change. However, the predictions of most current SDMs are specific to the habitat composition of the environments in which they were fitted. This may limit SDM predictive power because species may respond differently to a given habitat depending on the availability of all habitats in their environment, a phenomenon known as a functional response in resource selection. The Generalised Functional Response (GFR) framework captures this dependence by formulating the SDM coefficients as functions of habitat availability. The original GFR implementation used global polynomial functions of habitat availability to describe the functional responses. In this study, we develop several refinements of this approach and compare their predictive performance using two simulated and two real datasets. We first use local radial basis functions (RBF), a more flexible approach than global polynomials, to represent the habitat selection coefficients, and balance bias with precision via regularization to prevent overfitting. Second, we use the RBF-GFR and GFR models in combination with the classification and regression tree CART, which has more flexibility and better predictive powers for non-linear modelling. As further extensions, we use random forests (RFs) and extreme gradient boosting (XGBoost), ensemble approaches that consistently lead to variance reduction in generalization error. We find that the different methods are ranked consistently across the datasets for out-of-data prediction. The traditional stationary approach to SDMs and the GFR model consistently perform at the bottom of the ranking (simple SDMs underfit, and polynomial GFRs overfit the data). The best methods in our list provide non-negligible improvements in predictive performance, in some cases taking the out-of-sample R2 from 0.3 up to 0.7 across datasets. At times of rapid environmental change and spatial non-stationarity ignoring the effects of functional responses on SDMs, results in two different types of prediction bias (under-prediction or mis-positioning of distribution hotspots). However, not all functional response models perform equally well. The more volatile polynomial GFR models can generate biases through over-prediction. Our results indicate that there are consistently robust GFR approaches that achieve impressive gains in transferability across very different datasets.  相似文献   

4.
Summary In studies involving functional data, it is commonly of interest to model the impact of predictors on the distribution of the curves, allowing flexible effects on not only the mean curve but also the distribution about the mean. Characterizing the curve for each subject as a linear combination of a high‐dimensional set of potential basis functions, we place a sparse latent factor regression model on the basis coefficients. We induce basis selection by choosing a shrinkage prior that allows many of the loadings to be close to zero. The number of latent factors is treated as unknown through a highly‐efficient, adaptive‐blocked Gibbs sampler. Predictors are included on the latent variables level, while allowing different predictors to impact different latent factors. This model induces a framework for functional response regression in which the distribution of the curves is allowed to change flexibly with predictors. The performance is assessed through simulation studies and the methods are applied to data on blood pressure trajectories during pregnancy.  相似文献   

5.
6.
Regional association analysis is one of the most powerful tools for gene mapping because instead analysis of individual variants it simultaneously considers all variants in the region. Recent development of the models for regional association analysis involves functional data analysis approach. In the framework of this approach, genotypes of variants within region as well as their effects are described by continuous functions. Such approach allows us to use information about both linkage and linkage disequilibrium and reduce the influence of noise and/or observation errors. Here we define a functional linear mixed model to test association on independent and structured samples. We demonstrate how to test fixed and random effects of a set of genetic variants in the region on quantitative trait. Estimation of statistical properties of new methods shows that type I errors are in accordance with declared values and power is high especially for models with fixed effects of genotypes. We suppose that new functional regression linear models facilitate identification of rare genetic variants controlling complex human and animal traits. New methods are implemented in computer software FREGAT which is available for free download at http://mga.bionet.nsc.ru/soft/FREGAT/.  相似文献   

7.
Summary .   In this article, we apply the recently developed Bayesian wavelet-based functional mixed model methodology to analyze MALDI-TOF mass spectrometry proteomic data. By modeling mass spectra as functions, this approach avoids reliance on peak detection methods. The flexibility of this framework in modeling nonparametric fixed and random effect functions enables it to model the effects of multiple factors simultaneously, allowing one to perform inference on multiple factors of interest using the same model fit, while adjusting for clinical or experimental covariates that may affect both the intensities and locations of peaks in the spectra. For example, this provides a straightforward way to account for systematic block and batch effects that characterize these data. From the model output, we identify spectral regions that are differentially expressed across experimental conditions, in a way that takes both statistical and clinical significance into account and controls the Bayesian false discovery rate to a prespecified level. We apply this method to two cancer studies.  相似文献   

8.
Generalized linear models are a widely used method to obtain parametric estimates for the mean function. They have been further extended to allow the relationship between the mean function and the covariates to be more flexible via generalized additive models. However, the fixed variance structure can in many cases be too restrictive. The extended quasilikelihood (EQL) framework allows for estimation of both the mean and the dispersion/variance as functions of covariates. As for other maximum likelihood methods though, EQL estimates are not resistant to outliers: we need methods to obtain robust estimates for both the mean and the dispersion function. In this article, we obtain functional estimates for the mean and the dispersion that are both robust and smooth. The performance of the proposed method is illustrated via a simulation study and some real data examples.  相似文献   

9.
MOTIVATION: Assigning functions for unknown genes based on diverse large-scale data is a key task in functional genomics. Previous work on gene function prediction has addressed this problem using independent classifiers for each function. However, such an approach ignores the structure of functional class taxonomies, such as the Gene Ontology (GO). Over a hierarchy of functional classes, a group of independent classifiers where each one predicts gene membership to a particular class can produce a hierarchically inconsistent set of predictions, where for a given gene a specific class may be predicted positive while its inclusive parent class is predicted negative. Taking the hierarchical structure into account resolves such inconsistencies and provides an opportunity for leveraging all classifiers in the hierarchy to achieve higher specificity of predictions. RESULTS: We developed a Bayesian framework for combining multiple classifiers based on the functional taxonomy constraints. Using a hierarchy of support vector machine (SVM) classifiers trained on multiple data types, we combined predictions in our Bayesian framework to obtain the most probable consistent set of predictions. Experiments show that over a 105-node subhierarchy of the GO, our Bayesian framework improves predictions for 93 nodes. As an additional benefit, our method also provides implicit calibration of SVM margin outputs to probabilities. Using this method, we make function predictions for multiple proteins, and experimentally confirm predictions for proteins involved in mitosis. SUPPLEMENTARY INFORMATION: Results for the 105 selected GO classes and predictions for 1059 unknown genes are available at: http://function.princeton.edu/genesite/ CONTACT: ogt@cs.princeton.edu.  相似文献   

10.
MOTIVATION: Protein families evolve a multiplicity of functions through gene duplication, speciation and other processes. As a number of studies have shown, standard methods of protein function prediction produce systematic errors on these data. Phylogenomic analysis--combining phylogenetic tree construction, integration of experimental data and differentiation of orthologs and paralogs--has been proposed to address these errors and improve the accuracy of functional classification. The explicit integration of structure prediction and analysis in this framework, which we call structural phylogenomics, provides additional insights into protein superfamily evolution. RESULTS: Results of protein functional classification using phylogenomic analysis show fewer expected false positives overall than when pairwise methods of functional classification are employed. We present an overview of the motivations and fundamental principles of phylogenomic analysis, new methods developed for the key tasks, benchmark datasets for these tasks (when available) and suggest procedures to increase accuracy. We also discuss some of the methods used in the Celera Genomics high-throughput phylogenomic classification of the human genome. AVAILABILITY: Software tools from the Berkeley Phylogenomics Group are available at http://phylogenomics.berkeley.edu  相似文献   

11.
The primary objective of this paper is to provide a guide on implementing Bayesian generalized kernel regression methods for genomic prediction in the statistical software R. Such methods are quite efficient for capturing complex non-linear patterns that conventional linear regression models cannot. Furthermore, these methods are also powerful for leveraging environmental covariates, such as genotype × environment (G×E) prediction, among others. In this study we provide the building process of seven kernel methods: linear, polynomial, sigmoid, Gaussian, Exponential, Arc-cosine 1 and Arc-cosine L. Additionally, we highlight illustrative examples for implementing exact kernel methods for genomic prediction under a single-environment, a multi-environment and multi-trait framework, as well as for the implementation of sparse kernel methods under a multi-environment framework. These examples are followed by a discussion on the strengths and limitations of kernel methods and, subsequently by conclusions about the main contributions of this paper.Subject terms: Genomics, Plant sciences  相似文献   

12.
A comparison of neural network methods and Bayesian statistical methods is presented for prediction of the secondary structure of proteins given their primary sequence. The Bayesian method makes the unphysical assumption that the probability of an amino acid occurring in each position in the protein is independent of the amino acids occurring elsewhere. However, we find the predictive accuracy of the Bayesian method to be only minimally less than the accuracy of the most sophisticated methods used to date. We present the relationship of neural network methods to Bayesian statistical methods and show that, in principle, neural methods offer considerable power, although apparently they are not particularly useful for this problem. In the process, we derive a neural formalism in which the output neurons directly represent the conditional probabilities of structure class. The probabilistic formalism allows introduction of a new objective function, the mutual information, which translates the notion of correlation as a measure of predictive accuracy into a useful training measure. Although a similar accuracy to other approaches (utilizing a mean-square error) is achieved using this new measure, the accuracy on the training set is significantly and tantalizingly higher, even though the number of adjustable parameters remains the same. The mutual information measure predicts a greater fraction of helix and sheet structures correctly than the mean-square error measure, at the expense of coil accuracy, precisely as it was designed to do. By combining the two objective functions, we obtain a marginally improved accuracy of 64.4%, with Matthews coefficients C alpha, C beta and Ccoil of 0.40, 0.32 and 0.42, respectively. However, since all methods to date perform only slightly better than the Bayes algorithm, which entails the drastic assumption of independence of amino acids, one is forced to conclude that little progress has been made on this problem, despite the application of a variety of sophisticated algorithms such as neural networks, and that further advances will require a better understanding of the relevant biophysics.  相似文献   

13.
14.
MOTIVATION: A global view of the protein space is essential for functional and evolutionary analysis of proteins. In order to achieve this, a similarity network can be built using pairwise relationships among proteins. However, existing similarity networks employ a single similarity measure and therefore their utility depends highly on the quality of the selected measure. A more robust representation of the protein space can be realized if multiple sources of information are used. RESULTS: We propose a novel approach for analyzing multi-attribute similarity networks by combining random walks on graphs with Bayesian theory. A multi-attribute network is created by combining sequence and structure based similarity measures. For each attribute of the similarity network, one can compute a measure of affinity from a given protein to every other protein in the network using random walks. This process makes use of the implicit clustering information of the similarity network, and we show that it is superior to naive, local ranking methods. We then combine the computed affinities using a Bayesian framework. In particular, when we train a Bayesian model for automated classification of a novel protein, we achieve high classification accuracy and outperform single attribute networks. In addition, we demonstrate the effectiveness of our technique by comparison with a competing kernel-based information integration approach.  相似文献   

15.

Background

The goal of personalized medicine is to provide patients optimal drug screening and treatment based on individual genomic or proteomic profiles. Reverse-Phase Protein Array (RPPA) technology offers proteomic information of cancer patients which may be directly related to drug sensitivity. For cancer patients with different drug sensitivity, the proteomic profiling reveals important pathophysiologic information which can be used to predict chemotherapy responses.

Results

The goal of this paper is to present a framework for personalized medicine using both RPPA and drug sensitivity (drug resistance or intolerance). In the proposed personalized medicine system, the prediction of drug sensitivity is obtained by a proposed augmented naive Bayesian classifier (ANBC) whose edges between attributes are augmented in the network structure of naive Bayesian classifier. For discriminative structure learning of ANBC, local classification rate (LCR) is used to score augmented edges, and greedy search algorithm is used to find the discriminative structure that maximizes classification rate (CR). Once a classifier is trained by RPPA and drug sensitivity using cancer patient samples, the classifier is able to predict the drug sensitivity given RPPA information from a patient.

Conclusion

In this paper we proposed a framework for personalized medicine where a patient is profiled by RPPA and drug sensitivity is predicted by ANBC and LCR. Experimental results with lung cancer data demonstrate that RPPA can be used to profile patients for drug sensitivity prediction by Bayesian network classifier, and the proposed ANBC for personalized cancer medicine achieves better prediction accuracy than naive Bayes classifier in small sample size data on average and outperforms other the state-of-the-art classifier methods in terms of classification accuracy.
  相似文献   

16.
17.
An inexpensive, noninvasive system that could accurately classify flying insects would have important implications for entomological research, and allow for the development of many useful applications in vector and pest control for both medical and agricultural entomology. Given this, the last sixty years have seen many research efforts devoted to this task. To date, however, none of this research has had a lasting impact. In this work, we show that pseudo-acoustic optical sensors can produce superior data; that additional features, both intrinsic and extrinsic to the insect’s flight behavior, can be exploited to improve insect classification; that a Bayesian classification approach allows to efficiently learn classification models that are very robust to over-fitting, and a general classification framework allows to easily incorporate arbitrary number of features. We demonstrate the findings with large-scale experiments that dwarf all previous works combined, as measured by the number of insects and the number of species considered.  相似文献   

18.
19.
This article presents a framework to evaluate emerging systems in life cycle assessment (LCA). Current LCA methods are effective for established systems; however, lack of data often inhibits robust analysis of future products or processes that may benefit the most from life cycle information. In many cases the life cycle inventory (LCI) of a system can change depending on its development pathway. Modeling emerging systems allows insights into probable trends and a greater understanding of the effect of future scenarios on LCA results. The proposed framework uses Bayesian probabilities to model technology adoption. The method presents a unique approach to modeling system evolution and can be used independently or within the context of an agent‐based model (ABM). LCA can be made more robust and dynamic by using this framework to couple scenario modeling with life cycle data, analyzing the effect of decision‐making patterns over time. Potential uses include examining the changing urban metabolism of growing cities, understanding the development of renewable energy technologies, identifying transformations in material flows over space and time, and forecasting industrial networks for developing products. A switchgrass‐to‐energy case demonstrates the approach.  相似文献   

20.
This paper presents a new statistical techniques — Bayesian Generalized Associative Functional Networks (GAFN), to model the dynamical plant growth process of greenhouse crops. GAFNs are able to incorporate the domain knowledge and data to model complex ecosystem. By use of the functional networks and Bayesian framework, the prior knowledge can be naturally embedded into the model, and the functional relationship between inputs and outputs can be learned during the training process. Our main interest is focused on the Generalized Associative Functional Networks (GAFNs), which are appropriate to model multiple variable processes. Three main advantages are obtained through the applications of Bayesian GAFN methods to modeling dynamic process of plant growth. Firstly, this approach provides a powerful tool for revealing some useful relationships between the greenhouse environmental factors and the plant growth parameters. Secondly, Bayesian GAFN can model Multiple-Input Multiple-Output (MIMO) systems from the given data, and presents a good generalization capability from the final single model for successfully fitting all 12 data sets over 5-year field experiments. Thirdly, the Bayesian GAFN method can also play as an optimization tool to estimate the interested parameter in the agro-ecosystem. In this work, two algorithms are proposed for the statistical inference of parameters in GAFNs. Both of them are based on the variational inference, also called variational Bayes (VB) techniques, which may provide probabilistic interpretations for the built models. VB-based learning methods are able to yield estimations of the full posterior probability of model parameters. Synthetic and real-world examples are implemented to confirm the validity of the proposed methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号