首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Yi G  Shi JQ  Choi T 《Biometrics》2011,67(4):1285-1294
The model based on Gaussian process (GP) prior and a kernel covariance function can be used to fit nonlinear data with multidimensional covariates. It has been used as a flexible nonparametric approach for curve fitting, classification, clustering, and other statistical problems, and has been widely applied to deal with complex nonlinear systems in many different areas particularly in machine learning. However, it is a challenging problem when the model is used for the large-scale data sets and high-dimensional data, for example, for the meat data discussed in this article that have 100 highly correlated covariates. For such data, it suffers from large variance of parameter estimation and high predictive errors, and numerically, it suffers from unstable computation. In this article, penalized likelihood framework will be applied to the model based on GPs. Different penalties will be investigated, and their ability in application given to suit the characteristics of GP models will be discussed. The asymptotic properties will also be discussed with the relevant proofs. Several applications to real biomechanical and bioinformatics data sets will be reported.  相似文献   

2.
In quantitative proteomics work, the differences in expression of many separate proteins are routinely examined to test for significant differences between treatments. This leads to the multiple hypothesis testing problem: when many separate tests are performed many will be significant by chance and be false positive results. Statistical methods such as the false discovery rate method that deal with this problem have been disseminated for more than one decade. However a survey of proteomics journals shows that such tests are not widely implemented in one commonly used technique, quantitative proteomics using two-dimensional electrophoresis. We outline a selection of multiple hypothesis testing methods, including some that are well known and some lesser known, and present a simple strategy for their use by the experimental scientist in quantitative proteomics work generally. The strategy focuses on the desirability of simultaneous use of several different methods, the choice and emphasis dependent on research priorities and the results in hand. This approach is demonstrated using case scenarios with experimental and simulated model data.  相似文献   

3.
Auxiliary covariate data are often collected in biomedical studies when the primary exposure variable is only assessed on a subset of the study subjects. In this study, we investigate a semiparametric‐estimated likelihood estimation for the generalized linear mixed models (GLMM) in the presence of a continuous auxiliary variable. We use a kernel smoother to handle continuous auxiliary data. The method can be used to deal with missing or mismeasured covariate data problems in a variety of applications when an auxiliary variable is available and cluster sizes are not too small. Simulation study results show that the proposed method performs better than that which ignores the random effects in GLMM and that which only uses data in the validation data set. We illustrate the proposed method with a real data set from a recent environmental epidemiology study on the maternal serum 1,1‐dichloro‐2,2‐bis(p‐chlorophenyl) ethylene level in relationship to preterm births.  相似文献   

4.
5.
It has been known even since relatively few structures had been solved that longer protein chains often contain multiple domains, which may fold separately and play the role of reusable functional modules found in many contexts. In many structural biology tasks, in particular structure prediction, it is of great use to be able to identify domains within the structure and analyze these regions separately. However, when using sequence data alone this task has proven exceptionally difficult, with relatively little improvement over the naive method of choosing boundaries based on size distributions of observed domains. The recent significant improvement in contact prediction provides a new source of information for domain prediction. We test several methods for using this information including a kernel smoothing‐based approach and methods based on building alpha‐carbon models and compare performance with a length‐based predictor, a homology search method and four published sequence‐based predictors: DOMCUT, DomPRO, DLP‐SVM, and SCOOBY‐DOmain. We show that the kernel‐smoothing method is significantly better than the other ab initio predictors when both single‐domain and multidomain targets are considered and is not significantly different to the homology‐based method. Considering only multidomain targets the kernel‐smoothing method outperforms all of the published methods except DLP‐SVM. The kernel smoothing method therefore represents a potentially useful improvement to ab initio domain prediction. Proteins 2013. © 2012 Wiley Periodicals, Inc.  相似文献   

6.
A mathematical model for the spatial computations performed by simple cells in the mammalian visual cortex is derived. The construction uses as organizing principles the experimentally observed simple cell linearity and rotational symmetry breaking, together with the constraint that simple cell inputs must effectively be ganglion cell outputs. This leads to a closed form expression for the simple cellkernel in terms of Jacobi-functions. Using a-function identity, it is also shown how Gabor sampling arises as an approximation to this exact kernel for most cells. In addition, the model provides a natural mechanism for introducing the type of nonlinearity observed in some simple cells. The cell's responses to a variety of visual stimuli are calculated using the exact kernel and compared to single cell recordings. In all cases, the model's predictions are in agreement with available experimental data.Work supported by the National Science Foundation, grant PHYS86-20266Work supported by the Department of Energy, contract DE-AC02-76ERO2220  相似文献   

7.
Analysis of heterosis by a direct method using the concept of heritability   总被引:4,自引:0,他引:4  
Wu ZX  Li MD 《Genetica》2002,114(2):163-170
The presence of heterosis has been observed in many species at both phenotypic and gene levels. Strangely, the genetic basis of heterosis was and still is largely unknown. In this study, we extended and simplified some formulas that we reported previously. The foundation of our model was based on partitioning the F 1 phenotypic variance of the cross between two pure lines into additive, dominance and epistasis components, which lead to the estimation of effective factors, crossheritability in the broad and narrow sense and heterotic power. In the model, we assume that all polygenes controlling a quantitative trait have an equal genetic effect and are independent of each other. By extension of the heritability to a cross population, new features appear. The word crossheritability acquires the status of a new genetic parameter that suffices to deal with the problem of crossbreeding and clarifies the picture of heterosis. Lastly, an example of employing the proposed method in analyzing the crossing data from Drosophila melanogaster is given to illustrate its application.  相似文献   

8.
Batch marking is common and useful for many capture–recapture studies where individual marks cannot be applied due to various constraints such as timing, cost, or marking difficulty. When batch marks are used, observed data are not individual capture histories but a set of counts including the numbers of individuals first marked, marked individuals that are recaptured, and individuals captured but released without being marked (applicable to some studies) on each capture occasion. Fitting traditional capture–recapture models to such data requires one to identify all possible sets of capture–recapture histories that may lead to the observed data, which is computationally infeasible even for a small number of capture occasions. In this paper, we propose a latent multinomial model to deal with such data, where the observed vector of counts is a non-invertible linear transformation of a latent vector that follows a multinomial distribution depending on model parameters. The latent multinomial model can be fitted efficiently through a saddlepoint approximation based maximum likelihood approach. The model framework is very flexible and can be applied to data collected with different study designs. Simulation studies indicate that reliable estimation results are obtained for all parameters of the proposed model. We apply the model to analysis of golden mantella data collected using batch marks in Central Madagascar.  相似文献   

9.
Population-Based Reversible Jump Markov Chain Monte Carlo   总被引:2,自引:0,他引:2  
We present an extension of population-based Markov chain MonteCarlo to the transdimensional case. A major challenge is thatof simulating from high- and transdimensional target measures.In such cases, Markov chain Monte Carlo methods may not adequatelytraverse the support of the target; the simulation results willbe unreliable. We develop population methods to deal with suchproblems, and give a result proving the uniform ergodicity ofthese population algorithms, under mild assumptions. This resultis used to demonstrate the superiority, in terms of convergencerate, of a population transition kernel over a reversible jumpsampler for a Bayesian variable selection problem. We also givean example of a population algorithm for a Bayesian multivariatemixture model with an unknown number of components. This isapplied to gene expression data of 1000 data points in six dimensionsand it is demonstrated that our algorithm outperforms some competingMarkov chain samplers. In this example, we show how to combinethe methods of parallel chains (Geyer, 1991), tempering (Geyer& Thompson, 1995), snooker algorithms (Gilks et al., 1994),constrained sampling and delayed rejection (Green & Mira,2001).  相似文献   

10.
We consider the problem of estimating the marginal mean of an incompletely observed variable and develop a multiple imputation approach. Using fully observed predictors, we first establish two working models: one predicts the missing outcome variable, and the other predicts the probability of missingness. The predictive scores from the two models are used to measure the similarity between the incomplete and observed cases. Based on the predictive scores, we construct a set of kernel weights for the observed cases, with higher weights indicating more similarity. Missing data are imputed by sampling from the observed cases with probability proportional to their kernel weights. The proposed approach can produce reasonable estimates for the marginal mean and has a double robustness property, provided that one of the two working models is correctly specified. It also shows some robustness against misspecification of both models. We demonstrate these patterns in a simulation study. In a real‐data example, we analyze the total helicopter response time from injury in the Arizona emergency medical service data.  相似文献   

11.
ABSTRACT The kernel density estimator is used commonly for estimating animal utilization distributions from location data. This technique requires estimation of a bandwidth, for which ecologists often use least-squares cross-validation (LSCV). However, LSCV has large variance and a tendency to under-smooth data, and it fails to generate a bandwidth estimate in some situations. We compared performance of 2 new bandwidth estimators (root-n) versus that of LSCV using simulated data and location data from sharp-shinned hawks (Accipter striatus) and red wolves (Canis rufus). With simulated data containing no repeat locations, LSCV often produced a better fit between estimated and true utilization distributions than did root-n estimators on a case-by-case basis. On average, LSCV also provided lower positive relative error in home-range areas with small sample sizes of simulated data. However, root-n estimators tended to produce a better fit than LSCV on average because of extremely poor estimates generated on occasion by LSCV. Furthermore, the relative performance of LSCV decreased substantially as the number of repeat locations in the data increased. Root-n estimators also generally provided a better fit between utilization distributions generated from subsamples of hawk data and the local densities of locations from the full data sets. Least-squares cross-validation generated more unrealistically disjointed estimates of home ranges using real location data from red wolf packs. Most importantly, LSCV failed to generate home-range estimates for >20% of red wolf packs due to presence of repeat locations. We conclude that root-n estimators are superior to LSCV for larger data sets with repeat locations or other extreme clumping of data. In contrast, LSCV may be superior where the primary interest is in generating animal home ranges (rather than the utilization distribution) and data sets are small with limited clumping of locations.  相似文献   

12.
Choosing an appropriate kernel is very important and critical when classifying a new problem with Support Vector Machine. So far, more attention has been paid on constructing new kernels and choosing suitable parameter values for a specific kernel function, but less on kernel selection. Furthermore, most of current kernel selection methods focus on seeking a best kernel with the highest classification accuracy via cross-validation, they are time consuming and ignore the differences among the number of support vectors and the CPU time of SVM with different kernels. Considering the tradeoff between classification success ratio and CPU time, there may be multiple kernel functions performing equally well on the same classification problem. Aiming to automatically select those appropriate kernel functions for a given data set, we propose a multi-label learning based kernel recommendation method built on the data characteristics. For each data set, the meta-knowledge data base is first created by extracting the feature vector of data characteristics and identifying the corresponding applicable kernel set. Then the kernel recommendation model is constructed on the generated meta-knowledge data base with the multi-label classification method. Finally, the appropriate kernel functions are recommended to a new data set by the recommendation model according to the characteristics of the new data set. Extensive experiments over 132 UCI benchmark data sets, with five different types of data set characteristics, eleven typical kernels (Linear, Polynomial, Radial Basis Function, Sigmoidal function, Laplace, Multiquadric, Rational Quadratic, Spherical, Spline, Wave and Circular), and five multi-label classification methods demonstrate that, compared with the existing kernel selection methods and the most widely used RBF kernel function, SVM with the kernel function recommended by our proposed method achieved the highest classification performance.  相似文献   

13.
A functional expansion was used to model the relationship between a Gaussian white noise stimulus current and the resulting action potential output in the single sensory neuron of the cockroach femoral tactile spine. A new precise procedure was used to measure the kernels of the functional expansion. Very similar kernel estimates were obtained from separate sections of the data produced by the same neuron with the same input noise power level, although some small time-varying effects were detectable in moving through the data. Similar kernel estimates were measured using different input noise power levels for a given cell, or when comparing different cells under similar stimulus conditions. The kernels were used to identify a model for sensory encoding in the neuron, comprising a cascade of dynamic linear, static nonlinear, and dynamic linear elements. Only a single slice of the estimated experimental second-order kernel was used in identifying the cascade model. However, the complete second-order kernel of the cascade model closely resembled the estimated experimental kernel. Moreover, the model could closely predict the experimental action potential train obtained with novel white noise inputs.  相似文献   

14.
A spatial open-population capture-recapture model is described that extends both the non-spatial open-population model of Schwarz and Arnason and the spatially explicit closed-population model of Borchers and Efford. The superpopulation of animals available for detection at some time during a study is conceived as a two-dimensional Poisson point process. Individual probabilities of birth and death follow the conventional open-population model. Movement between sampling times may be modeled with a dispersal kernel using a recursive Markovian algorithm. Observations arise from distance-dependent sampling at an array of detectors. As in the closed-population spatial model, the observed data likelihood relies on integration over the unknown animal locations; maximization of this likelihood yields estimates of the birth, death, movement, and detection parameters. The models were fitted to data from a live-trapping study of brushtail possums (Trichosurus vulpecula) in New Zealand. Simulations confirmed that spatial modeling can greatly reduce the bias of capture-recapture survival estimates and that there is a degree of robustness to misspecification of the dispersal kernel. An R package is available that includes various extensions.  相似文献   

15.
16.
The selection of a specific statistical distribution as a model for describing the population behavior of a given variable is seldom a simple problem. One strategy consists in testing different distributions (normal, lognormal, Weibull, etc.), and selecting the one providing the best fit to the observed data and being the most parsimonious. Alternatively, one can make a choice based on theoretical arguments and simply fit the corresponding parameters to the observed data. In either case, different distributions can give similar results and provide almost equivalent models for a given data set. Model selection can be more complicated when the goal is to describe a trend in the distribution of a given variable. In those cases, changes in shape and skewness are difficult to represent by a single distributional form. As an alternative to the use of complicated families of distributions as models for data, the S‐distribution [Voit, E. O. (1992) Biom. J. 7 , 855–878] provides a highly flexible mathematical form in which the density is defined as a function of the cumulative. S‐distributions can accurately approximate many known continuous and unimodal distributions, preserving the well known limit relationships between them. Besides representing well‐known distributions, S‐distributions provide an infinity of new possibilities that do not correspond with known classical distributions. Although the utility and performance of this general form has been clearly proved in different applications, its definition as a differential equation is a potential drawback for some problems. In this paper we obtain an analytical solution for the quantile equation that highly simplifies the use of S‐distributions. We show the utility of this solution in different applications. After classifying the different qualitative behaviors of the S‐distribution in parameter space, we show how to obtain different S‐distributions that accomplish specific constraints. One of the most interesting cases is the possibility of obtaining distributions that acomplish P(XXc) = 0. Then, we demonstrate that the quantile solution facilitates the use of S‐distributions in Monte‐Carlo experiments through the generation of random samples. Finally, we show how to fit an S‐distribution to actual data, so that the resulting distribution can be used as a statistical model for them.  相似文献   

17.
Analysis with time-to-event data in clinical and epidemiological studies often encounters missing covariate values, and the missing at random assumption is commonly adopted, which assumes that missingness depends on the observed data, including the observed outcome which is the minimum of survival and censoring time. However, it is conceivable that in certain settings, missingness of covariate values is related to the survival time but not to the censoring time. This is especially so when covariate missingness is related to an unmeasured variable affected by the patient's illness and prognosis factors at baseline. If this is the case, then the covariate missingness is not at random as the survival time is censored, and it creates a challenge in data analysis. In this article, we propose an approach to deal with such survival-time-dependent covariate missingness based on the well known Cox proportional hazard model. Our method is based on inverse propensity weighting with the propensity estimated by nonparametric kernel regression. Our estimators are consistent and asymptotically normal, and their finite-sample performance is examined through simulation. An application to a real-data example is included for illustration.  相似文献   

18.
Two-part regression models are frequently used to analyze longitudinal count data with excess zeros, where the same set of subjects is repeatedly observed over time. In this context, several sources of heterogeneity may arise at individual level that affect the observed process. Further, longitudinal studies often suffer from missing values: individuals dropout of the study before its completion, and thus present incomplete data records. In this paper, we propose a finite mixture of hurdle models to face the heterogeneity problem, which is handled by introducing random effects with a discrete distribution; a pattern-mixture approach is specified to deal with non-ignorable missing values. This approach helps us to consider overdispersed counts, while allowing for association between the two parts of the model, and for non-ignorable dropouts. The effectiveness of the proposal is tested through a simulation study. Finally, an application to real data on skin cancer is provided.  相似文献   

19.
King R  Brooks SP  Coulson T 《Biometrics》2008,64(4):1187-1195
SUMMARY: We consider the issue of analyzing complex ecological data in the presence of covariate information and model uncertainty. Several issues can arise when analyzing such data, not least the need to take into account where there are missing covariate values. This is most acutely observed in the presence of time-varying covariates. We consider mark-recapture-recovery data, where the corresponding recapture probabilities are less than unity, so that individuals are not always observed at each capture event. This often leads to a large amount of missing time-varying individual covariate information, because the covariate cannot usually be recorded if an individual is not observed. In addition, we address the problem of model selection over these covariates with missing data. We consider a Bayesian approach, where we are able to deal with large amounts of missing data, by essentially treating the missing values as auxiliary variables. This approach also allows a quantitative comparison of different models via posterior model probabilities, obtained via the reversible jump Markov chain Monte Carlo algorithm. To demonstrate this approach we analyze data relating to Soay sheep, which pose several statistical challenges in fully describing the intricacies of the system.  相似文献   

20.
Predicting protein localization in budding yeast   总被引:4,自引:0,他引:4  
MOTIVATION: Most of the existing methods in predicting protein subcellular location were used to deal with the cases limited within the scope from two to five localizations, and only a few of them can be effectively extended to cover the cases of 12-14 localizations. This is because the more the locations involved are, the poorer the success rate would be. Besides, some proteins may occur in several different subcellular locations, i.e. bear the feature of 'multiplex locations'. So far there is no method that can be used to effectively treat the difficult multiplex location problem. The present study was initiated in an attempt to address (1) how to efficiently identify the localization of a query protein among many possible subcellular locations, and (2) how to deal with the case of multiplex locations. RESULTS: By hybridizing gene ontology, functional domain and pseudo amino acid composition approaches, a new method has been developed that can be used to predict subcellular localization of proteins with multiplex location feature. A global analysis of the proteins in budding yeast classified into 22 locations was performed by jack-knife cross-validation with the new method. The overall success identification rate thus obtained is 70%. In contrast to this, the corresponding rates obtained by some other existing methods were only 13-14%, indicating that the new method is very powerful and promising. Furthermore, predictions were made for the four proteins whose localizations could not be determined by experiments, as well as for the 236 proteins whose localizations in budding yeast were ambiguous according to experimental observations. However, according to our predicted results, many of these 'ambiguous proteins' were found to have the same score and ranking for several different subcellular locations, implying that they may simultaneously exist, or move around, in these locations. This finding is intriguing because it reflects the dynamic feature of these proteins in a cell that may be associated with some special biological functions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号