首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The identification and assessment of prognostic factors is one of the major tasks in clinical research. The assessment of one single prognostic factor can be done by recently established methods for using optimal cutpoints. Here, we suggest a method to consider an optimal selected prognostic factor from a set of prognostic factors of interest. This can be viewed as a variable selection method and is the underlying decision problem at each node of various tree building algorithms. We propose to use maximally selected statistics where the selection is defined over the set of prognostic factors and over all cutpoints in each prognostic factor. We demonstrate that it is feasible to compute the approximate null distribution. We illustrate the new variable selection test with data of the German Breast Cancer Study Group and of a small study on patients with diffuse large B‐cell lymphoma. Using the null distribution for a p‐value adjusted regression trees algorithm, we adjust for the number of variables analysed at each node as well. (© 2004 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

2.
Statistical models support medical research by facilitating individualized outcome prognostication conditional on independent variables or by estimating effects of risk factors adjusted for covariates. Theory of statistical models is well‐established if the set of independent variables to consider is fixed and small. Hence, we can assume that effect estimates are unbiased and the usual methods for confidence interval estimation are valid. In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10–30. This number is often too large to be considered in a statistical model. We provide an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change‐in‐estimate criterion, background knowledge, or combinations thereof. These methods were usually developed in the context of a linear regression model and then transferred to more generalized linear models or models for censored survival data. Variable selection, in particular if used in explanatory modeling where effect estimates are of central interest, can compromise stability of a final model, unbiasedness of regression coefficients, and validity of p‐values or confidence intervals. Therefore, we give pragmatic recommendations for the practicing statistician on application of variable selection methods in general (low‐dimensional) modeling problems and on performing stability investigations and inference. We also propose some quantities based on resampling the entire variable selection process to be routinely reported by software packages offering automated variable selection algorithms.  相似文献   

3.
A computer program has been written which performs a stepwise selection of variables for logistic regression using maximum likelihood estimation. The selection procedure is based on likelihood ratio tests for the coefficients. These tests are used in a forward selection and a backward elimination at each step. The use of the program is illustrated by several examples.  相似文献   

4.

Background  

The main problem in many model-building situations is to choose from a large set of covariates those that should be included in the "best" model. A decision to keep a variable in the model might be based on the clinical or statistical significance. There are several variable selection algorithms in existence. Those methods are mechanical and as such carry some limitations. Hosmer and Lemeshow describe a purposeful selection of covariates within which an analyst makes a variable selection decision at each step of the modeling process.  相似文献   

5.
The effect of environmental conditions on river macrobenthic communities was studied using a dataset consisting of 343 sediment samples from unnavigable watercourses in Flanders, Belgium. Artificial neural network models were used to analyse the relation among river characteristics and macrobenthic communities. The dataset included presence or absence of macroinvertebrate taxa and 12 physicochemical and hydromorphological variables for each sampling site. The abiotic variables served as input for the artificial neural networks to predict the macrobenthic community. The effects of the input variables on model performance were assessed in order to identify the most diagnostic river characteristics for macrobenthic community composition. This was done by consecutively eliminating the least important variables and, when beneficial for model performance, adding previously removed ones again. This stepwise input variable selection procedure was tested not only on a model predicting the entire macrobenthic community, but also on three models, each predicting an individual taxon. Additionally, during each step of the stepwise leave-one-out procedure, a sensitivity analysis was performed to determine the response of the predicted macroinvertebrate taxa to the input variables applied. This research illustrated that a combination of input variable selection with sensitivity analyses can contribute to the development of reliable and ecologically relevant ANN models. The river characteristics predicting presence or absence of the benthic macroinvertebrates best were the Julian day, conductivity, and dissolved oxygen content. These conditions reflect the importance of discharges of untreated wastewater that occurred during the period of investigation in nearly all Flemish rivers.  相似文献   

6.
The identification of biomarkers is one of the leading research areas in proteomics. When biomarkers have to be searched for in spot volume datasets produced by 2D gel-electrophoresis, problems may arise related to the large number of spots present in each map and the small number of samples available in each class (control/pathological). In such cases multivariate methods are usually exploited together with variable selection procedures, to provide a set of possible biomarkers: they are however usually aimed to the selection of the smallest set of variables (spots) providing the best performances in prediction. This approach seems not to be suitable for the identification of potential biomarkers since in this case all the possible candidate biomarkers have to be identified to provide a general picture of the "pathological state": in this case exhaustivity has to be preferred to provide a complete understanding of the mechanisms underlying the pathology. We propose here a ranking and classification method, "Ranking-PCA", based on Principal Component Analysis and variable selection in forward search: the method selects one variable at a time as the one providing the best separation of the two classes investigated in the space given by the relevant PCs. The method was applied to an artificial dataset and a real case-study: Ranking-PCA exhaustively identified the potential biomarkers and provided reliable and robust results.  相似文献   

7.
Gene selection and classification of microarray data using random forest   总被引:9,自引:0,他引:9  

Background  

Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.  相似文献   

8.
Marker pair selection for mapping quantitative trait loci   总被引:10,自引:0,他引:10  
Piepho HP  Gauch HG 《Genetics》2001,157(1):433-444
Mapping of quantitative trait loci (QTL) for backcross and F(2) populations may be set up as a multiple linear regression problem, where marker types are the regressor variables. It has been shown previously that flanking markers absorb all information on isolated QTL. Therefore, selection of pairs of markers flanking QTL is useful as a direct approach to QTL detection. Alternatively, selected pairs of flanking markers can be used as cofactors in composite interval mapping (CIM). Overfitting is a serious problem, especially if the number of regressor variables is large. We suggest a procedure denoted as marker pair selection (MPS) that uses model selection criteria for multiple linear regression. Markers enter the model in pairs, which reduces the number of models to be considered, thus alleviating the problem of overfitting and increasing the chances of detecting QTL. MPS entails an exhaustive search per chromosome to maximize the chance of finding the best-fitting models. A simulation study is conducted to study the merits of different model selection criteria for MPS. On the basis of our results, we recommend the Schwarz Bayesian criterion (SBC) for use in practice.  相似文献   

9.
Spatial randomization of clones across a seed orchard’s grid is commonly applied to promote cross-fertilization and minimize selfing. The high selection differential attained from advanced-generation breeding programs sets high premier on the genetic gain and diversity delivery from seed orchards, thus clonal allocation is important and even more challenging when clones share common ancestry. Evidences of low selfing in many conifers’ seed orchards, as a result of their high genetic load, inbreeding depression, and polyembryony are abundant and call for orchards’ design re-evaluation, specifically when randomization is associated with added managerial burden. Clonal-rows represent a viable option for simplifying orchards management; however, they are often associated with elevated correlated matings between adjacent clones. Here, we propose a modified clonal-row design that replicates, staggers, and randomizes the rows, thus doubling the number of adjacent clones and providing different set of neighboring clones at each replication, thus allowing accommodating related parents more readily than any single-tree arrangement. We present a novel algorithm packaged in user-friendly software for executing various seed orchards’ designs. The developed program is interactive and suitable for any orchard size and configuration, accommodates any number of clones that are allocated to rows with variable length (ranging from a single tree to any even number) and pre-set separation zone between ramets of the same clone. The program offers three deployment modes (equal, linear, and custom) each with multiple layouts determined by the number of iterations requested. The resulting layouts are ranked based on four criteria including: (1) the number of empty positions, (2) deviation between expected and observed clone size, (3) minimum inbreeding, and (4) a neighborhood index that expresses the efficiency of clonal distribution.  相似文献   

10.
Nonomura M 《PloS one》2012,7(4):e33501
A model of multicellular systems with several types of cells is developed from the phase field model. The model is presented as a set of partial differential equations of the field variables, each of which expresses the shape of one cell. The dynamics of each cell is based on the criteria for minimizing the surface area and retaining a certain volume. The effects of cell adhesion and excluded volume are also taken into account. The proposed model can be used to find the position of the membrane and/or the cortex of each cell without the need to adopt extra variables. This model is suitable for numerical simulations of a system having a large number of cells. The two-dimensional results of cell division, cell adhesion, rearrangement of a cell cluster, chemotaxis, and cell sorting as well as the three-dimensional results of cell clusters on the substrate are presented.  相似文献   

11.
The MCS/SEL/BAS program provides a method for group recognition,based on a criterion of homegeneity within the groups. The basicaim of this clustering method is not to ‘force’data into a number of separate groups, as it allows the possibilitythat a given element in the data set can be assigned to morethan one group. Moreover, a parsimonious path through the groupsis sought by selecting groups on the basis of two suitably chosen,peak-ordered criteria. This selection continues until a coveringof the data set is obtained (i. e., until each element in thedata set is assigned to at least one group). Then relationshipsoccurring among the set of selected groups are investigatedby means of two coefficients, called overlapping and cohesioncoefficient, respectively. The utility of this program has beendemonstrated here in elaborating large sets of data derivedfrom mating type interactions of ciliates, but it can be usedalso for analyzing data derived from a wide spectrum of compatibilityphenomena exhibited by other living organisms. Algorithms ofthis program are written in BASIC and formulated in a conversationalmode for processing on a Macintosh. A computer program (MCS/SEL/BAS)is available from G.Mancini upon request. Received on September 18, 1990; accepted on January 21, 1991  相似文献   

12.
This paper discusses the challenges of setting process validation acceptance criteria for biotech products for cases where using statistical tools is appropriate. Data are analyzed under three different scenarios that are frequently encountered in biotech applications. Scenario A represents the case when a small data set around center point conditions is available for setting acceptance criteria. Scenario B represents the case when a larger data set within normal operation conditions is available for setting acceptance criteria. Scenario C represents the case when a large characterization data set is available for setting acceptance criteria and it is possible to accurately model the impact of operation conditions on performance of the step. Statistical approaches including mean +/- 3SD, tolerance interval analysis, prediction profiler, and Monte Carlo simulation are applied to the different scenarios. Strengths and shortcomings of the different statistical tools are discussed, and the best approach for each scenario is recommended. It is shown that selection of the right statistical approach is a critical first step toward setting appropriate acceptance criteria.  相似文献   

13.

Background

The identification of new diagnostic or prognostic biomarkers is one of the main aims of clinical cancer research. Technologies like mass spectrometry are commonly being used in proteomic research. Mass spectrometry signals show the proteomic profiles of the individuals under study at a given time. These profiles correspond to the recording of a large number of proteins, much larger than the number of individuals. These variables come in addition to or to complete classical clinical variables. The objective of this study is to evaluate and compare the predictive ability of new and existing models combining mass spectrometry data and classical clinical variables. This study was conducted in the context of binary prediction.

Results

To achieve this goal, simulated data as well as a real dataset dedicated to the selection of proteomic markers of steatosis were used to evaluate the methods. The proposed methods meet the challenge of high-dimensional data and the selection of predictive markers by using penalization methods (Ridge, Lasso) and dimension reduction techniques (PLS), as well as a combination of both strategies through sparse PLS in the context of a binary class prediction. The methods were compared in terms of mean classification rate and their ability to select the true predictive values. These comparisons were done on clinical-only models, mass-spectrometry-only models and combined models.

Conclusions

It was shown that models which combine both types of data can be more efficient than models that use only clinical or mass spectrometry data when the sample size of the dataset is large enough.  相似文献   

14.
In many attitudinal investigations, particularly those involving free-choice profiling, a very large list of variables or features can emerge. Ordination using generalized Procrustes analysis provides a common base for comparing assessors, but the derived configurations are often high-dimensional and difficult to summarize. This problem can be rectified by selecting a small subset of the original set of variables. Methods of variable selection in principal component analysis can be adapted easily for such purposes, but there is no guarantee with these methods that overall data structure is preserved. A recently introduced variable selection procedure that does aim to preserve the data structure as much as possible would seem to be more appropriate. All methods are described and applied to a set of data arising from an attitudinal investigation of meat products. The results indicate that variable selection should be more widely encouraged.  相似文献   

15.
A group of variables are commonly seen in diagnostic medicine when multiple prognostic factors are aggregated into a composite score to represent the risk profile. A model selection method considers these covariates as all-in or all-out types. Model selection procedures for grouped covariates and their applications have thrived in recent years, in part because of the development of genetic research in which gene–gene or gene–environment interactions and regulatory network pathways are considered groups of individual variables. However, little has been discussed on how to utilize grouped covariates to grow a classification tree. In this paper, we propose a nonparametric method to address the selection of split variables for grouped covariates and their following selection of split points. Comprehensive simulations were implemented to show the superiority of our procedures compared to a commonly used recursive partition algorithm. The practical use of our method is demonstrated through a real data analysis that uses a group of prognostic factors to classify the successful mobilization of peripheral blood stem cells.  相似文献   

16.
Automated variable selection procedures, such as backward elimination, are commonly employed to perform model selection in the context of multivariable regression. The stability of such procedures can be investigated using a bootstrap‐based approach. The idea is to apply the variable selection procedure on a large number of bootstrap samples successively and to examine the obtained models, for instance, in terms of the inclusion of specific predictor variables. In this paper, we aim to investigate a particular important problem affecting this method in the case of categorical predictor variables with different numbers of categories and to give recommendations on how to avoid it. For this purpose, we systematically assess the behavior of automated variable selection based on the likelihood ratio test using either bootstrap samples drawn with replacement or subsamples drawn without replacement from the original dataset. Our study consists of extensive simulations and a real data example from the NHANES study. Our main result is that if automated variable selection is conducted on bootstrap samples, variables with more categories are substantially favored over variables with fewer categories and over metric variables even if none of them have any effect. Importantly, variables with no effect and many categories may be (wrongly) preferred to variables with an effect but few categories. We suggest the use of subsamples instead of bootstrap samples to bypass these drawbacks.  相似文献   

17.
Biomarkers are of increasing importance for personalized medicine, with applications including diagnosis, prognosis, and selection of targeted therapies. Their use is extremely diverse, ranging from pharmacodynamics to treatment monitoring. Following a concise review of terminology, we provide examples and current applications of three broad categories of biomarkers-DNA biomarkers, DNA tumor biomarkers, and other general biomarkers. We outline clinical trial phases for identifying and validating diagnostic and prognostic biomarkers. Predictive biomarkers, more generally termed companion diagnostic tests predict treatment response in terms of efficacy and/or safety. We consider suitability of clinical trial designs for predictive biomarkers, including a detailed discussion of validation study designs, with emphasis on interpretation of study results. We specifically discuss the interpretability of treatment effects if a large set of DNA biomarker profiles is available and the number of therapies is identical to the number of different profiles.  相似文献   

18.
mdclust--exploratory microarray analysis by multidimensional clustering   总被引:1,自引:0,他引:1  
MOTIVATION: Unsupervised clustering of microarray data may detect potentially important, but not obvious characteristics of samples, for instance subgroups of diagnoses with distinct gene profiles or systematic errors in experimentation. RESULTS: Multidimensional clustering (mdclust) is a method, which identifies sets of sample clusters and associated genes. It applies iteratively two-means clustering and score-based gene selection. For any phenotype variable best matching sets of clusters can be selected. This provides a method to identify gene-phenotype associations, suited even for settings with a large number of phenotype variables. An optional model based discriminant step may reduce further the number of selected genes.  相似文献   

19.
The Cox proportional hazards model has become the standard for the analysis of survival time data in cancer and other chronic diseases. In most studies, proportional hazards (PH) are assumed for covariate effects. With long-term follow-up, the PH assumption may be violated, leading to poor model fit. To accommodate non-PH effects, we introduce a new procedure, MFPT, an extension of the multivariable fractional polynomial (MFP) approach, to do the following: (1) select influential variables; (2) determine a sensible dose-response function for continuous variables; (3) investigate time-varying effects; (4) model such time-varying effects on a continuous scale. Assuming PH initially, we start with a detailed model-building step, including a search for possible non-linear functions for continuous covariates. Sometimes a variable with a strong short-term effect may appear weak or non-influential if 'averaged' over time under the PH assumption. To protect against omitting such variables, we repeat the analysis over a restricted time-interval. Any additional prognostic variables identified by this second analysis are added to create our final time-fixed multivariable model. Using a forward-selection algorithm we search for possible improvements in fit by adding time-varying covariates. The first part to create a final time-fixed model does not require the use of MFP. A model may be given from 'outside' or a different strategy may be preferred for this part. This broadens the scope of the time-varying part. To motivate and illustrate the methodology, we create prognostic models from a large database of patients with primary breast cancer. Non-linear time-fixed effects are found for progesterone receptor status and number of positive lymph nodes. Highly statistically significant time-varying effects are present for progesterone receptor status and tumour size.  相似文献   

20.
A new method for the choice of variables with the greatest discriminatory power in the location model for mixed variable discriminant analysis is presented in the paper. The procedure based on the multivariate discriminatory measure enables a simultaneous reduction of the number of discrete and continuous variables. The introduced criterion can be used for both optimal or step-wise selection of variable subset. As an example the results of the stepwise variable selection for some medical data are presented in the paper.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号