首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
    
Ma S  Huang J 《Biometrics》2007,63(3):751-757
In biomedical studies, it is of great interest to develop methodologies for combining multiple markers for the purpose of disease classification. The receiving operating characteristic (ROC) technique has been widely used, where classification performance can be measured with the area under the ROC curve (AUC). In this article, we study a ROC-based method for effectively combining multiple markers for disease classification. We propose a sigmoid AUC (SAUC) estimator that maximizes the sigmoid approximation of the empirical AUC. The SAUC estimator is computationally affordable, n(1/2)-consistent and achieves the same asymptotic efficiency as the AUC estimator. Inference based on the weighted bootstrap is investigated. We also propose Monte Carlo methods to assess the overall prediction performance and the relative importance of individual markers. Finite sample performance is evaluated using simulation studies and two public data sets.  相似文献   

2.
  总被引:1,自引:0,他引:1  
Bandos AI  Rockette HE  Song T  Gur D 《Biometrics》2009,65(1):247-256
Summary .  Free-response assessment of diagnostic systems continues to gain acceptance in areas related to the detection, localization, and classification of one or more \"abnormalities\" within a subject. A free-response receiver operating characteristic (FROC) curve is a tool for characterizing the performance of a free-response system at all decision thresholds simultaneously. Although the importance of a single index summarizing the entire curve over all decision thresholds is well recognized in ROC analysis (e.g., area under the ROC curve), currently there is no widely accepted summary of a system being evaluated under the FROC paradigm. In this article, we propose a new index of the free-response performance at all decision thresholds simultaneously, and develop a nonparametric method for its analysis. Algebraically, the proposed summary index is the area under the empirical FROC curve penalized for the number of erroneous marks, rewarded for the fraction of detected abnormalities, and adjusted for the effect of the target size (or \"acceptance radius\"). Geometrically, the proposed index can be interpreted as a measure of average performance superiority over an artificial \"guessing\" free-response process and it represents an analogy to the area between the ROC curve and the \"guessing\" or diagonal line. We derive the ideal bootstrap estimator of the variance, which can be used for a resampling-free construction of asymptotic bootstrap confidence intervals and for sample size estimation using standard expressions. The proposed procedure is free from any parametric assumptions and does not require an assumption of independence of observations within a subject. We provide an example with a dataset sampled from a diagnostic imaging study and conduct simulations that demonstrate the appropriateness of the developed procedure for the considered sample sizes and ranges of parameters.  相似文献   

3.
In lightning-induced fire risk prediction models, the number of potential predictors is usually high, with some redundancy among them. It is therefore important to select the best subset of predictors that obtain models with the greatest discrimination capacity. With this aim in mind, the logistic generalized linear model was used to estimate lightning-induced fire occurrence using a case study of the province of León (northwest Spain). A bootstrap-based test was used to obtain the optimal number of predictors and to model this optimal number of predictors displaying the largest area under the receiver operating characteristics curve. The results show that of the 16 variables initially considered, only three were necessary to obtain the model with the best discriminatory capacity for estimating lightning-induced fire occurrence. Moreover, this model can be considered equivalent to another nine alternative models with three covariates. Both the optimal and the equivalent models are useful in the spatially explicit assessment of fire risk, the planning and coordination of regional efforts to identify areas at greatest risk, and the design of long-term wildfire management strategies. The methodology used for this case study can be applied to other wildfire risk assessment situations where multiple and interconnected covariates are available.  相似文献   

4.
    
The receiver operating characteristic (ROC) curve is a tool commonly used to evaluate biomarker utility in clinical diagnosis of disease. Often, multiple biomarkers are developed to evaluate the discrimination for the same outcome. Levels of multiple biomarkers can be combined via best linear combination (BLC) such that their overall discriminatory ability is greater than any of them individually. Biomarker measurements frequently have undetectable levels below a detection limit sometimes denoted as limit of detection (LOD). Ignoring observations below the LOD or substituting some replacement value as a method of correction has been shown to lead to negatively biased estimates of the area under the ROC curve for some distributions of single biomarkers. In this paper, we develop asymptotically unbiased estimators, via the maximum likelihood technique, of the area under the ROC curve of BLC of two bivariate normally distributed biomarkers affected by LODs. We also propose confidence intervals for this area under curve. Point and confidence interval estimates are scrutinized by simulation study, recording bias and root mean square error and coverage probability, respectively. An example using polychlorinated biphenyl (PCB) levels to classify women with and without endometriosis illustrates the potential benefits of our methods.  相似文献   

5.
6.
    
Receiver operating characteristic (ROC) curve is commonly used to evaluate and compare the accuracy of classification methods or markers. Estimating ROC curves has been an important problem in various fields including biometric recognition and diagnostic medicine. In real applications, classification markers are often developed under two or more ordered conditions, such that a natural stochastic ordering exists among the observations. Incorporating such a stochastic ordering into estimation can improve statistical efficiency (Davidov and Herman, 2012). In addition, clustered and correlated data arise when multiple measurements are gleaned from the same subject, making estimation of ROC curves complicated due to within-cluster correlations. In this article, we propose to model the ROC curve using a weighted empirical process to jointly account for the order constraint and within-cluster correlation structure. The algebraic properties of resulting summary statistics of the ROC curve such as its area and partial area are also studied. The algebraic expressions reduce to the ones by Davidov and Herman (2012) for independent observations. We derive asymptotic properties of the proposed order-restricted estimators and show that they have smaller mean-squared errors than the existing estimators. Simulation studies also demonstrate better performance of the newly proposed estimators over existing methods for finite samples. The proposed method is further exemplified with the fingerprint matching data from the National Institute of Standards and Technology Special Database 4.  相似文献   

7.
    
In this work, we extend our previous ligand shape-based virtual screening approach by using the scoring function Hamza–Wei–Zhan (HWZ) score and an enhanced molecular shape-density model for the ligands. The performance of the method has been tested against the 40 targets in the Database of Useful Decoys and compared with the performance of our previous HWZ score method. The virtual screening results using the novel ligand shape-based approach demonstrated a favorable improvement (area under the receiver operator characteristics curve AUC?=?.89?±?.02) and effectiveness (hit rate HR1%?=?53.0%?±?6.3 and HR10%?=?71.1%?±?4.9). The comparison of the overall performance of our ligand shape-based method with the highest ligand shape-based virtual screening approach using the data fusion of multi queries showed that our strategy takes into account deeper the chemical information of the set of active ligands. Furthermore, the results indicated that our method are suitable for virtual screening and yields superior prediction accuracy than the other study derived from the data fusion using five queries. Therefore, our novel ligand shape-based screening method constitutes a robust and efficient approach to the 3D similarity screening of small compounds and open the door to a whole new approach to drug design by implementing the method in the structure-based virtual screening.  相似文献   

8.
Combining diagnostic test results to increase accuracy   总被引:4,自引:0,他引:4  
When multiple diagnostic tests are performed on an individual or multiple disease markers are available it may be possible to combine the information to diagnose disease. We consider how to choose linear combinations of markers in order to optimize diagnostic accuracy. The accuracy index to be maximized is the area or partial area under the receiver operating characteristic (ROC) curve. We propose a distribution-free rank-based approach for optimizing the area under the ROC curve and compare it with logistic regression and with classic linear discriminant analysis (LDA). It has been shown that the latter method optimizes the area under the ROC curve when test results have a multivariate normal distribution for diseased and non-diseased populations. Simulation studies suggest that the proposed non-parametric method is efficient when data are multivariate normal.The distribution-free method is generalized to a smooth distribution-free approach to: (i) accommodate some reasonable smoothness assumptions; (ii) incorporate covariate effects; and (iii) yield optimized partial areas under the ROC curve. This latter feature is particularly important since it allows one to focus on a region of the ROC curve which is of most relevance to clinical practice. Neither logistic regression nor LDA necessarily maximize partial areas. The approaches are illustrated on two cancer datasets, one involving serum antigen markers for pancreatic cancer and the other involving longitudinal prostate specific antigen data.  相似文献   

9.
    
Ding S  Zhang S  Li Y  Wang T 《Biochimie》2012,94(5):1166-1171
Knowledge of structural classes plays an important role in understanding protein folding patterns. In this paper, features based on the predicted secondary structure sequence and the corresponding E–H sequence are extracted. Then, an 11-dimensional feature vector is selected based on a wrapper feature selection algorithm and a support vector machine (SVM). Among the 11 selected features, 4 novel features are newly designed to model the differences between α/β class and α + β class, and other 7 rational features are proposed by previous researchers. To examine the performance of our method, a total of 5 datasets are used to design and test the proposed method. The results show that competitive prediction accuracies can be achieved by the proposed method compared to existing methods (SCPRED, RKS-PPSC and MODAS), and 4 new features are demonstrated essential to differentiate α/β and α + β classes. Standalone version of the proposed method is written in JAVA language and it can be downloaded from http://web.xidian.edu.cn/slzhang/paper.html.  相似文献   

10.
    
Rizopoulos D 《Biometrics》2011,67(3):819-829
In longitudinal studies it is often of interest to investigate how a marker that is repeatedly measured in time is associated with a time to an event of interest. This type of research question has given rise to a rapidly developing field of biostatistics research that deals with the joint modeling of longitudinal and time-to-event data. In this article, we consider this modeling framework and focus particularly on the assessment of the predictive ability of the longitudinal marker for the time-to-event outcome. In particular, we start by presenting how survival probabilities can be estimated for future subjects based on their available longitudinal measurements and a fitted joint model. Following we derive accuracy measures under the joint modeling framework and assess how well the marker is capable of discriminating between subjects who experience the event within a medically meaningful time frame from subjects who do not. We illustrate our proposals on a real data set on human immunodeficiency virus infected patients for which we are interested in predicting the time-to-death using their longitudinal CD4 cell count measurements.  相似文献   

11.
    
We present a method of data reduction using a wavelet transform in discriminant analysis when the number of variables is much greater than the number of observations. The method is illustrated with a prostate cancer study, where the sample size is 248, and the number of variables is 48,538 (generated using the ProteinChip technology). Using a discrete wavelet transform, the 48,538 data points are represented by 1271 wavelet coefficients. Information criteria identified 11 of the 1271 wavelet coefficients with the highest discriminatory power. The linear classifier with the 11 wavelet coefficients detected prostate cancer in a separate test set with a sensitivity of 97% and specificity of 100%.  相似文献   

12.
The thermostability of proteins is particularly relevant for enzyme engineering. Developing a computational method to identify mesophilic proteins would be helpful for protein engineering and design. In this work, we developed support vector machine based method to predict thermophilic proteins using the information of amino acid distribution and selected amino acid pairs. A reliable benchmark dataset including 915 thermophilic proteins and 793 non-thermophilic proteins was constructed for training and testing the proposed models. Results showed that 93.8% thermophilic proteins and 92.7% non-thermophilic proteins could be correctly predicted by using jackknife cross-validation. High predictive successful rate exhibits that this model can be applied for designing stable proteins.  相似文献   

13.
We developed an accurate method to predict nucleosome positioning from genome sequences by refining the previously developed method of Peckham et al. (2007) [19]. Here, we used the relative fragment frequency index we developed and a support vector machine to screen for nucleosomal and linker DNA sequences. Our twofold cross-validation revealed that the accuracy of our method based on the area under the receiver operating characteristic curve was 81%, whereas that of Peckham’s method was 75% when both of two nucleosomal sequence data obtained from independent experiments were used for validation. We suggest that our method is more effective in predicting nucleosome positioning.  相似文献   

14.
15.
Development of glutamate non-competitive antagonists of mGluR1 (Metabotropic glutamate receptor subtype 1) has increasingly attracted much attention in recent years due to their potential therapeutic application for various nervous disorders. Since there is no crystal structure reported for mGluR1, ligand-based virtual screening (VS) methods, typically pharmacophore-based VS (PB-VS), are often used for the discovery of mGluR1 antagonists. Nevertheless, PB-VS usually suffers a lower hit rate and enrichment factor. In this investigation, we established a multistep ligand-based VS approach that is based on a support vector machine (SVM) classification model and a pharmacophore model. Performance evaluation of these methods in virtual screening against a large independent test set, M-MDDR, show that the multistep VS approach significantly increases the hit rate and enrichment factor compared with the individual SB-VS and PB-VS methods. The multistep VS approach was then used to screen several large chemical libraries including PubChem, Specs, and Enamine. Finally a total of 20 compounds were selected from the top ranking compounds, and shifted to the subsequent in vitro and in vivo studies, which results will be reported in the near future.  相似文献   

16.
As one of the most common post-translational modifications, ubiquitination regulates the quantity and function of a variety of proteins. Experimental and clinical investigations have also suggested the crucial roles of ubiquitination in several human diseases. The complicated sequence context of human ubiquitination sites revealed by proteomic studies highlights the need of developing effective computational strategies to predict human ubiquitination sites. Here we report the establishment of a novel human-specific ubiquitination site predictor through the integration of multiple complementary classifiers. Firstly, a Support Vector Machine (SVM) classier was constructed based on the composition of k-spaced amino acid pairs (CKSAAP) encoding, which has been utilized in our previous yeast ubiquitination site predictor. To further exploit the pattern and properties of the ubiquitination sites and their flanking residues, three additional SVM classifiers were constructed using the binary amino acid encoding, the AAindex physicochemical property encoding and the protein aggregation propensity encoding, respectively. Through an integration that relied on logistic regression, the resulting predictor termed hCKSAAP_UbSite achieved an area under ROC curve (AUC) of 0.770 in 5-fold cross-validation test on a class-balanced training dataset. When tested on a class-balanced independent testing dataset that contains 3419 ubiquitination sites, hCKSAAP_UbSite has also achieved a robust performance with an AUC of 0.757. Specifically, it has consistently performed better than the predictor using the CKSAAP encoding alone and two other publicly available predictors which are not human-specific. Given its promising performance in our large-scale datasets, hCKSAAP_UbSite has been made publicly available at our server (http://protein.cau.edu.cn/cksaap_ubsite/).  相似文献   

17.
探讨原发性肝癌患者精确放疗后乙型肝炎病毒(hepatitis b virus,HBV)再激活的危险特征和分类预测模型。提出基于遗传算法的特征选择方法,从原发性肝癌数据的初始特征集中选择HBV再激活的最优特征子集。建立贝叶斯和支持向量机的HBV再激活分类预测模型,并预测最优特征子集和初始特征集的分类性能。实验结果表明,基于遗传算法的特征选择提高了HBV再激活分类性能,最优特征子集的分类性能明显优于初始特征子集的分类性能。影响HBV再激活的最优特征子集包括:HBV DNA水平,肿瘤分期TNM,Child-Pugh,外放边界和全肝最大剂量。贝叶斯的分类准确性最高可达82.89%,支持向量机的分类准确性最高可达83.34%。  相似文献   

18.
    
  相似文献   

19.
Remote homology detection refers to the detection of structure homology in evolutionarily related proteins with low sequence similarity. Supervised learning algorithms such as support vector machine (SVM) are currently the most accurate methods. In most of these SVM-based methods, efforts have been dedicated to developing new kernels to better use the pairwise alignment scores or sequence profiles. Moreover, amino acids’ physicochemical properties are not generally used in the feature representation of protein sequences. In this article, we present a remote homology detection method that incorporates two novel features: (1) a protein's primary sequence is represented using amino acid's physicochemical properties and (2) the similarity between two proteins is measured using recurrence quantification analysis (RQA). An optimization scheme was developed to select different amino acid indices (up to 10 for a protein family) that are best to characterize the given protein family. The selected amino acid indices may enable us to draw better biological explanation of the protein family classification problem than using other alignment-based methods. An SVM-based classifier will then work on the space described by the RQA metrics. The classification scheme is named as SVM-RQA. Experiments at the superfamily level of the SCOP1.53 dataset show that, without using alignment or sequence profile information, the features generated from amino acid indices are able to produce results that are comparable to those obtained by the published state-of-the-art SVM kernels. In the future, better prediction accuracies can be expected by combining the alignment-based features with our amino acids property-based features. Supplementary information including the raw dataset, the best-performing amino acid indices for each protein family and the computed RQA metrics for all protein sequences can be downloaded from http://ym151113.ym.edu.tw/svm-rqa.  相似文献   

20.
Cancers are regarded as malignant proliferations of tumor cells present in many tissues and organs, which can severely curtail the quality of human life. The potential of using plasma DNA for cancer detection has been widely recognized, leading to the need of mapping the tissue-of-origin through the identification of somatic mutations. With cutting-edge technologies, such as next-generation sequencing, numerous somatic mutations have been identified, and the mutation signatures have been uncovered across different cancer types. However, somatic mutations are not independent events in carcinogenesis but exert functional effects. In this study, we applied a pan-cancer analysis to five types of cancers: (I) breast cancer (BRCA), (II) colorectal adenocarcinoma (COADREAD), (III) head and neck squamous cell carcinoma (HNSC), (IV) kidney renal clear cell carcinoma (KIRC), and (V) ovarian cancer (OV). Based on the mutated genes of patients suffering from one of the aforementioned cancer types, patients they were encoded into a large number of numerical values based upon the enrichment theory of gene ontology (GO) terms and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. We analyzed these features with the Monte-Carlo Feature Selection (MCFS) method, followed by the incremental feature selection (IFS) method to identify functional alteration features that could be used to build the support vector machine (SVM)-based classifier for distinguishing the five types of cancers. Our results showed that the optimal classifier with the selected 344 features had the highest Matthews correlation coefficient value of 0.523. Sixteen decision rules produced by the MCFS method can yield an overall accuracy of 0.498 for the classification of the five cancer types. Further analysis indicated that some of these features and rules were supported by previous experiments. This study not only presents a new approach to mapping the tissue-of-origin for cancer detection but also unveils the specific functional alterations of each cancer type, providing insight into cancer-specific functional aberrations as potential therapeutic targets. This article is part of a Special Issue entitled: Accelerating Precision Medicine through Genetic and Genomic Big Data Analysis edited by Yudong Cai & Tao Huang.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号