首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Advances in DNA microarray technologies have made gene expression profiles a significant candidate in identifying different types of cancers. Traditional learning-based cancer identification methods utilize labeled samples to train a classifier, but they are inconvenient for practical application because labels are quite expensive in the clinical cancer research community. This paper proposes a semi-supervised projective non-negative matrix factorization method (Semi-PNMF) to learn an effective classifier from both labeled and unlabeled samples, thus boosting subsequent cancer classification performance. In particular, Semi-PNMF jointly learns a non-negative subspace from concatenated labeled and unlabeled samples and indicates classes by the positions of the maximum entries of their coefficients. Because Semi-PNMF incorporates statistical information from the large volume of unlabeled samples in the learned subspace, it can learn more representative subspaces and boost classification performance. We developed a multiplicative update rule (MUR) to optimize Semi-PNMF and proved its convergence. The experimental results of cancer classification for two multiclass cancer gene expression profile datasets show that Semi-PNMF outperforms the representative methods.  相似文献   

2.
In the year 2020, there were 105 different statutory insurance companies in Germany with heterogeneous regional coverage. Obtaining data from all insurance companies is challenging, so that it is likely that projects will have to rely on data not covering the whole population. Consequently, the study of epidemic spread in hospital referral networks using data-driven models may be biased. We studied this bias using data from three German regional insurance companies covering four federal states: AOK (historically “general local health insurance company”, but currently only the abbreviation is used) Lower Saxony (in Federal State of Lower Saxony), AOK Bavaria (in Bavaria), and AOK PLUS (in Thuringia and Saxony). To understand how incomplete data influence network characteristics and related epidemic simulations, we created sampled datasets by randomly dropping a proportion of patients from the full datasets and replacing them with random copies of the remaining patients to obtain scale-up datasets to the original size. For the sampled and scale-up datasets, we calculated several commonly used network measures, and compared them to those derived from the original data. We found that the network measures (degree, strength and closeness) were rather sensitive to incompleteness. Infection prevalence as an outcome from the applied susceptible-infectious-susceptible (SIS) model was fairly robust against incompleteness. At incompleteness levels as high as 90% of the original datasets the prevalence estimation bias was below 5% in scale-up datasets. Consequently, a coverage as low as 10% of the local population of the federal state population was sufficient to maintain the relative bias in prevalence below 10% for a wide range of transmission parameters as encountered in clinical settings. Our findings are reassuring that despite incomplete coverage of the population, German health insurance data can be used to study effects of patient traffic between institutions on the spread of pathogens within healthcare networks.  相似文献   

3.
Kim S  Kon M  Delisi C 《Biology direct》2012,7(1):21-22
ABSTRACT: BACKGROUND: Molecular markers based on gene expression profiles have been used in experimental and clinical settings to distinguish cancerous tumors in stage, grade, survival time, metastasis, and drug sensitivity. However, most significant gene markers are unstable (not reproducible) among data sets. We introduce a standardized method for representing cancer markers as 2-level hierarchical feature vectors, with a basic gene level as well as a second level of (more stable) pathway markers, for the purpose of discriminating cancer subtypes. This extends standard gene expression arrays with new pathway-level activation features obtained directly from off-the-shelf gene set enrichment algorithms such as GSEA. Such so-called pathway-based expression arrays are significantly more reproducible across datasets. Such reproducibility will be important for clinical usefulness of genomic markers, and augment currently accepted cancer classification protocols. RESULTS: The present method produced more stable (reproducible) pathway-based markers for discriminating breast cancer metastasis and ovarian cancer survival time. Between two datasets for breast cancer metastasis, the intersection of standard significant gene biomarkers totaled 7.47% of selected genes, compared to 17.65% using pathway-based markers; the corresponding percentages for ovarian cancer datasets were 20.65% and 33.33% respectively. Three pathways, consisting of Type_1_diabetes mellitus, Cytokine-cytokine_receptor_interaction and Hedgehog_signaling (all previously implicated in cancer), are enriched in both the ovarian long survival and breast non-metastasis groups. In addition, integrating pathway and gene information, we identified five (ID4, ANXA4, CXCL9, MYLK, FBXL7) and six (SQLE, E2F1, PTTG1, TSTA3, BUB1B, MAD2L1) known cancer genes significant for ovarian and breast cancer respectively. CONCLUSIONS: Standardizing the analysis of genomic data in the process of cancer staging, classification and analysis is important as it has implications for both pre-clinical as well as clinical studies. The paradigm of diagnosis and prediction using pathway-based biomarkers as features can be an important part of the process of biomarker-based cancer analysis, and the resulting canonical (clinically reproducible) biomarkers can be important in standardizing genomic data. We expect that identification of such canonical biomarkers will improve clinical utility of high-throughput datasets for diagnostic and prognostic applications. Reviewers This article was reviewed by John McDonald (nominated by I. King Jordon), Eugene Koonin, Nathan Bowen (nominated by I, King Jordon), and Ekaterina Kotelnikova (nominated by Mikhail Gelfand).  相似文献   

4.
Jing Qin  Yu Shen 《Biometrics》2010,66(2):382-392
Summary Length‐biased time‐to‐event data are commonly encountered in applications ranging from epidemiological cohort studies or cancer prevention trials to studies of labor economy. A longstanding statistical problem is how to assess the association of risk factors with survival in the target population given the observed length‐biased data. In this article, we demonstrate how to estimate these effects under the semiparametric Cox proportional hazards model. The structure of the Cox model is changed under length‐biased sampling in general. Although the existing partial likelihood approach for left‐truncated data can be used to estimate covariate effects, it may not be efficient for analyzing length‐biased data. We propose two estimating equation approaches for estimating the covariate coefficients under the Cox model. We use the modern stochastic process and martingale theory to develop the asymptotic properties of the estimators. We evaluate the empirical performance and efficiency of the two methods through extensive simulation studies. We use data from a dementia study to illustrate the proposed methodology, and demonstrate the computational algorithms for point estimates, which can be directly linked to the existing functions in S‐PLUS or R .  相似文献   

5.
MOTIVATION: Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. To equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods and two cross-validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types. RESULTS: Multicategory support vector machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms, such as k-nearest neighbors, backpropagation and probabilistic neural networks, often to a remarkable degree. Gene selection techniques can significantly improve the classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets. AVAILABILITY: The software system GEMS is available for download from http://www.gems-system.org for non-commercial use. CONTACT: alexander.statnikov@vanderbilt.edu.  相似文献   

6.
从分子层面对泛癌进行研究已经得到了很大的进展,但是对宫颈鳞状细胞癌的分子分类研究仍然需要更多的探索.为了找到宫颈鳞状细胞癌潜在的子类,本文提出了一个基于多维组学数据的癌症亚型分类分析流程.通过统计学方法对癌症基因组图谱(The Cancer Genome Atlas,TCGA)宫颈鳞状细胞癌的mRNA表达数据、小分子核糖核酸(microRNA,miRNA)表达数据、DNA甲基化数据以及拷贝数变异数据4个维度包含的分子进行筛选,然后对筛选后的分类特征进行整合聚类,进一步筛选能够区分不同子类的关键分类特征,并使用这些关键分类特征建立宫颈鳞状细胞癌分类模型.本研究为宫颈鳞状细胞癌分子层面子类的识别提供了分析流程,得到了两个临床生存水平具有显著性差异的宫颈鳞状细胞癌子类,并确定了8个宫颈鳞状细胞癌的关键分类特征.本研究中识别的宫颈鳞状细胞癌子类和关键分类特征为宫颈鳞状细胞癌早期分类及分类标志物的鉴定提供了重要参考.  相似文献   

7.
It is an increasing evidence that long non‐coding RNAs (lncRNAs) are involved in tumour initiation and progression. Here, we analysed RNA‐sequencing data from the Cancer Genome Atlas (TCGA) datasets. Totally, 1176lncRNAs, 245miRNAs and 2081mRNAs were identified to be differentially expressed (DE) in colon cancer tissues compared with normal tissues. CASC21, a novel lncRNA located in 8q24.21 locus, was significantly overexpressed in 30 colon cancer tissues compared with matched normal tissues by qRT‐PCR assay. CASC21 tended to higher expression as the increase of the tumour‐node‐metastasis (TNM) classification. Functionally, CASC21 promoted cell proliferation by regulating cell cycle and enhanced tumour metastasis by epithelial‐mesenchymal transition (EMT) in colon cancer. Mechanism study indicated that CASC21 might be involved in activating WNT/β‐catenin pathway in colon cancer. In addition, we also built a competing endogenous RNA (ceRNNA) network by bioinformatic analysis using TCGA datasets. Together, our results not only provide novel lncRNAs as potential candidates for further study but also prove that CASC21 is an oncogenic regulator through activating WNT/β‐catenin signalling in colon cancer.  相似文献   

8.
9.
We investigate the multiclass classification of cancer microarray samples. In contrast to classification of two cancer types from gene expression data, multiclass classification of more than two cancer types are relatively hard and less studied problem. We used class-wise optimized genes with corresponding one-versus-all support vector machine (OVA-SVM) classifier to maximize the utilization of selected genes. Final prediction was made by using probability scores from all classifiers. We used three different methods of estimating probability from decision value. Among the three probability methods, Platt's approach was more consistent, whereas, isotonic approach performed better for datasets with unequal proportion of samples in different classes. Probability based decision does not only gives true and fair comparison between different one-versus-all (OVA) classifiers but also gives the possibility of using them for any post analysis. Several ensemble experiments, an example of post analysis, of the three probability methods were implemented to study their effect in improving the classification accuracy. We observe that ensemble did help in improving the predictive accuracy of cancer data sets especially involving unbalanced samples. Four-fold external stratified cross-validation experiment was performed on the six multiclass cancer datasets to obtain unbiased estimates of prediction accuracies. Analysis of class-wise frequently selected genes on two cancer datasets demonstrated that the approach was able to select important and relevant genes consistent to literature. This study demonstrates successful implementation of the framework of class-wise feature selection and multiclass classification for prediction of cancer subtypes on six datasets.  相似文献   

10.
MOTIVATION: In the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) amongst a large set of sequences for which class membership is unknown (unlabeled instances). Traditional two-class classification methods do not effectively handle such data. RESULTS: Here, we develop a novel method, likely positive-iterative classification (LP-IC) for this problem, and contrast its performance with the few existing methods, most of which were devised and utilized in the text classification context. LP-IC employs an iterative classification scheme and introduces a class dispersion measure, adopted from unsupervised clustering approaches, to monitor the model selection process. Using two case studies--prediction of HLA binding, and alternative splicing conservation between human and mouse--we show that LP-IC provides superior performance to existing methodologies in terms of: (i) combined accuracy and precision in positive identification from the unlabeled set; and (ii) predictive performance of the resultant classifiers on independent test data.  相似文献   

11.
Microarray techniques provide new insights into molecular classification of cancer types, which is critical for cancer treatments and diagnosis. Recently, an increasing number of supervised machine learning methods have been applied to cancer classification problems using gene expression data. Support vector machines (SVMs), in particular, have become one of the most effective and leading methods. However, there exist few studies on the application of other kernel methods in the literature. We apply a kernel subspace (KS) method to multiclass cancer classification problems, and assess its validity by comparing it with multiclass SVMs. Our comparative study using seven multiclass cancer datasets demonstrates that the KS method has high performance that is comparable to multiclass SVMs. Furthermore, we propose an effective criterion for kernel parameter selection, which is shown to be useful for the computation of the KS method.  相似文献   

12.
Pre-clinical studies provide compelling evidence that Eph family receptor tyrosine kinases (RTKs) and ligands promote cancer growth, neovascularization, invasion, and metastasis. Tumor suppressive roles have also been reported for the receptors, however, creating a potential barrier for clinical application. Determining how these observations relate to clinical outcome is a crucial step for translating the biological and mechanistic data into new molecularly targeted therapies. We investigated eph and ephrin expression in human breast cancer relative to endpoints of overall and/or recurrence-free survival in large microarray datasets. We also investigated protein expression in commercial human breast tissue microarrays (TMA) and Stage I prognostic TMAs linked to recurrence outcome data. We found significant correlations between ephA2, ephA4, ephA7, ephB4, and ephB6 and overall and/or recurrence-free survival in large microarray datasets. Protein expression in TMAs supported these trends. While observed no correlation between ephrin ligand expression and clinical outcome in microarray datasets, ephrin-A1 and EphA2 protein co-expression was significantly associated with recurrence in Stage I prognostic breast cancer TMAs. Our data suggest that several Eph family members are clinically relevant and tractable targets for intervention in human breast cancer. Moreover, profiling Eph receptor expression patterns in the context of relevant ligands and in the context of stage may be valuable in terms of diagnostics and treatment.  相似文献   

13.

Aim

Species distribution data play a pivotal role in the study of ecology, evolution, biogeography and biodiversity conservation. Although large amounts of location data are available and accessible from public databases, data quality remains problematic. Of the potential sources of error, positional errors are critical for spatial applications, particularly where these errors place observations beyond the environmental or geographical range of species. These outliers need to be identified, checked and removed to improve data quality and minimize the impact on subsequent analyses. Manually checking all species records within large multispecies datasets is prohibitively costly. This work investigates algorithms that may assist in the efficient vetting of outliers in such large datasets.

Location

We used real, spatially explicit environmental data derived from the western part of Victoria, Australia, and simulated species distributions within this same region.

Methods

By adapting species distribution modelling (SDM), we developed a pseudo‐SDM approach for detecting outliers in species distribution data, which was implemented with random forest (RF) and support vector machine (SVM) resulting in two new methods: RF_pdSDM and SVM_pdSDM. Using virtual species, we compared eight existing multivariate outlier detection methods with these two new methods under various conditions.

Results

The two new methods based on the pseudo‐SDM approach had higher true skill statistic (TSS) values than other approaches, with TSS values always exceeding 0. More than 70% of the true outliers in datasets for species with a low and intermediate prevalence can be identified by checking 10% of the data points with the highest outlier scores.

Main conclusions

Pseudo‐SDM‐based methods were more effective than other outlier detection methods. However, this outlier detection procedure can only be considered as a screening tool, and putative outliers must be examined by experts to determine whether they are actual errors or important records within an inherently biased set of data.  相似文献   

14.
Microarray technology is becoming a powerful tool for clinical diagnosis, as it has potential to discover gene expression patterns that are characteristic for a particular disease. To date, this possibility has received much attention in the context of cancer research, especially in tumor classification. However, most published articles have concentrated on the development of binary classification methods while neglected ubiquitous multiclass problems. Unfortunately, only a few multiclass classification approaches have had poor predictive accuracy. In an effort to improve classification accuracy, we developed a novel multiclass microarray data classification method. First, we applied a "one versus rest-support vector machine" to classify the samples. Then the classification confidence of each testing sample was evaluated according to its distribution in feature space and some with poor confidence were extracted. Next, a novel strategy, which we named as "class priority estimation method based on centroid distance", was used to make decisions about categories for those poor confidence samples. This approach was tested on seven benchmark multiclass microarray datasets, with encouraging results, demonstrating effectiveness and feasibility.  相似文献   

15.
Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks.  相似文献   

16.
The Maximal Margin (MAMA) linear programming classification algorithm has recently been proposed and tested for cancer classification based on expression data. It demonstrated sound performance on publicly available expression datasets. We developed a web interface to allow potential users easy access to the MAMA classification tool. Basic and advanced options provide flexibility in exploitation. The input data format is the same as that used in most publicly available datasets. This makes the web resource particularly convenient for non-expert machine learning users working in the field of expression data analysis.  相似文献   

17.
In clinical study reports (CSRs), adverse events (AEs) are commonly summarized using the incidence proportion (IP). IPs can be calculated for all types of AEs and are often interpreted as the probability that a treated patient experiences specific AEs. Exposure time can be taken into account with time-to-event methods. Using one minus Kaplan–Meier (1-KM) is known to overestimate the AE probability in the presence of competing events (CEs). The use of a nonparametric estimator of the cumulative incidence function (CIF) has therefore been advocated as more appropriate. In this paper, we compare different methods to estimate the probability of one selected AE. In particular, we investigate whether the proposed methods provide a reasonable estimate of the AE probability at an interim analysis (IA). The characteristics of the methods in the presence of a CE are illustrated using data from a breast cancer study and we quantify the potential bias in a simulation study. At the final analysis performed for the CSR, 1-KM systematically overestimates and in most cases IP slightly underestimates the given AE probability. CIF has the lowest bias in most simulation scenarios. All methods might lead to biased estimates at the IA except for AEs with early onset. The magnitude of the bias varies with the time-to-AE and/or CE occurrence, the selection of event-specific hazards and the amount of censoring. In general, reporting AE probabilities for prespecified fixed time points is recommended.  相似文献   

18.
目的探讨环氧合酶-2(COX-2)和Ki-67在前列腺癌中的表达以及结核菌L型感染率及临床意义。方法应用免疫组化、原位杂交和抗酸染色等方法检测了65例前列腺癌(carcinoma of prostate,PCa)和30例良性前列腺增生(benign prostatic hyperplasia,BPH)中的COX-2、Ki-67蛋白及mRNA的表达,以及结核菌L型的检出率;并对前列腺肿瘤主要临床资料和病理分级参数进行比较,用χ^2检验进行统计学处理。结果COX-2、Ki-67蛋白及mRNA阳性表达和结核菌L型检出率,前列腺癌明显高于前列腺增生(P〈0.001~0.05)。COX-2、Ki-67蛋白及mRNA阳性表达和结核菌L型检出率与前列腺癌的临床分期、病理分级有明显差异(P〈0.01~0.05)。淋巴结转移组中COX-2、Ki-67蛋白及mRNA的阳性表达率明显高于非转移组(P〈0.01)。结核菌L型检出率淋巴结转移组明显高于非转移组(P〈0.05)。结论COX-2、Ki-67蛋白及mRNA在前列腺肿瘤中不同程度异常表达以及结核菌L型检出率与肿瘤的临床分期、病理分级和转移呈正相关,提示2种基因均可作为判断前列腺癌生物学行为及患者预后参考指标。结核菌L型感染极有可能导致基因的变异或过表达,成为诱发肿瘤因素之一,它们可能有协同致瘤作用。  相似文献   

19.

Background  

Microarray experiments are becoming a powerful tool for clinical diagnosis, as they have the potential to discover gene expression patterns that are characteristic for a particular disease. To date, this problem has received most attention in the context of cancer research, especially in tumor classification. Various feature selection methods and classifier design strategies also have been generally used and compared. However, most published articles on tumor classification have applied a certain technique to a certain dataset, and recently several researchers compared these techniques based on several public datasets. But, it has been verified that differently selected features reflect different aspects of the dataset and some selected features can obtain better solutions on some certain problems. At the same time, faced with a large amount of microarray data with little knowledge, it is difficult to find the intrinsic characteristics using traditional methods. In this paper, we attempt to introduce a combinational feature selection method in conjunction with ensemble neural networks to generally improve the accuracy and robustness of sample classification.  相似文献   

20.
目的探讨CyclinD1、CDK4和P16在前列腺癌中的表达以及结核菌L型感染率及临床意义。方法应用免疫组化和抗酸染色等方法检测了65例前列腺癌(carcinoma of prostate,PCa)和30例良性前列腺增生(benignprostatic hyperplasia,BPH)中的CyclinD1、CDK4和P16的表达,以及结核菌L型的检出率。并对前列腺肿瘤主要临床资料和病理分级参数进行比较,用χ^2检验进行统计学处理。结果 CyclinD1、CDK4阳性表达前列腺癌明显高于前列腺增生(P〈0.01);并与前列腺癌的临床分期、病理分级及淋巴结转移差异有统计学意义(P〈0.01-0.05)呈正相关。P16阳性表达前列腺增生明显高于前列腺癌(P〈0.01);与前列腺癌的临床分期、病理分级及淋巴结转移差异有统计学意义(P〈0.01-0.05)呈负正相关。结核菌L型检出率前列腺癌明显高于前列腺增生;与前列腺癌的临床分期、病理分级及淋巴结转移差异有统计学意义(P〈0.05)。结论 CyclinD1、CDK4和P16在前列腺肿瘤中不同程度异常表达以及结核菌L型检出率与肿瘤的临床分期、病理分级和转移相关,因此研究CyclinD1、CDK4和P16的阳性表达和结核菌L型感染与前列腺癌发生发展中可能有协同作用,具有重要的临床应用价值。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号