首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 281 毫秒
1.
在基于质谱的蛋白质组学数据分类研究中, 降维技术能够通过提取特征来降低变量维
度, 有助于机器学习方法进行准确高效地分类. 为了研究和比较机器学习分类器与降维技术相
结合的分类模型在蛋白质组学数据分类中的性能, 为相关的分类研究提供参考, 将线性判别分
析、k-最近邻、决策树、支持向量机及人工神经网络分类方法与主成分分析及偏最小二乘降维
技术相结合, 应用于质谱公共数据的分类中. 本文所使用的结合式分类模型中, PLS-LDA,
PLS-SVM 和PLS-ANN 方法表现出了最高的分类准确率. 为进一步提升分类效果, 基于最优的
这3 种组合分类方法, 采用多数投票法构建了专家分类系统. 在10 倍交叉验证中, 多数投票模
型仅使用前5 个主成分即达到100%的分类准确率. 本研究方法和结论为质谱及其他类型的指纹
图谱分析研究提供了有益的参考.  相似文献   

2.
特征选择技术被广泛应用于生物信息学中。通过重复利用偏最小二乘(partial least square,PLS)方法提取主成分,通过逐次选择在主成分中权重较大的基因,将PLS应用于特征选择中。将这种方法用于对肿瘤基因表达谱数据的特征基因选择中,并用提取的特征基因分类,用8个特征基因进行分类时,能达到92.5%的正确率。  相似文献   

3.
随机森林:一种重要的肿瘤特征基因选择法   总被引:2,自引:0,他引:2  
特征选择技术已经被广泛地应用于生物信息学科,随机森林(random forests,RF)是其中一种重要的特征选择方法。利用RF对胃癌、结肠癌和肺癌等5组基因表达谱数据进行特征基因选择,将选择结果与支持向量机(support vector machine,SVM)结合对原数据集分类,并对特征基因选择及分类结果进行初步的分析。同时使用微阵列显著性分析(significant analysis of microarray,SAM)和ReliefF法与RF比较,结果显示随机森林选择的特征基因包含更多分类信息,分类准确率更高。结合该方法自身具有的分类方面的诸多优势,随机森林可以作为一种可靠的基因表达谱数据分析手段被广泛使用。  相似文献   

4.
肝癌是中国最常见的恶性肿瘤之一。基于肿瘤基因表达谱数据的分析与研究是当今研究的热点,对于癌症的早期诊断、治疗具有十分重要的意义。针对高维小样本基因表达谱数据所显现的变量间严重共线性、类别变量与预测变量的非线性关系,采用了基于样条变换的偏最小二乘回归新技术。首先通过筛选法去除基因表达谱数据中的冗余信息,然后以3次B基样条变换实现非线性基因表达谱数据的线性化重构,随后将重构的矩阵交由偏最小二乘法构建类别变量与预测变量间的关系模型。最后,通过对肝癌肿瘤基因表达谱数据的分析,结果显示此分类模型对数据重构稳健,有效的解决了高维小样本基因表达谱数据间的过拟合和变量间的共线性,具有较高的拟合和分类正确率。  相似文献   

5.
基于流形学习的基因表达谱数据可视化   总被引:2,自引:0,他引:2  
基因表达谱的可视化本质上是高维数据的降维问题。采用流形学习算法来解决基因表达谱的降维数据可视化,讨论了典型的流形学习算法(Isomap和LLE)在表达谱降维中的适用性。通过类内/类间距离定量评价数据降维的效果,对两个典型基因芯片数据集(结肠癌基因表达谱数据集和急性白血病基因表达谱数据集)进行降维分析,发现两个数据集的本征维数都低于3,因而可以用流形学习方法在低维投影空间中进行可视化。与传统的降维方法(如PCA和MDS)的投影结果作比较,显示Isomap流形学习方法有更好的可视化效果。  相似文献   

6.
夏遥  孔薇 《生物磁学》2011,(Z1):4742-4747
目的:基于阿尔茨海默病微阵列基因表达数据,分析研究微阵列基因表达数据预处理的新的有效方法。方法:首先采用标准差滤波、FSC(特征记分准则)和WPT-SAM(小波包变换-微阵列数据显著性分析)方法对微阵列基因表达数据进行预处理,比较处理后获得的基因数和FDR值;然后采用分类聚类方法对处理后的数据进行分类聚类和分层决策聚类,比较分类聚类结果。结果:标准差滤波和FSC方法获得的初筛基因数据较WPT-SAM方法多,但FDR值也高、后续分类聚类结果较WPT-SAM方法差。结论:WPT-SAM方法在预处理微阵列基因表达数据中,是比较灵活理想的分析方法。  相似文献   

7.
目的:基于阿尔茨海默病微阵列基因表达数据,分析研究微阵列基因表达数据预处理的新的有效方法.方法:首先采用标准差滤波、FSC(特征记分准则)和WPT-SAM(小波包变换-微阵列数据显著性分析)方法对微阵列基因表达数据进行预处理,比较处理后获得的基因数和FDR值;然后采用分类聚类方法对处理后的数据进行分类聚类和分层决策聚类,比较分类聚类结果.结果:标准差滤波和FSC方法获得的初筛基因数据较WPT-SAM方法多,但FDR值也高、后续分类聚类结果较WPT-SAM方法差.结论:WPT-SAM方法在预处理微阵列基因表达数据中,是比较灵活理想的分析方法.  相似文献   

8.
PLS-ANN判别分析自体荧光光谱识别胃癌   总被引:4,自引:2,他引:2  
本文对58例胃癌病人离体标本的癌浆膜和正常浆膜进行以308nm为激发光的自体荧光光谱检测,采用多因素分析法进行光谱信息提取,以识别胃癌。研究表明偏最小二乘法结合神经网络法(简称PLS—ANN)进行判别分析,诊断胃癌的灵敏度为86%,特异度为100%,准确率为93%,有望成为手术中快速识别胃癌在胃壁的浸润范围的有效方法。  相似文献   

9.
基于拉曼光谱和化学计量学方法判别大米分类的研究   总被引:2,自引:0,他引:2  
本文利用拉曼光谱和化学计量学方法,建立快速分类模型对大米进行区分。在使用最小二乘法对离散拉曼光谱进行多项式拟合去除荧光背景的前提下,利用在第一次迭代过程去除大型拉曼峰和计算噪声电平的方法,并且保留数据维数在原来的50%以下。获取精确的拉曼信号。再用主成分分析法(Principal component Analysis,PCA)对3种大米全波段的拉曼光谱进行降维分析,线性判别方法 (Linear discrimination analysis,LDA)对样品进行分类,结果显示采用前两个主成分能达到93.8%的正确分类,采用前三个主成分能达到97.9%的正确分类。优化之后的模型对于大米的判别分析具有很好的效果。  相似文献   

10.
基于NSCLC(非小细胞肺癌)子类分类在临床和生物医学研究方面的意义,利用全基因组基因表达水平(GE)和甲基化(ME)水平的微阵列数据对NSCLC子类分类进行全基因组特征基因识别分析。针对全基因组微阵列数据的高噪声、超高维小样本特性,利用弹性正交贝叶斯算法对全基因组基因进行递归筛选,识别分类精度最优的特征基因集。以TCGA的490的基因表达数据和378个甲基化数据为例,分别识别出52个GE特征基因和25个ME特征基因,相应的分类准确率分别为99%和98%。结合特征基因和临床数据建立的多变量Cox模型明确说明了特征基因在病人生存分析方面的重要作用:仅利用相应的基因表达数据和甲基化数据即可对病人样本的"高/低风险"进行正确分类,显著性水平均低于0.05。特征基因参与的代谢通路与p53、TGF-beta、Wnt等重要的癌症分类和发展的代谢通路的密切关系进一步证实了特征基因对NSCLC分类的重要性。  相似文献   

11.
MOTIVATION: One important application of gene expression microarray data is classification of samples into categories, such as the type of tumor. The use of microarrays allows simultaneous monitoring of thousands of genes expressions per sample. This ability to measure gene expression en masse has resulted in data with the number of variables p(genes) far exceeding the number of samples N. Standard statistical methodologies in classification and prediction do not work well or even at all when N < p. Modification of existing statistical methodologies or development of new methodologies is needed for the analysis of microarray data. RESULTS: We propose a novel analysis procedure for classifying (predicting) human tumor samples based on microarray gene expressions. This procedure involves dimension reduction using Partial Least Squares (PLS) and classification using Logistic Discrimination (LD) and Quadratic Discriminant Analysis (QDA). We compare PLS to the well known dimension reduction method of Principal Components Analysis (PCA). Under many circumstances PLS proves superior; we illustrate a condition when PCA particularly fails to predict well relative to PLS. The proposed methods were applied to five different microarray data sets involving various human tumor samples: (1) normal versus ovarian tumor; (2) Acute Myeloid Leukemia (AML) versus Acute Lymphoblastic Leukemia (ALL); (3) Diffuse Large B-cell Lymphoma (DLBCLL) versus B-cell Chronic Lymphocytic Leukemia (BCLL); (4) normal versus colon tumor; and (5) Non-Small-Cell-Lung-Carcinoma (NSCLC) versus renal samples. Stability of classification results and methods were further assessed by re-randomization studies.  相似文献   

12.
Microarray gene expression data usually have a large number of dimensions, e.g., over ten thousand genes, and a small number of samples, e.g., a few tens of patients. In this paper, we use the support vector machine (SVM) for cancer classification with microarray data. Dimensionality reduction methods, such as principal components analysis (PCA), class-separability measure, Fisher ratio, and t-test, are used for gene selection. A voting scheme is then employed to do multi-group classification by k(k - 1) binary SVMs. We are able to obtain the same classification accuracy but with much fewer features compared to other published results.  相似文献   

13.
通过对基因表达谱数据的分析从而促进肿瘤诊断与治疗技术的发展,其研究正成为生物医学领域的一个热点。因此,提出了一种熵信息处理和主成分分析(principal component analysis,PCA)相结合的方法。首先运用熵信息对超高维基因表达谱数据进行粗选取,得到特征基因子集;由于基因子集仍存在相关性,进而利用PCA对其进一步冗余剔除;最后对得到的无冗余且具有正交性信息的基因特征进行真实数据实验。实验结果显示所采用的方法能有效去除肿瘤样本中的不相关和冗余信息,同时最大程度的保留肿瘤分类信息。与其他肿瘤分类方法相比,在精度上具有比较明显的优势,从而验证了该方法是有效的、可行的。  相似文献   

14.
15.
Cancer diagnosis depending on microarray technology has drawn more and more attention in the past few years. Accurate and fast diagnosis results make gene expression profiling produced from microarray widely used by a large range of researchers. Much research work highlights the importance of gene selection and gains good results. However, the minimum sets of genes derived from different methods are seldom overlapping and often inconsistent even for the same set of data, partially because of the complexity of cancer disease. In this paper, cancer classification was attempted in an alternative way of the whole gene expression profile for all samples instead of partial gene sets. Here, the three common sets of data were tested by NIPALS-KPLS method for acute leukemia, prostate cancer and lung cancer respectively. Compared to other conventional methods, the results showed wide improvement in classification accuracy. This paper indicates that sample profile of gene expression may be explored as a better indicator for cancer classification, which deserves further investigation.  相似文献   

16.
Linear regression and two-class classification with gene expression data   总被引:3,自引:0,他引:3  
MOTIVATION: Using gene expression data to classify (or predict) tumor types has received much research attention recently. Due to some special features of gene expression data, several new methods have been proposed, including the weighted voting scheme of Golub et al., the compound covariate method of Hedenfalk et al. (originally proposed by Tukey), and the shrunken centroids method of Tibshirani et al. These methods look different and are more or less ad hoc. RESULTS: We point out a close connection of the three methods with a linear regression model. Casting the classification problem in the general framework of linear regression naturally leads to new alternatives, such as partial least squares (PLS) methods and penalized PLS (PPLS) methods. Using two real data sets, we show the competitive performance of our new methods when compared with the other three methods.  相似文献   

17.
建立了基于小波降噪和支持向量机的结肠癌基因表达数据肿瘤识别模型.对试验数据进行小波分解,并利用交叉验证的方法计算试验样本的平均分类准确率,确定小波函数与小波分解层数;引入能量阈值方法对小波分解系数进行阈值处理,达到降噪的目的;提出了基因分类贡献率与主成分分析结合的方法,提取结肠癌样本数据特征;利用支持向量机强大的非线性映射能力,实现对结肠癌样本数据的非线性分类.为了减弱样本集的划分对分类准确率的影响,本文采取Jackknife检验方法对支持向量分类器的分类器检验,其分类准确率为96.77%.试验结果证明了该方法的有效性,该方法对结肠癌的识别具有一定的参考价值.  相似文献   

18.
Fung ES  Ng MK 《Bioinformation》2007,2(5):230-234
One of the applications of the discriminant analysis on microarray data is to classify patient and normal samples based on gene expression values. The analysis is especially important in medical trials and diagnosis of cancer subtypes. The main contribution of this paper is to propose a simple Fisher-type discriminant method on gene selection in microarray data. In the new algorithm, we calculate a weight for each gene and use the weight values as an indicator to identify the subsets of relevant genes that categorize patient and normal samples. A l(2) - l(1) norm minimization method is implemented to the discriminant process to automatically compute the weights of all genes in the samples. The experiments on two microarray data sets have shown that the new algorithm can generate classification results as good as other classification methods, and effectively determine relevant genes for classification purpose. In this study, we demonstrate the gene selection's ability and the computational effectiveness of the proposed algorithm. Experimental results are given to illustrate the usefulness of the proposed model.  相似文献   

19.
MOTIVATION: Since DNA microarray experiments provide us with huge amount of gene expression data, they should be analyzed with statistical methods to extract the meanings of experimental results. Some dimensionality reduction methods such as Principal Component Analysis (PCA) are used to roughly visualize the distribution of high dimensional gene expression data. However, in the case of binary classification of gene expression data, PCA does not utilize class information when choosing axes. Thus clearly separable data in the original space may not be so in the reduced space used in PCA. RESULTS: For visualization and class prediction of gene expression data, we have developed a new SVM-based method called multidimensional SVMs, that generate multiple orthogonal axes. This method projects high dimensional data into lower dimensional space to exhibit properties of the data clearly and to visualize a distribution of the data roughly. Furthermore, the multiple axes can be used for class prediction. The basic properties of conventional SVMs are retained in our method: solutions of mathematical programming are sparse, and nonlinear classification is implemented implicitly through the use of kernel functions. The application of our method to the experimentally obtained gene expression datasets for patients' samples indicates that our algorithm is efficient and useful for visualization and class prediction. CONTACT: komura@hal.rcast.u-tokyo.ac.jp.  相似文献   

20.
Rapid methods for the characterization of biomass for energy purpose utilization are fundamental. In this work, near infrared spectroscopy is used to measure ash and char content of various types of biomass. Very strong models were developed, independently of the type of biomass, to predict ash and char content by near infrared spectroscopy and multivariate analysis. Several statistical approaches such as principal component analysis (PCA), orthogonal signal correction (OSC) treated PCA and partial least squares (PLS), Kernel PCA and PLS were tested in order to find the best method to deal with near infrared data to classify and predict these biomass characteristics. The model with the highest coefficient of correlation and the lowest RMSEP was obtained with OSC-treated Kernel PLS method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号