首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
文章研究了基于微阵列基因表达数据的胃癌亚型分类。微阵列基因表达数据样本少、纬度高、噪声大的特点,使得数据降维成为分类成功的关键。作者将主成分分析(PCA) 和偏最小二乘(PLS)两种降维方法应用于胃癌亚型分类研究,以支持向量机(SVM)、K- 近邻法(KNN)为分类器对两套胃癌数据进行亚型分类。分类效果相比传统的医理诊断略高,最高准确率可达100%。研究结果表明,主成分分析和偏最小二乘方法能够有效地提取分类特征信息,并能在保持较高的分类准确率的前提下大幅度地降低基因表达数据的维数。  相似文献   

2.
主元余像集主成分分析在蛋白质质谱数据中的应用   总被引:1,自引:1,他引:0  
癌蛋白质谱数据中包含了大量未知的内部结构和变量。针对癌蛋白质谱数据这些特点,在总结主元余像集主成分分析(二次主成分分析)应用的基础上,提出了用t-验证方法进行特征子集选取,然后用主元余像集主成分分析提取特征,以线性判别分析进行分类的新方法。通过对典型癌蛋白质谱数据的分类实验,证明该方法不但识别率高,而且需要选取的特征子集小,分类速度快,提高了方法的准确性与分类速度。  相似文献   

3.
提出一种有别于系统发育树的根据16S rRNA基因序列进行物种分类的新方法。首先将基因的碱基字母形式转换成数字形式,构建多维向量。然后根据主成分分析方法将该向量向数据分布最大方向投影,将原数据用几个“主成分”线性表出,而不丢失原数据的信息,采用主成分的显示功能作出三维主成分特征投影视图,达到分类的目的。在双歧杆菌和肠球菌的分类识别中得到较好的应用。  相似文献   

4.
研究了如何提取基因图谱信息的方法问题,提出一种新的基因"标签"提取方法.该方法采用秩和检验法选取基因"标签",得到不同显著性水平下基因"标签"的数目,然后将秩和检验法和分类信息指数相结合,在与基于SVM建立的分类模型比较后,提取最合适的基因"标签",再结合分类信息指数重复数次对这些基因"标签"选取主基因,最后用SVM检验主基因选取及分类的准确率,发现准确率有所提高,说明该方法的有效性.  相似文献   

5.
人类群体遗传结构的协方差阵主成分分析方法   总被引:3,自引:0,他引:3  
目的:探讨基因频率矩阵的中心化(或均值化)协方差阵主成分分析方法在人类群体遗传结构研究中的适用性和合理性。方法:从基因频率矩阵的结构特征入手,分析中心化、均值化协方差阵主成分分析与标准化相关阵主成分分析在特征根、特征向量以及降维效果等方面的差异,并通过实例比较不同方法在解释群体遗传结构特征上合理性。结果:中心化(或均值化)协方差阵的主成分不仅反映了基因变异程度的“方差信息量权”,而且反映了基因间相互影响程度的“相关信息量权”;标准化相关阵的主成分反映的仅是“相关信息量权”,不包括“方差信息量权”。通过比较中国26个汉族人群HLA-A基因座中心化协方差阵和标准化相关阵2种主成分分析结果,证实中心化协方差阵主成分分析方法在特征根与特征向量、保留主成分的个数和对主成分的群体遗传学解释的合理性等方面均优于标准化相关阵主成分分析方法。结论:在对群体遗传结构进行主成分分析时,应使用中心化(或均值化)变换消除基因频率矩阵中量级的影响,然后在用其协方差阵提取主成分。  相似文献   

6.
对叶片高光谱信息进行分析,实现苎麻褐斑病快速无损的诊断,对提高苎麻产量和品质有重要意义。利用FieldSpec3便携式地物光谱仪和手持叶片夹持器,采集了430个苎麻褐斑病叶片和健康叶片高光谱数据。提出了一种基于离散系数的子波段主成分分析PCA方法来提取特征变量。同时,为了探讨不同主成分个数对模型的影响,分别以1~10个主成分作为特征变量,采用支持向量机分类SVC方法建立苎麻叶片褐斑病识别模型。结果表明:1)波段A(511~636 nm)、波段B(690~714 nm)、波段C(1406~1511 nm)和波段D(1870~2450 nm)离散系数较大,是建立识别模型的敏感波段;2)4个子波段中,波段C建模效果最好,选择5~10个PCA主成分作为特征变量建立SVC识别模型时,在主成分个数相同的情况下,其正确率可以达到90%以上,总体高于全波段和其他子波段。基于离散系数筛选较敏感的子波段进行PCA,选择合适的主成分个数作为特征变量,建立苎麻叶片褐斑病SVC识别模型是可行的,为开创一种新的苎麻褐斑病诊断方法提供技术支持。  相似文献   

7.
在基于质谱的蛋白质组学数据分类研究中, 降维技术能够通过提取特征来降低变量维
度, 有助于机器学习方法进行准确高效地分类. 为了研究和比较机器学习分类器与降维技术相
结合的分类模型在蛋白质组学数据分类中的性能, 为相关的分类研究提供参考, 将线性判别分
析、k-最近邻、决策树、支持向量机及人工神经网络分类方法与主成分分析及偏最小二乘降维
技术相结合, 应用于质谱公共数据的分类中. 本文所使用的结合式分类模型中, PLS-LDA,
PLS-SVM 和PLS-ANN 方法表现出了最高的分类准确率. 为进一步提升分类效果, 基于最优的
这3 种组合分类方法, 采用多数投票法构建了专家分类系统. 在10 倍交叉验证中, 多数投票模
型仅使用前5 个主成分即达到100%的分类准确率. 本研究方法和结论为质谱及其他类型的指纹
图谱分析研究提供了有益的参考.  相似文献   

8.
基于中分辨率TM数据的湿地水生植被提取   总被引:8,自引:0,他引:8  
林川  宫兆宁  赵文吉 《生态学报》2010,30(23):6460-6469
利用湿地水生植被生长旺盛、光谱反射较强、光谱信息比较丰富的8月份中分辨率Landsat TM和ETM+多光谱遥感影像,采用面向对象的分类方法,进行野鸭湖湿地水生植被的提取。研究表明:在提取过程中,通过对原始影像进行主成分变换和穗帽变换,将主要信息与噪声分离,不仅减小了数据冗余和波段间的相关性,而且增大了影像上湿地水生植被与其他地物类型光谱和空间信息的差异性,并结合野外水生植被光谱特征分析,选择归一化植被指数NDVI与归一化水体指数NDWI辅助分类,构建特征波段或波段组合,然后,确定适当的隶属度函数和阈值范围,构建分类决策树,完成湿地水生植被的自动分类,提高了影像分割与面向对象分类的精度,取得了较为理想的湿地水生植被提取结果。2002年和2008年两景影像的总体分类精度分别达到86.5%和85.44%,表明中分辨率TM影像可以满足湿地水生植被提取的需要,又因为其具有较高的波谱分辨率、极为丰富的信息量、相对较低的价格、长时间序列,可以作为近20a湿地水生植被提取和动态变化监测的主要数据源。  相似文献   

9.
随机森林:一种重要的肿瘤特征基因选择法   总被引:2,自引:0,他引:2  
特征选择技术已经被广泛地应用于生物信息学科,随机森林(random forests,RF)是其中一种重要的特征选择方法。利用RF对胃癌、结肠癌和肺癌等5组基因表达谱数据进行特征基因选择,将选择结果与支持向量机(support vector machine,SVM)结合对原数据集分类,并对特征基因选择及分类结果进行初步的分析。同时使用微阵列显著性分析(significant analysis of microarray,SAM)和ReliefF法与RF比较,结果显示随机森林选择的特征基因包含更多分类信息,分类准确率更高。结合该方法自身具有的分类方面的诸多优势,随机森林可以作为一种可靠的基因表达谱数据分析手段被广泛使用。  相似文献   

10.
建立了基于小波降噪和支持向量机的结肠癌基因表达数据肿瘤识别模型.对试验数据进行小波分解,并利用交叉验证的方法计算试验样本的平均分类准确率,确定小波函数与小波分解层数;引入能量阈值方法对小波分解系数进行阈值处理,达到降噪的目的;提出了基因分类贡献率与主成分分析结合的方法,提取结肠癌样本数据特征;利用支持向量机强大的非线性映射能力,实现对结肠癌样本数据的非线性分类.为了减弱样本集的划分对分类准确率的影响,本文采取Jackknife检验方法对支持向量分类器的分类器检验,其分类准确率为96.77%.试验结果证明了该方法的有效性,该方法对结肠癌的识别具有一定的参考价值.  相似文献   

11.
The investigation of associations between rare genetic variants and diseases or phenotypes has two goals. Firstly, the identification of which genes or genomic regions are associated, and secondly, discrimination of associated variants from background noise within each region. Over the last few years, many new methods have been developed which associate genomic regions with phenotypes. However, classical methods for high-dimensional data have received little attention. Here we investigate whether several classical statistical methods for high-dimensional data: ridge regression (RR), principal components regression (PCR), partial least squares regression (PLS), a sparse version of PLS (SPLS), and the LASSO are able to detect associations with rare genetic variants. These approaches have been extensively used in statistics to identify the true associations in data sets containing many predictor variables. Using genetic variants identified in three genes that were Sanger sequenced in 1998 individuals, we simulated continuous phenotypes under several different models, and we show that these feature selection and feature extraction methods can substantially outperform several popular methods for rare variant analysis. Furthermore, these approaches can identify which variants are contributing most to the model fit, and therefore both goals of rare variant analysis can be achieved simultaneously with the use of regression regularization methods. These methods are briefly illustrated with an analysis of adiponectin levels and variants in the ADIPOQ gene.  相似文献   

12.
啤酒风味是保证啤酒品质的关键因素之一。运用代谢组学的方法,分析工业啤酒发酵过程中酵母胞内代谢物和啤酒风味物质的对应关系,从代谢水平上研究风味物质形成过程中的关键影响因素。在啤酒发酵过程中,同时检测风味物质的含量变化和酵母胞内代谢物的变化,对得到海量的、多维的代谢数据采用主成分分析(PCA)和偏最小二乘分析(PLS)的多元统计分析方法进行处理。由PCA分析结果可知:磷酸、海藻糖、琥珀酸、谷氨酸、天冬氨酸、丙氨酸对主成分贡献比较大,说明这些代谢物在不同发酵阶段含量变化显著。由PLS分析结果可知:对啤酒风味影响最大的物质主要为氨基酸,包括丝氨酸、缬氨酸、苏氨酸、赖氨酸、丙氨酸、亮氨酸和天冬酰胺等,这为啤酒中风味物质的调控提供了一定的理论指导。  相似文献   

13.
Predicting survival from microarray data--a comparative study   总被引:1,自引:0,他引:1  
MOTIVATION: Survival prediction from gene expression data and other high-dimensional genomic data has been subject to much research during the last years. These kinds of data are associated with the methodological problem of having many more gene expression values than individuals. In addition, the responses are censored survival times. Most of the proposed methods handle this by using Cox's proportional hazards model and obtain parameter estimates by some dimension reduction or parameter shrinkage estimation technique. Using three well-known microarray gene expression data sets, we compare the prediction performance of seven such methods: univariate selection, forward stepwise selection, principal components regression (PCR), supervised principal components regression, partial least squares regression (PLS), ridge regression and the lasso. RESULTS: Statistical learning from subsets should be repeated several times in order to get a fair comparison between methods. Methods using coefficient shrinkage or linear combinations of the gene expression values have much better performance than the simple variable selection methods. For our data sets, ridge regression has the overall best performance. AVAILABILITY: Matlab and R code for the prediction methods are available at http://www.med.uio.no/imb/stat/bmms/software/microsurv/.  相似文献   

14.
15.
We consider the problem of predicting survival times of cancer patients from the gene expression profiles of their tumor samples via linear regression modeling of log-transformed failure times. The partial least squares (PLS) and least absolute shrinkage and selection operator (LASSO) methodologies are used for this purpose where we first modify the data to account for censoring. Three approaches of handling right censored data-reweighting, mean imputation, and multiple imputation-are considered. Their performances are examined in a detailed simulation study and compared with that of full data PLS and LASSO had there been no censoring. A major objective of this article is to investigate the performances of PLS and LASSO in the context of microarray data where the number of covariates is very large and there are extremely few samples. We demonstrate that LASSO outperforms PLS in terms of prediction error when the list of covariates includes a moderate to large percentage of useless or noise variables; otherwise, PLS may outperform LASSO. For a moderate sample size (100 with 10,000 covariates), LASSO performed better than a no covariate model (or noise-based prediction). The mean imputation method appears to best track the performance of the full data PLS or LASSO. The mean imputation scheme is used on an existing data set on lung cancer. This reanalysis using the mean imputed PLS and LASSO identifies a number of genes that were known to be related to cancer or tumor activities from previous studies.  相似文献   

16.
The diagnosis of cancer by examination of the urine has the potential to improve patient outcomes by means of earlier detection. Due to the fact that the urine contains metabolic signatures of many biochemical pathways, this biofluid is ideally suited for metabolomic analysis, especially involving diseases of the kidney and urinary system. In this pilot study, we test three independent analytical techniques for suitability for detection of renal cell carcinoma (RCC) in urine of affected patients. Hydrophilic interaction chromatography (HILIC-LC-MS), reversed-phase ultra performance liquid chromatography (RP-UPLC-MS), and gas chromatography time-of-flight mass spectrometry (GC-TOF-MS) all were used as complementary separation techniques. The combination of these techniques is best suited to cover a very large part of the urine metabolome by enabling the detection of both lipophilic and hydrophilic metabolites present therein. In this study, it is demonstrated that sample pretreatment with urease dramatically alters the metabolome composition apart from removal of urea. Two new freely available peak alignment methods, MZmine and XCMS, are used for peak detection and retention time alignment. The results are analyzed by a feature selection algorithm with subsequent univariate analysis of variance (ANOVA) and a multivariate partial least squares (PLS) approach. From more than 2000 mass spectral features detected in the urine, we identify several significant components that lead to discrimination between RCC patients and controls despite the relatively small sample size. A feature selection process condensed the significant features to less than 30 components in each of the data sets. In future work, these potential biomarkers will be further validated with a larger patient cohort. Such investigation will likely lead to clinically applicable assays for earlier diagnosis of RCC, as well as other malignancies, and thereby improved patient prognosis.  相似文献   

17.
18.
研究了飞行状态下的四种菊头蝠回声定位声波的识别方法.通过小波包分解得到各个频带能量作为识一别特征向量,用主成分分析法优化特征空间.提取少数几个主成分,这些主成分彼此不相关,符合特征优化的要求,以主成分向量作为BP神经网络的输入对蝙蝠的种类进行识别.个体识别正确率达到了80%以上,表明基于小渡包分解和神经网络识别的方法对蝙蝠回声定位声波进行识别是可行的.  相似文献   

19.
MOTIVATION: Many methods have been developed for selecting small informative feature subsets in large noisy data. However, unsupervised methods are scarce. Examples are using the variance of data collected for each feature, or the projection of the feature on the first principal component. We propose a novel unsupervised criterion, based on SVD-entropy, selecting a feature according to its contribution to the entropy (CE) calculated on a leave-one-out basis. This can be implemented in four ways: simple ranking according to CE values (SR); forward selection by accumulating features according to which set produces highest entropy (FS1); forward selection by accumulating features through the choice of the best CE out of the remaining ones (FS2); backward elimination (BE) of features with the lowest CE. RESULTS: We apply our methods to different benchmarks. In each case we evaluate the success of clustering the data in the selected feature spaces, by measuring Jaccard scores with respect to known classifications. We demonstrate that feature filtering according to CE outperforms the variance method and gene-shaving. There are cases where the analysis, based on a small set of selected features, outperforms the best score reported when all information was used. Our method calls for an optimal size of the relevant feature set. This turns out to be just a few percents of the number of genes in the two Leukemia datasets that we have analyzed. Moreover, the most favored selected genes turn out to have significant GO enrichment in relevant cellular processes.  相似文献   

20.
Phosphodiesterase type-5 (PDE-5) is a key enzyme involved in the erection process. PDE-5 inhibitors, such as Sildenafil (ViagraTM), Vardenafil (LevitraTM) and Tadalafil (CialisTM), are used for the treatment of erectile dysfunction. Computer-assisted modelling of biological activities of PDE-5 inhibitors may make quantitative structure–activity relationship (QSAR) models useful for the development of safer (low side effects) and more potent drugs. The multivariate image analysis applied to QSAR (MIA-QSAR) method, coupled to partial least-squares (PLS) regression, has provided highly predictive QSAR models. Nevertheless, regression methods which take into account nonlinearity, such as least-squares support-vector machines (LS-SVMs), are supposed to predict biological activities more accurately than the usual linear methods. Thus, together with prior variable selection using principal component analysis ranking, MIA-QSAR and LS-SVM regression were applied to model the bioactivities of a series of cyclic guanine derivatives (PDE-5 inhibitors), and the results were compared with those based on linear methodologies. MIA-QSAR/LS-SVM was found to improve greatly the prediction performance when compared with MIA-QSAR/PLS, MIA-QSAR/N-PLS, CoMFA/PLS and CoMSIA/PLS models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号