首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到17条相似文献,搜索用时 234 毫秒
1.
通过对基因表达谱数据的分析从而促进肿瘤诊断与治疗技术的发展,其研究正成为生物医学领域的一个热点。因此,提出了一种熵信息处理和主成分分析(principal component analysis,PCA)相结合的方法。首先运用熵信息对超高维基因表达谱数据进行粗选取,得到特征基因子集;由于基因子集仍存在相关性,进而利用PCA对其进一步冗余剔除;最后对得到的无冗余且具有正交性信息的基因特征进行真实数据实验。实验结果显示所采用的方法能有效去除肿瘤样本中的不相关和冗余信息,同时最大程度的保留肿瘤分类信息。与其他肿瘤分类方法相比,在精度上具有比较明显的优势,从而验证了该方法是有效的、可行的。  相似文献   

2.
特征选择技术被广泛应用于生物信息学中。通过重复利用偏最小二乘(partial least square,PLS)方法提取主成分,通过逐次选择在主成分中权重较大的基因,将PLS应用于特征选择中。将这种方法用于对肿瘤基因表达谱数据的特征基因选择中,并用提取的特征基因分类,用8个特征基因进行分类时,能达到92.5%的正确率。  相似文献   

3.
为探讨人工神经网络(ANN)在昆虫分类上的可行性,本文提出利用主成分分析和数学建模等方法相结合改进ANN,并以鳞翅目夜蛾科6种蛾类昆虫为样本进行了验证.首先利用Bugshape1.0特征提取软件获取6种蛾180个右前翅样本的13项数学形态特征数据,再运用主成分分析对蛾翅数学形态特征变量重新组合生成新的综合变量,然后结合主成分分析建立BP神经网络分类器.主成分分析结果表明,前5个主成分的累积贡献率为85.52%,已基本包含了全部特征变量具有的信息.在主成分分析的基础上,建立具有5个输入层节点,12个隐含层节点和1个输出层节点的三层BP神经网络分类器.每种蛾20个样本共120组特征数据对分类器进行训练和仿真,其余60组特征数据对分类器进行验证,仿真输出值与目标值的相关系数R=0.997,分类正确率达到了93.33%.较之未经过主成分分析而单独使用BP神经网络建立的分类器,基于主成分分析的BP神经网络分类器具有更优的性能和更准确的分类能力.研究结果表明本文提出的方法具有很好的分类和鉴别作用,为蛾种类的鉴别提供了一种可行的方法.  相似文献   

4.
遗传优化算法在基因数据分类中的应用   总被引:1,自引:0,他引:1  
本文提出了一种基于遗传算法的基因微阵列数据特征提取方法。首先对原始数据进行标准化,然后利用方差分析方法对数据进行降低维数处理,最后利用遗传算法对数据进行优化。针对基因数据对遗传算子和适应度函数进行设置,优化数据集选取特征基因,得到较小的特征子集。为了验证选取的特征,利用样本划分法通过判别分析建立分类器进行判定。实验论证此方法具有理想的分类效果,算法稳定、效率高。  相似文献   

5.
基于SVM和平均影响值的人肿瘤信息基因提取   总被引:1,自引:0,他引:1       下载免费PDF全文
基于基因表达谱的肿瘤分类信息基因选取是发现肿瘤特异表达基因、探索肿瘤基因表达模式的重要手段。借助由基因表达谱获得的分类信息进行肿瘤诊断是当今生物信息学领域中的一个重要研究方向,有望成为临床医学上一种快速而有效的肿瘤分子诊断方法。鉴于肿瘤基因表达谱样本数据维数高、样本量小以及噪音大等特点,提出一种结合支持向量机应用平均影响值来寻找肿瘤信息基因的算法,其优点是能够搜索到基因数量尽可能少而分类能力尽可能强的多个信息基因子集。采用二分类肿瘤数据集验证算法的可行性和有效性,对于结肠癌样本集,只需3个基因就能获得100%的留一法交叉验证识别准确率。为避免样本集的不同划分对分类性能的影响,进一步采用全折交叉验证方法来评估各信息基因子集的分类性能,优选出更可靠的信息基因子集。与基它肿瘤分类方法相比,实验结果在信息基因数量以及分类性能方面具有明显的优势。  相似文献   

6.
艾亮  冯杰 《生物信息学》2023,21(3):179-186
本文提出了一种新的快速非比对的蛋白质序列相似性与进化分析方法。在刻画蛋白质序列特征时,首先将氨基酸的10种理化性质通过主成分分析浓缩为6个主成分,并且将每条蛋白质序列里的氨基酸数目作为权重对主成分得分值进行加权平均,然后再融合氨基酸的位置信息构成一个26维的蛋白质序列特征向量,最后利用欧式距离度量蛋白质序列间的相似性及进化关系。通过对3个蛋白质序列数据集的测试表明,本文提出的方法能将每条蛋白质序列准确聚类,并且简便快捷,说明了该方法的有效性。  相似文献   

7.
提出一种有别于系统发育树的根据16S rRNA基因序列进行物种分类的新方法。首先将基因的碱基字母形式转换成数字形式,构建多维向量。然后根据主成分分析方法将该向量向数据分布最大方向投影,将原数据用几个“主成分”线性表出,而不丢失原数据的信息,采用主成分的显示功能作出三维主成分特征投影视图,达到分类的目的。在双歧杆菌和肠球菌的分类识别中得到较好的应用。  相似文献   

8.
基于31P磁共振波谱图(31Phosphorus Magnetic Resonance Spectroscopy,31P-MRS)对肝脏数据进行诊断,共分为三种类型:肝癌,肝硬化和正常肝。本文在线性分类器分类前先用遗传算法进行特征选择,选择出最优特征子集。实验中,用线性分类器分别对经过遗传算法特征选择后的最优特征子集分类和对提取的全波谱数据进行分类。实验结果证明,前者方法不仅明显提高了分类的准确率,而且减少了分类器运行的时间,其中31P-MR波谱对活体肝细胞癌的诊断正确率从62.50%提高到89.35%。  相似文献   

9.
癌症的早期诊断能够显著提高癌症患者的存活率,在肝细胞癌患者中这种情况更加明显。机器学习是癌症分类中的有效工具。如何在复杂和高维的癌症数据集中,选择出低维度、高分类精度的特征子集是癌症分类的难题。本文提出了一种二阶段的特征选择方法SC-BPSO:通过组合Spearman相关系数和卡方独立检验作为过滤器的评价函数,设计了一种新型的过滤器方法——SC过滤器,再组合SC过滤器方法和基于二进制粒子群算法(BPSO)的包裹器方法,从而实现两阶段的特征选择。并应用在高维数据的癌症分类问题中,区分正常样本和肝细胞癌样本。首先,对来自美国国家生物信息中心(NCBI)和欧洲生物信息研究所(EBI)的130个肝组织microRNA序列数据(64肝细胞癌,66正常肝组织)进行预处理,使用MiRME算法从原始序列文件中提取microRNA的表达量、编辑水平和编辑后表达量3类特征。然后,调整SC-BPSO算法在肝细胞癌分类场景中的参数,选择出关键特征子集。最后,建立分类模型,预测结果,并与信息增益过滤器、信息增益率过滤器、BPSO包裹器特征选择算法选出的特征子集,使用相同参数的随机森林、支持向量机、决策树、KNN四种分类器分类,对比分类结果。使用SC-BPSO算法选择出的特征子集,分类准确率高达98.4%。研究结果表明,与另外3个特征选择算法相比,SC-BPSO算法能有效地找到尺寸较小和精度更高的特征子集。这对于少量样本高维数据的癌症分类问题可能具有重要意义。  相似文献   

10.
高维蛋白质波谱数据分析过程中,对于数据的特征提取一直是许多学者专注解决的问题。本文提出了一种基于高频系数的小波分析和主成份分析技术(Principal component analysis,PCA)的特征提取方法,首先采用小波分析技术对数据进行降噪,提取高频系数作为特征,之后用主成份分析技术进行降维。实验显示:本论文中提出的方法在8-7-02、4/3/02数据集上的实验识别率分别可以达到100%和99.45%,可以有效提高分类识别率。  相似文献   

11.
蛋白质质谱技术是蛋白质组学的重要研究工具,它被出色地应用于癌症早期诊断等领域,但是蛋白质质谱数据带来的维灾难问题使得降维成为质谱分析的必需的步骤。本文首先将美国国家癌症研究所提供的高分辨率SELDI—TOF卵巢质谱数据进行预处理;然后将质谱数据的特征选择问题转化成基于模拟退火算法的组合优化模型,用基于线性判别式分析的分类错误率和样本后验概率构造待优化目标函数,用基于均匀分布和控制参数的方法构造新解产生器,在退火过程中添加记忆功能;然后用10-fold交叉验证法选择训练和测试样本,用线性判别式分析分类器评价降维后的质谱数据。实验证明,用模拟退火算法选择6个以上特征时,能够将高分辨率SELDI—TOF卵巢质谱数据全部正确分类,说明模拟退火算法可以很好地应用于蛋白质质谱数据的特征选择。  相似文献   

12.
MOTIVATION: Protein expression profiling for differences indicative of early cancer holds promise for improving diagnostics. Due to their high dimensionality, statistical analysis of proteomic data from mass spectrometers is challenging in many aspects such as dimension reduction, feature subset selection as well as construction of classification rules. Search of an optimal feature subset, commonly known as the feature subset selection (FSS) problem, is an important step towards disease classification/diagnostics with biomarkers. METHODS: We develop a parsimonious threshold-independent feature selection (PTIFS) method based on the concept of area under the curve (AUC) of the receiver operating characteristic (ROC). To reduce computational complexity to a manageable level, we use a sigmoid approximation to the empirical AUC as the criterion function. Starting from an anchor feature, the PTIFS method selects a feature subset through an iterative updating algorithm. Highly correlated features that have similar discriminating power are precluded from being selected simultaneously. The classification rule is then determined from the resulting feature subset. RESULTS: The performance of the proposed approach is investigated by extensive simulation studies, and by applying the method to two mass spectrometry data sets of prostate cancer and of liver cancer. We compare the new approach with the threshold gradient descent regularization (TGDR) method. The results show that our method can achieve comparable performance to that of the TGDR method in terms of disease classification, but with fewer features selected. AVAILABILITY: Supplementary Material and the PTIFS implementations are available at http://staff.ustc.edu.cn/~ynyang/PTIFS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

13.
高维蛋白质波谱癌症数据分析,一直面临着高维数据的困扰。针对高维蛋白质波谱癌症数据在降维过程中的问题,提出基于小波分析技术和主成分分析技术的高维蛋白质波谱癌症数据特征提取的方法,并在特征提取之后,使用支持向量机进行分类。对8-7-02数据集进行2层小波分解时,分别使用db1、db3、db4、db6、db8、db10、haar小波基,并使用支持向量机进行分类,正确率分别达到98.18%、98.35%、98.04%、98.36%、97.89%、97.96%、98.20%。在进一步提高分类识别正确率的同时,提高了时间率。  相似文献   

14.
Early detection of cancer can greatly improve prognosis. Identification of proteins or peptides in the circulation, at different stages of cancer, would greatly enhance treatment decisions. Mass spectrometry (MS) is emerging as a powerful tool to identify proteins from complex mixtures such as plasma that may help identify novel sets of markers that may be associated with the presence of tumors. To examine this feature we have used a genetically modified mouse model, Apc(Min), which develops intestinal tumors with 100% penetrance. Utilizing liquid chromatography-tandem mass spectrometry (LC-MS/MS), we identified total plasma proteome (TPP) and plasma glycoproteome (PGP) profiles in tumor-bearing mice. Principal component analysis (PCA) and agglomerative hierarchial clustering analysis revealed that these protein profiles can be used to distinguish between tumor-bearing Apc(Min) and wild-type control mice. Leave-one-out cross-validation analysis established that global TPP and global PGP profiles can be used to correctly predict tumor-bearing animals in 17/19 (89%) and 19/19 (100%) of cases, respectively. Furthermore, leave-one-out cross-validation analysis confirmed that the significant differentially expressed proteins from both the TPP and the PGP were able to correctly predict tumor-bearing animals in 19/19 (100%) of cases. A subset of these proteins was independently validated by antibody microarrays using detection by two color rolling circle amplification (TC-RCA). Analysis of the significant differentially expressed proteins indicated that some might derive from the stroma or the host response. These studies suggest that mass spectrometry-based approaches to examine the plasma proteome may prove to be a valuable method for determining the presence of intestinal tumors.  相似文献   

15.
Proteomics is a promising approach for molecular understanding of neoplastic processes including response to treatment. Widely used 2D‐gel electrophoresis/Liquid chromatography coupled with mass spectrometry (LC‐MS) are time consuming and not cost effective. We have developed a high‐sensitivity (femto/subfemtomoles of protein/20 μl) High Performance Liquid Chromatography‐Laser Induced Fluorescence HPLC‐LIF instrument for studying protein profiles of biological samples. In this study, we have explored the feasibility of classifying breast tissues by multivariate analysis of chromatographic data. We have analyzed 13 normal, 17 malignant, 5 benign and 4 post‐treatment breast‐tissue homogenates. Data was analyzed by Principal Component Analysis PCA in both unsupervised and supervised modes on derivative and baseline‐corrected chromatograms. Our findings suggest that PCA of derivative chromatograms gives better classification. Thus, the HPLC‐LIF instrument is not only suitable for generation of chromatographic data using femto/subfemto moles of proteins but the data can also be used for objective diagnosis via multivariate analysis. Prospectively, identified fractions can be collected and analyzed by biochemical and/or MS methods. (© 2009 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

16.
We investigated the potential use of gas chromatography mass spectrometry (GC-MS), in combination with multivariate statistical data processing, to build a model for the classification of various tuberculosis (TB) causing, and non-TB Mycobacterium species, on the basis of their characteristic metabolite profiles. A modified Bligh-Dyer extraction procedure was used to extract lipid components from Mycobacterium tuberculosis, Mycobacterium avium, Mycobacterium bovis, and Mycobacterium kansasii cultures. Principle component analyses (PCA) of the GC-MS generated data showed a clear differentiation between all the Mycobacterium species tested. Subsequently, the 12 compounds best describing the variation between the sample groups were identified as potential metabolite markers, using PCA and partial least-squares discriminant analysis (PLS-DA). These metabolite markers were then used to build a discriminant classification model based on Bayes' theorem, in conjunction with multivariate kernel density estimation. This model subsequently correctly classified 2 "unknown" samples for each of the Mycobacterium species analysed, with probabilities ranging from 72 to 100%. Furthermore, Mycobacterium species classification could be achieved in less than 16 h, and the detection limit for this approach was 1×10(3)bacteriamL(-1). This study proves the capacity of a GC-MS, metabolomics pattern recognition approach for its possible use in TB diagnostics and disease characterisation.  相似文献   

17.

Background  

The use of mass spectrometry as a proteomics tool is poised to revolutionize early disease diagnosis and biomarker identification. Unfortunately, before standard supervised classification algorithms can be employed, the "curse of dimensionality" needs to be solved. Due to the sheer amount of information contained within the mass spectra, most standard machine learning techniques cannot be directly applied. Instead, feature selection techniques are used to first reduce the dimensionality of the input space and thus enable the subsequent use of classification algorithms. This paper examines feature selection techniques for proteomic mass spectrometry.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号