首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
文章研究了基于微阵列基因表达数据的胃癌亚型分类。微阵列基因表达数据样本少、纬度高、噪声大的特点,使得数据降维成为分类成功的关键。作者将主成分分析(PCA) 和偏最小二乘(PLS)两种降维方法应用于胃癌亚型分类研究,以支持向量机(SVM)、K- 近邻法(KNN)为分类器对两套胃癌数据进行亚型分类。分类效果相比传统的医理诊断略高,最高准确率可达100%。研究结果表明,主成分分析和偏最小二乘方法能够有效地提取分类特征信息,并能在保持较高的分类准确率的前提下大幅度地降低基因表达数据的维数。  相似文献   

2.
由于基因表达数据高属性维、低样本维的特点,Fisher分类器对该种数据分类性能不是很高。本文提出了Fisher的改进算法Fisher-List。该算法独特之处在于为每个类别确定一个决策阀值,每个阀值既包含总体样本信息,又含有某些对分类至关重要的个体样本信息。本文用实验证明新算法在基因表达数据分类方面比Fisher、LogitBoost、AdaBoost、k-近邻法、决策树和支持向量机具有更高的性能。  相似文献   

3.
基于流形学习的基因表达谱数据可视化   总被引:2,自引:0,他引:2  
基因表达谱的可视化本质上是高维数据的降维问题。采用流形学习算法来解决基因表达谱的降维数据可视化,讨论了典型的流形学习算法(Isomap和LLE)在表达谱降维中的适用性。通过类内/类间距离定量评价数据降维的效果,对两个典型基因芯片数据集(结肠癌基因表达谱数据集和急性白血病基因表达谱数据集)进行降维分析,发现两个数据集的本征维数都低于3,因而可以用流形学习方法在低维投影空间中进行可视化。与传统的降维方法(如PCA和MDS)的投影结果作比较,显示Isomap流形学习方法有更好的可视化效果。  相似文献   

4.
蛋白质质谱技术是蛋白质组学的重要研究工具,它被出色地应用于癌症早期诊断等领域,但是蛋白质质谱数据带来的维灾难问题使得降维成为质谱分析的必需的步骤。本文首先将美国国家癌症研究所提供的高分辨率SELDI—TOF卵巢质谱数据进行预处理;然后将质谱数据的特征选择问题转化成基于模拟退火算法的组合优化模型,用基于线性判别式分析的分类错误率和样本后验概率构造待优化目标函数,用基于均匀分布和控制参数的方法构造新解产生器,在退火过程中添加记忆功能;然后用10-fold交叉验证法选择训练和测试样本,用线性判别式分析分类器评价降维后的质谱数据。实验证明,用模拟退火算法选择6个以上特征时,能够将高分辨率SELDI—TOF卵巢质谱数据全部正确分类,说明模拟退火算法可以很好地应用于蛋白质质谱数据的特征选择。  相似文献   

5.
孙远帅  陈垚  玄萍  江弋 《生物信息学》2013,11(3):161-166
基因芯片技术的发展为生物信息学带来了机遇,使在基因表达水平上进行癌症诊断成为可能。但基因芯片数据高维小样本的特征也使传统机器学习方法面临挑战。本文利用真实的基因表达数据,测试了目前主要的分类方法和降维方法在癌症诊断方面的效果,通过实验对比发现:基于线性核函数的支持向量机可以有效地分类肿瘤与非肿瘤的基因表达,从而为癌症诊断提供借鉴。  相似文献   

6.
王蕊平  王年  苏亮亮  陈乐 《生物信息学》2011,9(2):164-166,170
海量数据的存在是现代信息社会的一大特点,如何在成千上万的基因中有效地选出样本的分类特征对癌症的诊治具有重要意义。采用局部非负矩阵分解方法对癌症基因表达谱数据进行特征提取。首先对基因表达谱数据进行筛选,然后构造局部非负矩阵并对其进行分解得到维数低、能充分表征样本的特征向量,最后用支持向量机对特征向量进行分类。结果表明该方法的可行性和有效性。  相似文献   

7.
遗传优化算法在基因数据分类中的应用   总被引:1,自引:0,他引:1  
本文提出了一种基于遗传算法的基因微阵列数据特征提取方法。首先对原始数据进行标准化,然后利用方差分析方法对数据进行降低维数处理,最后利用遗传算法对数据进行优化。针对基因数据对遗传算子和适应度函数进行设置,优化数据集选取特征基因,得到较小的特征子集。为了验证选取的特征,利用样本划分法通过判别分析建立分类器进行判定。实验论证此方法具有理想的分类效果,算法稳定、效率高。  相似文献   

8.
构建微生物分子分类系统进化树的快速运算法与数据结构   总被引:1,自引:1,他引:0  
本文介绍了构建系统进化树的NJ方法(NeighborJoiningMethod)所涉及的算法与数据结构。文中给出了基于数据复用性的算法改进,获得了快速算法──FNJ算法,从而将算法的时间复杂度由(N5)降低为(N3);并给出了自动绘制进化分枝图的算法。  相似文献   

9.
基于SVM和平均影响值的人肿瘤信息基因提取   总被引:1,自引:0,他引:1       下载免费PDF全文
基于基因表达谱的肿瘤分类信息基因选取是发现肿瘤特异表达基因、探索肿瘤基因表达模式的重要手段。借助由基因表达谱获得的分类信息进行肿瘤诊断是当今生物信息学领域中的一个重要研究方向,有望成为临床医学上一种快速而有效的肿瘤分子诊断方法。鉴于肿瘤基因表达谱样本数据维数高、样本量小以及噪音大等特点,提出一种结合支持向量机应用平均影响值来寻找肿瘤信息基因的算法,其优点是能够搜索到基因数量尽可能少而分类能力尽可能强的多个信息基因子集。采用二分类肿瘤数据集验证算法的可行性和有效性,对于结肠癌样本集,只需3个基因就能获得100%的留一法交叉验证识别准确率。为避免样本集的不同划分对分类性能的影响,进一步采用全折交叉验证方法来评估各信息基因子集的分类性能,优选出更可靠的信息基因子集。与基它肿瘤分类方法相比,实验结果在信息基因数量以及分类性能方面具有明显的优势。  相似文献   

10.
肠道菌群与诸多人类重大疾病相关,研究在不同条件下的肠道菌群数据具有重要意义。由于菌群数据出现零膨胀现象,采用成对比率几何平均值(GMPR)方法对其进行归一化。本研究以2型糖尿病数据集为例,提出一种改进的Spectrum算法。首先,使用基于特征加权的相似度矩阵,避免忽视每个样本/特征所对应的不同特征值大小在该样本中所占据的权重;其次,将拉普拉斯矩阵替换为Hessian矩阵,避免传统谱聚类的灵敏度问题,将ISODATA聚类算法代替原本的K-means算法,有效地调整聚类中心数K。试验结果表明,GMPR+改进Spectrum在2型糖尿病中的标准化互信息(NMI)为0.423,戴维森堡丁指数(DBI)为4.751,Calinski-Harabasz指标(CH)为25.541,兰德指数(RI)为0.835,调整兰德指数(ARI)为0.019,较改进前的效果有所提升,并且该算法可以识别出不同类型患病人群在肠道菌群上的结构差异,挖掘出肠道微生物组的关键细菌。  相似文献   

11.
支持向量回归机(Support vector regressio,SVR)模型的拟合精度和泛化能力取决于其相关参数的选择,其参数选择实质上是一个优化搜索过程。根据启发式广度优先搜索(Heuristic Breadth first Search,HBFS)算法在求解优化问题上高效的特点,提出了一种以k-fold交叉验证的最小化误差为目标,HBFS为寻优策略的SVR参数选择方法,通过3个基准数据集对该模型进行了仿真实验,结果表明该方法在保证预测精度前提下,大幅度的缩短了训练建模时间,为大样本的SVR参数选择提供了一种新的有效解决方案。  相似文献   

12.
The recent explosion in procurement and availability of high-dimensional gene- and protein-expression profile datasets for cancer diagnostics has necessitated the development of sophisticated machine learning tools with which to analyze them. A major limitation in the ability to accurate classify these high-dimensional datasets stems from the 'curse of dimensionality', occurring in situations where the number of genes or peptides significantly exceeds the total number of patient samples. Previous attempts at dealing with this issue have mostly centered on the use of a dimensionality reduction (DR) scheme, Principal Component Analysis (PCA), to obtain a low-dimensional projection of the high-dimensional data. However, linear PCA and other linear DR methods, which rely on Euclidean distances to estimate object similarity, do not account for the inherent underlying nonlinear structure associated with most biomedical data. The motivation behind this work is to identify the appropriate DR methods for analysis of high-dimensional gene- and protein-expression studies. Towards this end, we empirically and rigorously compare three nonlinear (Isomap, Locally Linear Embedding, Laplacian Eigenmaps) and three linear DR schemes (PCA, Linear Discriminant Analysis, Multidimensional Scaling) with the intent of determining a reduced subspace representation in which the individual object classes are more easily discriminable.  相似文献   

13.
《Genomics》2020,112(2):1173-1181
Gene selection is the process of selecting the optimal feature subset in an arbitrary dataset. The significance of gene selection is in high dimensional datasets in which the number of samples and features are low and high respectively. The major goals of gene selection are increasing the accuracy, finding the minimal effective feature subset, and increasing the performance of evaluations. This paper proposed two heuristic methods for gene selection, namely, Xvariance against Mutual Congestion. Xvariance tries to classify labels using internal attributes of features however Mutual Congestion is frequency based. The proposed methods have been conducted on eight binary medical datasets. Results reveal that Xvariance works well with standard datasets, however Mutual Congestion improves the accuracy of high dimensional datasets considerably.  相似文献   

14.
Multi-atlas segmentation has been widely used to segment various anatomical structures. The success of this technique partly relies on the selection of atlases that are best mapped to a new target image after registration. Recently, manifold learning has been proposed as a method for atlas selection. Each manifold learning technique seeks to optimize a unique objective function. Therefore, different techniques produce different embeddings even when applied to the same data set. Previous studies used a single technique in their method and gave no reason for the choice of the manifold learning technique employed nor the theoretical grounds for the choice of the manifold parameters. In this study, we compare side-by-side the results given by 3 manifold learning techniques (Isomap, Laplacian Eigenmaps and Locally Linear Embedding) on the same data set. We assess the ability of those 3 different techniques to select the best atlases to combine in the framework of multi-atlas segmentation. First, a leave-one-out experiment is used to optimize our method on a set of 110 manually segmented atlases of hippocampi and find the manifold learning technique and associated manifold parameters that give the best segmentation accuracy. Then, the optimal parameters are used to automatically segment 30 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). For our dataset, the selection of atlases with Locally Linear Embedding gives the best results. Our findings show that selection of atlases with manifold learning leads to segmentation accuracy close to or significantly higher than the state-of-the-art method and that accuracy can be increased by fine tuning the manifold learning process.  相似文献   

15.

Background  

Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE) rather than recursive feature elimination (RFE). We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE.  相似文献   

16.
The function of the protein is closely correlated with its subcellular localization. Probing into the mechanism of protein sorting and predicting protein subcellular location can provide important clues or insights for understanding the function of proteins. In this paper, we introduce a new PseAAC approach to encode the protein sequence based on the physicochemical properties of amino acid residues. Each of the protein samples was defined as a 146D (dimensional) vector including the 20 amino acid composition components and 126 adjacent triune residues contents. To evaluate the effectiveness of this encoding scheme, we did jackknife tests on three datasets using the support vector machine algorithm. The total prediction accuracies are 84.9%, 91.2%, and 92.6%, respectively. The satisfactory results indicate that our method could be a useful tool in the area of bioinformatics and proteomics.  相似文献   

17.
苹果的粉质化是指苹果果肉发软、汁液减少等一系列物理和生理变化现象,采用高光谱散射图像技术结合信号稀疏表示分类算法(SRSA)研究了苹果的粉质化分类问题。首先利用平均反射算法(MEAN)提取了600~1000 nm的高光谱散射图像特征;引入遗传算法(GA)解决分类样本的不均衡问题,在此基础上,把苹果的粉质化分类问题,转化为一个求解待识别样本对于整体训练样本的稀疏表示问题。仿真结果表明,基于信号稀疏表示分类算法的苹果粉质化分类精度为79.8%,高于偏最小二乘判别分析(PLSDA)的74.8%,为苹果的粉质化分类提供了一种新的有效的方法。  相似文献   

18.
FastJoin, an improved neighbor-joining algorithm   总被引:1,自引:0,他引:1  
Reconstructing the evolutionary history of a set of species is an elementary problem in biology, and methods for solving this problem are evaluated based on two characteristics: accuracy and efficiency. Neighbor-joining reconstructs phylogenetic trees by iteratively picking a pair of nodes to merge as a new node until only one node remains; due to its good accuracy and speed, it has been embraced by the phylogeny research community. With the advent of large amounts of data, improved fast and precise methods for reconstructing evolutionary trees have become necessary. We improved the neighbor-joining algorithm by iteratively picking two pairs of nodes and merging as two new nodes, until only one node remains. We found that another pair of true neighbors could be chosen to merge as a new node besides the pair of true neighbors chosen by the criterion of the neighbor-joining method, in each iteration of the clustering procedure for the purely additive tree. These new neighbors will be selected by another iteration of the neighbor-joining method, so that they provide an improved neighbor-joining algorithm, by iteratively picking two pairs of nodes to merge as two new nodes until only one node remains, constructing the same phylogenetic tree as the neighbor-joining algorithm for the same input data. By combining the improved neighbor-joining algorithm with styles upper bound computation optimization of RapidNJ and external storage of ERapidNJ methods, a new method of reconstructing phylogenetic trees, FastJoin, was proposed. Experiments with sets of data showed that this new neighbor-joining algorithm yields a significant speed-up compared to classic neighbor-joining, showing empirically that FastJoin is superior to almost all other neighbor-joining implementations.  相似文献   

19.
Wang X 《Genomics》2012,99(2):90-95
Two-gene classifiers have attracted a broad interest for their simplicity and practicality. Most existing two-gene classification algorithms were involved in exhaustive search that led to their low time-efficiencies. In this study, we proposed two new two-gene classification algorithms which used simple univariate gene selection strategy and constructed simple classification rules based on optimal cut-points for two genes selected. We detected the optimal cut-point with the information entropy principle. We applied the two-gene classification models to eleven cancer gene expression datasets and compared their classification performance to that of some established two-gene classification models like the top-scoring pairs model and the greedy pairs model, as well as standard methods including Diagonal Linear Discriminant Analysis, k-Nearest Neighbor, Support Vector Machine and Random Forest. These comparisons indicated that the performance of our two-gene classifiers was comparable to or better than that of compared models.  相似文献   

20.
Linear mixed effects models are widely used to analyze a clustered response variable. Motivated by a recent study to examine and compare the hospital length of stay (LOS) between patients undertaking percutaneous coronary intervention (PCI) and coronary artery bypass graft (CABG) from several international clinical trials, we proposed a bivariate linear mixed effects model for the joint modeling of clustered PCI and CABG LOSs where each clinical trial is considered a cluster. Due to the large number of patients in some trials, commonly used commercial statistical software for fitting (bivariate) linear mixed models failed to run since it could not allocate enough memory to invert large dimensional matrices during the optimization process. We consider ways to circumvent the computational problem in the maximum likelihood (ML) inference and restricted maximum likelihood (REML) inference. Particularly, we developed an expected and maximization (EM) algorithm for the REML inference and presented an ML implementation using existing software. The new REML EM algorithm is easy to implement and computationally stable and efficient. With this REML EM algorithm, we could analyze the LOS data and obtained meaningful results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号