首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 583 毫秒
1.
改进的遗传算法(GA)自动优化支持向量机(SVM)参数,同步决策最优特征子集。新颖的分组多基因交叉技术保留了基因小组中的信息,而且允许后代继承更多的来自染色体的遗传信息。该算法促进可行解集中的高质量染色体信息交换,提高了解空间的搜索能力。实验结果说明:改进GA-SVM不仅可决策出与疾病相关的重要特征变量、优化SVM参数,而且可提升分类性能。与前馈BP神经网络及自适应模糊推理系统两种学习算法的比较表明,改进GA-SVM具有更好地表现。  相似文献   

2.
基于模糊支持向量机的膜蛋白折叠类型预测   总被引:1,自引:0,他引:1  
现有的基于支持向量机(support vector machine,SVM)来预测膜蛋白折叠类型的方法.利用的蛋白质序列特征并不充分.并且在处理多类蛋白质分类问题时存在不可分区域,针对这两类问题.提取蛋白质序列的氨基酸和二肽组成特征,并计算加权的多阶氨基酸残基指数相关系数特征,将3类特征融和作为分类器的输入特征矢量.并采用模糊SVM(fuzzy SVM,FSVM)算法解决对传统SVM不可分数据的分类.在无冗余的数据集上测试结果显示.改进的特征提取方法在相同分类算法下预测性能优于已有的特征提取方法:FSVM在相同特征提取方法下性能优于传统的SVM.二者相结合的分类策略在独立性数据集测试下的预测精度达到96.6%.优于现有的多种预测方法.能够作为预测膜蛋白和其它蛋白质折叠类型的有效工具.  相似文献   

3.
目的 不同患者对同一抗癌药物的反应可能不同,了解患者之间对抗癌药物的反应差异对癌症精准医疗具有重大参考价值。方法 高通量测序数据为构建抗癌药物反应分类预测模型提供了强大的数据支撑。针对两大经典数据集癌症细胞百科全书(CCLE)和癌症药物敏感性基因组学数据集(GDSC),本文提出了基于最大相关最小冗余(mRMR)算法和支持向量机(SVM)的计算模型mRMR-SVM。利用基因表达数据,通过方差排序和mRMR算法提取特征基因,借助SVM实现抗癌药物对细胞系的“敏感-抑制”二分类预测。结果 对于CCLE中的22种药物,mRMR-SVM的平均准确率为0.904;对于GDSC中的11种药物,平均准确率为0.851。结论 mRMR-SVM不仅在预测性能方面优于传统的支持向量机、随机森林、深度反应森林、深度神经网络和细胞系-药物复杂网络模型,而且具有良好的泛化能力,对于三类特定组织的抗癌药物反应分类预测也取得了令人满意的结果。此外,mRMR-SVM可以识别与癌症发生发展密切相关的生物标志物。  相似文献   

4.
癌症的早期诊断能够显著提高癌症患者的存活率,在肝细胞癌患者中这种情况更加明显。机器学习是癌症分类中的有效工具。如何在复杂和高维的癌症数据集中,选择出低维度、高分类精度的特征子集是癌症分类的难题。本文提出了一种二阶段的特征选择方法SC-BPSO:通过组合Spearman相关系数和卡方独立检验作为过滤器的评价函数,设计了一种新型的过滤器方法——SC过滤器,再组合SC过滤器方法和基于二进制粒子群算法(BPSO)的包裹器方法,从而实现两阶段的特征选择。并应用在高维数据的癌症分类问题中,区分正常样本和肝细胞癌样本。首先,对来自美国国家生物信息中心(NCBI)和欧洲生物信息研究所(EBI)的130个肝组织microRNA序列数据(64肝细胞癌,66正常肝组织)进行预处理,使用MiRME算法从原始序列文件中提取microRNA的表达量、编辑水平和编辑后表达量3类特征。然后,调整SC-BPSO算法在肝细胞癌分类场景中的参数,选择出关键特征子集。最后,建立分类模型,预测结果,并与信息增益过滤器、信息增益率过滤器、BPSO包裹器特征选择算法选出的特征子集,使用相同参数的随机森林、支持向量机、决策树、KNN四种分类器分类,对比分类结果。使用SC-BPSO算法选择出的特征子集,分类准确率高达98.4%。研究结果表明,与另外3个特征选择算法相比,SC-BPSO算法能有效地找到尺寸较小和精度更高的特征子集。这对于少量样本高维数据的癌症分类问题可能具有重要意义。  相似文献   

5.
原发性肝癌(PLC)患者在精确放疗后乙型肝炎病毒(HBV)再激活是一种常见并发症,及时的预测防护能降低发病率、死亡率。研究表明:多余的特征变量会影响HBV再激活的预测精度。通过提出基于近邻成分分析(NCA)的特征选择方法找出HBV再激活的危险因素及特征组合。之后分别建立经Bayes优化前后的支持向量机模型(SVM)对这些关键特征子集及初始特征集进行分类预测。实验结果表:明HBV DNA水平、KPS评分、分割方式、外放边界、V25、肿瘤分期TNM、ChildPugh等都是影响HBV再激活的危险因素。其中经NCA特征选择之后发现的V25是在乙型肝炎病毒再激活研究中首次提出的危险因素。10折交叉验证下特征组合HBV DNA水平、外放边界、V25的预测精度高达86.11%。支持向量机分类器可以很好的应用于乙型肝炎病毒再激活的研究,特征选择后的关键特征组合具有更优越的分类性能。  相似文献   

6.
随机森林:一种重要的肿瘤特征基因选择法   总被引:2,自引:0,他引:2  
特征选择技术已经被广泛地应用于生物信息学科,随机森林(random forests,RF)是其中一种重要的特征选择方法。利用RF对胃癌、结肠癌和肺癌等5组基因表达谱数据进行特征基因选择,将选择结果与支持向量机(support vector machine,SVM)结合对原数据集分类,并对特征基因选择及分类结果进行初步的分析。同时使用微阵列显著性分析(significant analysis of microarray,SAM)和ReliefF法与RF比较,结果显示随机森林选择的特征基因包含更多分类信息,分类准确率更高。结合该方法自身具有的分类方面的诸多优势,随机森林可以作为一种可靠的基因表达谱数据分析手段被广泛使用。  相似文献   

7.
【目的】植食性金龟子是我国的重要农林害虫,探索一种快速而准确地鉴别植食性金龟子的新方法,为将此法推及至其他鞘翅目昆虫的识别来建立研究基础。【方法】利用近红外光谱法对金龟子进行鉴别,提出了用支持向量机(Support vector machine,SVM)算法对15种植食性金龟子近红外光谱图(数据)进行分析,经过噪声波段去除后,用平滑求导与标准化法对的光谱进行预处理,选取金龟子标本150个,针对不同分类阶元和分类单元将66%样本谱图作为校正集,用SVM建立鉴别模型并对模型进行自身检验,用剩余样本图谱作为预测集对这些模型进行验证。【结果】模型的自身检验显示在金龟科4个亚科的鉴别模型中,鳃金龟亚科正确识别率为86%,其他样本的识别准确率均大于95%,在亚科不同属和属下不同种的鉴别模型中,除疏纹星花金Protaetia cathaica(Bates)外,其他样本的识别准确率均为100%;模型的预测集验证结果显示,在不同分类阶元和分类单元的鉴别模型中,由于云斑鳃金龟Polyphylla laticollis Lewis样本较少未能正确识别,其他样本的识别准确率均为100%。整体试验结果较为理想,说明模型性能较好。【结论】基于已定金龟子建立的模型能够很好地鉴别大部分样本,采用近红外光谱扫描技术结合支持向量机得到的植食性金龟子鉴别模型具有很强的推广能力。  相似文献   

8.
目的:实现室颤信号与非室颤信号的分类,进而实现室颤信号的检测。方法:本文引入了一种基于支持向量机(Support Vec-tor Machine,SVM)和改进的越限区间算法(TCI)的新算法,其中支持向量机在处理分类和模式识别等问题中具有很大的优势。该算法采用4s的滑动窗技术,并利用改进后的越限区间算法(Threshold Crossing Interval,TCI)方法提取心电信号的特征。新算法的实现如下:在每一滑动窗内采用改进的后的绝对值阈值,计算中间2s内的平均越限间隔值。并以此TCI值作为特征参数,输入一个预先设计好的二分类支持向量机中,从而实现分类。结果:成功实现了室颤信号的检测,通过计算该方法的灵敏度、精确度、预测性和准确度且与其他方法相比较,可知此新算法总体可靠性优于其他方法。结论:该算法能够实现室颤信号的实时监测,且简单易行,易于实现,较适合实时的心电监测以及除颤仪器。  相似文献   

9.
基于基因表达谱的肿瘤特异基因表达模式研究   总被引:1,自引:1,他引:0  
基于肿瘤基因表达谱, 利用生物信息学的方法, 从肿瘤与正常组织的样本分类入手就肿瘤特异表达基因的发现及其表达模式问题进行了分析和研究, 进而探讨了肿瘤在基因表达上的特点. 首先, 在分析肿瘤基因表达谱特点的基础上, 提出了基于Relief算法的样本分类特征基因选取策略; 然后, 以支持向量机为分类工具进行样本类型的识别, 以分类错误率为标准选取样本分类特征基因, 并对其中反映肿瘤与正常样本组织构成特点的组织特异表达基因进行排除以突出肿瘤样本真实的类别特征; 最后结合统计学方法, 从信息学的角度论证了分类特征基因在肿瘤组织中特异表达的确实性和普遍性, 并对这些基因在肿瘤组织中呈现出的特异的表达模式进行了分析.  相似文献   

10.
探讨原发性肝癌患者精确放疗后乙型肝炎病毒(hepatitis b virus,HBV)再激活的危险特征和分类预测模型。提出基于遗传算法的特征选择方法,从原发性肝癌数据的初始特征集中选择HBV再激活的最优特征子集。建立贝叶斯和支持向量机的HBV再激活分类预测模型,并预测最优特征子集和初始特征集的分类性能。实验结果表明,基于遗传算法的特征选择提高了HBV再激活分类性能,最优特征子集的分类性能明显优于初始特征子集的分类性能。影响HBV再激活的最优特征子集包括:HBV DNA水平,肿瘤分期TNM,Child-Pugh,外放边界和全肝最大剂量。贝叶斯的分类准确性最高可达82.89%,支持向量机的分类准确性最高可达83.34%。  相似文献   

11.
Mechanisms through which tissues are formed and maintained remain unknown but are fundamental aspects in biology. Tissue-specific gene expression is a valuable tool to study such mechanisms. But in many biomedical studies, cell lines, rather than human body tissues, are used to investigate biological mechanisms Whether or not cell lines maintain their tissue-specific characteristics after they are isolated and cultured outside the human body remains to be explored. In this study, we applied a novel computational method to identify core genes that contribute to the differentiation of cell lines from various tissues. Several advanced computational techniques, such as Monte Carlo feature selection method, incremental feature selection method, and support vector machine (SVM) algorithm, were incorporated in the proposed method, which extensively analyzed the gene expression profiles of cell lines from different tissues. As a result, we extracted a group of functional genes that can indicate the differences of cell lines in different tissues and built an optimal SVM classifier for identifying cell lines in different tissues. In addition, a set of rules for classifying cell lines were also reported, which can give a clearer picture of cell lines in different issues although its performance was not better than the optimal SVM classifier. Finally, we compared such genes with the tissue-specific genes identified by the Genotype-tissue Expression project. Results showed that most expression patterns between tissues remained in the derived cell lines despite some uniqueness that some genes show tissue specificity.  相似文献   

12.
Micro array data provides information of expression levels of thousands of genes in a cell in a single experiment. Numerous efforts have been made to use gene expression profiles to improve precision of tumor classification. In our present study we have used the benchmark colon cancer data set for analysis. Feature selection is done using t‐statistic. Comparative study of class prediction accuracy of 3 different classifiers viz., support vector machine (SVM), neural nets and logistic regression was performed using the top 10 genes ranked by the t‐statistic. SVM turned out to be the best classifier for this dataset based on area under the receiver operating characteristic curve (AUC) and total accuracy. Logistic Regression ranks as the next best classifier followed by Multi Layer Perceptron (MLP). The top 10 genes selected by us for classification are all well documented for their variable expression in colon cancer. We conclude that SVM together with t-statistic based feature selection is an efficient and viable alternative to popular techniques.  相似文献   

13.
《Genomics》2020,112(3):2524-2534
The development of embryonic cells involves several continuous stages, and some genes are related to embryogenesis. To date, few studies have systematically investigated changes in gene expression profiles during mammalian embryogenesis. In this study, a computational analysis using machine learning algorithms was performed on the gene expression profiles of mouse embryonic cells at seven stages. First, the profiles were analyzed through a powerful Monte Carlo feature selection method for the generation of a feature list. Second, increment feature selection was applied on the list by incorporating two classification algorithms: support vector machine (SVM) and repeated incremental pruning to produce error reduction (RIPPER). Through SVM, we extracted several latent gene biomarkers, indicating the stages of embryonic cells, and constructed an optimal SVM classifier that produced a nearly perfect classification of embryonic cells. Furthermore, some interesting rules were accessed by the RIPPER algorithm, suggesting different expression patterns for different stages.  相似文献   

14.
Ovarian cancer recurs at the rate of 75% within a few months or several years later after therapy. Early recurrence, though responding better to treatment, is difficult to detect. Surface-enhanced laser desorption/ionization time-of-flight (SELDI-TOF) mass spectrometry has showed the potential to accurately identify disease biomarkers to help early diagnosis. A major challenge in the interpretation of SELDI-TOF data is the high dimensionality of the feature space. To tackle this problem, we have developed a multi-step data processing method composed of t-test, binning and backward feature selection. A new algorithm, support vector machine-Markov blanket/recursive feature elimination (SVM-MB/RFE) is presented for the backward feature selection. This method is an integration of minimum weight feature elimination by SVM-RFE and information theory based redundant/irrelevant feature removal by Markov Blanket. Subsequently, SVM was used for classification. We conducted the biomarker selection algorithm on 113 serum samples to identify early relapse from ovarian cancer patients after primary therapy. To validate the performance of the proposed algorithm, experiments were carried out in comparison with several other feature selection and classification algorithms.  相似文献   

15.
Microarrays have thousands to tens-of-thousands of gene features, but only a few hundred patient samples are available. The fundamental problem in microarray data analysis is identifying genes whose disruption causes congenital or acquired disease in humans. In this paper, we propose a new evolutionary method that can efficiently select a subset of potentially informative genes for support vector machine (SVM) classifiers. The proposed evolutionary method uses SVM with a given subset of gene features to evaluate the fitness function, and new subsets of features are selected based on the estimates of generalization error of SVMs and frequency of occurrence of the features in the evolutionary approach. Thus, in theory, selected genes reflect to some extent the generalization performance of SVM classifiers. We compare our proposed method with several existing methods and find that the proposed method can obtain better classification accuracy with a smaller number of selected genes than the existing methods.  相似文献   

16.
MOTIVATION: Feature (gene) selection can dramatically improve the accuracy of gene expression profile based sample class prediction. Many statistical methods for feature (gene) selection such as stepwise optimization and Monte Carlo simulation have been developed for tissue sample classification. In contrast to class prediction, few statistical and computational methods for feature selection have been applied to clustering algorithms for pattern discovery. RESULTS: An integrated scheme and corresponding program SamCluster for automatic discovery of sample classes based on gene expression profile is presented in this report. The scheme incorporates the feature selection algorithms based on the calculation of CV (coefficient of variation) and t-test into hierarchical clustering and proceeds as follows. At first, the genes with their CV greater than the pre-specified threshold are selected for cluster analysis, which results in two putative sample classes. Then, significantly differentially expressed genes in the two putative sample classes with p-values < or = 0.01, 0.05, or 0.1 from t-test are selected for further cluster analysis. The above processes were iterated until the two stable sample classes were found. Finally, the consensus sample classes are constructed from the putative classes that are derived from the different CV thresholds, and the best putative sample classes that have the minimum distance between the consensus classes and the putative classes are identified. To evaluate the performance of the feature selection for cluster analysis, the proposed scheme was applied to four expression datasets COLON, LEUKEMIA72, LEUKEMIA38, and OVARIAN. The results show that there are only 5, 1, 0, and 0 samples that have been misclassified, respectively. We conclude that the proposed scheme, SamCluster, is an efficient method for discovery of sample classes using gene expression profile. AVAILABILITY: The related program SamCluster is available upon request or from the web page http://www.sph.uth.tmc.edu:8052/hgc/Downloads.asp.  相似文献   

17.
Analysis of recursive gene selection approaches from microarray data   总被引:1,自引:0,他引:1  
MOTIVATION: Finding a small subset of most predictive genes from microarray for disease prediction is a challenging problem. Support vector machines (SVMs) have been found to be successful with a recursive procedure in selecting important genes for cancer prediction. However, it is not well understood how much of the success depends on the choice of the specific classifier and how much on the recursive procedure. We answer this question by examining multiple classifers [SVM, ridge regression (RR) and Rocchio] with feature selection in recursive and non-recursive settings on three DNA microarray datasets (ALL-AML Leukemia data, Breast Cancer data and GCM data). RESULTS: We found recursive RR most effective. On the AML-ALL dataset, it achieved zero error rate on the test set using only three genes (selected from over 7000), which is more encouraging than the best published result (zero error rate using 8 genes by recursive SVM). On the Breast Cancer dataset and the two largest categories of the GCM dataset, the results achieved by recursive RR are also very encouraging. A further analysis of the experimental results shows that different classifiers penalize redundant features to different extent and this property plays an important role in the recursive feature selection process. RR classifier tends to penalize redundant features to a much larger extent than the SVM does. This may be the reason why recursive RR has a better performance in selecting genes.  相似文献   

18.
Chronic obstructive pulmonary disease (COPD) is a complex human disease with a high mortality rate. So far, the studies of COPD have not been well organized despite the well-documented role of cigarette smoking in the genesis of COPD. In the recent years, microarray analyses have helped to identify some potential disease related genes. However, the low reproducibility of many published gene signatures has been criticized. It therefore suggested that incorporation of network or pathway information into prognostic biomarker discovery might improve the prediction performance. In this analysis, we combined protein-protein interactions (PPI) information with the support vector machine (SVM) method to identify potential COPD-related genes that would allow one to distinguish accurately severe emphysema from non-/mildly emphysematous lung tissue. We identified 8 COPD-related feature genes. When compared with another SVM method which did not use the prior PPI information, the prediction accuracy was significantly enhanced (AUC was increased from 0.513 to 0.909). On the base of results obtained one can suppose that incorporating network of prior knowledge into gene selection methods significantly improves classification accuracy. Consequently, the gene expression profiles from human emphysematous lung tissue may provide insight into the pathogenesis, and a good classification prediction algorithm based on prior biological knowledge can further strengthen this performance.  相似文献   

19.
20.
基于肿瘤基因表达谱的肿瘤分类是生物信息学的一个重要研究内容。传统的肿瘤信息特征提取方法大多基于信息基因选择方法,但是在筛选基因时,不可避免的会造成分类信息的流失。提出了一种基于邻接矩阵分解的肿瘤亚型特征提取方法,首先对肿瘤基因表达谱数据构造高斯权邻接矩阵,接着对邻接矩阵进行奇异值分解,最后将分解得到的正交矩阵特征行向量作为分类特征输入支持向量机进行分类识别。采用留一法对白血病两个亚型的基因表达谱数据集进行实验,实验结果证明了该方法的可行性和有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号