首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Most of pyruvoyl-dependent proteins observed in prokaryotes and eukaryotes are critical regulatory enzymes, which are primary targets of inhibitors for anti-cancer and anti-parasitic therapy. These proteins undergo an autocatalytic, intramolecular self-cleavage reaction in which a covalently bound pyruvoyl group is generated on a conserved serine residue. Traditional detections of the modified serine sites are performed by experimental approaches, which are often labor-intensive and time-consuming. In this study, we initiated in an attempt for the computational predictions of such serine sites with Feature Selection based on a Random Forest. Since only a small number of experimentally verified pyruvoyl-modified proteins are collected in the protein database at its current version, we only used a small dataset in this study. After removing proteins with sequence identities >60%, a non-redundant dataset was generated and was used, which contained only 46 proteins, with one pyruvoyl serine site for each protein. Several types of features were considered in our method including PSSM conservation scores, disorders, secondary structures, solvent accessibilities, amino acid factors and amino acid occurrence frequencies. As a result, a pretty good performance was achieved in our dataset. The best 100.00% accuracy and 1.0000 MCC value were obtained from the training dataset, and 93.75% accuracy and 0.8441 MCC value from the testing dataset. The optimal feature set contained 9 features. Analysis of the optimal feature set indicated the important roles of some specific features in determining the pyruvoyl-group-serine sites, which were consistent with several results of earlier experimental studies. These selected features may shed some light on the in-depth understanding of the mechanism of the post-translational self-maturation process, providing guidelines for experimental validation. Future work should be made as more pyruvoyl-modified proteins are found and the method should be evaluated on larger datasets. At last, the predicting software can be downloaded from http://www.nkbiox.com/sub/pyrupred/index.html.  相似文献   

2.
《Genomics》2021,113(6):3864-3871
RNA editing exerts critical impacts on numerous biological processes. While millions of RNA editings have been identified in humans, much more are expected to be discovered. In this work, we constructed Convolutional Neural Network (CNN) models to predict human RNA editing events in both Alu regions and non-Alu regions. With a validation dataset resulting from CRISPR/Cas9 knockout of the ADAR1 enzyme, the validation accuracies reached 99.5% and 93.6% for Alu and non-Alu regions, respectively. We ported our CNN models in a web service named EditPredict. EditPredict not only works on reference genome sequences but can also take into consideration single nucleotide variants in personal genomes. In addition to the human genome, EditPredict tackles other model organisms including bumblebee, fruitfly, mouse, and squid genomes. EditPredict can be used stand-alone to predict novel RNA editing and it can be used to assist in filtering for candidate RNA editing detected from RNA-Seq data.  相似文献   

3.
4.
Gamma function is the standard methodology for comparing dose distributions. It is calculated in dedicated software, and its results verification is not performed. Thus we developed an automatic tool for patient-specific QA results verification through high accuracy machine learning (ML) models based on the radiomics characteristics extraction from gamma images. We used 158 patient-specific QA tests and extracted 105 radiomics features from each gamma image. Three random forest models were developed (ML I, ML II, and ML III). ML I and ML II verified the gamma image approval using criteria of 2%/2mm/15% threshold and 3%/3mm/15% threshold, respectively. ML III verified if the gamma analyzes software recommended protocol was followed to detect if the TPS grid modification step was done. The models were based on the most important features selected using the mean decreased impurity, and their performances were evaluated. ML I included 25 features. Its accuracy was 0.85 using the test set and 0.84 using dataset B. ML II included 10 features, and its accuracy with the test set was 0.98; the same value was achieved using the never seen data (dataset B). The First-order 10th percentile feature was identified as a feature strongly related to the approved classification. ML III selected 23 features with an accuracy of 0.99 for test set and 0.98 for dataset B. An automatic workflow example for gamma analyses QA results verification could be proposed combining the models to detect grid inconsistencies on software evaluation, followed by the test approval classification.  相似文献   

5.
6.
癌症的早期诊断能够显著提高癌症患者的存活率,在肝细胞癌患者中这种情况更加明显。机器学习是癌症分类中的有效工具。如何在复杂和高维的癌症数据集中,选择出低维度、高分类精度的特征子集是癌症分类的难题。本文提出了一种二阶段的特征选择方法SC-BPSO:通过组合Spearman相关系数和卡方独立检验作为过滤器的评价函数,设计了一种新型的过滤器方法——SC过滤器,再组合SC过滤器方法和基于二进制粒子群算法(BPSO)的包裹器方法,从而实现两阶段的特征选择。并应用在高维数据的癌症分类问题中,区分正常样本和肝细胞癌样本。首先,对来自美国国家生物信息中心(NCBI)和欧洲生物信息研究所(EBI)的130个肝组织microRNA序列数据(64肝细胞癌,66正常肝组织)进行预处理,使用MiRME算法从原始序列文件中提取microRNA的表达量、编辑水平和编辑后表达量3类特征。然后,调整SC-BPSO算法在肝细胞癌分类场景中的参数,选择出关键特征子集。最后,建立分类模型,预测结果,并与信息增益过滤器、信息增益率过滤器、BPSO包裹器特征选择算法选出的特征子集,使用相同参数的随机森林、支持向量机、决策树、KNN四种分类器分类,对比分类结果。使用SC-BPSO算法选择出的特征子集,分类准确率高达98.4%。研究结果表明,与另外3个特征选择算法相比,SC-BPSO算法能有效地找到尺寸较小和精度更高的特征子集。这对于少量样本高维数据的癌症分类问题可能具有重要意义。  相似文献   

7.
8.
使用转录组测序(RNA-Seq)数据识别黑猩猩RNA编辑位点,探索了RNA编辑的识别机制以及潜在的功能影响.基于黑猩猩RNA-Seq数据与基因组序列的比对信息发现RNA-DNA错配位点,并构建编辑位点候选集.从中滤除基因组或转录组测序质量低的位点,其他的过滤条件包括3′端测不准、覆盖度、SNP位点以及估算的编辑水平.构建二项分布统计模型和Bonferroni多重检验滤除候选集中的随机错误,得到RNA编辑位点.选取落在已知基因上的编辑位点进行功能分析,并用Two Sample Logo软件分析编辑位点上下游序列的特征.识别出黑猩猩12种碱基替换型RNA编辑位点8 334个,其中有41个编辑位点改变原有的氨基酸,另有3个编辑位点落在microRNA(miRNA)潜在靶基因的种子结合区.统计学分析表明,分别有640和872个RNA编辑位点存在组织和性别差异.上下游碱基频率分析表明,多种类型的编辑位点紧邻碱基具有显著偏好.结果显示, RNA编辑在黑猩猩体内大量存在,且潜在具有重要的生物学功能,为进一步深入研究灵长类RNA编辑的机制奠定了基础.  相似文献   

9.
Prediction of thermophilic and mesophilic protein plays a crucial role in both biochemistry and bioengineering. In this study, a different mode of pseudo amino acid composition (PseAAC) was proposed to formulate the protein samples by integrating the amino acid composition, the physic chemical features, as well as the composition transition and distribution features, where each of the protein samples was represented by a numerical vector through the sequence-based approach. Using the support vector machine algorithm, an accurate and reliable classifier was constructed to predict the thermophilic and mesophilic proteins. Moreover, three feature reduction algorithms were obtained for locating the most vital features and reducing the size of feature space. Among the three feature reduction algorithms, the genetic algorithm performed best. Finally, with the reduced features extracted from the genetic algorithm, it was observed that for the selected dataset the new classifier achieved a high accuracy of 95.93% with the Matthews correlation coefficient of 0.9187.  相似文献   

10.
11.
To identify non-coding RNA (ncRNA) signals within genomic regions, a classification tool was developed based on a hybrid random forest (RF) with a logistic regression model to efficiently discriminate short ncRNA sequences as well as long complex ncRNA sequences. This RF-based classifier was trained on a well-balanced dataset with a discriminative set of features and achieved an accuracy, sensitivity and specificity of 92.11%, 90.7% and 93.5%, respectively. The selected feature set includes a new proposed feature, SCORE. This feature is generated based on a logistic regression function that combines five significant features—structure, sequence, modularity, structural robustness and coding potential—to enable improved characterization of long ncRNA (lncRNA) elements. The use of SCORE improved the performance of the RF-based classifier in the identification of Rfam lncRNA families. A genome-wide ncRNA classification framework was applied to a wide variety of organisms, with an emphasis on those of economic, social, public health, environmental and agricultural significance, such as various bacteria genomes, the Arthrospira (Spirulina) genome, and rice and human genomic regions. Our framework was able to identify known ncRNAs with sensitivities of greater than 90% and 77.7% for prokaryotic and eukaryotic sequences, respectively. Our classifier is available at http://ncrna-pred.com/HLRF.htm.  相似文献   

12.
Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.  相似文献   

13.
An alternative technique for sleep stages classification based on heart rate variability (HRV) was presented in this paper. The simple subject specific scheme and a more practical subject independent scheme were designed to classify wake, rapid eye movement (REM) sleep and non-REM (NREM) sleep. 41 HRV features extracted from RR sequence of 45 healthy subjects were trained and tested through random forest (RF) method. Among the features, 25 were newly proposed or applied to sleep study for the first time. For the subject independent classifier, all features were normalized with our developed fractile values based method. Besides, the importance of each feature for sleep staging was also assessed by RF and the appropriate number of features was explored. For the subject specific classifier, a mean accuracy of 88.67% with Cohen's kappa statistic κ of 0.7393 was achieved. While the accuracy and κ dropped to 72.58% and 0.4627, respectively when the subject independent classifier was considered. Some new proposed HRV features even performed more effectively than the conventional ones. The proposed method could be used as an alternative or aiding technique for rough and convenient sleep stages classification.  相似文献   

14.
Non-synonymous SNPs (nsSNPs), also known as Single Amino acid Polymorphisms (SAPs) account for the majority of human inherited diseases. It is important to distinguish the deleterious SAPs from neutral ones. Most traditional computational methods to classify SAPs are based on sequential or structural features. However, these features cannot fully explain the association between a SAP and the observed pathophysiological phenotype. We believe the better rationale for deleterious SAP prediction should be: If a SAP lies in the protein with important functions and it can change the protein sequence and structure severely, it is more likely related to disease. So we established a method to predict deleterious SAPs based on both protein interaction network and traditional hybrid properties. Each SAP is represented by 472 features that include sequential features, structural features and network features. Maximum Relevance Minimum Redundancy (mRMR) method and Incremental Feature Selection (IFS) were applied to obtain the optimal feature set and the prediction model was Nearest Neighbor Algorithm (NNA). In jackknife cross-validation, 83.27% of SAPs were correctly predicted when the optimized 263 features were used. The optimized predictor with 263 features was also tested in an independent dataset and the accuracy was still 80.00%. In contrast, SIFT, a widely used predictor of deleterious SAPs based on sequential features, has a prediction accuracy of 71.05% on the same dataset. In our study, network features were found to be most important for accurate prediction and can significantly improve the prediction performance. Our results suggest that the protein interaction context could provide important clues to help better illustrate SAP''s functional association. This research will facilitate the post genome-wide association studies.  相似文献   

15.
16.
Willingale R  Jones DJ  Lamb JH  Quinn P  Farmer PB  Ng LL 《Proteomics》2006,6(22):5903-5914
We have developed a technique for analysing blood plasma using MALDI-MS with subsequent data analysis to identify significant and specific differences between heart failure (HF) patients and healthy individuals. A training dataset comprising 100 HF patients and 100 healthy individuals was used to search for biomarkers (m/z range 1000-10,000). EWP cartridges when used in tandem with microcon centrifugal filters were found to give the best results. A data management chain including event binning, background subtraction and feature extraction was developed to reduce the data, and statistical analysis was used to map feature intensities on to a common scale. Various mathematical approaches including a simple cumulative score, support vector machines (SVM) and genetic algorithms (GAs) were then used to combine the results from individual features and provide a robust classification algorithm. The SVM gave the most promising results (accuracy 95%, receiver operating characteristic (ROC) score of 0.997 using 18 selected features). Finally, a test dataset comprising a further 32 HF patients and 20 controls was used to verify that the 18 putative biomarkers and classification algorithms gave reliable predictions (accuracy 88.5%, ROC score 0.998).  相似文献   

17.
Hematopoiesis is a complicated process involving a series of biological sub-processes that lead to the formation of various blood components. A widely accepted model of early hematopoiesis proceeds from long-term hematopoietic stem cells (LT-HSCs) to multipotent progenitors (MPPs) and then to lineage-committed progenitors. However, the molecular mechanisms of early hematopoiesis have not been fully characterized. In this study, we applied a computational strategy to identify the gene expression signatures distinguishing three types of closely related hematopoietic cells collected in recent studies: (1) hematopoietic stem cell/multipotent progenitor cells; (2) LT-HSCs; and (3) hematopoietic progenitor cells. Each cell in these cell types was represented by its gene expression profile among a total number of 20,475 genes. The expression features were analyzed by a Monte-Carlo Feature Selection (MCFS) method, resulting in a feature list. Then, the incremental feature selection (IFS) and a support vector machine (SVM) optimized with a sequential minimum optimization (SMO) algorithm were employed to access the optimal classifier with the highest Matthews correlation coefficient (MCC) value of 0.889, in which 6698 features were used to represent cells. In addition, through an updated program of MCFS method, seventeen decision rules can be obtained, which can classify the three cell types with an overall accuracy of 0.812. Using a literature review, both the rules and the top features used for building the optimal classifier were confirmed to be commonly used or potential biological markers for distinguishing the three cell types of HSPCs. This article is part of a Special Issue entitled: Accelerating Precision Medicine through Genetic and Genomic Big Data Analysis edited by Yudong Cai & Tao Huang.  相似文献   

18.
【目的】本研究旨在利用已获得的PacBio单分子实时(single molecule real-time, SMRT)测序数据对蜜蜂球囊菌Ascosphaera apis菌丝(AaM)和孢子(AaS)中的转录因子(TF)、融合基因和RNA编辑事件进行鉴定和分析,以期丰富蜜蜂球囊菌的相关信息,并为进一步探究它们的功能提供理论依据。【方法】利用BLASTx工具将AaM和AaS的全长转录本序列比对到Nr, Swiss-Prot和KEGG数据库以获得一致性最高的蛋白序列,再利用hmmscan软件将上述蛋白序列比对到Plant TFdb数据库以获得TF的分类及注释信息。采用TOFU软件中的fusion_finder.py程序进行融合基因的预测,进而分析融合基因的序列和位置信息。使用SAMtools预测AaM和AaS中的RNA编辑事件,再利用ANNOVAR软件对RNA编辑事件进行注释,进而采用相关生物信息学软件对RNA编辑位点基因进行GO功能和KEGG通路注释。【结果】在AaS中共鉴定到17个TF家族的213个TF,其中C2H2家族包含的TF成员最多。在AaM和AaS中分别鉴定到921和510个融合基因,二者共有的融合基因为510个,特有的融合基因分别为411和0个。在AaM和AaS中分别鉴定到547和191次RNA编辑事件,其中AaM中同义单核苷酸突变的数量最多,AaS中非同义单核苷酸突变的数量最多。此外,在AaM中鉴定到12种碱基替换类型,其中发生C->T的RNA编辑事件数量最多,达到158次;在AaS中鉴定到9种碱基替换类型,其中发生C->T和G->T的RNA编辑事件数量最多,均有42次。AaM和AaS中RNA编辑位点基因分别涉及19和24个GO功能条目;此外还能注释到11和20条KEGG通路。【结论】蜜蜂球囊菌的菌丝和孢子中含有丰富的TF、融合基因和RNA编辑位点;转录因子C2H2家族与蜜蜂球囊菌菌丝和孢子的生长发育和细胞活动具有潜在关联;RNA编辑事件的碱基替换类型在蜜蜂球囊菌和其他物种中具有物种特异性;RNA编辑可能在蜜蜂球囊菌菌丝和孢子的生长和代谢中发挥作用。  相似文献   

19.
20.
人类基因组盒式外显子和内含子保留的可变剪接位点预测   总被引:2,自引:0,他引:2  
信使RNA的可变剪接是真核生物有别于原核生物的基本特征之一,信使RNA前体的可变剪接极大地丰富了高等真核生物蛋白质的多样性,并与生物体的组织特异性密切相关。文章对人类盒式外显子和内含子保留的一些基本特征进行了统计;根据剪接位点附近的单碱基、碱基二联体和三联体的保守性等特征,利用基于多样性指标的二次判别法,对盒式外显子和内含子保留的供体端和受体端可变剪接位点进行了预测。交叉检验结果表明,盒式外显子供体端和受体端的识别精度分别达到93%、84%以上的水平;内含子保留供体端和受体端的识别精度分别达到89%、81%以上的水平。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号