Background
Chronic kidney disease (CKD) is a disorder associated with breakdown of kidney structure and function. CKD can be diagnosed in its early stage only by experienced nephrologists and urologists (medical experts) using the disease history, symptoms and laboratory tests. There are few studies related to the automatic diagnosis of CKD in the literature. However, these methods are not adequate to help the medical experts.Methods
In this study, a new method was proposed to automatically diagnose the chronic kidney disease in its early stage. The method aims to help the medical diagnosis utilizing the results of urine test, blood test and disease history. Classification algorithms were used as the data mining methods. In the method section of the study, analysis data were first subjected to pre-processing. In the first phase of the method section of the study, pre-processing was applied to CKD data. K-Means clustering method was used as the pre-processing method. Then, the classification methods (KNN, SVM, and Naïve Bayes) were applied to pre-processed data to diagnose the CKD.Results
Highest success rate obtained by classification methods is 97.8% (98.2% for ages 35 and older). This result showed that the data mining methods are useful for automatic diagnosis of CKD in its early stage.Conclusion
A new automatic early stage CKD diagnosis method was proposed to help the medical doctors. Attributes that would provide the highest diagnosis success rate were the use of specific gravity, albumin, sugar and red blood cells together. Also, the relation between the success rate of automatic diagnosis method and age was identified. 相似文献Background
By using a standard Support Vector Machine (SVM) with a Sequential Minimal Optimization (SMO) method of training, Naïve Bayes and other machine learning algorithms we are able to distinguish between two classes of protein sequences: those folding to highly-designable conformations, or those folding to poorly- or non-designable conformations.Results
First, we generate all possible compact lattice conformations for the specified shape (a hexagon or a triangle) on the 2D triangular lattice. Then we generate all possible binary hydrophobic/polar (H/P) sequences and by using a specified energy function, thread them through all of these compact conformations. If for a given sequence the lowest energy is obtained for a particular lattice conformation we assume that this sequence folds to that conformation. Highly-designable conformations have many H/P sequences folding to them, while poorly-designable conformations have few or no H/P sequences. We classify sequences as folding to either highly – or poorly-designable conformations. We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.Conclusion
By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy – in some cases exceeding 95%.Objectives
Prediabetes is a major epidemic and is associated with adverse cardio-cerebrovascular outcomes. Early identification of patients who will develop rapid progression of atherosclerosis could be beneficial for improved risk stratification. In this paper, we investigate important factors impacting the prediction, using several machine learning methods, of rapid progression of carotid intima-media thickness in impaired glucose tolerance (IGT) participants.Methods
In the Actos Now for Prevention of Diabetes (ACT NOW) study, 382 participants with IGT underwent carotid intima-media thickness (CIMT) ultrasound evaluation at baseline and at 15–18 months, and were divided into rapid progressors (RP, n?=?39, 58?±?17.5 μM change) and non-rapid progressors (NRP, n?=?343, 5.8?±?20 μM change, p?<?0.001 versus RP). To deal with complex multi-modal data consisting of demographic, clinical, and laboratory variables, we propose a general data-driven framework to investigate the ACT NOW dataset. In particular, we first employed a Fisher Score-based feature selection method to identify the most effective variables and then proposed a probabilistic Bayes-based learning method for the prediction. Comparison of the methods and factors was conducted using area under the receiver operating characteristic curve (AUC) analyses and Brier score.Results
The experimental results show that the proposed learning methods performed well in identifying or predicting RP. Among the methods, the performance of Naïve Bayes was the best (AUC 0.797, Brier score 0.085) compared to multilayer perceptron (0.729, 0.086) and random forest (0.642, 0.10). The results also show that feature selection has a significant positive impact on the data prediction performance.Conclusions
By dealing with multi-modal data, the proposed learning methods show effectiveness in predicting prediabetics at risk for rapid atherosclerosis progression. The proposed framework demonstrated utility in outcome prediction in a typical multidimensional clinical dataset with a relatively small number of subjects, extending the potential utility of machine learning approaches beyond extremely large-scale datasets.Background
As individual naïve CD4 T lymphocytes circulate in the body after emerging from the thymus, they are likely to have individually varying microenvironmental interactions even in the absence of stimulation via specific target recognition. It is not clear if these interactions result in alterations in their activation, survival and effector programming. Naïve CD4 T cells show unimodal distribution for many phenotypic properties, suggesting that the variation is caused by intrinsic stochasticity, although underlying variation due to subsets created by different histories of microenvironmental interactions remains possible. To explore this possibility, we began examining the phenotype and functionality of naïve CD4 T cells differing in a basic unimodally distributed property, the CD4 levels, as well as the causal origin of these differences.Results
We examined separated CD4hi and CD4lo subsets of mouse naïve CD4 cells. CD4lo cells were smaller with higher CD5 levels and lower levels of the dual-specific phosphatase (DUSP)6-suppressing micro-RNA miR181a, and responded poorly with more Th2-skewed outcomes. Human naïve CD4lo and CD4hi cells showed similar differences. Naïve CD4lo and CD4hi subsets of thymic single-positive CD4 T cells did not show differences whereas peripheral naïve CD4lo and CD4hi subsets of T cell receptor (TCR)-transgenic T cells did. Adoptive transfer-mediated parking of naïve CD4 cells in vivo lowered CD4 levels, increased CD5 and reactive oxygen species (ROS) levels and induced hyporesponsiveness in them, dependent, at least in part, on availability of major histocompatibility complex class II (MHCII) molecules. ROS scavenging or DUSP inhibition ameliorated hyporesponsiveness. Naïve CD4 cells from aged mice showed lower CD4 levels and cell sizes, higher CD5 levels, and hyporesponsiveness and Th2-skewing reversed by DUSP inhibition.Conclusions
Our data show that, underlying a unimodally distributed property, the CD4 level, there are subsets of naïve CD4 cells that vary in the time spent in the periphery receiving MHCII-mediated signals and show resultant alteration of phenotype and functionality via ROS and DUSP activity. Our findings also suggest the feasibility of potential pharmacological interventions for improved CD4 T cell responses during vaccination of older people via either anti-oxidant or DUSP inhibitor small molecules.Background
Long noncoding RNAs (lncRNAs) are widely involved in the initiation and development of cancer. Although some computational methods have been proposed to identify cancer-related lncRNAs, there is still a demanding to improve the prediction accuracy and efficiency. In addition, the quick-update data of cancer, as well as the discovery of new mechanism, also underlay the possibility of improvement of cancer-related lncRNA prediction algorithm. In this study, we introduced CRlncRC, a novel Cancer-Related lncRNA Classifier by integrating manifold features with five machine-learning techniques.Results
CRlncRC was built on the integration of genomic, expression, epigenetic and network, totally in four categories of features. Five learning techniques were exploited to develop the effective classification model including Random Forest (RF), Naïve bayes (NB), Support Vector Machine (SVM), Logistic Regression (LR) and K-Nearest Neighbors (KNN). Using ten-fold cross-validation, we showed that RF is the best model for classifying cancer-related lncRNAs (AUC?=?0.82). The feature importance analysis indicated that epigenetic and network features play key roles in the classification. In addition, compared with other existing classifiers, CRlncRC exhibited a better performance both in sensitivity and specificity. We further applied CRlncRC to lncRNAs from the TANRIC (The Atlas of non-coding RNA in Cancer) dataset, and identified 121 cancer-related lncRNA candidates. These potential cancer-related lncRNAs showed a certain kind of cancer-related indications, and many of them could find convincing literature supports.Conclusions
Our results indicate that CRlncRC is a powerful method for identifying cancer-related lncRNAs. Machine-learning-based integration of multiple features, especially epigenetic and network features, had a great contribution to the cancer-related lncRNA prediction. RF outperforms other learning techniques on measurement of model sensitivity and specificity. In addition, using CRlncRC method, we predicted a set of cancer-related lncRNAs, all of which displayed a strong relevance to cancer as a valuable conception for the further cancer-related lncRNA function studies.Background
High resolution mass spectrometry has been employed to rapidly and accurately type and subtype influenza viruses. The detection of signature peptides with unique theoretical masses enables the unequivocal assignment of the type and subtype of a given strain. This analysis has, to date, required the manual inspection of mass spectra of whole virus and antigen digests.Results
A computer algorithm, FluTyper, has been designed and implemented to achieve the automated analysis of MALDI mass spectra recorded for proteolytic digests of the whole influenza virus and antigens. FluTyper incorporates the use of established signature peptides and newly developed naïve Bayes classifiers for four common influenza antigens, hemagglutinin, neuraminidase, nucleoprotein, and matrix protein 1, to type and subtype the influenza virus based on their detection within proteolytic peptide mass maps. Theoretical and experimental testing of the classifiers demonstrates their applicability at protein coverage rates normally achievable in mass mapping experiments. The application of FluTyper to whole virus and antigen digests of a range of different strains of the influenza virus is demonstrated.Conclusions
FluTyper algorithm facilitates the rapid and automated typing and subtyping of the influenza virus from mass spectral data. The newly developed naïve Bayes classifiers increase the confidence of influenza virus subtyping, especially where signature peptides are not detected. FluTyper is expected to popularize the use of mass spectrometry to characterize influenza viruses. 相似文献Background
Intrinsically Disordered Proteins (IDPs) lack an ordered three-dimensional structure and are enriched in various biological processes. The Molecular Recognition Features (MoRFs) are functional regions within IDPs that undergo a disorder-to-order transition on binding to a partner protein. Identifying MoRFs in IDPs using computational methods is a challenging task.Methods
In this study, we introduce hidden Markov model (HMM) profiles to accurately identify the location of MoRFs in disordered protein sequences. Using windowing technique, HMM profiles are utilised to extract features from protein sequences and support vector machines (SVM) are used to calculate a propensity score for each residue. Two different SVM kernels with high noise tolerance are evaluated with a varying window size and the scores of the SVM models are combined to generate the final propensity score to predict MoRF residues. The SVM models are designed to extract maximal information between MoRF residues, its neighboring regions (Flanks) and the remainder of the sequence (Others).Results
To evaluate the proposed method, its performance was compared to that of other MoRF predictors; MoRFpred and ANCHOR. The results show that the proposed method outperforms these two predictors.Conclusions
Using HMM profile as a source of feature extraction, the proposed method indicates improvement in predicting MoRFs in disordered protein sequences.Background
The prediction of calmodulin-binding (CaM-binding) proteins plays a very important role in the fields of biology and biochemistry, because the calmodulin protein binds and regulates a multitude of protein targets affecting different cellular processes. Computational methods that can accurately identify CaM-binding proteins and CaM-binding domains would accelerate research in calcium signaling and calmodulin function. Short-linear motifs (SLiMs), on the other hand, have been effectively used as features for analyzing protein-protein interactions, though their properties have not been utilized in the prediction of CaM-binding proteins.Results
We propose a new method for the prediction of CaM-binding proteins based on both the total and average scores of known and new SLiMs in protein sequences using a new scoring method called sliding window scoring (SWS) as features for the prediction module. A dataset of 194 manually curated human CaM-binding proteins and 193 mitochondrial proteins have been obtained and used for testing the proposed model. The motif generation tool, Multiple EM for Motif Elucidation (MEME), has been used to obtain new motifs from each of the positive and negative datasets individually (the SM approach) and from the combined negative and positive datasets (the CM approach). Moreover, the wrapper criterion with random forest for feature selection (FS) has been applied followed by classification using different algorithms such as k-nearest neighbors (k-NN), support vector machines (SVM), naive Bayes (NB) and random forest (RF).Conclusions
Our proposed method shows very good prediction results and demonstrates how information contained in SLiMs is highly relevant in predicting CaM-binding proteins. Further, three new CaM-binding motifs have been computationally selected and biologically validated in this study, and which can be used for predicting CaM-binding proteins.Background
Protein secondary structure prediction method based on probabilistic models such as hidden Markov model (HMM) appeals to many because it provides meaningful information relevant to sequence-structure relationship. However, at present, the prediction accuracy of pure HMM-type methods is much lower than that of machine learning-based methods such as neural networks (NN) or support vector machines (SVM). 相似文献Background
Many essential cellular processes, such as cellular metabolism, transport, cellular metabolism and most regulatory mechanisms, rely on physical interactions between proteins. Genome-wide protein interactome networks of yeast, human and several other animal organisms have already been established, but this kind of network reminds to be established in the field of plant.Results
We first predicted the protein protein interaction in Arabidopsis thaliana with methods, including ortholog, SSBP, gene fusion, gene neighbor, phylogenetic profile, coexpression, protein domain, and used Naïve Bayesian approach next to integrate the results of these methods and text mining data to build a genome-wide protein interactome network. Furthermore, we adopted the data of GO enrichment analysis, pathway, published literature to validate our network, the confirmation of our network shows the feasibility of using our network to predict protein function and other usage.Conclusions
Our interactome is a comprehensive genome-wide network in the organism plant Arabidopsis thaliana, and provides a rich resource for researchers in related field to study the protein function, molecular interaction and potential mechanism under different conditions.Background
Genomic selection has gained much attention and the main goal is to increase the predictive accuracy and the genetic gain in livestock using dense marker information. Most methods dealing with the large p (number of covariates) small n (number of observations) problem have dealt only with continuous traits, but there are many important traits in livestock that are recorded in a discrete fashion (e.g. pregnancy outcome, disease resistance). It is necessary to evaluate alternatives to analyze discrete traits in a genome-wide prediction context.Methods
This study shows two threshold versions of Bayesian regressions (Bayes A and Bayesian LASSO) and two machine learning algorithms (boosting and random forest) to analyze discrete traits in a genome-wide prediction context. These methods were evaluated using simulated and field data to predict yet-to-be observed records. Performances were compared based on the models'' predictive ability.Results
The simulation showed that machine learning had some advantages over Bayesian regressions when a small number of QTL regulated the trait under pure additivity. However, differences were small and disappeared with a large number of QTL. Bayesian threshold LASSO and boosting achieved the highest accuracies, whereas Random Forest presented the highest classification performance. Random Forest was the most consistent method in detecting resistant and susceptible animals, phi correlation was up to 81% greater than Bayesian regressions. Random Forest outperformed other methods in correctly classifying resistant and susceptible animals in the two pure swine lines evaluated. Boosting and Bayes A were more accurate with crossbred data.Conclusions
The results of this study suggest that the best method for genome-wide prediction may depend on the genetic basis of the population analyzed. All methods were less accurate at correctly classifying intermediate animals than extreme animals. Among the different alternatives proposed to analyze discrete traits, machine-learning showed some advantages over Bayesian regressions. Boosting with a pseudo Huber loss function showed high accuracy, whereas Random Forest produced more consistent results and an interesting predictive ability. Nonetheless, the best method may be case-dependent and a initial evaluation of different methods is recommended to deal with a particular problem. 相似文献Real-time accurate traffic congestion prediction can enable Intelligent traffic management systems (ITMSs) that replace traditional systems to improve the efficiency of traffic and reduce traffic congestion. The ITMS consists of three main layers, which are: Internet of Things (IoT), edge, and cloud layers. Edge can collect real-time data from different routes through IoT devices such as wireless sensors, and then it can compute and store this collected data before transmitting them to the cloud for further processing. Thus, an edge is an intermediate layer between IoT and cloud layers that can receive the transmitted data through IoT to overcome cloud challenges such as high latency. In this paper, a novel real-time traffic congestion prediction strategy (TCPS) is proposed based on the collected data in the edge’s cache server at the edge layer. The proposed TCPS contains three stages, which are: (i) real-time congestion prediction (RCP) stage, (ii) congestion direction detection (CD2) stage, and (iii) width change decision (WCD) stage. The RCP aims to predict traffic congestion based on the causes of congestion in the hotspot using a fuzzy inference system. If there is congestion, the CD2 stage is used to detect the congestion direction based on the predictions from the RCP by using the Optimal Weighted Naïve Bayes (OWNB) method. The WCD stage aims to prevent the congestion occurrence in which it is used to change the width of changeable routes (CR) after detecting the direction of congestion in CD2. The experimental results have shown that the proposed TCPS outperforms other recent methodologies. TCPS provides the highest accuracy, precision, and recall. Besides, it provides the lowest error, with values equal to 95%, 74%, 75%, and 5% respectively.
相似文献The prediction of domain/linker residues in protein sequences is a crucial task in the functional classification of proteins, homology-based protein structure prediction, and high-throughput structural genomics. In this work, a novel consensus-based machine-learning technique was applied for residue-level prediction of the domain/linker annotations in protein sequences using ordered/disordered regions along protein chains and a set of physicochemical properties. Six different classifiers—decision tree, Gaussian naïve Bayes, linear discriminant analysis, support vector machine, random forest, and multilayer perceptron—were exhaustively explored for the residue-level prediction of domain/linker regions. The protein sequences from the curated CATH database were used for training and cross-validation experiments. Test results obtained by applying the developed PDP-CON tool to the mutually exclusive, independent proteins of the CASP-8, CASP-9, and CASP-10 databases are reported. An n-star quality consensus approach was used to combine the results yielded by different classifiers. The average PDP-CON accuracy and F-measure values for the CASP targets were found to be 0.86 and 0.91, respectively. The dataset, source code, and all supplementary materials for this work are available at https://cmaterju.org/cmaterbioinfo/ for noncommercial use.
相似文献