首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents some essential findings and results on using ranking-based kernels for the analysis and utilization of high dimensional and noisy biomedical data in applied clinical diagnostics. We claim that presented kernels combined with a state-of-the-art classification technique - a Support Vector Machine (SVM) - could significantly improve the classification rate and predictive power of the wrapper method, e.g. SVM. Moreover, the advantage of such kernels could be potentially exploited for other kernel methods and essential computer-aided tasks such as novelty detection and clustering. Our experimental results and theoretical generalization bounds imply that ranking-based kernels outperform other traditionally employed SVM kernels on high dimensional biomedical and microarray data.  相似文献   

2.

Background  

The construction of interaction networks between proteins is central to understanding the underlying biological processes. However, since many useful relations are excluded in databases and remain hidden in raw text, a study on automatic interaction extraction from text is important in bioinformatics field.  相似文献   

3.
MOTIVATION: In order to extract protein sequences from nucleotide sequences, it is an important step to recognize points at which regions start that code for proteins. These points are called translation initiation sites (TIS). RESULTS: The task of finding TIS can be modeled as a classification problem. We demonstrate the applicability of support vector machines for this task, and show how to incorporate prior biological knowledge by engineering an appropriate kernel function. With the described techniques the recognition performance can be improved by 26% over leading existing approaches. We provide evidence that existing related methods (e.g. ESTScan) could profit from advanced TIS recognition.  相似文献   

4.
Guo Y  Yu L  Wen Z  Li M 《Nucleic acids research》2008,36(9):3025-3030
Compared to the available protein sequences of different organisms, the number of revealed protein-protein interactions (PPIs) is still very limited. So many computational methods have been developed to facilitate the identification of novel PPIs. However, the methods only using the information of protein sequences are more universal than those that depend on some additional information or predictions about the proteins. In this article, a sequence-based method is proposed by combining a new feature representation using auto covariance (AC) and support vector machine (SVM). AC accounts for the interactions between residues a certain distance apart in the sequence, so this method adequately takes the neighbouring effect into account. When performed on the PPI data of yeast Saccharomyces cerevisiae, the method achieved a very promising prediction result. An independent data set of 11,474 yeast PPIs was used to evaluate this prediction model and the prediction accuracy is 88.09%. The performance of this method is superior to those of the existing sequence-based methods, so it can be a useful supplementary tool for future proteomics studies. The prediction software and all data sets used in this article are freely available at http://www.scucic.cn/Predict_PPI/index.htm.  相似文献   

5.
Lo SL  Cai CZ  Chen YZ  Chung MC 《Proteomics》2005,5(4):876-884
Knowledge of protein-protein interaction is useful for elucidating protein function via the concept of 'guilt-by-association'. A statistical learning method, Support Vector Machine (SVM), has recently been explored for the prediction of protein-protein interactions using artificial shuffled sequences as hypothetical noninteracting proteins and it has shown promising results (Bock, J. R., Gough, D. A., Bioinformatics 2001, 17, 455-460). It remains unclear however, how the prediction accuracy is affected if real protein sequences are used to represent noninteracting proteins. In this work, this effect is assessed by comparison of the results derived from the use of real protein sequences with that derived from the use of shuffled sequences. The real protein sequences of hypothetical noninteracting proteins are generated from an exclusion analysis in combination with subcellular localization information of interacting proteins found in the Database of Interacting Proteins. Prediction accuracy using real protein sequences is 76.9% compared to 94.1% using artificial shuffled sequences. The discrepancy likely arises from the expected higher level of difficulty for separating two sets of real protein sequences than that for separating a set of real protein sequences from a set of artificial sequences. The use of real protein sequences for training a SVM classification system is expected to give better prediction results in practical cases. This is tested by using both SVM systems for predicting putative protein partners of a set of thioredoxin related proteins. The prediction results are consistent with observations, suggesting that real sequence is more practically useful in development of SVM classification system for facilitating protein-protein interaction prediction.  相似文献   

6.
MOTIVATION: Contrasts are useful conceptual vehicles for learning processes and exploratory research of the unknown. For example, contrastive information between proteins can reveal what similarities, divergences and relations there are of the two proteins, leading to invaluable insights for better understanding about the proteins. Such contrastive information are found to be reported in the biomedical literature. However, there have been no reported attempts in current biomedical text mining work that systematically extract and present such useful contrastive information from the literature for exploitation. RESULTS: Our BioContrasts system extracts protein-protein contrastive information from MEDLINE abstracts and presents the information to biologists in a web-application for exploitation. Contrastive information are identified in the text abstracts with contrastive negation patterns such as 'A but not B'. A total of 799 169 pairs of contrastive expressions were successfully extracted from 2.5 million MEDLINE abstracts. Using grounding of contrastive protein names to Swiss-Prot entries, we were able to produce 41 471 pieces of contrasts between Swiss-Prot protein entries. These contrastive pieces of information are then presented via a user-friendly interactive web portal that can be exploited for applications such as the refinement of biological pathways. AVAILABILITY: BioContrasts can be accessed at http://biocontrasts.i2r.a-star.edu.sg. It is also mirrored at http://biocontrasts.biopathway.org. SUPPLEMENTARY INFORMATION: Supplementary materials are available at Bioinformatics online.  相似文献   

7.

Background  

The prediction of protein-protein binding site can provide structural annotation to the protein interaction data from proteomics studies. This is very important for the biological application of the protein interaction data that is increasing rapidly. Moreover, methods for predicting protein interaction sites can also provide crucial information for improving the speed and accuracy of protein docking methods.  相似文献   

8.
Extracting protein-protein interaction (PPI) from biomedical literature is an important task in biomedical text mining (BioTM). In this paper, we propose a hash subgraph pairwise (HSP) kernel-based approach for this task. The key to the novel kernel is to use the hierarchical hash labels to express the structural information of subgraphs in a linear time. We apply the graph kernel to compute dependency graphs representing the sentence structure for protein-protein interaction extraction task, which can efficiently make use of full graph structural information, and particularly capture the contiguous topological and label information ignored before. We evaluate the proposed approach on five publicly available PPI corpora. The experimental results show that our approach significantly outperforms all-path kernel approach on all five corpora and achieves state-of-the-art performance.  相似文献   

9.
Kim D  Yu H 《PloS one》2011,6(1):e15338

Background

Figures are ubiquitous in biomedical full-text articles, and they represent important biomedical knowledge. However, the sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Therefore, we are developing the Biomedical Figure Search engine (http://figuresearch.askHERMES.org) to allow bioscientists to access figures efficiently. Since text frequently appears in figures, automatically extracting such text may assist the task of mining information from figures. Little research, however, has been conducted exploring text extraction from biomedical figures.

Methodology

We first evaluated an off-the-shelf Optical Character Recognition (OCR) tool on its ability to extract text from figures appearing in biomedical full-text articles. We then developed a Figure Text Extraction Tool (FigTExT) to improve the performance of the OCR tool for figure text extraction through the use of three innovative components: image preprocessing, character recognition, and text correction. We first developed image preprocessing to enhance image quality and to improve text localization. Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition. Finally, we developed and evaluated a novel text correction framework by taking advantage of figure-specific lexicons.

Results/Conclusions

The evaluation on 382 figures (9,643 figure texts in total) randomly selected from PubMed Central full-text articles shows that FigTExT performed with 84% precision, 98% recall, and 90% F1-score for text localization and with 62.5% precision, 51.0% recall and 56.2% F1-score for figure text extraction. When limiting figure texts to those judged by domain experts to be important content, FigTExT performed with 87.3% precision, 68.8% recall, and 77% F1-score. FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36.6% precision, 19.3% recall, and 25.3% F1-score for text extraction. In addition, our results show that FigTExT can extract texts that do not appear in figure captions or other associated text, further suggesting the potential utility of FigTExT for improving figure search.Open in a separate windowFigure 9Additional reasons for OCR errors.(A) High image complexity. (B) Thick stroke. (C) Low image contrast. (D) Small font size. (E) Non-standard font type.  相似文献   

10.
Protein-protein interaction (PPI) prediction is a central task in achieving a better understanding of cellular and intracellular processes. Because high-throughput experimental methods are both expensive and time-consuming, and are also known of suffering from the problems of incompleteness and noise, many computational methods have been developed, with varied degrees of success. However, the inference of PPI network from multiple heterogeneous data sources remains a great challenge. In this work, we developed a novel method based on approximate Bayesian computation and modified differential evolution sampling (ABC-DEP) and regularized laplacian (RL) kernel. The method enables inference of PPI networks from topological properties and multiple heterogeneous features including gene expression and Pfam domain profiles, in forms of weighted kernels. The optimal weights are obtained by ABC-DEP, and the kernel fusion built based on optimal weights serves as input to RL to infer missing or new edges in the PPI network. Detailed comparisons with control methods have been made, and the results show that the accuracy of PPI prediction measured by AUC is increased by up to 23 %, as compared to a baseline without using optimal weights. The method can provide insights into the relations between PPIs and various feature kernels and demonstrates strong capability of predicting faraway interactions that cannot be well detected by traditional RL method.  相似文献   

11.
12.
Chronic obstructive pulmonary disease (COPD) is a complex human disease with a high mortality rate. So far, the studies of COPD have not been well organized despite the well-documented role of cigarette smoking in the genesis of COPD. In the recent years, microarray analyses have helped to identify some potential disease related genes. However, the low reproducibility of many published gene signatures has been criticized. It therefore suggested that incorporation of network or pathway information into prognostic biomarker discovery might improve the prediction performance. In this analysis, we combined protein-protein interactions (PPI) information with the support vector machine (SVM) method to identify potential COPD-related genes that would allow one to distinguish accurately severe emphysema from non-/mildly emphysematous lung tissue. We identified 8 COPD-related feature genes. When compared with another SVM method which did not use the prior PPI information, the prediction accuracy was significantly enhanced (AUC was increased from 0.513 to 0.909). On the base of results obtained one can suppose that incorporating network of prior knowledge into gene selection methods significantly improves classification accuracy. Consequently, the gene expression profiles from human emphysematous lung tissue may provide insight into the pathogenesis, and a good classification prediction algorithm based on prior biological knowledge can further strengthen this performance.  相似文献   

13.
MOTIVATION: To understand biological process, we must clarify how proteins interact with each other. However, since information about protein-protein interactions still exists primarily in the scientific literature, it is not accessible in a computer-readable format. Efficient processing of large amounts of interactions therefore needs an intelligent information extraction method. Our aim is to develop an efficient method for extracting information on protein-protein interaction from scientific literature. RESULTS: We present a method for extracting information on protein-protein interactions from the scientific literature. This method, which employs only a protein name dictionary, surface clues on word patterns and simple part-of-speech rules, achieved high recall and precision rates for yeast (recall = 86.8% and precision = 94.3%) and Escherichia coli (recall = 82.5% and precision = 93.5%). The result of extraction suggests that our method should be applicable to any species for which a protein name dictionary is constructed. AVAILABILITY: The program is available on request from the authors.  相似文献   

14.
15.
We study the number of causal variants and associated regions identified by top SNPs in rankings given by the popular 1 df chi-squared statistic, support vector machine (SVM) and the random forest (RF) on simulated and real data. If we apply the SVM and RF to the top 2r chi-square-ranked SNPs, where r is the number of SNPs with P-values within the Bonferroni correction, we find that both improve the ranks of causal variants and associated regions and achieve higher power on simulated data. These improvements, however, as well as stability of the SVM and RF rankings, progressively decrease as the cutoff increases to 5r and 10r. As applications we compare the ranks of previously replicated SNPs in real data, associated regions in type 1 diabetes, as provided by the Type 1 Diabetes Consortium, and disease risk prediction accuracies as given by top ranked SNPs by the three methods. Software and webserver are available at http://svmsnps.njit.edu.  相似文献   

16.
Freshwater crayfish are one of the most important aquatic organisms that play a pivotal role in the aquatic food chain as well as serving as bioindicators for the aquatic ecosystem health assessment. Hemocytes, the blood cells of crustaceans, can be considered stress and health indicators in crayfish, and are used to evaluate the health response. Therefore, total hemocyte cell numbers (THCs) are useful parameters to show the health of crustaceans and serve as stress indicators to decide the quality of the habitat. Since, catching the fish and the other aquatic organisms, and collecting the data for further assessments are time-consuming and frustrating, today, scientists tend to use swift, more sophisticated, and more reliable methods for modeling the ecosystem stressors based on bioindicators. One tool which has attracted the attention of science communities in the last decades is machine learning algorithms that are reliable and accurate methods to solve classification and regression problems. In this study, a support vector machine is carried out as a machine learning algorithm to classify healthy and unhealthy crayfish based on physiological characteristics. To solve the non-linearity problem of the data by transporting data to high-dimensional space, different kernel functions including polynomial (PK), Pearson VII function-based universal (PUK), and radial basis function (RBF) kernels are used and their effect on the performance of the SVM model was evaluated. Both PK and PUK functions performed well in classifying the crayfish. RBF, however, had an adverse impact on the performance of the model. PUK kernel exhibited an outstanding performance (Accuracy = 100%) for the classification of the healthy and unhealthy crayfish.  相似文献   

17.
Han DS  Kim HS  Jang WH  Lee SD  Suh JK 《Nucleic acids research》2004,32(21):6312-6320
With the accumulation of protein and its related data on the Internet, many domain-based computational techniques to predict protein interactions have been developed. However, most techniques still have many limitations when used in real fields. They usually suffer from low accuracy in prediction and do not provide any interaction possibility ranking method for multiple protein pairs. In this paper, we propose a probabilistic framework to predict the interaction probability of proteins and develop an interaction possibility ranking method for multiple protein pairs. Using the ranking method, one can discern the protein pairs that are more likely to interact with each other in multiple protein pairs. The validity of the prediction model was evaluated using an interacting set of protein pairs in yeast and an artificially generated non-interacting set of protein pairs. When 80% of the set of interacting protein pairs in the DIP (Database of Interacting Proteins) was used as a learning set of interacting protein pairs, high sensitivity (77%) and specificity (95%) were achieved for the test groups containing common domains with the learning set of proteins within our framework. The stability of the prediction model was also evident when tested over DIP CORE, HMS-PCI and TAP data. In the validation of the ranking method, we reveal that some correlations exist between the interacting probability and the accuracy of the prediction.  相似文献   

18.
Improved method for predicting beta-turn using support vector machine   总被引:2,自引:0,他引:2  
MOTIVATION: Numerous methods for predicting beta-turns in proteins have been developed based on various computational schemes. Here, we introduce a new method of beta-turn prediction that uses the support vector machine (SVM) algorithm together with predicted secondary structure information. Various parameters from the SVM have been adjusted to achieve optimal prediction performance. RESULTS: The SVM method achieved excellent performance as measured by the Matthews correlation coefficient (MCC = 0.45) using a 7-fold cross validation on a database of 426 non-homologous protein chains. To our best knowledge, this MCC value is the highest achieved so far for predicting beta-turn. The overall prediction accuracy Qtotal was 77.3%, which is the best among the existing prediction methods. Among its unique attractive features, the present SVM method avoids overtraining and compresses information and provides a predicted reliability index.  相似文献   

19.
20.
BACKGROUND: Interferon-alpha (IFNalpha) in combination with ribavirin can be used for the treatment of patients with chronic hepatitis C. This therapeutic approach achieves an overall sustained response rate of approximately 40%, but treatment takes 6-12 months and patients often experience significant adverse reactions. OBJECTIVE: We aim to develop a tool to distinguish potential responders from nonresponders prior to initiation of IFNalpha-ribavirin treatment. METHODS: Using single nucleotide polymorphisms (SNPs) and viral genotype, we applied the support vector machine (SVM) algorithm to build a tool to predict responsiveness to IFNalpha-ribavirin combination therapy. Furthermore, we utilized the SVM algorithm with the recursive feature elimination method to identify a subset of factors that are significantly more influential than the others. RESULTS AND CONCLUSION: The SVM model is a promising method for inferring responsiveness to IFNalpha dealing with the complex nonlinear relationship between factors (such as SNPs and viral genotype) and successful therapy. In this study, we demonstrate that our tool may allow patients and doctors to make more informed decisions by analyzing host SNP and viral genotype information.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号