首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Intrinsically disordered regions (IDR) play an important role in key biological processes and are closely related to human diseases. IDRs have great potential to serve as targets for drug discovery, most notably in disordered binding regions. Accurate prediction of IDRs is challenging because their genome wide occurrence and a low ratio of disordered residues make them difficult targets for traditional classification techniques. Existing computational methods mostly rely on sequence profiles to improve accuracy which is time consuming and computationally expensive. This article describes an ab initio sequence-only prediction method—which tries to overcome the challenge of accurate prediction posed by IDRs—based on reduced amino acid alphabets and convolutional neural networks (CNNs). We experiment with six different 3-letter reduced alphabets. We argue that the dimensional reduction in the input alphabet facilitates the detection of complex patterns within the sequence by the convolutional step. Experimental results show that our proposed IDR predictor performs at the same level or outperforms other state-of-the-art methods in the same class, achieving accuracy levels of 0.76 and AUC of 0.85 on the publicly available Critical Assessment of protein Structure Prediction dataset (CASP10). Therefore, our method is suitable for proteome-wide disorder prediction yielding similar or better accuracy than existing approaches at a faster speed.  相似文献   

2.
We propose a machine-learning approach to sequence-based prediction of protein crystallizability in which we exploit subtle differences between proteins whose structures were solved by X-ray analysis [or by both X-ray and nuclear magnetic resonance (NMR) spectroscopy] and those proteins whose structures were solved by NMR spectroscopy alone. Because the NMR technique is usually applied on relatively small proteins, sequence length distributions of the X-ray and NMR datasets were adjusted to avoid predictions biased by protein size. As feature space for classification, we used frequencies of mono-, di-, and tripeptides represented by the original 20-letter amino acid alphabet as well as by several reduced alphabets in which amino acids were grouped by their physicochemical and structural properties. The classification algorithm was constructed as a two-layered structure in which the output of primary support vector machine classifiers operating on peptide frequencies was combined by a second-level Naive Bayes classifier. Due to the application of metamethods for cost sensitivity, our method is able to handle real datasets with unbalanced class representation. An overall prediction accuracy of 67% [65% on the positive (crystallizable) and 69% on the negative (noncrystallizable) class] was achieved in a 10-fold cross-validation experiment, indicating that the proposed algorithm may be a valuable tool for more efficient target selection in structural genomics. A Web server for protein crystallizability prediction called SECRET is available at http://webclu.bio.wzw.tum.de:8080/secret.  相似文献   

3.
MOTIVATION: A large volume of experimental data on protein phosphorylation is buried in the fast-growing PubMed literature. While of great value, such information is limited in databases owing to the laborious process of literature-based curation. Computational literature mining holds promise to facilitate database curation. RESULTS: A rule-based system, RLIMS-P (Rule-based LIterature Mining System for Protein Phosphorylation), was used to extract protein phosphorylation information from MEDLINE abstracts. An annotation-tagged literature corpus developed at PIR was used to evaluate the system for finding phosphorylation papers and extracting phosphorylation objects (kinases, substrates and sites) from abstracts. RLIMS-P achieved a precision and recall of 91.4 and 96.4% for paper retrieval, and of 97.9 and 88.0% for extraction of substrates and sites. Coupling the high recall for paper retrieval and high precision for information extraction, RLIMS-P facilitates literature mining and database annotation of protein phosphorylation.  相似文献   

4.
Mutations help us to understand the molecular origins of diseases. Researchers, therefore, both publish and seek disease-relevant mutations in public databases and in scientific literature, e.g. Medline. The retrieval tends to be time-consuming and incomplete. Automated screening of the literature is more efficient. We developed extraction methods (called MEMA) that scan Medline abstracts for mutations. MEMA identified 24,351 singleton mutations in conjunction with a HUGO gene name out of 16,728 abstracts. From a sample of 100 abstracts we estimated the recall for the identification of mutation-gene pairs to 35% at a precision of 93%. Recall for the mutation detection alone was >67% with a precision rate of >96%. This shows that our system produces reliable data. The subset consisting of protein sequence mutations (PSMs) from MEMA was compared to the entries in OMIM (20,503 entries versus 6699, respectively). We found 1826 PSM-gene pairs to be in common to both datasets (cross-validated). This is 27% of all PSM-gene pairs in OMIM and 91% of those pairs from OMIM which co-occur in at least one Medline abstract. We conclude that Medline covers a large portion of the mutations known to OMIM. Another large portion could be artificially produced mutations from mutagenesis experiments. Access to the database of extracted mutation-gene pairs is available through the web pages of the EBI (refer to http://www.ebi. ac.uk/rebholz/index.html).  相似文献   

5.
The recognition and normalization of gene mentions in biomedical literature are crucial steps in biomedical text mining. We present a system for extracting gene names from biomedical literature and normalizing them to gene identifiers in databases. The system consists of four major components: gene name recognition, entity mapping, disambiguation and filtering. The first component is a gene name recognizer based on dictionary matching and semi-supervised learning, which utilizes the co-occurrence information of a large amount of unlabeled MEDLINE abstracts to enhance feature representation of gene named entities. In the stage of entity mapping, we combine the strategies of exact match and approximate match to establish linkage between gene names in the context and the EntrezGene database. For the gene names that map to more than one database identifiers, we develop a disambiguation method based on semantic similarity derived from the Gene Ontology and MEDLINE abstracts. To remove the noise produced in the previous steps, we design a filtering method based on the confidence scores in the dictionary used for NER. The system is able to adjust the trade-off between precision and recall based on the result of filtering. It achieves an F-measure of 83% (precision: 82.5% recall: 83.5%) on BioCreative II Gene Normalization (GN) dataset, which is comparable to the current state-of-the-art.  相似文献   

6.
7.
MOTIVATION: To understand biological process, we must clarify how proteins interact with each other. However, since information about protein-protein interactions still exists primarily in the scientific literature, it is not accessible in a computer-readable format. Efficient processing of large amounts of interactions therefore needs an intelligent information extraction method. Our aim is to develop an efficient method for extracting information on protein-protein interaction from scientific literature. RESULTS: We present a method for extracting information on protein-protein interactions from the scientific literature. This method, which employs only a protein name dictionary, surface clues on word patterns and simple part-of-speech rules, achieved high recall and precision rates for yeast (recall = 86.8% and precision = 94.3%) and Escherichia coli (recall = 82.5% and precision = 93.5%). The result of extraction suggests that our method should be applicable to any species for which a protein name dictionary is constructed. AVAILABILITY: The program is available on request from the authors.  相似文献   

8.

Background  

In this paper, it is proposed an optimization approach for producing reduced alphabets for peptide classification, using a Genetic Algorithm. The classification task is performed by a multi-classifier system where each classifier (Linear or Radial Basis function Support Vector Machines) is trained using features extracted by different reduced alphabets. Each alphabet is constructed by a Genetic Algorithm whose objective function is the maximization of the area under the ROC-curve obtained in several classification problems.  相似文献   

9.
Building an abbreviation dictionary using a term recognition approach   总被引:1,自引:0,他引:1  
MOTIVATION: Acronyms result from a highly productive type of term variation and trigger the need for an acronym dictionary to establish associations between acronyms and their expanded forms. RESULTS: We propose a novel method for recognizing acronym definitions in a text collection. Assuming a word sequence co-occurring frequently with a parenthetical expression to be a potential expanded form, our method identifies acronym definitions in a similar manner to the statistical term recognition task. Applied to the whole MEDLINE (7 811 582 abstracts), the implemented system extracted 886 755 acronym candidates and recognized 300 954 expanded forms in reasonable time. Our method outperformed base-line systems, achieving 99% precision and 82-95% recall on our evaluation corpus that roughly emulates the whole MEDLINE. AVAILABILITY AND SUPPLEMENTARY INFORMATION: The implementations and supplementary information are available at our web site: http://www.chokkan.org/research/acromine/  相似文献   

10.
To have a better understanding of the mechanisms of disease development, knowledge of mutations and the genes on which the mutations occur is of crucial importance. Information on disease-related mutations can be accessed through public databases or biomedical literature sources. However, information retrieval from such resources can be problematic because of two reasons: manually created databases are usually incomplete and not up to date, and reading through a vast amount of publicly available biomedical documents is very time-consuming. In this paper, we describe an automated system, MuGeX (Mutation Gene eXtractor), that automatically extracts mutation-gene pairs from Medline abstracts for a disease query. Our system is tested on a corpus that consists of 231 Medline abstracts. While recall for mutation detection alone is 85.9%, precision is 95.9%. For extraction of mutation-gene pairs, we focus on Alzheimer's disease. The recall for mutation-gene pair identification is estimated at 91.3%, and precision is estimated at 88.9%. With automatic extraction techniques, MuGeX overcomes the problems of information retrieval from public resources and reduces the time required to access relevant information, while preserving the accuracy of retrieved information.  相似文献   

11.
MOTIVATION: Mining the biomedical literature for references to genes and proteins always involves a tradeoff between high precision with false negatives, and high recall with false positives. Having a reliable method for assessing the relevance of literature mining results is crucial to finding ways to balance precision and recall, and for subsequently building automated systems to analyze these results. We hypothesize that abstracts and titles that discuss the same gene or protein use similar words. To validate this hypothesis, we built a dictionary- and rule-based system to mine Medline for references to genes and proteins, and used a Bayesian metric for scoring the relevance of each reference assignment. RESULTS: We analyzed the entire set of Medline records from 1966 to late 2001, and scored each gene and protein reference using a Bayesian estimated probability (EP) based on word frequency in a training set of 137837 known assignments from 30594 articles to 36197 gene and protein symbols. Two test sets of 148 and 150 randomly chosen assignments, respectively, were hand-validated and categorized as either good or bad. The distributions of EP values, when plotted on a log-scale histogram, are shown to markedly differ between good and bad assignments. Using EP values, recall was 100% at 61% precision (EP=2 x 10(-5)), 63% at 88% precision (EP=0.008), and 10% at 100% precision (EP=0.1). These results show that Medline entries discussing the same gene or protein have similar word usage, and that our method of assessing this similarity using EP values is valid, and enables an EP cutoff value to be determined that accurately and reproducibly balances precision and recall, allowing automated analysis of literature mining results. .  相似文献   

12.
13.
A major problem in designing vaccine for the dengue virus has been the high antigenic variability in the envelope protein of different virus strains. In this study, a computational approach was adopted to identify a multi-epitope vaccine candidate against dengue virus that may be suitable for large populations in the dengue-endemic regions. Different bioinformatics tools were exploited that helped the identification of a conserved immunological hot-spot in the dengue envelope protein. The tools also rendered the prediction of immunogenicity and population coverage to the proposed 'in silico' vaccine candidate against dengue. A peptide region, spanning 19 amino acids, was identified in the envelope protein which found to be conserved in all four types of dengue viruses. Ten proteasomal cleavage sites were identified within the 19-mer conserved peptide sequence and a total of 8 overlapping putative cytotoxic T cell (CTL) epitopes were identified. The immunogenicity of these epitopes was evaluated in terms of their binding affinities to and dissociation half-time from respective human leukocyte antigen (HLA) molecules. The HLA allele frequencies were studied among populations in the dengue endemic regions and compared with respect to HLA restriction patterns of the overlapping epitopes. The cumulative population coverage for these epitopes as vaccine candidates was high ranging from approximately 80% to 92%. Structural analysis suggested that a 9-mer epitope fitted well into the peptide-binding groove of HLA-A*0201. In conclusion, the 19-mer epitope cluster was shown to have the potential for use as a vaccine candidate against dengue.  相似文献   

14.
The effect of neuropeptides from the corpora cardiaca of the fruit beetle Pachnoda sinuata on proline metabolism has been investigated in vivo. Conspecific injections of a crude extract from corpora cardiaca cause an increase of the concentration of proline in the haemolymph by nearly 20% and a decrease of the concentration of alanine, the precursor in proline synthesis, by about 64% when compared with a water-injected group. Purification of an extract of corpora cardiaca on reversed-phase liquid chromatography revealed two distinct UV absorbance and fluorescence peaks that cause hyperprolinaemia in the fruit beetle. The major peak is the previously identified octapeptide Mem-CC; the second peak is also a peptide, but its primary sequence remains, as yet, unidentified. Synthetic Mem-CC elicited time- and dose-dependent increases/decreases of the concentrations of proline and alanine in the haemolymph respectively. Furthermore, the receptor for this peptide seems to be specific in P. sinuata: only peptides of the large family of adipokinetic hormones with an Asp, Asn or Gly residue at position 7 could elicit biological activity, whereas those with a Trp, Ser or Val residue at this position did not have any activity.  相似文献   

15.
MHC class II heterodimers bind peptides 12-20 aa in length. The peptide flanking residues (PFRs) of these ligands extend from a central binding core consisting of nine amino acids. Increasing evidence suggests that the PFRs can alter the immunogenicity of T cell epitopes. We have previously noted that eluted peptide pool sequence data derived from an MHC class II Ag reflect patterns of enrichment not only in the core binding region but also in the PFRS: We sought to distinguish whether these enrichments reflect cellular processes or direct MHC-peptide interactions. Using the multiple sclerosis-associated allele HLA-DR2, pool sequence data from naturally processed ligands were compared with the patterns of enrichment obtained by binding semicombinatorial peptide libraries to empty HLA-DR2 molecules. Naturally processed ligands revealed patterns of enrichment reflecting both the binding motif of HLA-DR2 (position (P)1, aliphatic; P4, bulky hydrophobic; and P6, polar) as well as the nonbound flanking regions, including acidic residues at the N terminus and basic residues at the C terminus. These PFR enrichments were independent of MHC-peptide interactions. Further studies revealed similar patterns in nine other HLA alleles, with the C-terminal basic residues being as highly conserved as the previously described N-terminal prolines of MHC class II ligands. There is evidence that addition of C-terminal basic PFRs to known peptide epitopes is able to enhance both processing as well as T cell activation. Recognition of these allele-transcending patterns in the PFRs may prove useful in epitope identification and vaccine design.  相似文献   

16.
BioRAT: extracting biological information from full-length papers   总被引:2,自引:0,他引:2  
MOTIVATION: Converting the vast quantity of free-format text found in journals into a concise, structured format makes the researcher's quest for information easier. Recently, several information extraction systems have been developed that attempt to simplify the retrieval and analysis of biological and medical data. Most of this work has used the abstract alone, owing to the convenience of access and the quality of data. Abstracts are generally available through central collections with easy direct access (e.g. PubMed). The full-text papers contain more information, but are distributed across many locations (e.g. publishers' web sites, journal web sites and local repositories), making access more difficult. In this paper, we present BioRAT, a new information extraction (IE) tool, specifically designed to perform biomedical IE, and which is able to locate and analyse both abstracts and full-length papers. BioRAT is a Biological Research Assistant for Text mining, and incorporates a document search ability with domain-specific IE. RESULTS: We show first, that BioRAT performs as well as existing systems, when applied to abstracts; and second, that significantly more information is available to BioRAT through the full-length papers than via the abstracts alone. Typically, less than half of the available information is extracted from the abstract, with the majority coming from the body of each paper. Overall, BioRAT recalled 20.31% of the target facts from the abstracts with 55.07% precision, and achieved 43.6% recall with 51.25% precision on full-length papers.  相似文献   

17.
A diuretic peptide (Acheta-DP) has been isolated from extracts of whole heads of the house cricket, Acheta domesticus. The native peptide increases both cyclic AMP production and the rate of fluid secretion by isolated Malpighian tubules in vitro to an extent comparable with those responses obtained with supra-maximal amounts of crude extracts of corpora cardiaca. The primary structure of Acheta-DP was established as a 46-residue amidated peptide: TGAQSLSIVAPLDVLRQRLMNELNRRRMRELQGSRIQQNRQLLTSI-NH2. Acheta-DP has 41% sequence identity with a diuretic peptide isolated from Manduca sexta, providing direct evidence for the presence of a family of diuretic peptides in insects.  相似文献   

18.
The 10th annual meeting of the Italian Society for Virology (SIV) comprised seven plenary sessions focused on: General virology and viral genetics; Virus-Host interaction and pathogenesis; Viral oncology; Emerging viruses and zoonotic, foodborne and environmental pathways of transmission; Viral immunology and vaccines; Medical virology and antiviral therapy; Viral biotechnologies and gene therapy. The meeting had an attendance of 143 virologists, about 60% were senior, and the other were young scientists. The submitted abstracts amounted to 88 and the abstracts selected for oral presentation were 41. Complete abstracts of oral and poster presentations are available at the web site www.siv-virologia.it. A summary of the plenary lectures and oral selected presentations is reported.  相似文献   

19.
A model of the tick-borne encephalitis virus envelope protein E is presented that contains information on the structural organization of this flavivirus protein and correlates epitopes and antigenic domains to defined sequence elements. It thus reveals details of the structural and functional characteristics of the corresponding protein domains. The localization of three antigenic domains (composed of 16 distinct epitopes) within the primary structure was performed by (i) amino-terminal sequencing of three immunoreactive fragments of protein E and (ii) sequencing the protein E-coding regions of seven antigenic variants of tick-borne encephalitis virus that had been selected in the presence of neutralizing monoclonal antibodies directed against the E protein. Further information about variable and conserved regions was obtained by a comparative computer analysis of flavivirus E protein amino acid sequences. The search for potential T-cell determinants revealed at least one sequence compatible with an amphipathic alpha-helix which is conserved in all flaviviruses sequenced so far. By combining these data with those on the location of disulfide bridges (T. Nowak and G. Wengler, Virology 156:127-137, 1987) and the structural characteristics of epitopes, such as dependency on conformation or on intact disulfide bridges or both, a model was established that goes beyond the location of epitopes in the primary sequence and reveals features of the folding of the polypeptide chain, including the generation of discontinuous protein domains.  相似文献   

20.
Identification of MHC binding peptides is essential for understanding the molecular mechanism of immune response. However, most of the prediction methods use motifs/profiles derived from experimental peptide binding data for specific MHC alleles, thus limiting their applicability only to those alleles for which such data is available. In this work we have developed a structure-based method which does not require experimental peptide binding data for training. Our method models MHC-peptide complexes using crystal structures of 170 MHC-peptide complexes and evaluates the binding energies using two well known residue based statistical pair potentials, namely Betancourt-Thirumalai (BT) and Miyazawa-Jernigan (MJ) matrices. Extensive benchmarking of prediction accuracy on a data set of 1654 epitopes from class I and class II alleles available in the SYFPEITHI database indicate that BT pair-potential can predict more than 60% of the known binders in case of 14 MHC alleles with AUC values for ROC curves ranging from 0.6 to 0.9. Similar benchmarking on 29,522 class I and class II MHC binding peptides with known IC(50) values in the IEDB database showed AUC values higher than 0.6 for 10 class I alleles and 9 class II alleles in predictions involving classification of a peptide to be binder or non-binder. Comparison with recently available benchmarking studies indicated that, the prediction accuracy of our method for many of the class I and class II MHC alleles was comparable to the sequence based methods, even if it does not use any experimental data for training. It is also encouraging to note that the ranks of true binding peptides could further be improved, when high scoring peptides obtained from pair potential were re-ranked using all atom forcefield and MM/PBSA method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号