首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MOTIVATION: Since their initial development, integration and construction of databases for molecular-level data have progressed. Though biological molecules are related to each other and form a complex system, the information is stored in the vast archives of the literature or in diverse databases. There is no unified naming convention for biological object, and biological terms may be ambiguous or polysemic. This makes the integration and interaction of databases difficult. In order to eliminate these problems, machine-readable natural language resources appear to be quite promising. We have developed a workbench for protein name abbreviation dictionary (PNAD) building. RESULTS: We have developed PNAD Construction Support System (PNAD-CSS), which offers various convenient facilities to decrease the construction costs of a protein name abbreviation dictionary of which entries are collected from abstracts in biomedical papers. The system allows the users to concentrate on higher level interpretation by removing some troublesome tasks, e.g. management of abstracts, extracting protein names and their abbreviations, and so on. To extract a pair of protein names and abbreviations, we have developed a hybrid system composed of the PROPER System and the PNAD System. The PNAD System can extract the pairs from parenthetical-paraphrases involved in protein names, the PROPER System identified these paris, with 98.95% precision, 95.56% recall and 97.58% complete precision. AVAILABILITY: PROPER System is freely available from http://www.hgc.inc.u-tokyo.ac.jp/service/tooldoc /KeX/intro.html. The other software are also available on request. Contact the authors. CONTACT: mikio@ims.u-tokyo.ac.jp  相似文献   

2.

Background  

The rapid growth of biomedical literature presents challenges for automatic text processing, and one of the challenges is abbreviation identification. The presence of unrecognized abbreviations in text hinders indexing algorithms and adversely affects information retrieval and extraction. Automatic abbreviation definition identification can help resolve these issues. However, abbreviations and their definitions identified by an automatic process are of uncertain validity. Due to the size of databases such as MEDLINE only a small fraction of abbreviation-definition pairs can be examined manually. An automatic way to estimate the accuracy of abbreviation-definition pairs extracted from text is needed. In this paper we propose an abbreviation definition identification algorithm that employs a variety of strategies to identify the most probable abbreviation definition. In addition our algorithm produces an accuracy estimate, pseudo-precision, for each strategy without using a human-judged gold standard. The pseudo-precisions determine the order in which the algorithm applies the strategies in seeking to identify the definition of an abbreviation.  相似文献   

3.
ADAM: another database of abbreviations in MEDLINE   总被引:1,自引:0,他引:1  
MOTIVATION: Abbreviations are an important type of terminology in the biomedical domain. Although several groups have already created databases of biomedical abbreviations, these are either not public, or are not comprehensive, or focus exclusively on acronym-type abbreviations. We have created another abbreviation database, ADAM, which covers commonly used abbreviations and their definitions (or long-forms) within MEDLINE titles and abstracts, including both acronym and non-acronym abbreviations. RESULTS: A model of recognizing abbreviations and their long-forms from titles and abstracts of MEDLINE (2006 baseline) was employed. After grouping morphological variants, 59 405 abbreviation/long-form pairs were identified. ADAM shows high precision (97.4%) and includes most of the frequently used abbreviations contained in the Unified Medical Language System (UMLS) Lexicon and the Stanford Abbreviation Database. Conversely, one-third of abbreviations in ADAM are novel insofar as they are not included in either database. About 19% of the novel abbreviations are non-acronym-type and these cover at least seven different types of short-form/long-form pairs. AVAILABILITY: A free, public query interface to ADAM is available at http://arrowsmith.psych.uic.edu, and the entire database can be downloaded as a text file.  相似文献   

4.
MOTIVATION: Due to recent interest in the use of textual material to augment traditional experiments it has become necessary to automatically cluster, classify and filter natural language information. RESULTS: The Simple and Robust Abbreviation Dictionary (SaRAD) provides an easy to implement, high performance tool for the construction of a biomedical symbol dictionary. The algorithms, applied to the MEDLINE document set, result in a high quality dictionary and toolset to disambiguate abbreviation symbols automatically.  相似文献   

5.
This paper proposes an ensemble of classifiers for biomedical name recognition in which three classifiers, one Support Vector Machine and two discriminative Hidden Markov Models, are combined effectively using a simple majority voting strategy. In addition, we incorporate three post-processing modules, including an abbreviation resolution module, a protein/gene name refinement module and a simple dictionary matching module, into the system to further improve the performance. Evaluation shows that our system achieves the best performance from among 10 systems with a balanced F-measure of 82.58 on the closed evaluation of the BioCreative protein/gene name recognition task (Task 1A).  相似文献   

6.
MBA: a literature mining system for extracting biomedical abbreviations   总被引:1,自引:0,他引:1  

Background  

The exploding growth of the biomedical literature presents many challenges for biological researchers. One such challenge is from the use of a great deal of abbreviations. Extracting abbreviations and their definitions accurately is very helpful to biologists and also facilitates biomedical text analysis. Existing approaches fall into four broad categories: rule based, machine learning based, text alignment based and statistically based. State of the art methods either focus exclusively on acronym-type abbreviations, or could not recognize rare abbreviations. We propose a systematic method to extract abbreviations effectively. At first a scoring method is used to classify the abbreviations into acronym-type and non-acronym-type abbreviations, and then their corresponding definitions are identified by two different methods: text alignment algorithm for the former, statistical method for the latter.  相似文献   

7.
Gene/protein recognition and normalization is an important preliminary step for many biological text mining tasks. In this paper, we present a multistage gene normalization system which consists of four major subtasks: pre-processing, dictionary matching, ambiguity resolution and filtering. For the first subtask, we apply the gene mention tagger developed in our earlier work, which achieves an F-score of 88.42% on the BioCreative II GM testing set. In the stage of dictionary matching, the exact matching and approximate matching between gene names and the EntrezGene lexicon have been combined. For the ambiguity resolution subtask, we propose a semantic similarity disambiguation method based on Munkres'' Assignment Algorithm. At the last step, a filter based on Wikipedia has been built to remove the false positives. Experimental results show that the presented system can achieve an F-score of 90.1%, outperforming most of the state-of-the-art systems.  相似文献   

8.
Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system that is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts is described. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7344 produced good quality models (F-measure >0.7, nearly 60% of which were >0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the system's internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques.  相似文献   

9.
BACKGROUND: Hundreds of genes lacking homology to any protein of known function are sequenced every day. Genome-context methods have proved useful in providing clues about functional annotations for many proteins. However, genome-context methods detect many biological types of functional associations, and do not identify which type of functional association they have found. RESULTS: We have developed two new genome-context-based algorithms. Algorithm 1 extends our previous algorithm for identifying missing enzymes in predicted metabolic pathways (pathway holes) to use genome-context features. The new algorithm has significantly improved scope because it can now be applied to pathway reactions to which sequence similarity methods cannot be applied due to an absence of known sequences for enzymes catalyzing the reaction in other organisms. The new method identifies at least one known enzyme in the top ten hits for 58% of EcoCyc reactions that lack enzyme sequences in other organisms. Surprisingly, the addition of genome-context features does not improve the accuracy of the algorithm when sequences for the enzyme do exist in other organisms. Algorithm 2 uses genome-context methods to predict three distinct types of functional relationships between pairs of proteins: pairs that occur in the same protein complex, the same pathway, or the same operon. This algorithm performs with varying degrees of accuracy on each type of relationship, and performs best in predicting pathway and protein complex relationships.  相似文献   

10.
Siegel RW  Jain R  Bradbury A 《FEBS letters》2001,505(3):467-473
The site-specific recombination system of bacteriophage P1 is composed of the Cre recombinase that recognizes a 34-bp loxP site. The Cre/loxP system has been extensively used to manipulate eukaryotic genomes for functional genomic investigations. The creation of additional heterologous loxP sequences potentially expands the utility of this system, but only if these loxP sequences do not recombine with one another. We have developed a stringent in vivo assay to examine the degree of recombination between all combinations of each previously published heterologous loxP sequence. As expected, homologous loxP sequences efficiently underwent Cre-mediated recombination. However, many of the heterologous loxP pairs were able to support recombination with rates varying from 5 to 100%. Some of these loxP sequences have previously been reported to be non-compatible with one another. Our study also confirmed other heterologous loxP pairs that had previously been shown to be non-compatible, as well as defined additional combinations that could be used in designing new recombination vectors.  相似文献   

11.
R W Siegel  R Jain  A Bradbury 《FEBS letters》2001,499(1-2):147-153
The site-specific recombination system of bacteriophage P1 is composed of the Cre recombinase that recognizes a 34-bp loxP site. The Cre/loxP system has been extensively used to manipulate eukaryotic genomes for functional genomic investigations. The creation of additional heterologous loxP sequences potentially expands the utility of this system, but only if these loxP sequences do not recombine with one another. We have developed a stringent in vivo assay to examine the degree of recombination between all combinations of each previously published heterologous loxP sequence. As expected, homologous loxP sequences efficiently underwent Cre-mediated recombination. However, many of the heterologous loxP pairs were able to support recombination with rates varying from 5 to 100%. Some of these loxP sequences have previously been reported to be non-compatible with one another. Our study also confirmed other heterologous loxP pairs that had previously been shown to be non-compatible, as well as defined additional combinations that could be used in designing new recombination vectors.  相似文献   

12.
Han DS  Kim HS  Jang WH  Lee SD  Suh JK 《Nucleic acids research》2004,32(21):6312-6320
With the accumulation of protein and its related data on the Internet, many domain-based computational techniques to predict protein interactions have been developed. However, most techniques still have many limitations when used in real fields. They usually suffer from low accuracy in prediction and do not provide any interaction possibility ranking method for multiple protein pairs. In this paper, we propose a probabilistic framework to predict the interaction probability of proteins and develop an interaction possibility ranking method for multiple protein pairs. Using the ranking method, one can discern the protein pairs that are more likely to interact with each other in multiple protein pairs. The validity of the prediction model was evaluated using an interacting set of protein pairs in yeast and an artificially generated non-interacting set of protein pairs. When 80% of the set of interacting protein pairs in the DIP (Database of Interacting Proteins) was used as a learning set of interacting protein pairs, high sensitivity (77%) and specificity (95%) were achieved for the test groups containing common domains with the learning set of proteins within our framework. The stability of the prediction model was also evident when tested over DIP CORE, HMS-PCI and TAP data. In the validation of the ranking method, we reveal that some correlations exist between the interacting probability and the accuracy of the prediction.  相似文献   

13.
Ascidian larvae develop after an invariant pattern of embryonic cleavage. Fewer than 400 cells constitute the larval central nervous system (CNS), which forms without either extensive migration or cell death. We catalogue the mitotic history of these cells in Ciona intestinalis, using confocal microscopy of whole-mount embryos at stages from neurulation until hatching. The positions of cells contributing to the CNS were reconstructed from confocal image stacks of embryonic nuclei, and maps of successive stages were used to chart the mitotic descent, thereby creating a cell lineage for each cell. The entire CNS is formed from 10th- to 14th-generation cells. Although minor differences exist in cell position, lineage is invariant in cells derived from A-line blastomeres, which form the caudal nerve cord and visceral ganglion. We document the lineage of five pairs of presumed motor neurons within the visceral ganglion: one pair arises from A/A 10.57, and four from progeny of A/A 9.30. The remaining cells of the visceral ganglion are in their 13th and 14th generations at hatching, with most mitotic activity ceasing around 85% of embryonic development. Of the approximately 330 larval cells previously reported in the CNS of Ciona, we document the lineage of 226 that derive predominantly from A-line blastomeres.  相似文献   

14.
15.
New phase supports for liquid-liquid partition chromatography, using aqueous poly(ethyleneglycol)-dextran systems have been developed by grafting linear polyacrylamide chains on to premanufactured chromatographic supports carrying primary or secondary aliphatic hydroxyl functions on their surface. Columns prepared from such supports have a higher binding capacity for the dextran-rich stationary phase and much higher performances than columns prepared from cellulose, the previously used phase support for this system. Test separations of DNA restriction fragments, ranging from 11 base pairs to 3829 base pairs, document a high resolution for DNA fragments larger than 200 base pairs.  相似文献   

16.
Decomposing the life track of an animal into behavioral segments is a fundamental challenge for movement ecology. The proliferation of high‐resolution data, often collected many times per second, offers much opportunity for understanding animal movement. However, the sheer size of modern data sets means there is an increasing need for rapid, novel computational techniques to make sense of these data. Most existing methods were designed with smaller data sets in mind and can thus be prohibitively slow. Here, we introduce a method for segmenting high‐resolution movement trajectories into sites of interest and transitions between these sites. This builds on a previous algorithm of Benhamou and Riotte‐Lambert (2012). Adapting it for use with high‐resolution data. The data’s resolution removed the need to interpolate between successive locations, allowing us to increase the algorithm’s speed by approximately two orders of magnitude with essentially no drop in accuracy. Furthermore, we incorporate a color scheme for testing the level of confidence in the algorithm's inference (high = green, medium = amber, low = red). We demonstrate the speed and accuracy of our algorithm with application to both simulated and real data (Alpine cattle at 1 Hz resolution). On simulated data, our algorithm correctly identified the sites of interest for 99% of “high confidence” paths. For the cattle data, the algorithm identified the two known sites of interest: a watering hole and a milking station. It also identified several other sites which can be related to hypothesized environmental drivers (e.g., food). Our algorithm gives an efficient method for turning a long, high‐resolution movement path into a schematic representation of broadscale decisions, allowing a direct link to existing point‐to‐point analysis techniques such as optimal foraging theory. It is encoded into an R package called SitesInterest , so should serve as a valuable tool for making sense of these increasingly large data streams.  相似文献   

17.
We present here a neural network-based method for detection of signal peptides (abbreviation used: SP) in proteins. The method is trained on sequences of known signal peptides extracted from the Swiss-Prot protein database and is able to work separately on prokaryotic and eukaryotic proteins. A query protein is dissected into overlapping short sequence fragments, and then each fragment is analyzed with respect to the probability of it being a signal peptide and containing a cleavage site. While the accuracy of the method is comparable to that of other existing prediction tools, it provides a significantly higher speed and portability. The accuracy of cleavage site prediction reaches 73% on heterogeneous source data that contains both prokaryotic and eukaryotic sequences while the accuracy of discrimination between signal peptides and non-signal peptides is above 93% for any source dataset. As a consequence, the method can be easily applied to genome-wide datasets. The software can be downloaded freely from http://rpsp.bioinfo.pl/RPSP.tar.gz.  相似文献   

18.
Seven recently established (2007 – 2010) systems of classification for flowering plants (Magnoliidae or angiosperms) are uniformly arranged in a linear fashion to allow ready comparisons with an eighth proposed anew here. Uniform names and orthography are used throughout with author abbreviations; however, the choice of names above the rank of family are not standardised as rules of priority do not apply. Full bibliographic information is not given but is available both in print and online. The numbers of names per rank for each system is given. Two previously unpublished names recognised by Takhtajan in 2009, Aextoxicales and Polypremaceae, are validated. In addition nine suborder and two family names are established: Alseuosmiineae, Campynematineae, Crossosomatineae, Geissolomatineae, Mazaceae, Proteineae, Rehmanniaceae, Rutineae, Sarraceniineae, Simmondsiineae, and Smilacineae.  相似文献   

19.
MOTIVATION: Fold recognition is a key step in the protein structure discovery process, especially when traditional sequence comparison methods fail to yield convincing structural homologies. Although many methods have been developed for protein fold recognition, their accuracies remain low. This can be attributed to insufficient exploitation of fold discriminatory features. RESULTS: We have developed a new method for protein fold recognition using structural information of amino acid residues and amino acid residue pairs. Since protein fold recognition can be treated as a protein fold classification problem, we have developed a Support Vector Machine (SVM) based classifier approach that uses secondary structural state and solvent accessibility state frequencies of amino acids and amino acid pairs as feature vectors. Among the individual properties examined secondary structural state frequencies of amino acids gave an overall accuracy of 65.2% for fold discrimination, which is better than the accuracy by any method reported so far in the literature. Combination of secondary structural state frequencies with solvent accessibility state frequencies of amino acids and amino acid pairs further improved the fold discrimination accuracy to more than 70%, which is approximately 8% higher than the best available method. In this study we have also tested, for the first time, an all-together multi-class method known as Crammer and Singer method for protein fold classification. Our studies reveal that the three multi-class classification methods, namely one versus all, one versus one and Crammer and Singer method, yield similar predictions. AVAILABILITY: Dataset and stand-alone program are available upon request.  相似文献   

20.
The precise mechanism of stop codon recognition in translation termination is still unclear. A previously published study by Ivanov and colleagues proposed a new model for stop codon recognition in which 3-nucleotide Ter-anticodons within the loops of hairpin helices 69 (domain IV) and 89 (domain V) in large ribosomal subunit (LSU) rRNA recognize stop codons to terminate protein translation in eubacteria and certain organelles. We evaluated this model by extensive bioinformatic analysis of stop codons and their putative corresponding Ter-anticodons across a much wider range of species, and found many cases for which it cannot explain the stop codon usage without requiring the involvement of one or more of the eight possible noncomplementary base pairs. Involvement of such base pairs may not be structurally or thermodynamically damaging to the model. However, if, according to the model, Ter-anticodon interaction with stop codons occurs within the ribosomal A-site, the structural stringency which that site imposes on sense codon.tRNA anticodon interaction should also extend to stop codon.Ter-anticodon interactions. Moreover, with Ter-tRNA in place of an aminoacyl-tRNA, for each of the various Ter-anticodons there is a sense codon that can interact with it preferentially by complementary and wobble base-pairing. Both these considerations considerably weaken the arguments put forth previously.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号