首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org.  相似文献   

2.

Background  

The MEDLINE database contains over 12 million references to scientific literature, with about 3/4 of recent articles including an abstract of the publication. Retrieval of entries using queries with keywords is useful for human users that need to obtain small selections. However, particular analyses of the literature or database developments may need the complete ranking of all the references in the MEDLINE database as to their relevance to a topic of interest. This report describes a method that does this ranking using the differences in word content between MEDLINE entries related to a topic and the whole of MEDLINE, in a computational time appropriate for an article search query engine.  相似文献   

3.
MPtopo: A database of membrane protein topology   总被引:12,自引:0,他引:12       下载免费PDF全文
The reliability of the transmembrane (TM) sequence assignments for membrane proteins (MPs) in standard sequence databases is uncertain because the vast majority are based on hydropathy plots. A database of MPs with dependable assignments is necessary for developing new computational tools for the prediction of MP structure. We have therefore created MPtopo, a database of MPs whose topologies have been verified experimentally by means of crystallography, gene fusion, and other methods. Tests using MPtopo strongly validated four existing MP topology-prediction algorithms. MPtopo is freely available over the internet and can be queried by means of an SQL-based search engine.  相似文献   

4.
Google Scholar (GS), a commonly used web-based academic search engine, catalogues between 2 and 100 million records of both academic and grey literature (articles not formally published by commercial academic publishers). Google Scholar collates results from across the internet and is free to use. As a result it has received considerable attention as a method for searching for literature, particularly in searches for grey literature, as required by systematic reviews. The reliance on GS as a standalone resource has been greatly debated, however, and its efficacy in grey literature searching has not yet been investigated. Using systematic review case studies from environmental science, we investigated the utility of GS in systematic reviews and in searches for grey literature. Our findings show that GS results contain moderate amounts of grey literature, with the majority found on average at page 80. We also found that, when searched for specifically, the majority of literature identified using Web of Science was also found using GS. However, our findings showed moderate/poor overlap in results when similar search strings were used in Web of Science and GS (10–67%), and that GS missed some important literature in five of six case studies. Furthermore, a general GS search failed to find any grey literature from a case study that involved manual searching of organisations’ websites. If used in systematic reviews for grey literature, we recommend that searches of article titles focus on the first 200 to 300 results. We conclude that whilst Google Scholar can find much grey literature and specific, known studies, it should not be used alone for systematic review searches. Rather, it forms a powerful addition to other traditional search methods. In addition, we advocate the use of tools to transparently document and catalogue GS search results to maintain high levels of transparency and the ability to be updated, critical to systematic reviews.  相似文献   

5.
A consolidated approach to the study of the mental representation of word meanings has consisted in contrasting different domains of knowledge, broadly reflecting the abstract-concrete dichotomy. More fine-grained semantic distinctions have emerged in neuropsychological and cognitive neuroscience work, reflecting semantic category specificity, but almost exclusively within the concrete domain. Theoretical advances, particularly within the area of embodied cognition, have more recently put forward the idea that distributed neural representations tied to the kinds of experience maintained with the concepts'' referents might distinguish conceptual meanings with a high degree of specificity, including those within the abstract domain. Here we report the results of two psycholinguistic rating studies incorporating such theoretical advances with two main objectives: first, to provide empirical evidence of fine-grained distinctions within both the abstract and the concrete semantic domains with respect to relevant psycholinguistic dimensions; second, to develop a carefully controlled linguistic stimulus set that may be used for auditory as well as visual neuroimaging studies focusing on the parametrization of the semantic space beyond the abstract-concrete dichotomy. Ninety-six participants rated a set of 210 sentences across pre-selected concrete (mouth, hand, or leg action-related) and abstract (mental state-, emotion-, mathematics-related) categories, with respect either to different semantic domain-related scales (rating study 1), or to concreteness, familiarity, and context availability (rating study 2). Inferential statistics and correspondence analyses highlighted distinguishing semantic and psycholinguistic traits for each of the pre-selected categories, indicating that a simple abstract-concrete dichotomy is not sufficient to account for the entire semantic variability within either domains.  相似文献   

6.
Brazilian scientists have been contributing to the protozoology field for more than 100 years with important discoveries of new species such asTrypanosoma cruzi and Leishmania spp. In this work, we used a Brazilian thesis database (Coordination for the Improvement of Higher Education Personnel) covering the period from 1987-2011 to identify researchers who contributed substantially to protozoology. We selected 248 advisors by filtering to obtain researchers who supervised at least 10 theses. Based on a computational analysis of the thesis databases, we found students who were supervised by these scientists. A computational procedure was developed to determine the advisors’ scientific ancestors using the Lattes Platform. These analyses provided a list of 1,997 researchers who were inspected through Lattes CV examination and allowed the identification of the pioneers of Brazilian protozoology. Moreover, we investigated the areas in which researchers who earned PhDs in protozoology are now working. We found that 68.4% of them are still in protozoology, while 16.7% have migrated to other fields. We observed that support for protozoology by national or international agencies is clearly correlated with the increase of scientists in the field. Finally, we described the academic genealogy of Brazilian protozoology by formalising the “forest” of Brazilian scientists involved in the study of protozoa and their vectors over the past century.  相似文献   

7.
To tell the truth, I find it difficult to work when flying, or even when sitting in an airport for an extended period of time. So, typically I take along a book to read. And when I truly cannot concentrate, for example when a flight is considerably delayed, I have even been known to resort to word puzzles. Depending on the type, they do not require much attention (that is, you can pick up right where you left off after you glance at the flight status screen for the twentieth or so time, even though you know nothing has changed), or effort (although you need to use a pen or pencil, not a keyboard), but nonetheless they can keep your mind somewhat occupied. I even rationalize doing them based on the assumption that they are sharpening my observational/pattern-finding skills. One type of word puzzle that is particularly mindless, but for that very reason I still enjoy in the above circumstances, is a word search; you are given a grid with letters and/or numbers, and a list of “hidden” terms, and you circle them within the grid, crossing them off the list as you go along. I do admit that the categories of terms used in the typical word searches can become rather mundane (breeds of dog, types of food, words that are followed by “stone,” words associated with a famous movie star, words from a particular television show, etc.). Therefore, on one of my last seminar trips I decided to generate my own word search, using the category of autophagy.  相似文献   

8.
We present a novel protein structure database search tool, 3D-BLAST, that is useful for analyzing novel structures and can return a ranked list of alignments. This tool has the features of BLAST (for example, robust statistical basis, and effective and reliable search capabilities) and employs a kappa-alpha (κ, α) plot derived structural alphabet and a new substitution matrix. 3D-BLAST searches more than 12,000 protein structures in 1.2 s and yields good results in zones with low sequence similarity.  相似文献   

9.
We discuss several aspects related to load balancing of database search jobs in a distributed computing environment, such as Linux cluster. Load balancing is a technique for making the most of multiple computational resources, which is particularly relevant in environments in which the usage of such resources is very high. The particular case of the Sequest program is considered here, but the general methodology should apply to any similar database search program. We show how the runtimes for Sequest searches of tandem mass spectral data can be predicted from profiles of previous representative searches, and how this information can be used for better load balancing of novel data. A well-known heuristic load balancing method is shown to be applicable to this problem, and its performance is analyzed for a variety of search parameters.  相似文献   

10.
We describe a new approach to identify proteins involved in disease pathogenesis. The technology, Epitope-Mediated Antigen Prediction (E-MAP), leverages the specificity of patients' immune responses to disease-relevant targets and requires no prior knowledge about the protein. E-MAP links pathologic antibodies of unknown specificity, isolated from patient sera, to their cognate antigens in the protein database. The E-MAP process first involves reconstruction of a predicted epitope using a peptide combinatorial library. We then search the protein database for closely matching amino acid sequences. Previously published attempts to identify unknown antibody targets in this manner have largely been unsuccessful for two reasons: 1) short predicted epitopes yield too many irrelevant matches from a database search and 2) the epitopes may not accurately represent the native antigen with sufficient fidelity. Using an in silico model, we demonstrate the critical threshold requirements for epitope length and epitope fidelity. We find that epitopes generally need to have at least seven amino acids, with an overall accuracy of >70% to the native protein, in order to correctly identify the protein in a nonredundant protein database search. We then confirmed these findings experimentally, using the predicted epitopes for four monoclonal antibodies. Since many predicted epitopes often fail to achieve the seven amino acid threshold, we demonstrate the efficacy of paired epitope searches. This is the first systematic analysis of the computational framework to make this approach viable, coupled with experimental validation.  相似文献   

11.
12.
MOTIVATION: Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is 3-fold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determining the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation method to simulate data from the distribution of SK-LD (symmetric Kullback-Leibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity beta between any pair of DNA sequences. RESULTS: Our study shows (1) for whole sequence similiarity/dissimilarity identification the window size taken should be as large as possible, but probably not >3000, as restricted by CPU time in practice, (2) for each measure the optimal word size increases with window size, (3) when the optimal word size is used, SK-LD performance is superior in both simulation and real data analysis, (4) the estimate beta of beta based on SK-LD can be used to filter out quickly a large number of dissimilar sequences and speed alignment-based database search for similar sequences and (5) beta is also applicable in local similarity comparison situations. For example, it can help in selecting oligo probes with high specificity and, therefore, has potential in probe design for microarrays. AVAILABILITY: The algorithm SK-LD, estimate beta and simulation software are implemented in MATLAB code, and are available at http://www.stat.ncku.edu.tw/tjwu  相似文献   

13.
The cognitive analysis of adult language disorders continues to draw heavily on linguistic theory, but increasingly it reflects the influence of connectionist, spreading activation models of cognition. In the area of spoken word production, ‘localist’ connectionist models represent a natural evolution from the psycholingistic theories of earlier decades. By contrast, the parallel distributed processing framework forces more radical rethinking of aphasic impairments. This paper exemplifies these multiple influences in contemporary cognitive aphasiology. Topics include (i) what aphasia reveals about semantic-phonological interaction in lexical access; (ii) controversies surrounding the interpretation of semantic errors and (iii) a computational account of the relationship between naming and word repetition in aphasia. Several of these topics have been addressed using case series methods, including computational simulation of the individual, quantitative error patterns of diverse groups of patients and analysis of brain lesions that correlate with error rates and patterns. Efforts to map the lesion correlates of nonword errors in naming and repetition highlight the involvement of sensorimotor areas in the brain and suggest the need to better integrate models of word production with models of speech and action.  相似文献   

14.
Because the volume of information available online is growing at breakneck speed, keeping up with meaning and information communicated by the media and netizens is a new challenge both for scholars and for companies who must address public relations crises. Most current theories and tools are directed at identifying one website or one piece of online news and do not attempt to develop a rapid understanding of all websites and all news covering one topic. This paper represents an effort to integrate statistics, word segmentation, complex networks and visualization to analyze headlines’ keywords and words relationships in online Chinese news using two samples: the 2011 Bohai Bay oil spill and the 2010 Gulf of Mexico oil spill. We gathered all the news headlines concerning the two trending events in the search results from Baidu, the most popular Chinese search engine. We used Simple Chinese Word Segmentation to segment all the headlines into words and then took words as nodes and considered adjacent relations as edges to construct word networks both using the whole sample and at the monthly level. Finally, we develop an integrated mechanism to analyze the features of words’ networks based on news headlines that can account for all the keywords in the news about a particular event and therefore track the evolution of news deeply and rapidly.  相似文献   

15.
Opportunities for associationist learning of word meaning, where a word is heard or read contemperaneously with information being available on its meaning, are considered too infrequent to account for the rate of language acquisition in children. It has been suggested that additional learning could occur in a distributional mode, where information is gleaned from the distributional statistics (word co-occurrence etc.) of natural language. Such statistics are relevant to meaning because of the Distributional Principle that ‘words of similar meaning tend to occur in similar contexts’. Computational systems, such as Latent Semantic Analysis, have substantiated the viability of distributional learning of word meaning, by showing that semantic similarities between words can be accurately estimated from analysis of the distributional statistics of a natural language corpus. We consider whether appearance similarities can also be learnt in a distributional mode. As grounds for such a mode we advance the Appearance Hypothesis that ‘words with referents of similar appearance tend to occur in similar contexts’. We assess the viability of such learning by looking at the performance of a computer system that interpolates, on the basis of distributional and appearance similarity, from words that it has been explicitly taught the appearance of, in order to identify and name objects that it has not been taught about. Our experiment tests with a set of 660 simple concrete noun words. Appearance information on words is modelled using sets of images of examples of the word. Distributional similarity is computed from a standard natural language corpus. Our computation results support the viability of distributional learning of appearance.  相似文献   

16.
The rapid proliferation of genomic DNA sequences has created a significant need for software that can both focus on relatively small areas (such as within genes or promoters) and provide wide-zoom views of patterns across entire genomes. We present our DNA Motif Lexicon that enables users to perform genome-wide searches for motifs of interest and create customizable results pages, where results differ in the degree and extent of annotation. Searching for a particular motif is akin to a word search in a natural language; our motif lexicon speaks to this new time when we will increasingly rely upon DNA dictionaries that offer rich types of annotation. Indeed, the concept of "lexomics", introduced in this paper may be appropriate to the types of meta-analyses appropriate to the deciphering of regulatory information. Currently supporting five genomes, our web-based lexicon allows users to look up motifs of interest and build user-defined result pages to include the following: (1) all base pair locations where a motif is found with links to further search the "neighborhoods" near each of these locations; whether each location of the motif is genic (within) a gene, intergenic, or a bridging sequence (overlapping a gene boundary) (2) NCBI hot-links to nearest upstream and downstream genes for each location (3) statistical information about the query (4) whether the motif is a certain type of repeat (5) links for the reverse, complement and reverse-complement of the motif of interest and (6) hot-links to PubMed abstracts which mention the motif of interest. A software framework facilitates the continual development of new annotation modules. The tool is located at: http://genomics.wheatoncollege.edu/cgi-bin/lexicon.exe.  相似文献   

17.
18.
In the “digital native” generation, internet search engines are a commonly used source of information. However, adolescents may fail to recognize relevant search results when they are related in discipline to the search topic but lack other cues. Middle school students, high school students, and adults rated simulated search results for relevance to the search topic. The search results were designed to contrast deep discipline-based relationships with lexical similarity to the search topic. Results suggest that the ability to recognize disciplinary relatedness without supporting cues may continue to develop into high school. Despite frequent search engine usage, younger adolescents may require additional support to make the most of the information available to them.  相似文献   

19.

Background

There has been a marked rise in suicide by charcoal burning (CB) in some East Asian countries but little is known about its incidence in mainland China. We examined media-reported CB suicides and the availability of online information about the method in mainland China.

Methods

We extracted and analyzed data for i) the characteristics and trends of fatal and nonfatal CB suicides reported by mainland Chinese newspapers (1998–2014); ii) trends and geographic variations in online searches using keywords relating to CB suicide (2011–2014); and iii) the content of Internet search results.

Results

109 CB suicide attempts (89 fatal and 20 nonfatal) were reported by newspapers in 13 out of the 31 provinces or provincial-level-municipalities in mainland China. There were increasing trends in the incidence of reported CB suicides and in online searches using CB-related keywords. The province-level search intensities were correlated with CB suicide rates (Spearman’s correlation coefficient = 0.43 [95% confidence interval: 0.08–0.68]). Two-thirds of the web links retrieved using the search engine contained detailed information about the CB suicide method, of which 15% showed pro-suicide attitudes, and the majority (86%) did not encourage people to seek help.

Limitations

The incidence of CB suicide was based on newspaper reports and likely to be underestimated.

Conclusions

Mental health and suicide prevention professionals in mainland China should be alert to the increased use of this highly lethal suicide method. Better surveillance and intervention strategies need to be developed and implemented.  相似文献   

20.
Dictionary-driven prokaryotic gene finding   总被引:2,自引:0,他引:2       下载免费PDF全文
Gene identification, also known as gene finding or gene recognition, is among the important problems of molecular biology that have been receiving increasing attention with the advent of large scale sequencing projects. Previous strategies for solving this problem can be categorized into essentially two schools of thought: one school employs sequence composition statistics, whereas the other relies on database similarity searches. In this paper, we propose a new gene identification scheme that combines the best characteristics from each of these two schools. In particular, our method determines gene candidates among the ORFs that can be identified in a given DNA strand through the use of the Bio-Dictionary, a database of patterns that covers essentially all of the currently available sample of the natural protein sequence space. Our approach relies entirely on the use of redundant patterns as the agents on which the presence or absence of genes is predicated and does not employ any additional evidence, e.g. ribosome-binding site signals. The Bio-Dictionary Gene Finder (BDGF), the algorithm’s implementation, is a single computational engine able to handle the gene identification task across distinct archaeal and bacterial genomes. The engine exhibits performance that is characterized by simultaneous very high values of sensitivity and specificity, and a high percentage of correctly predicted start sites. Using a collection of patterns derived from an old (June 2000) release of the Swiss-Prot/TrEMBL database that contained 451 602 proteins and fragments, we demonstrate our method’s generality and capabilities through an extensive analysis of 17 complete archaeal and bacterial genomes. Examples of previously unreported genes are also shown and discussed in detail.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号