首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
ABSTRACT. The standard method for determining the genotypes of Enterocytozoon bieneusi is based on the DNA sequence of the internal transcribed spacer (ITS) region of the rRNA gene. There are 81 genotypes with 111 genotype names: 26 genotypes have been identified exclusively in humans, eight have been identified in humans and in other hosts, 27 have been identified exclusively in cattle and pigs, six have been identified exclusively in cats and dogs, and 14 have been identified in miscellaneous hosts. Because none of these genotypes has taxonomic status and therefore do not adhere to the International Code of Zoological Nomenclature regarding naming, some genotypes have received multiple names, each different and in separate publications by different authors. Because of the proliferation of genotypes with overlapping names and multiple hosts the scientific literature has become confusing and difficult to efficiently utilize. To reduce confusion and provide guidance for future publications we tabulated all names, GenBank accession numbers, and author citations and propose that the first published name has precedence and should become the primary name used in all subsequent publications in which genotyping is based on ITS sequencing. In those publications the names and GenBank numbers that were submitted at later dates should also be provided by the authors as synonyms to aid readers and reviewers.  相似文献   

2.
When parents select similar sounding names for their children, do they set themselves up for more speech errors in the future? Questionnaire data from 334 respondents suggest that they do. Respondents whose names shared initial or final sounds with a sibling’s reported that their parents accidentally called them by the sibling’s name more often than those without such name overlap. Having a sibling of the same gender, similar appearance, or similar age was also associated with more frequent name substitutions. Almost all other name substitutions by parents involved other family members and over 5% of respondents reported a parent substituting the name of a pet, which suggests a strong role for social and situational cues in retrieving personal names for direct address. To the extent that retrieval cues are shared with other people or animals, other names become available and may substitute for the intended name, particularly when names sound similar.  相似文献   

3.
The identification of gene/protein names in natural language text is an important problem in named entity recognition. In previous work we have processed MEDLINE documents to obtain a collection of over two million names of which we estimate that perhaps two thirds are valid gene/protein names. Our problem has been how to purify this set to obtain a high quality subset of gene/protein names. Here we describe an approach which is based on the generation of certain classes of names that are characterized by common morphological features. Within each class inductive logic programming (ILP) is applied to learn the characteristics of those names that are gene/protein names. The criteria learned in this manner are then applied to our large set of names. We generated 193 classes of names and ILP led to criteria defining a select subset of 1,240,462 names. A simple false positive filter was applied to remove 8% of this set leaving 1,145,913 names. Examination of a random sample from this gene/protein name lexicon suggests it is composed of 82% (+/-3%) complete and accurate gene/protein names, 12% names related to genes/proteins (too generic, a valid name plus additional text, part of a valid name, etc.), and 6% names unrelated to genes/proteins. The lexicon is freely available at ftp.ncbi.nlm.nih.gov/pub/tanabe/Gene.Lexicon.  相似文献   

4.
ABSTRACT: BACKGROUND: A scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important metadata element to link biological information. RESULTS: We present NetiNeti (Name Extraction from Textual Information-Name Extraction for Taxonomic Indexing), a machine learning based approach for recognition of scientific names including the discovery of new species names from text that will also handle misspellings, OCR errors and other variations in names. The system generates candidate names using rules for scientific names and applies probabilistic machine learning methods to classify names based on structural features of candidate names and features derived from their contexts. NetiNeti can also disambiguate scientific names from other names using the contextual information. We evaluated NetiNeti on legacy biodiversity texts and biomedical literature (MEDLINE). NetiNeti performs better (precision = 98.9 % and recall = 70.5 %) compared to a popular dictionary based approach (precision = 97.5 % and recall = 54.3 %) on a 600-page biodiversity book that was manually marked by an annotator. On a small set of PubMed Central's full text articles annotated with scientific names, the precision and recall values are 98.5 % and 96.2 % respectively. NetiNeti found more than 190,000 unique binomial and trinomial names in more than 1,880,000 PubMed records when used on the full MEDLINE database. NetiNeti also successfully identifies almost all of the new species names mentioned within web pages. Additionally, we present the comparison results of various machine learning algorithms on our annotated corpus. Naive Bayes and Maximum Entropy with Generalized Iterative Scaling (GIS) parameter estimation are the top two performing algorithms. CONCLUSIONS: We present NetiNeti, a machine learning based approach for identification and discovery of scientific names. The system implementing the approach can be accessed at http://namefinding.ubio.org.  相似文献   

5.
The assumption that a name uniquely identifies an entity introduces two types of errors: splitting treats one entity as two or more (because of name variants); lumping treats multiple entities as if they were one (because of shared names). Here we investigate the extent to which splitting and lumping affect commonly-used measures of large-scale named-entity networks within two disambiguated bibliographic datasets: one for co-author names in biomedicine (PubMed, 2003–2007); the other for co-inventor names in U.S. patents (USPTO, 2003–2007). In both cases, we find that splitting has relatively little effect, whereas lumping has a dramatic effect on network measures. For example, in the biomedical co-authorship network, lumping (based on last name and both initials) drives several measures down: the global clustering coefficient by a factor of 4 (from 0.265 to 0.066); degree assortativity by a factor of ∼13 (from 0.763 to 0.06); and average shortest path by a factor of 1.3 (from 5.9 to 4.5). These results can be explained in part by the fact that lumping artificially creates many intransitive relationships and high-degree vertices. This effect of lumping is much less dramatic but persists with measures that give less weight to high-degree vertices, such as the mean local clustering coefficient and log-based degree assortativity. Furthermore, the log-log distribution of collaborator counts follows a much straighter line (power law) with splitting and lumping errors than without, particularly at the low and the high counts. This suggests that part of the power law often observed for collaborator counts in science and technology reflects an artifact: name ambiguity.  相似文献   

6.
We continued our effort to make a comprehensive database (LISTA) for the yeast Saccharomyces cerevisiae. As in previous editions the genetic names are consistently associated to each sequence with a known and confirmed ORF. If necessary, synonyms are given in the case of allelic duplicated sequences. Although the first publication of a sequence gives-according to our rules-the genetic name of a gene, in some instances more commonly used names are given to avoid nomenclature problems and the use of ancient designations which are no longer used. In these cases the old designation is given as synonym. Thus sequences can be found either by the name or by synonyms given in LISTA. Each entry contains the genetic name, the mnemonic from the EMBL data bank, the codon bias, reference of the publication of the sequence, Chromosomal location as far as known, SWISSPROT and EMBL accession numbers. New entries will also contain the name from the systematic sequencing efforts. Since the release of LISTA4.1 we update the database continuously. To obtain more information on the included sequences, each entry has been screened against non-redundant nucleotide and protein data bank collections resulting in LISTA-HON and LISTA-HOP. This release includes reports from full Smith and Watermann peptide-level searches against a non-redundant protein sequence database. The LISTA data base can be linked to the associated data sets or to nucleotide and protein banks by the Sequence Retrieval System (SRS). The database is available by FTP and on World Wide Web.  相似文献   

7.
Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of gene name mentions in documents. We present a generic and effective rule-based approach to link gene mentions in the literature to referent genomic databases, where pre-processing of both gene synonyms in the databases and gene mentions in text are first applied. The mapping method employs a cascaded approach, which combines exact, exact-like and token-based approximate matching by using flexible representations of a gene synonym dictionary and gene mentions generated during the pre-processing phase. We also consider multi-gene name mentions and permutation of components in gene names. A systematic evaluation of the suggested methods has identified steps that are beneficial for improving either precision or recall in gene name identification. The results of the experiments on the BioCreAtIvE2 data sets (identification of human gene names) demonstrated that our methods achieved highly encouraging results with F-measure of up to 81.20%.  相似文献   

8.
Pictorial stimuli are commonly used by scientists to explore central processes; including memory, attention, and language. Pictures that have been collected and put into sets for these purposes often contain visual ambiguities that lead to name disagreement amongst subjects. In the present work, we propose new norms which reflect these sources of name disagreement, and we apply this method to two sets of pictures: the Snodgrass and Vanderwart (S&V) set and the Bank of Standardized Stimuli (BOSS). Naming responses of the presented pictures were classified within response categories based on whether they were correct, incorrect, or equivocal. To characterize the naming strategy where an alternative name was being used, responses were further divided into different sub-categories that reflected various sources of name disagreement. Naming strategies were also compared across the two sets of stimuli. Results showed that the pictures of the S&V set and the BOSS were more likely to elicit alternative specific and equivocal names, respectively. It was also found that the use of incorrect names was not significantly different across stimulus sets but that errors were more likely caused by visual ambiguity in the S&V set and by a misuse of names in the BOSS. Norms for name disagreement presented in this paper are useful for subsequent research for their categorization and elucidation of name disagreement that occurs when choosing visual stimuli from one or both stimulus sets. The sources of disagreement should be examined carefully as they help to provide an explanation of errors and inconsistencies of many concepts during picture naming tasks.  相似文献   

9.
A set of stable simple common bird names helps non-ornithologist birders, who contribute to conservation by visiting protected areas and participating in citizen science projects. Changes in English bird names have caused discomfort in the local birding community, especially those that followed international standardisation of common bird names between 2000 and 2005. To understand the extent and nature of English bird name changes, an analysis was done of all southern African bird names through the eight editions of Roberts Birds of South/Southern Africa field guides published from 1940 to 2016. Of 813 species listed in both the first and the latest of the field guides, 453 (55.7%) had their names changed, among which 108 (13.3%) had changes in both the group name and the species epithet. The greatest single wave of changes (31.4%) occurred in the first ‘Roberts bird guide’ (the seventh field guide) in 2007, following international standardisation. Mean word and syllable counts of bird names also increased significantly in that edition. Name changes were associated with new authorships, taxonomic changes and use of geographic species epithets. There was a trend towards name stability for southern African endemic species. Further name changes should be kept to a minimum, shortening and simplifying wherever possible.  相似文献   

10.
In total 70 genebank accessions comprising 50 hexaploid, 12 tetraploid and 8 diploid wheats of the Gatersleben collection were selected based on the screening of the passport data for identical cultivar names or accession numbers of the donor genebanks. Twelve potential duplicate groups consisting of three to nine accessions with identical names/numbers were selected and analysed with DNA markers (microsatellites). A bootstrap approach based on re-sampling of both microsatellite markers and alleles within marker loci was used to test for homogeneity. Although several homogeneous groups were identified it became clear that cultivar name identity alone did not allow the determination of duplicates. A combination of SSR-analysis followed by the bootstrap method and database survey considering the botanical classification and other data (origin, growth habit and donor) available is recommended in order to determine duplicates. A procedure for the identification of duplicates and their further handling in ex situ genebanks is discussed.  相似文献   

11.
GeneInfoMiner is a web-based system for searching Medline abstracts using sequence ID lists such as GenBank accession numbers derived from high-throughput experiments. It will map query results to MeSH topics to facilitate the exploration of the biological significance of the sequence ID lists. GeneInfoMiner is based on a custom gene and protein name identification engine that can map gene and protein names to important molecular biology databases.  相似文献   

12.
Misspellings of organism scientific names create barriers to optimal storage and organization of biological data, reconciliation of data stored under different spelling variants of the same name, and appropriate responses from user queries to taxonomic data systems. This study presents an analysis of the nature of the problem from first principles, reviews some available algorithmic approaches, and describes Taxamatch, an improved name matching solution for this information domain. Taxamatch employs a custom Modified Damerau-Levenshtein Distance algorithm in tandem with a phonetic algorithm, together with a rule-based approach incorporating a suite of heuristic filters, to produce improved levels of recall, precision and execution time over the existing dynamic programming algorithms n-grams (as bigrams and trigrams) and standard edit distance. Although entirely phonetic methods are faster than Taxamatch, they are inferior in the area of recall since many real-world errors are non-phonetic in nature. Excellent performance of Taxamatch (as recall, precision and execution time) is demonstrated against a reference database of over 465,000 genus names and 1.6 million species names, as well as against a range of error types as present at both genus and species levels in three sets of sample data for species and four for genera alone. An ancillary authority matching component is included which can be used both for misspelled names and for otherwise matching names where the associated cited authorities are not identical.  相似文献   

13.
The recognition and normalization of gene mentions in biomedical literature are crucial steps in biomedical text mining. We present a system for extracting gene names from biomedical literature and normalizing them to gene identifiers in databases. The system consists of four major components: gene name recognition, entity mapping, disambiguation and filtering. The first component is a gene name recognizer based on dictionary matching and semi-supervised learning, which utilizes the co-occurrence information of a large amount of unlabeled MEDLINE abstracts to enhance feature representation of gene named entities. In the stage of entity mapping, we combine the strategies of exact match and approximate match to establish linkage between gene names in the context and the EntrezGene database. For the gene names that map to more than one database identifiers, we develop a disambiguation method based on semantic similarity derived from the Gene Ontology and MEDLINE abstracts. To remove the noise produced in the previous steps, we design a filtering method based on the confidence scores in the dictionary used for NER. The system is able to adjust the trade-off between precision and recall based on the result of filtering. It achieves an F-measure of 83% (precision: 82.5% recall: 83.5%) on BioCreative II Gene Normalization (GN) dataset, which is comparable to the current state-of-the-art.  相似文献   

14.
Genew: the Human Gene Nomenclature Database   总被引:5,自引:0,他引:5       下载免费PDF全文
Genew, the Human Gene Nomenclature Database, is the only resource that provides data for all human genes which have approved symbols. It is managed by the HUGO Gene Nomenclature Committee (HGNC) as a confidential database, containing over 16 000 records, 80% of which are represented on the Web by searchable text files. The data in Genew are highly curated by HGNC editors and gene records can be searched on the Web by symbol or name to directly retrieve information on gene symbol, gene name, cytogenetic location, OMIM number and PubMed ID. Data are integrated with other human gene databases, e.g. GDB, LocusLink and SWISS-PROT, and approved gene symbols are carefully co-ordinated with the Mouse Genome Database (MGD). Approved gene symbols are available for querying and browsing at http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl.  相似文献   

15.
Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data.  相似文献   

16.
Errors by Rataj in lecto- and neotypification of five names inEchinodorus are corrected. These errors include the selection of lectotype specimens that were not cited in the protologue and the designation of neotypes from syntypes. Lectotypes for four of the names are designated here. The other is not typified at this time, as we have as yet been unable to examine the cited collections. In addition, a lectotype is designated for the Linnaean nameAlisma cordifolia, a name for which Rataj did not choose a lectotype. A new species,Echinodorus reticulatus, is described and illustrated. Five new combinations at the subspecies level are proposed, with most of these subspecies having a distribution from Central America to Paraguay and Argentina.  相似文献   

17.
本研究以常用于动物种属鉴定的12S rRNA基因位点为研究对象,利用所测得的17种常见涉案兽类12S rRNA基因部分片段序列及NCBI数据库中下载的该物种DNA序列及其近缘物种DNA序列,构建系统进化树。根据进化树的聚类情况,判断NCBI数据库中的相关基因序列或物种名称的正确性,并对其中错误序列的登陆号进行标记,以防对后续涉案动物的准确鉴定造成影响。分别从17种常见涉案兽类(共26份样本)中提取线粒体DNA,并利用通用引物扩增线粒体DNA上的12S rRNA基因部分片段并进行测序分析。通过NCBI数据库的Blast比对功能,筛选出与本研究物种同源性由高到低的物种,并从NCBI基因数据库中下载此类近缘物种的12S rRNA基因序列共351条,利用MEGA7.0软件构建该物种及其近缘物种系统进化树。通过比对发现NCBI中登录号为KP202279等3个序列所对应物种拉丁名错误。登录号为AY184436等11个序列所对应物种拉丁名可能存在疑问。GenBank中某些物种拉丁名有同种异名现象。因此,NCBI数据库数据可靠性有待进一步验证,只能作为涉案物种鉴定的参考数据之一,可借助构建系统进化树等方法来确认其结果的准确性。  相似文献   

18.
An infant's own name is a unique social cue. Infants are sensitive to their own name by 4 months of age, but whether they use their names as a social cue is unknown. Electroencephalogram (EEG) was measured as infants heard their own name or stranger's names and while looking at novel objects. Event related brain potentials (ERPs) in response to names revealed that infants differentiate their own name from stranger names from the first phoneme. The amplitude of the ERPs to objects indicated that infants attended more to objects after hearing their own names compared to another name. Thus, by 5 months of age infants not only detect their name, but also use it as a social cue to guide their attention to events and objects in the world.  相似文献   

19.
"云南凤仙花属新类群"的更正   总被引:1,自引:0,他引:1  
作者去年在准备本文时 ,对手稿曾有过几次修改。遗憾的是当我们在匆忙中送去发表时 ,选出了一份未改好的手稿。我们的疏忽不可避免地带来了如下的错误 ,即有些出现在摘要中的种名和描述中相应的种名不一致 ;描述中还或多或少地存在一些错误。因此为改正这些错误 ,除图版外 ,不得不重新发表此文。本文中采用的所有名称应视为合格发表的正确名称。  相似文献   

20.

Background

The digitization of biodiversity data is leading to the widespread application of taxon names that are superfluous, ambiguous or incorrect, resulting in mismatched records and inflated species numbers. The ultimate consequences of misspelled names and bad taxonomy are erroneous scientific conclusions and faulty policy decisions. The lack of tools for correcting this ‘names problem’ has become a fundamental obstacle to integrating disparate data sources and advancing the progress of biodiversity science.

Results

The TNRS, or Taxonomic Name Resolution Service, is an online application for automated and user-supervised standardization of plant scientific names. The TNRS builds upon and extends existing open-source applications for name parsing and fuzzy matching. Names are standardized against multiple reference taxonomies, including the Missouri Botanical Garden's Tropicos database. Capable of processing thousands of names in a single operation, the TNRS parses and corrects misspelled names and authorities, standardizes variant spellings, and converts nomenclatural synonyms to accepted names. Family names can be included to increase match accuracy and resolve many types of homonyms. Partial matching of higher taxa combined with extraction of annotations, accession numbers and morphospecies allows the TNRS to standardize taxonomy across a broad range of active and legacy datasets.

Conclusions

We show how the TNRS can resolve many forms of taxonomic semantic heterogeneity, correct spelling errors and eliminate spurious names. As a result, the TNRS can aid the integration of disparate biological datasets. Although the TNRS was developed to aid in standardizing plant names, its underlying algorithms and design can be extended to all organisms and nomenclatural codes. The TNRS is accessible via a web interface at http://tnrs.iplantcollaborative.org/ and as a RESTful web service and application programming interface. Source code is available at https://github.com/iPlantCollaborativeOpenSource/TNRS/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号