共查询到20条相似文献,搜索用时 31 毫秒
1.
ABSTRACT: BACKGROUND: A scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important metadata element to link biological information. RESULTS: We present NetiNeti (Name Extraction from Textual Information-Name Extraction for Taxonomic Indexing), a machine learning based approach for recognition of scientific names including the discovery of new species names from text that will also handle misspellings, OCR errors and other variations in names. The system generates candidate names using rules for scientific names and applies probabilistic machine learning methods to classify names based on structural features of candidate names and features derived from their contexts. NetiNeti can also disambiguate scientific names from other names using the contextual information. We evaluated NetiNeti on legacy biodiversity texts and biomedical literature (MEDLINE). NetiNeti performs better (precision = 98.9 % and recall = 70.5 %) compared to a popular dictionary based approach (precision = 97.5 % and recall = 54.3 %) on a 600-page biodiversity book that was manually marked by an annotator. On a small set of PubMed Central's full text articles annotated with scientific names, the precision and recall values are 98.5 % and 96.2 % respectively. NetiNeti found more than 190,000 unique binomial and trinomial names in more than 1,880,000 PubMed records when used on the full MEDLINE database. NetiNeti also successfully identifies almost all of the new species names mentioned within web pages. Additionally, we present the comparison results of various machine learning algorithms on our annotated corpus. Naive Bayes and Maximum Entropy with Generalized Iterative Scaling (GIS) parameter estimation are the top two performing algorithms. CONCLUSIONS: We present NetiNeti, a machine learning based approach for identification and discovery of scientific names. The system implementing the approach can be accessed at http://namefinding.ubio.org. 相似文献
2.
3.
Readily available proxies for the time of disease onset such as the time of the first diagnostic code can lead to substantial risk prediction error if performing analyses based on poor proxies. Due to the lack of detailed documentation and labor intensiveness of manual annotation, it is often only feasible to ascertain for a small subset the current status of the disease by a follow-up time rather than the exact time. In this paper, we aim to develop risk prediction models for the onset time efficiently leveraging both a small number of labels on the current status and a large number of unlabeled observations on imperfect proxies. Under a semiparametric transformation model for onset and a highly flexible measurement error model for proxy onset time, we propose the semisupervised risk prediction method by combining information from proxies and limited labels efficiently. From an initially estimator solely based on the labeled subset, we perform a one-step correction with the full data augmenting against a mean zero rank correlation score derived from the proxies. We establish the consistency and asymptotic normality of the proposed semisupervised estimator and provide a resampling procedure for interval estimation. Simulation studies demonstrate that the proposed estimator performs well in a finite sample. We illustrate the proposed estimator by developing a genetic risk prediction model for obesity using data from Mass General Brigham Healthcare Biobank. 相似文献
4.
5.
Background
The majority of information in the biological literature resides in full text articles, instead of abstracts. Yet, abstracts remain the focus of many publicly available literature data mining tools. Most literature mining tools rely on pre-existing lexicons of biological names, often extracted from curated gene or protein databases. This is a limitation, because such databases have low coverage of the many name variants which are used to refer to biological entities in the literature. 相似文献6.
Kohane IS 《Nature reviews. Genetics》2011,12(6):417-428
If genomic studies are to be a clinically relevant and timely reflection of the relationship between genetics and health status--whether for common or rare variants--cost-effective ways must be found to measure both the genetic variation and the phenotypic characteristics of large populations, including the comprehensive and up-to-date record of their medical treatment. The adoption of electronic health records, used by clinicians to document clinical care, is becoming widespread and recent studies demonstrate that they can be effectively employed for genetic studies using the informational and biological 'by-products' of health-care delivery while maintaining patient privacy. 相似文献
7.
We present an automated system for assigning protein, gene, or mRNA class labels to biological terms in free text. Three machine learning algorithms and several extended ways for defining contextual features for disambiguation are examined, and a fully unsupervised manner for obtaining training examples is proposed. We train and evaluate our system over a collection of 9 million words of molecular biology journal articles, obtaining accuracy rates up to 85%. 相似文献
8.
9.
10.
11.
Lu Wang Jill Schnall Aeron Small Rebecca A. Hubbard Jason H. Moore Scott M. Damrauer Jinbo Chen 《Biometrics》2021,77(1):67-77
Clinically relevant information from electronic health records (EHRs) permits derivation of a rich collection of phenotypes. Unlike traditionally designed studies where scientific hypotheses are specified a priori before data collection, the true phenotype status of any given individual in EHR‐based studies is not directly available. Structured and unstructured data elements need to be queried through preconstructed rules to identify case and control groups. A sufficient number of controls can usually be identified with high accuracy by making the selection criteria stringent. But more relaxed criteria are often necessary for more thorough identification of cases to ensure achievable statistical power. The resulting pool of candidate cases consists of genuine cases contaminated with noncase patients who do not satisfy the control definition. The presence of patients who are neither true cases nor controls among the identified cases is a unique challenge in EHR‐based case‐control studies. Ignoring case contamination would lead to biased estimation of odds ratio association parameters. We propose an estimating equation approach to bias correction, study its large sample property, and evaluate its performance through extensive simulation studies and an application to a pilot study of aortic stenosis in the Penn medicine EHR. Our method holds the promise of facilitating more efficient EHR studies by accommodating enlarged albeit contaminated case pools. 相似文献
12.
Clinical data describing the phenotypes and treatment of patients represents an underused data source that has much greater research potential than is currently realized. Mining of electronic health records (EHRs) has the potential for establishing new patient-stratification principles and for revealing unknown disease correlations. Integrating EHR data with genetic data will also give a finer understanding of genotype-phenotype relationships. However, a broad range of ethical, legal and technical reasons currently hinder the systematic deposition of these data in EHRs and their mining. Here, we consider the potential for furthering medical research and clinical care using EHR data and the challenges that must be overcome before this is a reality. 相似文献
13.
Kurreeman F Liao K Chibnik L Hickey B Stahl E Gainer V Li G Bry L Mahan S Ardlie K Thomson B Szolovits P Churchill S Murphy SN Cai T Raychaudhuri S Kohane I Karlson E Plenge RM 《American journal of human genetics》2011,(1):529-69
Discovering and following up on genetic associations with complex phenotypes require large patient cohorts. This is particularly true for patient cohorts of diverse ancestry and clinically relevant subsets of disease. The ability to mine the electronic health records (EHRs) of patients followed as part of routine clinical care provides a potential opportunity to efficiently identify affected cases and unaffected controls for appropriate-sized genetic studies. Here, we demonstrate proof-of-concept that it is possible to use EHR data linked with biospecimens to establish a multi-ethnic case-control cohort for genetic research of a complex disease, rheumatoid arthritis (RA). In 1,515 EHR-derived RA cases and 1,480 controls matched for both genetic ancestry and disease-specific autoantibodies (anti-citrullinated protein antibodies [ACPA]), we demonstrate that the odds ratios and aggregate genetic risk score (GRS) of known RA risk alleles measured in individuals of European ancestry within our EHR cohort are nearly identical to those derived from a genome-wide association study (GWAS) of 5,539 autoantibody-positive RA cases and 20,169 controls. We extend this approach to other ethnic groups and identify a large overlap in the GRS among individuals of European, African, East Asian, and Hispanic ancestry. We also demonstrate that the distribution of a GRS based on 28 non-HLA risk alleles in ACPA+ cases partially overlaps with ACPA- subgroup of RA cases. Our study demonstrates that the genetic basis of rheumatoid arthritis risk is similar among cases of diverse ancestry divided into subsets based on ACPA status and emphasizes the utility of linking EHR clinical data with biospecimens for genetic studies. 相似文献
14.
15.
16.
Chalup SK 《International journal of neural systems》2002,12(6):447-465
Incremental learning concepts are reviewed in machine learning and neurobiology. They are identified in evolution, neurodevelopment and learning. A timeline of qualitative axon, neuron and synapse development summarizes the review on neurodevelopment. A discussion of experimental results on data incremental learning with recurrent artificial neural networks reveals that incremental learning often seems to be more efficient or powerful than standard learning but can produce unexpected side effects. A characterization of incremental learning is proposed which takes the elaborated biological and machine learning concepts into account. 相似文献
17.
ABSTRACTThe electronic health record (EHR) contains rich histories of clinical care, but has not traditionally been mined for information related to sleep habits. Here, we performed a retrospective EHR study based on a cohort of 3,652 individuals with self-reported sleep behaviors documented from visits to the sleep clinic. These individuals were obese (mean body mass index 33.6 kg/m2) and had a high prevalence of sleep apnea (60.5%), however we found sleep behaviors largely concordant with prior prospective cohort studies. In our cohort, average wake time was 1 hour later and average sleep duration was 40 minutes longer on weekends than on weekdays (p < 10?12). Sleep duration varied considerably as a function of age and tended to be longer in females and in whites. Additionally, through phenome-wide association analyses, we found an association of long weekend sleep with depression, and an unexpectedly large number of associations of long weekday sleep with mental health and neurological disorders (q < 0.05). We then sought to replicate previously published genetic associations with morning/evening preference on a subset of our cohort with extant genotyping data (n = 555). While those findings did not replicate in our cohort, a polymorphism (rs3754214) in high linkage disequilibrium with a previously published polymorphism near TARS2 was associated with long sleep duration (p < 0.01). Collectively, our results highlight the potential of the EHR for uncovering the correlates of human sleep in real-world populations. 相似文献
18.
Exploring the feasibility of using electronic health records in the surveillance of fetal alcohol syndrome 下载免费PDF全文
Craig Hansen Marvin Adams Deborah J. Fox Leslie A. O'Leary Jaime L. Frías Heather Freiman F. John Meaney 《Birth defects research. Part A, Clinical and molecular teratology》2014,100(2):67-78
19.
Expanding digital data sources, including social media, online news articles and blogs, provide an opportunity to understand better the context and intensity of human-nature interactions, such as wildlife exploitation. However, online searches encompassing large taxonomic groups can generate vast datasets, which can be overwhelming to filter for relevant content without the use of automated tools. The variety of machine learning models available to researchers, and the need for manually labelled training data with an even balance of labels, can make applying these tools challenging. Here, we implement and evaluate a hierarchical text classification pipeline which brings together three binary classification tasks with increasingly specific relevancy criteria. Crucially, the hierarchical approach facilitates the filtering and structuring of a large dataset, of which relevant sources make up a small proportion. Using this pipeline, we also investigate how the accuracy with which text classifiers identify relevant and irrelevant texts is influenced by the use of different models, training datasets, and the classification task. To evaluate our methods, we collected data from Facebook, Twitter, Google and Bing search engines, with the aim of identifying sources documenting the hunting and persecution of bats (Chiroptera). Overall, the ‘state-of-the-art’ transformer-based models were able to identify relevant texts with an average accuracy of 90%, with some classifiers achieving accuracy of >95%. Whilst this demonstrates that application of more advanced models can lead to improved accuracy, comparable performance was achieved by simpler models when applied to longer documents and less ambiguous classification tasks. Hence, the benefits from using more computationally expensive models are dependent on the classification context. We also found that stratification of training data, according to the presence of key search terms, improved classification accuracy for less frequent topics within datasets, and therefore improves the applicability of classifiers to future data collection. Overall, whilst our findings reinforce the usefulness of automated tools for facilitating online analyses in conservation and ecology, they also highlight that the effectiveness and appropriateness of such tools is determined by the nature and volume of data collected, the complexity of the classification task, and the computational resources available to researchers. 相似文献