首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The recognition and normalization of gene mentions in biomedical literature are crucial steps in biomedical text mining. We present a system for extracting gene names from biomedical literature and normalizing them to gene identifiers in databases. The system consists of four major components: gene name recognition, entity mapping, disambiguation and filtering. The first component is a gene name recognizer based on dictionary matching and semi-supervised learning, which utilizes the co-occurrence information of a large amount of unlabeled MEDLINE abstracts to enhance feature representation of gene named entities. In the stage of entity mapping, we combine the strategies of exact match and approximate match to establish linkage between gene names in the context and the EntrezGene database. For the gene names that map to more than one database identifiers, we develop a disambiguation method based on semantic similarity derived from the Gene Ontology and MEDLINE abstracts. To remove the noise produced in the previous steps, we design a filtering method based on the confidence scores in the dictionary used for NER. The system is able to adjust the trade-off between precision and recall based on the result of filtering. It achieves an F-measure of 83% (precision: 82.5% recall: 83.5%) on BioCreative II Gene Normalization (GN) dataset, which is comparable to the current state-of-the-art.  相似文献   

2.
Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons – Attribution – Share Alike (CC BY-SA) license.  相似文献   

3.
Gene/protein recognition and normalization is an important preliminary step for many biological text mining tasks. In this paper, we present a multistage gene normalization system which consists of four major subtasks: pre-processing, dictionary matching, ambiguity resolution and filtering. For the first subtask, we apply the gene mention tagger developed in our earlier work, which achieves an F-score of 88.42% on the BioCreative II GM testing set. In the stage of dictionary matching, the exact matching and approximate matching between gene names and the EntrezGene lexicon have been combined. For the ambiguity resolution subtask, we propose a semantic similarity disambiguation method based on Munkres'' Assignment Algorithm. At the last step, a filter based on Wikipedia has been built to remove the false positives. Experimental results show that the presented system can achieve an F-score of 90.1%, outperforming most of the state-of-the-art systems.  相似文献   

4.
Quantitative PCR (qPCR) is a powerful tool for measuring gene expression levels. Accurate and reproducible results are dependent on the correct choice of reference genes for data normalization. Atropa belladonna is a commercial plant species from which pharmaceutical tropane alkaloids are extracted. In this study, eight candidate reference genes, namely 18S ribosomal RNA (18S), actin (ACT), cyclophilin (CYC), elongation factor 1α (EF-1α), β-fructosidase (FRU), glyceraldehyde-3-phosphate dehydrogenase (GAPDH), phosphoglycerate kinase (PGK), and beta-tubulin (TUB), were selected and their expression stabilities studied to determine their suitability for normalizing gene expression in A. belladonna. The expression stabilities of these genes were analyzed in the root, stem, and leaf under cold, heat, NaCl, UV-B, methyl jasmonate, salicylic acid, and abscisic acid treatments using geNorm, NormFinder, and BestKeeper. The statistical algorithms indicated that PGK was a reliable gene for normalizing gene expression under most of the experimental conditions. The pairwise value analysis showed that two genes were sufficient for proper expression normalization, except when analyzing gene expression in heat-treated roots. However, the choice of the second reference gene depended on specific conditions. Finally, the relative expression level of the PMT gene of A. belladonna was detected to validate the selection of PGK a reliable reference gene. In summary, our results should guide the selection of appropriate reference genes for gene expression studies in A. belladonna under different organs and abiotic stress conditions.  相似文献   

5.
6.
7.
8.
Despite its superiority for evaluating gene expression, real-time quantitative polymerase chain reaction (qPCR) results can be significantly biased by the use of inappropriate reference genes under different experimental conditions. Reaumuria soongorica is a dominant species of desert ecosystems in arid central Asia. Given the increasing interest in ecological engineering and potential genetic resources for arid agronomy, it is important to analyze gene function. However, systematic evaluation of stable reference genes should be performed prior to such analyses. In this study, the stabilities of 10 candidate reference genes were analyzed under 4 kinds of abiotic stresses (drought, salt, dark, and heat) within 4 accessions (HG010, HG020, XGG030, and XGG040) from 2 different habitats using 3 algorithms (geNorm, NormFinder, and BestKeeper). After validation of the ribulose-1,5-bisphosphate carboxylase/oxygenase large unite (rbcL) expression pattern, our data suggested that histone H2A (H2A) and eukaryotic initiation factor 4A-2 (EIF4A2) were the most stable reference genes, cyclophilin (CYCL) was moderate, and elongation factor 1α (EF1α) was the worst choice. This first systematic analysis for stably expressed genes will facilitate future functional analyses and deep mining of genetic resources in R. soongorica and other species of the Reaumuria genus.  相似文献   

9.
Normalization of fluorescence-based quantitative real-time PCR (qPCR) data varies across quantitative gene expression studies, despite its integral role in accurate data quantification and interpretation. Identification of suitable reference genes plays an essential role in accurate qPCR normalization, as it ensures that uncorrected gene expression data reflect normalized data. The reference residual normalization (RRN) method presented here is a modified approach to conventional 2−ΔΔCtqPCR normalization that increases mathematical transparency and incorporates statistical assessment of reference gene stability. RRN improves mathematical transparency through the use of sample-specific reference residuals (RRi) that are generated from the mean Ct of one or more reference gene(s) that are unaffected by treatment. To determine stability of putative reference genes, RRN uses ANOVA to assess the effect of treatment on expression and subsequent equivalence-threshold testing to establish the minimum permitted resolution. Step-by-step instructions and comprehensive examples that demonstrate the influence of reference gene stability on target gene normalization and interpretation are provided. Through mathematical transparency and statistical rigor, RRN promotes compliance with Minimum Information for Quantitative Experiments and, in so doing, provides increased confidence in qPCR data analysis and interpretation.  相似文献   

10.
11.
Structured information provided by manual annotation of proteins with Gene Ontology concepts represents a high-quality reliable data source for the research community. However, a limited scope of proteins is annotated due to the amount of human resources required to fully annotate each individual gene product from the literature. We introduce a novel method for automatic identification of GO terms in natural language text. The method takes into consideration several features: (1) the evidence for a GO term given by the words occurring in text, (2) the proximity between the words, and (3) the specificity of the GO terms based on their information content. The method has been evaluated on the BioCreAtIvE corpus and has been compared to current state of the art methods. The precision reached 0.34 at a recall of 0.34 for the identified terms at rank 1. In our analysis, we observe that the identification of GO terms in the _cellular component_ subbranch of GO is more accurate than for terms from the other two subbranches. This observation is explained by the average number of words forming the terminology over the different subbranches.  相似文献   

12.
13.
在DNA芯片技术中 ,通过反转录反应 ,由mRNA合成带有荧光标记物的cDNA的过程中 ,往往要参入已知质量的poly(A) + RNA ,以对DNA芯片的检测灵敏度进行归一化处理 .通过体外转录的方法 ,以真核生物的cDNA克隆中的DNA片段为模板合成poly(A) +RNA ,对之定量后 ,以不同的质量比参入到样品的反转录体系中 ,代表不同的RNA拷贝丰度 ,从而对DNA芯片检测的灵敏度进行了定量 ,并得到DNA芯片上杂交点的荧光信号强度与基因表达的RNA拷贝数成正相关的关系 .利用含有内标的DNA芯片检测了热击反应后酵母细胞的基因表达变化 ,结果与Northern印迹方法检测结果是相符的  相似文献   

14.
The selection and validation of stably expressed reference genes is a critical issue for proper RT-qPCR data normalization. In zebrafish expression studies, many commonly used reference genes are not generally applicable given their variability in expression levels under a variety of experimental conditions. Inappropriate use of these reference genes may lead to false interpretation of expression data and unreliable conclusions. In this study, we evaluated a novel normalization method in zebrafish using expressed repetitive elements (ERE) as reference targets, instead of specific protein coding mRNA targets. We assessed and compared the expression stability of a number of EREs to that of commonly used zebrafish reference genes in a diverse set of experimental conditions including a developmental time series, a set of different organs from adult fish and different treatments of zebrafish embryos including morpholino injections and administration of chemicals. Using geNorm and rank aggregation analysis we demonstrated that EREs have a higher overall expression stability compared to the commonly used reference genes. Moreover, we propose a limited set of ERE reference targets (hatn10, dna15ta1 and loopern4), that show stable expression throughout the wide range of experiments in this study, as strong candidates for inclusion as reference targets for qPCR normalization in future zebrafish expression studies. Our applied strategy to find and evaluate candidate expressed repeat elements for RT-qPCR data normalization has high potential to be used also for other species.  相似文献   

15.
16.
17.
18.
19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号