首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

Background

The expressed sequence tag (EST) methodology is an attractive option for the generation of sequence data for species for which no completely sequenced genome is available. The annotation and comparative analysis of such datasets poses a formidable challenge for research groups that do not have the bioinformatics infrastructure of major genome sequencing centres. Therefore, there is a need for user-friendly tools to facilitate the annotation of non-model species EST datasets with well-defined ontologies that enable meaningful cross-species comparisons. To address this, we have developed annot8r, a platform for the rapid annotation of EST datasets with GO-terms, EC-numbers and KEGG-pathways.

Results

annot8r automatically downloads all files relevant for the annotation process and generates a reference database that stores UniProt entries, their associated Gene Ontology (GO), Enzyme Commission (EC) and Kyoto Encyclopaedia of Genes and Genomes (KEGG) annotation and additional relevant data. For each of GO, EC and KEGG, annot8r extracts a specific sequence subset from the UniProt dataset based on the information stored in the reference database. These three subsets are then formatted for BLAST searches. The user provides the protein or nucleotide sequences to be annotated and annot8r runs BLAST searches against these three subsets. The BLAST results are parsed and the corresponding annotations retrieved from the reference database. The annotations are saved both as flat files and also in a relational postgreSQL results database to facilitate more advanced searches within the results. annot8r is integrated with the PartiGene suite of EST analysis tools.

Conclusion

annot8r is a tool that assigns GO, EC and KEGG annotations for data sets resulting from EST sequencing projects both rapidly and efficiently. The benefits of an underlying relational database, flexibility and the ease of use of the program make it ideally suited for non-model species EST-sequencing projects.  相似文献   

2.

Background

Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data.

Results

In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general.

Conclusion

GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values.  相似文献   

3.

Background

Since the initial publication of its complete genome sequence, Arabidopsis thaliana has become more important than ever as a model for plant research. However, the initial genome annotation was submitted by multiple centers using inconsistent methods, making the data difficult to use for many applications.

Results

Over the course of three years, TIGR has completed its effort to standardize the structural and functional annotation of the Arabidopsis genome. Using both manual and automated methods, Arabidopsis gene structures were refined and gene products were renamed and assigned to Gene Ontology categories. We present an overview of the methods employed, tools developed, and protocols followed, summarizing the contents of each data release with special emphasis on our final annotation release (version 5).

Conclusion

Over the entire period, several thousand new genes and pseudogenes were added to the annotation. Approximately one third of the originally annotated gene models were significantly refined yielding improved gene structure annotations, and every protein-coding gene was manually inspected and classified using Gene Ontology terms.  相似文献   

4.
5.

Background

Gene-list annotations are critical for researchers to explore the complex relationships between genes and functionalities. Currently, the annotations of a gene list are usually summarized by a table or a barplot. As such, potentially biologically important complexities such as one gene belonging to multiple annotation categories are difficult to extract. We have devised explicit and efficient visualization methods that provide intuitive methods for interrogating the intrinsic connections between biological categories and genes.

Findings

We have constructed a data model and now present two novel methods in a Bioconductor package, "GeneAnswers", to simultaneously visualize genes, concepts (a.k.a. annotation categories), and concept-gene connections (a.k.a. annotations): the "Concept-and-Gene Network" and the "Concept-and-Gene Cross Tabulation". These methods have been tested and validated with microarray-derived gene lists.

Conclusions

These new visualization methods can effectively present annotations using Gene Ontology, Disease Ontology, or any other user-defined gene annotations that have been pre-associated with an organism's genome by human curation, automated pipelines, or a combination of the two. The gene-annotation data model and associated methods are available in the Bioconductor package called "GeneAnswers " described in this publication.  相似文献   

6.

Background  

Gene annotation is a pivotal component in computational genomics, encompassing prediction of gene function, expression analysis, and sequence scrutiny. Hence, quantitative measures of the annotation landscape constitute a pertinent bioinformatics tool. GeneCards? is a gene-centric compendium of rich annotative information for over 50,000 human gene entries, building upon 68 data sources, including Gene Ontology (GO), pathways, interactions, phenotypes, publications and many more.  相似文献   

7.

Background

Interferons (IFNs) play a critical role in the host antiviral defense and are an essential component of current therapies against hepatitis C virus (HCV), a major cause of liver disease worldwide. To examine liver-specific responses to IFN and begin to elucidate the mechanisms of IFN inhibition of virus replication, we performed a global quantitative proteomic analysis in a human hepatoma cell line (Huh7) in the presence and absence of IFN treatment using the isotope-coded affinity tag (ICAT) method and tandem mass spectrometry (MS/MS).

Results

In three subcellular fractions from the Huh7 cells treated with IFN (400 IU/ml, 16 h) or mock-treated, we identified more than 1,364 proteins at a threshold that corresponds to less than 5% false-positive error rate. Among these, 54 were induced by IFN and 24 were repressed by more than two-fold, respectively. These IFN-regulated proteins represented multiple cellular functions including antiviral defense, immune response, cell metabolism, signal transduction, cell growth and cellular organization. To analyze this proteomics dataset, we utilized several systems-biology data-mining tools, including Gene Ontology via the GoMiner program and the Cytoscape bioinformatics platform.

Conclusions

Integration of the quantitative proteomics with global protein interaction data using the Cytoscape platform led to the identification of several novel and liver-specific key regulatory components of the IFN response, which may be important in regulating the interplay between HCV, interferon and the host response to virus infection.  相似文献   

8.

Background

With the increased availability of high throughput data, such as DNA microarray data, researchers are capable of producing large amounts of biological data. During the analysis of such data often there is the need to further explore the similarity of genes not only with respect to their expression, but also with respect to their functional annotation which can be obtained from Gene Ontology (GO).

Results

We present the freely available software package GOSim, which allows to calculate the functional similarity of genes based on various information theoretic similarity concepts for GO terms. GOSim extends existing tools by providing additional lately developed functional similarity measures for genes. These can e.g. be used to cluster genes according to their biological function. Vice versa, they can also be used to evaluate the homogeneity of a given grouping of genes with respect to their GO annotation. GOSim hence provides the researcher with a flexible and powerful tool to combine knowledge stored in GO with experimental data. It can be seen as complementary to other tools that, for instance, search for significantly overrepresented GO terms within a given group of genes.

Conclusion

GOSim is implemented as a package for the statistical computing environment R and is distributed under GPL within the CRAN project.  相似文献   

9.

Background

The current progress in sequencing projects calls for rapid, reliable and accurate function assignments of gene products. A variety of methods has been designed to annotate sequences on a large scale. However, these methods can either only be applied for specific subsets, or their results are not formalised, or they do not provide precise confidence estimates for their predictions.

Results

We have developed a large-scale annotation system that tackles all of these shortcomings. In our approach, annotation was provided through Gene Ontology terms by applying multiple Support Vector Machines (SVM) for the classification of correct and false predictions. The general performance of the system was benchmarked with a large dataset. An organism-wise cross-validation was performed to define confidence estimates, resulting in an average precision of 80% for 74% of all test sequences. The validation results show that the prediction performance was organism-independent and could reproduce the annotation of other automated systems as well as high-quality manual annotations. We applied our trained classification system to Xenopus laevis sequences, yielding functional annotation for more than half of the known expressed genome. Compared to the currently available annotation, we provided more than twice the number of contigs with good quality annotation, and additionally we assigned a confidence value to each predicted GO term.

Conclusions

We present a complete automated annotation system that overcomes many of the usual problems by applying a controlled vocabulary of Gene Ontology and an established classification method on large and well-described sequence data sets. In a case study, the function for Xenopus laevis contig sequences was predicted and the results are publicly available at ftp://genome.dkfz-heidelberg.de/pub/agd/gene_association.agd_Xenopus.
  相似文献   

10.
11.
12.
13.
14.

Background

Pheochromocytoma and neuroblastoma are the most common neural crest-derived tumors in adults and children, respectively. We have performed a large-scale in silico analysis of altogether 1784 neuroblastoma and 531 pheochromocytoma samples to establish similarities and differences using analysis of mRNA and microRNA expression, chromosome aberrations and a novel bioinformatics analysis based on cooperative game theory.

Methods

Datasets obtained from Gene Expression Omnibus and ArrayExpress have been subjected to a complex bioinformatics analysis using GeneSpring, Gene Set Enrichment Analysis, Ingenuity Pathway Analysis and own software.

Results

Comparison of neuroblastoma and pheochromocytoma with other tumors revealed the overexpression of genes involved in development of noradrenergic cells. Among these, the significance of paired-like homeobox 2b in pheochromocytoma has not been reported previously. The analysis of similar expression patterns in neuroblastoma and pheochromocytoma revealed the same anti-apoptotic strategies in these tumors. Cancer regulation by stathmin turned out to be the major difference between pheochromocytoma and neuroblastoma. Underexpression of genes involved in neuronal cell-cell interactions was observed in unfavorable neuroblastoma. By the comparison of hypoxia- and Ras-associated pheochromocytoma, we have found that enhanced insulin like growth factor 1 signaling may be responsible for the activation of Src homology 2 domain containing transforming protein 1, the main co-factor of RET. Hypoxia induced factor 1?? and vascular endothelial growth factor signaling included the most prominent gene expression changes between von Hippel-Lindau- and multiple endocrine neoplasia type 2A-associated pheochromocytoma.

Conclusions

These pathways include previously undescribed pathomechanisms of neuroblastoma and pheochromocytoma and associated gene products may serve as diagnostic markers and therapeutic targets.  相似文献   

15.

Background

Endorepellin, the C-terminal domain V of the heparan sulfate proteoglycan perlecan, exhibits powerful and targeted anti-angiogenic activity on endothelial cells. To identify proteins involved with endorepellin anti-angiogenic action, we performed an extensive comparative proteomic analysis between vehicle- and endorepellin-treated human endothelial cells.

Results

Proteomic analysis of endorepellin influence on human umbilical vein endothelial cells identified five differentially expressed proteins, three of which (β-actin, calreticulin, and chaperonin/Hsp60) were down-regulated and two of which (vimentin and the β subunit of prolyl 4-hydroxylase also known as protein disulfide isomerase) were up-regulated in response to endorepellin treatment—and associated with a fold change (endorepellin/control) ≤ 0.75 and ≥ 2.00, and a statistically significant p-value as determined by Student's t test.

Conclusion

The proteins identified represent potential target areas involved with endorepellin anti-angiogenic mechanism of action. Further elucidation as such will ultimately provide useful in utilizing endorepellin as an anti-angiogenic therapy in humans.  相似文献   

16.

Background  

Incorrectly annotated sequence data are becoming more commonplace as databases increasingly rely on automated techniques for annotation. Hence, there is an urgent need for computational methods for checking consistency of such annotations against independent sources of evidence and detecting potential annotation errors. We show how a machine learning approach designed to automatically predict a protein's Gene Ontology (GO) functional class can be employed to identify potential gene annotation errors.  相似文献   

17.

Background

The mitotic spindle is a complex mechanical apparatus required for accurate segregation of sister chromosomes during mitosis. We designed a genetic screen using automated microscopy to discover factors essential for mitotic progression. Using a RNA interference library of 49,164 double-stranded RNAs targeting 23,835 human genes, we performed a loss of function screen to look for small interfering RNAs that arrest cells in metaphase.

Results

Here we report the identification of genes that, when suppressed, result in structural defects in the mitotic spindle leading to bent, twisted, monopolar, or multipolar spindles, and cause cell cycle arrest. We further describe a novel analysis methodology for large-scale RNA interference datasets that relies on supervised clustering of these genes based on Gene Ontology, protein families, tissue expression, and protein-protein interactions.

Conclusion

This approach was utilized to classify functionally the identified genes in discrete mitotic processes. We confirmed the identity for a subset of these genes and examined more closely their mechanical role in spindle architecture.  相似文献   

18.

Background  

Cellular processes require the interaction of many proteins across several cellular compartments. Determining the collective network of such interactions is an important aspect of understanding the role and regulation of individual proteins. The Gene Ontology (GO) is used by model organism databases and other bioinformatics resources to provide functional annotation of proteins. The annotation process provides a mechanism to document the binding of one protein with another. We have constructed protein interaction networks for mouse proteins utilizing the information encoded in the GO annotations. The work reported here presents a methodology for integrating and visualizing information on protein-protein interactions.  相似文献   

19.

Background

Clustering is a widely used technique for analysis of gene expression data. Most clustering methods group genes based on the distances, while few methods group genes according to the similarities of the distributions of the gene expression levels. Furthermore, as the biological annotation resources accumulated, an increasing number of genes have been annotated into functional categories. As a result, evaluating the performance of clustering methods in terms of the functional consistency of the resulting clusters is of great interest.

Results

In this paper, we proposed the WDCM (Weibull Distribution-based Clustering Method), a robust approach for clustering gene expression data, in which the gene expressions of individual genes are considered as the random variables following unique Weibull distributions. Our WDCM is based on the concept that the genes with similar expression profiles have similar distribution parameters, and thus the genes are clustered via the Weibull distribution parameters. We used the WDCM to cluster three cancer gene expression data sets from the lung cancer, B-cell follicular lymphoma and bladder carcinoma and obtained well-clustered results. We compared the performance of WDCM with k-means and Self Organizing Map (SOM) using functional annotation information given by the Gene Ontology (GO). The results showed that the functional annotation ratios of WDCM are higher than those of the other methods. We also utilized the external measure Adjusted Rand Index to validate the performance of the WDCM. The comparative results demonstrate that the WDCM provides the better clustering performance compared to k-means and SOM algorithms. The merit of the proposed WDCM is that it can be applied to cluster incomplete gene expression data without imputing the missing values. Moreover, the robustness of WDCM is also evaluated on the incomplete data sets.

Conclusions

The results demonstrate that our WDCM produces clusters with more consistent functional annotations than the other methods. The WDCM is also verified to be robust and is capable of clustering gene expression data containing a small quantity of missing values.  相似文献   

20.

Background

Annotations that describe the function of sequences are enormously important to researchers during laboratory investigations and when making computational inferences. However, there has been little investigation into the data quality of sequence function annotations. Here we have developed a new method of estimating the error rate of curated sequence annotations, and applied this to the Gene Ontology (GO) sequence database (GOSeqLite). This method involved artificially adding errors to sequence annotations at known rates, and used regression to model the impact on the precision of annotations based on BLAST matched sequences.

Results

We estimated the error rate of curated GO sequence annotations in the GOSeqLite database (March 2006) at between 28% and 30%. Annotations made without use of sequence similarity based methods (non-ISS) had an estimated error rate of between 13% and 18%. Annotations made with the use of sequence similarity methodology (ISS) had an estimated error rate of 49%.

Conclusion

While the overall error rate is reasonably low, it would be prudent to treat all ISS annotations with caution. Electronic annotators that use ISS annotations as the basis of predictions are likely to have higher false prediction rates, and for this reason designers of these systems should consider avoiding ISS annotations where possible. Electronic annotators that use ISS annotations to make predictions should be viewed sceptically. We recommend that curators thoroughly review ISS annotations before accepting them as valid. Overall, users of curated sequence annotations from the GO database should feel assured that they are using a comparatively high quality source of information.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号