首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The large variety of clustering algorithms and their variants can be daunting to researchers wishing to explore patterns within their microarray datasets. Furthermore, each clustering method has distinct biases in finding patterns within the data, and clusterings may not be reproducible across different algorithms. A consensus approach utilizing multiple algorithms can show where the various methods agree and expose robust patterns within the data. In this paper, we present a software package - Consense, written for R/Bioconductor - that utilizes such an approach to explore microarray datasets. Consense produces clustering results for each of the clustering methods and produces a report of metrics comparing the individual clusterings. A feature of Consense is identification of genes that cluster consistently with an index gene across methods. Utilizing simulated microarray data, sensitivity of the metrics to the biases of the different clustering algorithms is explored. The framework is easily extensible, allowing this tool to be used by other functional genomic data types, as well as other high-throughput OMICS data types generated from metabolomic and proteomic experiments. It also provides a flexible environment to benchmark new clustering algorithms. Consense is currently available as an installable R/Bioconductor package (http://www.ohsucancer.com/isrdev/consense/).  相似文献   

2.
MOTIVATION: The IntAct repository is one of the largest and most widely used databases for the curation and storage of molecular interaction data. These datasets need to be analyzed by computational methods. Software packages in the statistical environment R provide powerful tools for conducting such analyses. RESULTS: We introduce Rintact, a Bioconductor package that allows users to transform PSI-MI XML2.5 interaction data files from IntAct into R graph objects. On these, they can use methods from R and Bioconductor for a variety of tasks: determining cohesive subgraphs, computing summary statistics, fitting mathematical models to the data or rendering graphical layouts. Rintact provides a programmatic interface to the IntAct repository and allows the use of the analytic methods provided by R and Bioconductor. AVAILABILITY: Rintact is freely available at http://bioconductor.org  相似文献   

3.
MOTIVATION: Despite the growing literature devoted to finding differentially expressed genes in assays probing different tissues types, little attention has been paid to the combinatorial nature of feature selection inherent to large, high-dimensional gene expression datasets. New flexible data analysis approaches capable of searching relevant subgroups of genes and experiments are needed to understand multivariate associations of gene expression patterns with observed phenotypes. RESULTS: We present in detail a deterministic algorithm to discover patterns of multivariate gene associations in gene expression data. The patterns discovered are differential with respect to a control dataset. The algorithm is exhaustive and efficient, reporting all existent patterns that fit a given input parameter set while avoiding enumeration of the entire pattern space. The value of the pattern discovery approach is demonstrated by finding a set of genes that differentiate between two types of lymphoma. Moreover, these genes are found to behave consistently in an independent dataset produced in a different laboratory using different arrays, thus validating the genes selected using our algorithm. We show that the genes deemed significant in terms of their multivariate statistics will be missed using other methods. AVAILABILITY: Our set of pattern discovery algorithms including a user interface is distributed as a package called Genes@Work. This package is freely available to non-commercial users and can be downloaded from our website (http://www.research.ibm.com/FunGen).  相似文献   

4.
SurvJamda (Survival prediction by joint analysis of microarray data) is an R package that utilizes joint analysis of microarray gene expression data to predict patients' survival and risk assessment. Joint analysis can be performed by merging datasets or meta-analysis to increase the sample size and to improve survival prognosis. The prognosis performance derived from the combined datasets can be assessed to determine which feature selection approach, joint analysis method and bias estimation provide the most robust prognosis for a given set of datasets. AVAILABILITY: The survJamda package is available at the Comprehensive R Archive Network, http://cran.r-project.org. CONTACT: hyasrebi@yahoo.com.  相似文献   

5.
The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) is a public resource that curates interactions between environmental chemicals and gene products, and their relationships to diseases, as a means of understanding the effects of environmental chemicals on human health. CTD provides a triad of core information in the form of chemical-gene, chemical-disease, and gene-disease interactions that are manually curated from scientific articles. To increase the efficiency, productivity, and data coverage of manual curation, we have leveraged text mining to help rank and prioritize the triaged literature. Here, we describe our text-mining process that computes and assigns each article a document relevancy score (DRS), wherein a high DRS suggests that an article is more likely to be relevant for curation at CTD. We evaluated our process by first text mining a corpus of 14,904 articles triaged for seven heavy metals (cadmium, cobalt, copper, lead, manganese, mercury, and nickel). Based upon initial analysis, a representative subset corpus of 3,583 articles was then selected from the 14,094 articles and sent to five CTD biocurators for review. The resulting curation of these 3,583 articles was analyzed for a variety of parameters, including article relevancy, novel data content, interaction yield rate, mean average precision, and biological and toxicological interpretability. We show that for all measured parameters, the DRS is an effective indicator for scoring and improving the ranking of literature for the curation of chemical-gene-disease information at CTD. Here, we demonstrate how fully incorporating text mining-based DRS scoring into our curation pipeline enhances manual curation by prioritizing more relevant articles, thereby increasing data content, productivity, and efficiency.  相似文献   

6.
ACUA: a software tool for automated codon usage analysis   总被引:1,自引:0,他引:1  
Currently available codon usage analysis tools lack intuitive graphical user interface and are limited to inbuilt calculations. ACUA (Automated Codon Usage Tool) has been developed to perform high throughput sequence analysis aiding statistical profiling of codon usage. The results of ACUA are presented in a spreadsheet with all perquisite codon usage data required for statistical analysis, displayed in a graphical interface. The package is also capable of on-click sequence retrieval from the results interface, and this feature is unique to ACUA. AVAILABILITY: The package is available for non-commercial purposes and can be downloaded from: http://www.bioinsilico.com/acua.  相似文献   

7.
MOTIVATION: A Robot Scientist is a physically implemented robotic system that can automatically carry out cycles of scientific experimentation. We are commissioning a new Robot Scientist designed to investigate gene function in S. cerevisiae. This Robot Scientist will be capable of initiating >1,000 experiments, and making >200,000 observations a day. Robot Scientists provide a unique test bed for the development of methodologies for the curation and annotation of scientific experiments: because the experiments are conceived and executed automatically by computer, it is possible to completely capture and digitally curate all aspects of the scientific process. This new ability brings with it significant technical challenges. To meet these we apply an ontology driven approach to the representation of all the Robot Scientist's data and metadata. RESULTS: We demonstrate the utility of developing an ontology for our new Robot Scientist. This ontology is based on a general ontology of experiments. The ontology aids the curation and annotating of the experimental data and metadata, and the equipment metadata, and supports the design of database systems to hold the data and metadata. AVAILABILITY: EXPO in XML and OWL formats is at: http://sourceforge.net/projects/expo/. All materials about the Robot Scientist project are available at: http://www.aber.ac.uk/compsci/Research/bio/robotsci/.  相似文献   

8.
JIGSAW: integration of multiple sources of evidence for gene prediction   总被引:3,自引:0,他引:3  
MOTIVATION: Computational gene finding systems play an important role in finding new human genes, although no systems are yet accurate enough to predict all or even most protein-coding regions perfectly. Ab initio programs can be augmented by evidence such as expression data or protein sequence homology, which improves their performance. The amount of such evidence continues to grow, but computational methods continue to have difficulty predicting genes when the evidence is conflicting or incomplete. Genome annotation pipelines collect a variety of types of evidence about gene structure and synthesize the results, which can then be refined further through manual, expert curation of gene models. RESULTS: JIGSAW is a new gene finding system designed to automate the process of predicting gene structure from multiple sources of evidence, with results that often match the performance of human curators. JIGSAW computes the relative weight of different lines of evidence using statistics generated from a training set, and then combines the evidence using dynamic programming. Our results show that JIGSAW's performance is superior to ab initio gene finding methods and to other pipelines such as Ensembl. Even without evidence from alignment to known genes, JIGSAW can substantially improve gene prediction accuracy as compared with existing methods. AVAILABILITY: JIGSAW is available as an open source software package at http://cbcb.umd.edu/software/jigsaw.  相似文献   

9.
MOTIVATION: The biological literature is a major repository of knowledge. Many biological databases draw much of their content from a careful curation of this literature. However, as the volume of literature increases, the burden of curation increases. Text mining may provide useful tools to assist in the curation process. To date, the lack of standards has made it impossible to determine whether text mining techniques are sufficiently mature to be useful. RESULTS: We report on a Challenge Evaluation task that we created for the Knowledge Discovery and Data Mining (KDD) Challenge Cup. We provided a training corpus of 862 articles consisting of journal articles curated in FlyBase, along with the associated lists of genes and gene products, as well as the relevant data fields from FlyBase. For the test, we provided a corpus of 213 new ('blind') articles; the 18 participating groups provided systems that flagged articles for curation, based on whether the article contained experimental evidence for gene expression products. We report on the evaluation results and describe the techniques used by the top performing groups.  相似文献   

10.
Gramene: development and integration of trait and gene ontologies for rice   总被引:1,自引:0,他引:1  
Gramene (http://www.gramene.org/) is a comparative genome database for cereal crops and a community resource for rice. We are populating and curating Gramene with annotated rice (Oryza sativa) genomic sequence data and associated biological information including molecular markers, mutants, phenotypes, polymorphisms and Quantitative Trait Loci (QTL). In order to support queries across various data sets as well as across external databases, Gramene will employ three related controlled vocabularies. The specific goal of Gramene is, first to provide a Trait Ontology (TO) that can be used across the cereal crops to facilitate phenotypic comparisons both within and between the genera. Second, a vocabulary for plant anatomy terms, the Plant Ontology (PO) will facilitate the curation of morphological and anatomical feature information with respect to expression, localization of genes and gene products and the affected plant parts in a phenotype. The TO and PO are both in the early stages of development in collaboration with the International Rice Research Institute, TAIR and MaizeDB as part of the Plant Ontology Consortium. Finally, as part of another consortium comprising macromolecular databases from other model organisms, the Gene Ontology Consortium, we are annotating the confirmed and predicted protein entries from rice using both electronic and manual curation.  相似文献   

11.
The fine periodic growth patterns on shell surfaces have been widely used for studies in the ecology and evolution of scallops. Modern X‐ray CT scanners and digital cameras can provide high‐resolution image data that contain abundant information such as the shell formation rate, ontogenetic age, and life span of shellfish organisms. We introduced a novel multiscale image processing method based on matched filters with Gaussian kernels and partial differential equation (PDE) multiscale hierarchical decomposition to segment the small tubular and periodic structures in scallop shell images. The periodic patterns of structures (consisting of bifurcation points, crossover points of the rings and ribs, and the connected lines) could be found by our Space‐based Depth‐First Search (SDFS) algorithm. We created a MATLAB package to implement our method of periodic pattern extraction and pattern matching on the CT and digital scallop images available in this study. The results confirmed the hypothesis that the shell cyclic structure patterns encompass genetically specific information that can be used as an effective invariable biomarker for biological individual recognition. The package is available with a quick‐start guide and includes three examples: http://mgb.ouc.edu.cn/novegene/html/code.php .  相似文献   

12.
MOTIVATION: Knowledge base construction has been an area of intense activity and great importance in the growth of computational biology. However, there is little or no history of work on the subject of evaluation of knowledge bases, either with respect to their contents or with respect to the processes by which they are constructed. This article proposes the application of a metric from software engineering known as the found/fixed graph to the problem of evaluating the processes by which genomic knowledge bases are built, as well as the completeness of their contents. RESULTS: Well-understood patterns of change in the found/fixed graph are found to occur in two large publicly available knowledge bases. These patterns suggest that the current manual curation processes will take far too long to complete the annotations of even just the most important model organisms, and that at their current rate of production, they will never be sufficient for completing the annotation of all currently available proteomes.  相似文献   

13.
The Zebrafish Information Network (ZFIN) is a web based community resource that serves as a centralized location for the curation and integration of zebrafish genetic, genomic and developmental data. ZFIN is publicly accessible at http://zfin.org. ZFIN provides an integrated representation of mutants, genes, genetic markers, mapping panels, publications and community contact data. Recent enhancements to ZFIN include: (i) an anatomical dictionary that provides a controlled vocabulary of anatomical terms, grouped by developmental stages, that may be used to annotate and query gene expression data; (ii) gene expression data; (iii) expanded support for genome sequence; (iv) gene annotation using the standardized vocabulary of Gene Ontology (GO) terms that can be used to elucidate relationships between gene products in zebrafish and other organisms; and (v) collaborations with other databases (NCBI, Sanger Institute and SWISS-PROT) to provide standardization and interconnections based on shared curation.  相似文献   

14.
《Fly》2013,7(2):151-156
In modern functional genomics registration techniques are used to construct reference gene expression patterns and create a spatiotemporal atlas of the expression of all the genes in a network. In this paper we present a software package called GCPReg, which can be used to register the expression patterns of segmentation genes in the early Drosophila embryo. The key task which this package performs is the extraction of spatially localized characteristic features of expression patterns. To facilitate this task, we have developed an easy-to-use interactive graphical interface. We describe GCPReg usage and demonstrate how this package can be applied to register gene expression patterns in wild-type and mutants. GCPReg has been designed to operate on a UNIX platform and is freely available via the Internet at http://urchin.spbcas.ru/downloads/GCPReg/GCPReg.htm.  相似文献   

15.
16.
SUMMARY: Vbmp is an R package for Gaussian Process classification of data over multiple classes. It features multinomial probit regression with Gaussian Process priors and estimates class posterior probabilities employing fast variational approximations to the full posterior. This software also incorporates feature weighting by means of Automatic Relevance Determination. Being equipped with only one main function and reasonable default values for optional parameters, vbmp combines flexibility with ease of usage as is demonstrated on a breast cancer microarray study. AVAILABILITY: The R library vbmp implementing this method is part of Bioconductor and can be downloaded from http://www.dcs.gla.ac.uk/~girolami  相似文献   

17.
SUMMARY: DNAfan (DNA Feature ANalyzer) is a tool combining sequence-filtering and pattern searching. DNAfan automatically extracts user-defined sets of sequence fragments from large sequence sets. Fragments are defined by annotated gene feature keys and co- or non-occurring patterns within the feature or close to it. A gene feature parser and a pattern-based filter tool localizes and extracts the specific subset of sequences. The selected sequence data can subsequently be retrieved for analyses or further processed with DNAfan to find the occurrence of specific patterns or structural motifs. DNAfan is a powerful tool for pattern analysis. Its filter features restricts the pattern search to a well-defined set of sequences, allowing drastic reduction in false positive hits. AVAILABILITY: http://bighost.ba.itb.cnr.it:8080/Framework.  相似文献   

18.
Challenges in integrating Escherichia coli molecular biology data   总被引:1,自引:0,他引:1  
One key challenge in Systems Biology is to provide mechanisms to collect and integrate the necessary data to be able to meet multiple analysis requirements. Typically, biological contents are scattered over multiple data sources and there is no easy way of comparing heterogeneous data contents. This work discusses ongoing standardisation and interoperability efforts and exposes integration challenges for the model organism Escherichia coli K-12. The goal is to analyse the major obstacles faced by integration processes, suggest ways to systematically identify them, and whenever possible, propose solutions or means to assist manual curation. Integration of gene, protein and compound data was evaluated by performing comparisons over EcoCyc, KEGG, BRENDA, ChEBI, Entrez Gene and UniProt contents. Cross-links, a number of standard nomenclatures and name information supported the comparisons. Except for the gene integration scenario, in no other scenario an element of integration performed well enough to support the process by itself. Indeed, both the integration of enzyme and compound records imply considerable curation. Results evidenced that, even for a well-studied model organism, source contents are still far from being as standardized as it would be desired and metadata varies considerably from source to source. Before designing any data integration pipeline, researchers should decide on the sources that best fit the purpose of analysis and be aware of existing conflicts/inconsistencies to be able to intervene in their resolution. Moreover, they should be aware of the limits of automatic integration such that they can define the extent of necessary manual curation for each application.  相似文献   

19.
Increasingly, animal behavior studies are enhanced through the use of accelerometry. To allow translation of raw accelerometer data to animal behaviors requires the development of classifiers. Here, we present the “rabc” (r for animal behavior classification) package to assist researchers with the interactive development of such animal behavior classifiers in a supervised classification approach. The package uses datasets consisting of accelerometer data with their corresponding animal behaviors (e.g., for triaxial accelerometer data along the x, y and z axes arranged as “x, y, z, x, y, z,…, behavior”). Using an example dataset collected on white stork (Ciconia ciconia), we illustrate the workflow of this package, including accelerometer data visualization, feature calculation, feature selection, feature visualization, extreme gradient boost model training, validation, and, finally, a demonstration of the behavior classification results.  相似文献   

20.
Haploview: analysis and visualization of LD and haplotype maps   总被引:134,自引:0,他引:134  
SUMMARY: Research over the last few years has revealed significant haplotype structure in the human genome. The characterization of these patterns, particularly in the context of medical genetic association studies, is becoming a routine research activity. Haploview is a software package that provides computation of linkage disequilibrium statistics and population haplotype patterns from primary genotype data in a visually appealing and interactive interface. AVAILABILITY: http://www.broad.mit.edu/mpg/haploview/ CONTACT: jcbarret@broad.mit.edu  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号