首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 546 毫秒
1.
Data mining in bioinformatics using Weka   总被引:8,自引:0,他引:8  
The Weka machine learning workbench provides a general-purpose environment for automatic classification, regression, clustering and feature selection-common data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data pre-processing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it. AVAILABILITY: http://www.cs.waikato.ac.nz/ml/weka.  相似文献   

2.
sMOL Explorer is a 2D ligand-based computational tool that provides three major functionalities: data management, information retrieval and extraction and statistical analysis and data mining through Web interface. With sMOL Explorer, users can create personal databases by adding each small molecule via a drawing interface or uploading the data files from internal and external projects into the sMOL database. Then, the database can be browsed and queried with textual and structural similarity search. The molecule can also be submitted to search against external public databases including PubChem, KEGG, DrugBank and eMolecules. Moreover, users can easily access a variety of data mining tools from Weka and R packages to perform analysis including (1) finding the frequent substructure, (2) clustering the molecular fingerprints, (3) identifying and removing irrelevant attributes from the data and (4) building the classification model of biological activity. AVAILABILITY: sMOL Explorer is an Open Source project and is freely available to all interested users at http://www.biotec.or.th/ISL/SMOL/.  相似文献   

3.
Recent technological advances have made it possible to identify and quantify thousands of proteins in a single proteomics experiment. As a result of these developments, the analysis of data has become the bottleneck of proteomics experiment. To provide the proteomics community with a user-friendly platform for comprehensive analysis, inspection and visualization of quantitative proteomics data we developed the Graphical Proteomics Data Explorer (GProX)(1). The program requires no special bioinformatics training, as all functions of GProX are accessible within its graphical user-friendly interface which will be intuitive to most users. Basic features facilitate the uncomplicated management and organization of large data sets and complex experimental setups as well as the inspection and graphical plotting of quantitative data. These are complemented by readily available high-level analysis options such as database querying, clustering based on abundance ratios, feature enrichment tests for e.g. GO terms and pathway analysis tools. A number of plotting options for visualization of quantitative proteomics data is available and most analysis functions in GProX create customizable high quality graphical displays in both vector and bitmap formats. The generic import requirements allow data originating from essentially all mass spectrometry platforms, quantitation strategies and software to be analyzed in the program. GProX represents a powerful approach to proteomics data analysis providing proteomics experimenters with a toolbox for bioinformatics analysis of quantitative proteomics data. The program is released as open-source and can be freely downloaded from the project webpage at http://gprox.sourceforge.net.  相似文献   

4.
MOTIVATION: Effective use of proteomics data, specifically mass spectrometry data, relies on the ability to read and write the many mass spectrometer file formats. Even with mass spectrometer vendor-specific libraries and vendor-neutral file formats, such as mzXML and mzData it can be difficult to extract raw data files in a form suitable for batch processing and basic research. Introduced here are the ProteomeCommons.org Input and Output Framework, abbreviated to IO Framework, which is designed to abstractly represent mass spectrometry data. This project is a public, open-source, free-to-use framework that supports most of the mass spectrometry data formats, including current formats, legacy formats and proprietary formats that require a vendor-specific library in order to operate. The IO Framework includes an on-line tool for non-programmers and a set of libraries that developers may use to convert between various proteomics file formats. AVAILABILITY: The current source-code and documentation for the ProteomeCommons.org IO Framework is freely available at http://www.proteomecommons.org/current/531/  相似文献   

5.
BioJava: an open-source framework for bioinformatics   总被引:1,自引:0,他引:1  
SUMMARY: BioJava is a mature open-source project that provides a framework for processing of biological data. BioJava contains powerful analysis and statistical routines, tools for parsing common file formats and packages for manipulating sequences and 3D structures. It enables rapid bioinformatics application development in the Java programming language. AVAILABILITY: BioJava is an open-source project distributed under the Lesser GPL (LGPL). BioJava can be downloaded from the BioJava website (http://www.biojava.org). BioJava requires Java 1.5 or higher. All queries should be directed to the BioJava mailing lists. Details are available at http://biojava.org/wiki/BioJava:MailingLists.  相似文献   

6.
SUMMARY: The Helmholtz Network for Bioinformatics (HNB) is a joint venture of eleven German bioinformatics research groups that offers convenient access to numerous bioinformatics resources through a single web portal. The 'Guided Solution Finder' which is available through the HNB portal helps users to locate the appropriate resources to answer their queries by employing a detailed, tree-like questionnaire. Furthermore, automated complex tool cascades ('tasks'), involving resources located on different servers, have been implemented, allowing users to perform comprehensive data analyses without the requirement of further manual intervention for data transfer and re-formatting. Currently, automated cascades for the analysis of regulatory DNA segments as well as for the prediction of protein functional properties are provided. AVAILABILITY: The HNB portal is available at http://www.hnbioinfo.de  相似文献   

7.
About five years ago, ontology was almost unknown in bioinformatics, even more so in molecular biology. Nowadays, many bioinformatics articles mention it in connection with text mining, data integration or as a metaphysical cure for problems in standardisation of nomenclature and other applications. This article attempts to give an account of what concept ontologies in the domain of biology and bioinformatics are; what they are not; how they can be constructed; how they can be used; and some fallacies and pitfalls creators and users should be aware of.  相似文献   

8.
Most bioinformatics tools require specialized input formats for sequence comparison and analysis. This is particularly true for molecular phylogeny programs, which accept only certain formats. In addition, it is often necessary to eliminate highly similar sequences among the input, especially when the dataset is large. Moreover, most programs have restrictions upon the sequence name. Here we introduce SeqMaT, a Sequence Manipulation Tool. It has the following functions: data format conversion,sequence name coding and decoding,redundant and highly similar sequence removal, anddata mining utilities. SeqMaT was developed using Java with two versions, web-based and standalone. A standalone program is convenient to manipulate a large number of sequences, while the web version will guarantee wide availability of the tool for researchers and practitioners throughout the Internet. AVAILABILITY: The database is available for free at http://glee.ist.unomaha.edu/seqmat.  相似文献   

9.
MOTIVATION: As more whole genome sequences become available, comparing multiple genomes at the sequence level can provide insight into new biological discovery. However, there are significant challenges for genome comparison. The challenge includes requirement for computational resources owing to the large volume of genome data. More importantly, since the choice of genomes to be compared is entirely subjective, there are too many choices for genome comparison. For these reasons, there is pressing need for bioinformatics systems for comparing multiple genomes where users can choose genomes to be compared freely. RESULTS: PLATCOM (Platform for Computational Comparative Genomics) is an integrated system for the comparative analysis of multiple genomes. The system is built on several public databases and a suite of genome analysis applications are provided as exemplary genome data mining tools over these internal databases. Researchers are able to visually investigate genomic sequence similarities, conserved gene neighborhoods, conserved metabolic pathways and putative gene fusion events among a set of selected multiple genomes. AVAILABILITY: http://platcom.informatics.indiana.edu/platcom  相似文献   

10.
ABSTRACT: BACKGROUND: Ongoing innovation in phylogenetics and evolutionary biology has been accompanied by a proliferation of software tools, data formats, analytical techniques and web servers. This brings with it the challenge of integrating phylogenetic and other related biological data found in a wide variety of formats, and underlines the need for reusable software that can read, manipulate and transform this information into the various forms required to build computational pipelines. RESULTS: We built a Python software library for working with phylogenetic data that is tightly integrated with Biopython, a broad-ranging toolkit for computational biology. Our library, Bio.Phylo, is highly interoperable with existing libraries, tools and standards, and is capable of parsing common file formats for phylogenetic trees, performing basic transformations and manipulations, attaching rich annotations, and visualizing trees. We unified the modules for working with the standard file formats Newick, NEXUS and phyloXML behind a consistent and simple API, providing a common set of functionality independent of the data source. CONCLUSIONS: Bio.Phylo meets a growing need in bioinformatics for working with heterogeneous types of phylogenetic data. By supporting interoperability with multiple file formats and leveraging existing Biopython features, this library simplifies the construction of phylogenetic workflows. We also provide examples of the benefits of building a community around a shared open-source project. Bio.Phylo is included with Biopython, available through the Biopython website, http://biopython.org.  相似文献   

11.
12.
Background: Metagenomic sequencing is a complex sampling procedure from unknown mixtures of many genomes. Having metagenome data with known genome compositions is essential for both benchmarking bioinformatics software and for investigating influences of various factors on the data. Compared to data from real microbiome samples or from defined microbial mock community, simulated data with proper computational models are better for the purpose as they provide more flexibility for controlling multiple factors. Methods: We developed a non-uniform metagenomic sequencing simulation system (nuMetaSim) that is capable of mimicking various factors in real metagenomic sequencing to reflect multiple properties of real data with customizable parameter settings. Results: We generated 9 comprehensive metagenomic datasets with different composition complexity from of 203 bacterial genomes and 2 archaeal genomes related with human intestine system. Conclusion: The data can serve as benchmarks for comparing performance of different methods at different situations, and the software package allows users to generate simulation data that can better reflect the specific properties in their scenarios.  相似文献   

13.
Since the publication of the human genome, two key points have emerged. First, it is still not certain which regions of the genome code for proteins. Second, the number of discrete protein-coding genes is far fewer than the number of different proteins. Proteomics has the potential to address some of these postgenomic issues if the obstacles that we face can be overcome in our efforts to combine proteomic and genomic data. There are many challenges associated with high-throughput and high-output proteomic technologies. Consequently, for proteomics to continue at its current growth rate, new approaches must be developed to ease data management and data mining. Initiatives have been launched to develop standard data formats for exchanging mass spectrometry proteomic data, including the Proteomics Standards Initiative formed by the Human Proteome Organization. Databases such as SwissProt and Uniprot are publicly available repositories for protein sequences annotated for function, subcellular location and known potential post-translational modifications. The availability of bioinformatics solutions is crucial for proteomics technologies to fulfil their promise of adding further definition to the functional output of the human genome. The aim of the Oxford Genome Anatomy Project is to provide a framework for integrating molecular, cellular, phenotypic and clinical information with experimental genetic and proteomics data. This perspective also discusses models to make the Oxford Genome Anatomy Project accessible and beneficial for academic and commercial research and development.  相似文献   

14.
15.
Since the publication of the human genome, two key points have emerged. First, it is still not certain which regions of the genome code for proteins. Second, the number of discrete protein-coding genes is far fewer than the number of different proteins. Proteomics has the potential to address some of these postgenomic issues if the obstacles that we face can be overcome in our efforts to combine proteomic and genomic data. There are many challenges associated with high-throughput and high-output proteomic technologies. Consequently, for proteomics to continue at its current growth rate, new approaches must be developed to ease data management and data mining. Initiatives have been launched to develop standard data formats for exchanging mass spectrometry proteomic data, including the Proteomics Standards Initiative formed by the Human Proteome Organization. Databases such as SwissProt and Uniprot are publicly available repositories for protein sequences annotated for function, subcellular location and known potential post-translational modifications. The availability of bioinformatics solutions is crucial for proteomics technologies to fulfil their promise of adding further definition to the functional output of the human genome. The aim of the Oxford Genome Anatomy Project is to provide a framework for integrating molecular, cellular, phenotypic and clinical information with experimental genetic and proteomics data. This perspective also discusses models to make the Oxford Genome Anatomy Project accessible and beneficial for academic and commercial research and development.  相似文献   

16.
We describe PerlMAT, a Perl microarray toolkit providing easy to use object-oriented methods for the simplified manipulation, management and analysis of microarray data. The toolkit provides objects for the encapsulation of microarray spots and reporters, several common microarray data file formats and GAL files. In addition, an analysis object provides methods for data processing, and an image object enables the visualisation of microarray data. This important addition to the Perl developer's library will facilitate more widespread use of Perl for microarray application development within the bioinformatics community. The coherent interface and well-documented code enables rapid analysis by even inexperienced Perl developers. AVAILABILITY: Software is available at http://sourceforge.net/projects/perlmat  相似文献   

17.
18.
The Protein Circular Dichroism Data Bank (PCDDB) [https://pcddb.cryst.bbk.ac.uk] is an established resource for the biological, biophysical, chemical, bioinformatics, and molecular biology communities. It is a freely-accessible repository of validated protein circular dichroism (CD) spectra and associated sample and metadata, with entries having links to other bioinformatics resources including, amongst others, structure (PDB), AlphaFold, and sequence (UniProt) databases, as well as to published papers which produced the data and cite the database entries. It includes primary (unprocessed) and final (processed) spectral data, which are available in both text and pictorial formats, as well as detailed sample and validation information produced for each of the entries. Recently the metadata content associated with each of the entries, as well as the number and structural breadth of the protein components included, have been expanded. The PCDDB includes data on both wild-type and mutant proteins, and because CD studies primarily examine proteins in solution, it also contains examples of the effects of different environments on their structures, plus thermal unfolding/folding series. Methods for both sequence and spectral comparisons are included.The data included in the PCDDB complement results from crystal, cryo-electron microscopy, NMR spectroscopy, bioinformatics characterisations and classifications, and other structural information available for the proteins via links to other databases. The entries in the PCDDB have been used for the development of new analytical methodologies, for interpreting spectral and other biophysical data, and for providing insight into structures and functions of individual soluble and membrane proteins and protein complexes.  相似文献   

19.

Background

The analysis of microbial communities through DNA sequencing brings many challenges: the integration of different types of data with methods from ecology, genetics, phylogenetics, multivariate statistics, visualization and testing. With the increased breadth of experimental designs now being pursued, project-specific statistical analyses are often needed, and these analyses are often difficult (or impossible) for peer researchers to independently reproduce. The vast majority of the requisite tools for performing these analyses reproducibly are already implemented in R and its extensions (packages), but with limited support for high throughput microbiome census data.

Results

Here we describe a software project, phyloseq, dedicated to the object-oriented representation and analysis of microbiome census data in R. It supports importing data from a variety of common formats, as well as many analysis techniques. These include calibration, filtering, subsetting, agglomeration, multi-table comparisons, diversity analysis, parallelized Fast UniFrac, ordination methods, and production of publication-quality graphics; all in a manner that is easy to document, share, and modify. We show how to apply functions from other R packages to phyloseq-represented data, illustrating the availability of a large number of open source analysis techniques. We discuss the use of phyloseq with tools for reproducible research, a practice common in other fields but still rare in the analysis of highly parallel microbiome census data. We have made available all of the materials necessary to completely reproduce the analysis and figures included in this article, an example of best practices for reproducible research.

Conclusions

The phyloseq project for R is a new open-source software package, freely available on the web from both GitHub and Bioconductor.  相似文献   

20.
Current research in the biosciences depends heavily on the effective exploitation of huge amounts of data. These are in disparate formats, remotely dispersed, and based on the different vocabularies of various disciplines. Furthermore, data are often stored or distributed using formats that leave implicit many important features relating to the structure and semantics of the data. Conceptual data modelling involves the development of implementation-independent models that capture and make explicit the principal structural properties of data. Entities such as a biopolymer or a reaction, and their relations, eg catalyses, can be formalised using a conceptual data model. Conceptual models are implementation-independent and can be transformed in systematic ways for implementation using different platforms, eg traditional database management systems. This paper describes the basics of the most widely used conceptual modelling notations, the ER (entity-relationship) model and the class diagrams of the UML (unified modelling language), and illustrates their use through several examples from bioinformatics. In particular, models are presented for protein structures and motifs, and for genomic sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号