首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The structural comparison of two proteins comes up in many applications in structural biology where it is often necessary to find similarities in very large conformation sets. This work describes techniques to achieve significant speedup in the computation of structural similarity between two given conformations, at the expense of introducing a small error in the similarity measure. Furthermore, the proposed computational scheme allows for a tradeoff between speedup and error. This scheme exploits the fact that the Calpha representation of a protein conformation contains redundant information, due to the chain topology and limited compactness of proteins. This redundancy can be reduced by approximating subchains of a protein by their centers of mass, resulting in a smaller number of points to describe a conformation. A Haar wavelet analysis of random chains and proteins is used to justify this approximated representation. Similarity measures computed with this representation are highly correlated to the measures computed with the original Calpha representation. Therefore, they can be used in applications where small similarity errors can be tolerated or as fast filters in applications that require exact measures. Computational tests have been conducted on two applications, nearest neighbor search and automatic structural classification.  相似文献   

2.
3.
The ability to analyze and classify three-dimensional (3D) biological morphology has lagged behind the analysis of other biological data types such as gene sequences. Here, we introduce the techniques of data mining to the study of 3D biological shapes to bring the analyses of phenomes closer to the efficiency of studying genomes. We compiled five training sets of highly variable morphologies of mammalian teeth from the MorphoBrowser database. Samples were labeled either by dietary class or by conventional dental types (e.g. carnassial, selenodont). We automatically extracted a multitude of topological attributes using Geographic Information Systems (GIS)-like procedures that were then used in several combinations of feature selection schemes and probabilistic classification models to build and optimize classifiers for predicting the labels of the training sets. In terms of classification accuracy, computational time and size of the feature sets used, non-repeated best-first search combined with 1-nearest neighbor classifier was the best approach. However, several other classification models combined with the same searching scheme proved practical. The current study represents a first step in the automatic analysis of 3D phenotypes, which will be increasingly valuable with the future increase in 3D morphology and phenomics databases.  相似文献   

4.
Systemic analysis of available large-scale biological/biomedical data is critical for studying biological mechanisms, and developing novel and effective treatment approaches against diseases. However, different layers of the available data are produced using different technologies and scattered across individual computational resources without any explicit connections to each other, which hinders extensive and integrative multi-omics-based analysis. We aimed to address this issue by developing a new data integration/representation methodology and its application by constructing a biological data resource. CROssBAR is a comprehensive system that integrates large-scale biological/biomedical data from various resources and stores them in a NoSQL database. CROssBAR is enriched with the deep-learning-based prediction of relationships between numerous data entries, which is followed by the rigorous analysis of the enriched data to obtain biologically meaningful modules. These complex sets of entities and relationships are displayed to users via easy-to-interpret, interactive knowledge graphs within an open-access service. CROssBAR knowledge graphs incorporate relevant genes-proteins, molecular interactions, pathways, phenotypes, diseases, as well as known/predicted drugs and bioactive compounds, and they are constructed on-the-fly based on simple non-programmatic user queries. These intensely processed heterogeneous networks are expected to aid systems-level research, especially to infer biological mechanisms in relation to genes, proteins, their ligands, and diseases.  相似文献   

5.
Translational cancer genomics research aims to ensure that experimental knowledge is subject to computational analysis, and integrated with a variety of records from omics and clinical sources. The data retrieval from such sources is not trivial, due to their redundancy and heterogeneity, and the presence of false evidence. In silico marker identification, therefore, remains a complex task that is mainly motivated by the impact that target identification from the elucidation of gene co-expression dynamics and regulation mechanisms, combined with the discovery of genotype–phenotype associations, may have for clinical validation. Based on the reuse of publicly available gene expression data, our aim is to propose cancer marker classification by integrating the prediction power of multiple annotation sources. In particular, with reference to the functional annotation for colorectal markers, we indicate a classification of markers into diagnostic and prognostic classes combined with susceptibility and risk factors.  相似文献   

6.
MOTIVATION: A large fraction of biological research concentrates on individual proteins and on small families of proteins. One of the current major challenges in bioinformatics is to extend our knowledge to very large sets of proteins. Several major projects have tackled this problem. Such undertakings usually start with a process that clusters all known proteins or large subsets of this space. Some work in this area is carried out automatically, while other attempts incorporate expert advice and annotation. RESULTS: We propose a novel technique that automatically clusters protein sequences. We consider all proteins in SWISSPROT, and carry out an all-against-all BLAST similarity test among them. With this similarity measure in hand we proceed to perform a continuous bottom-up clustering process by applying alternative rules for merging clusters. The outcome of this clustering process is a classification of the input proteins into a hierarchy of clusters of varying degrees of granularity. Here we compare the clusters that result from alternative merging rules, and validate the results against InterPro. Our preliminary results show that clusters that are consistent with several rather than a single merging rule tend to comply with InterPro annotation. This is an affirmation of the view that the protein space consists of families that differ markedly in their evolutionary conservation.  相似文献   

7.
8.
Methods of blind source separation are used in many contexts to separate composite data sets according to their sources. Multiply labeled fluorescence microscopy images represent such sets, in which the sources are the individual labels. Their distributions are the quantities of interest and have to be extracted from the images. This is often challenging, since the recorded emission spectra of fluorescent dyes are environment- and instrument-specific. We have developed a nonnegative matrix factorization (NMF) algorithm to detect and separate spectrally distinct components of multiply labeled fluorescence images. It operates on spectrally resolved images and delivers both the emission spectra of the identified components and images of their abundance. We tested the proposed method using biological samples labeled with up to four spectrally overlapping fluorescent labels. In most cases, NMF accurately decomposed the images into contributions of individual dyes. However, the solutions are not unique when spectra overlap strongly or when images are diffuse in their structure. To arrive at satisfactory results in such cases, we extended NMF to incorporate preexisting qualitative knowledge about spectra and label distributions. We show how data acquired through excitations at two or three different wavelengths can be integrated and that multiple excitations greatly facilitate the decomposition. By allowing reliable decomposition in cases where the spectra of the individual labels are not known or are known only inaccurately, the proposed algorithms greatly extend the range of questions that can be addressed with quantitative microscopy.  相似文献   

9.

Background

It is a major challenge of computational biology to provide a comprehensive functional classification of all known proteins. Most existing methods seek recurrent patterns in known proteins based on manually-validated alignments of known protein families. Such methods can achieve high sensitivity, but are limited by the necessary manual labor. This makes our current view of the protein world incomplete and biased. This paper concerns ProtoNet, a automatic unsupervised global clustering system that generates a hierarchical tree of over 1,000,000 proteins, based solely on sequence similarity.

Results

In this paper we show that ProtoNet correctly captures functional and structural aspects of the protein world. Furthermore, a novel feature is an automatic procedure that reduces the tree to 12% its original size. This procedure utilizes only parameters intrinsic to the clustering process. Despite the substantial reduction in size, the system's predictive power concerning biological functions is hardly affected. We then carry out an automatic comparison with existing functional protein annotations. Consequently, 78% of the clusters in the compressed tree (5,300 clusters) get assigned a biological function with a high confidence. The clustering and compression processes are unsupervised, and robust.

Conclusions

We present an automatically generated unbiased method that provides a hierarchical classification of all currently known proteins.
  相似文献   

10.
Genes that encode glycosylphosphatidylinositol anchored proteins (GPI-APs) constitute an estimated 1-2% of eukaryote genomes. Current computational methods for the prediction of GPI-APs are sensitive and specific; however, the analysis of the processing site (omega- or omega-site) of GPI-APs is still challenging. Only 10% of the proteins that are annotated as GPI-APs have the omega-site experimentally verified. We describe an integrated computational and experimental proteomics approach for the identification and characterization of GPI-APs that provides the means to identify GPI-APs and the derived GPI-anchored peptides in LC-MS/MS data sets. The method takes advantage of sequence features of GPI-APs and the known core structure of the GPI-anchor. The first stage of the analysis encompasses LC-MS/MS based protein identification. The second stage involves prediction of the processing sites of the identified GPI-APs and prediction of the corresponding terminal tryptic peptides. The third stage calculates possible GPI structures on the peptides from stage two. The fourth stage calculates the scores by comparing the theoretical spectra of the predicted GPI-peptides against the observed MS/MS spectra. Automated identification of C-terminal GPI-peptides from porcine membrane dipeptidase, folate receptor and CD59 in complex LC-MS/MS data sets demonstrates the sensitivity and specificity of this integrated computational and experimental approach.  相似文献   

11.
Functional classification of proteins from sequences alone has become a critical bottleneck in understanding the myriad of protein sequences that accumulate in our databases. The great diversity of homologous sequences hides, in many cases, a variety of functional activities that cannot be anticipated. Their identification appears critical for a fundamental understanding of the evolution of living organisms and for biotechnological applications. ProfileView is a sequence-based computational method, designed to functionally classify sets of homologous sequences. It relies on two main ideas: the use of multiple profile models whose construction explores evolutionary information in available databases, and a novel definition of a representation space in which to analyze sequences with multiple profile models combined together. ProfileView classifies protein families by enriching known functional groups with new sequences and discovering new groups and subgroups. We validate ProfileView on seven classes of widespread proteins involved in the interaction with nucleic acids, amino acids and small molecules, and in a large variety of functions and enzymatic reactions. ProfileView agrees with the large set of functional data collected for these proteins from the literature regarding the organization into functional subgroups and residues that characterize the functions. In addition, ProfileView resolves undefined functional classifications and extracts the molecular determinants underlying protein functional diversity, showing its potential to select sequences towards accurate experimental design and discovery of novel biological functions. On protein families with complex domain architecture, ProfileView functional classification reconciles domain combinations, unlike phylogenetic reconstruction. ProfileView proves to outperform the functional classification approach PANTHER, the two k-mer-based methods CUPP and eCAMI and a neural network approach based on Restricted Boltzmann Machines. It overcomes time complexity limitations of the latter.  相似文献   

12.
Data integration is key to functional and comparative genomics because integration allows diverse data types to be evaluated in new contexts. To achieve data integration in a scalable and sensible way, semantic standards are needed, both for naming things (standardized nomenclatures, use of key words) and also for knowledge representation. The Mouse Genome Informatics database and other model organism databases help to close the gap between information and understanding of biological processes because these resources enforce well-defined nomenclature and knowledge representation standards. Model organism databases have a critical role to play in ensuring that diverse kinds of data, especially genome-scale data sets and information, remain useful to the biological community in the long-term. The efforts of model organism database groups ensure not only that organism-specific data are integrated, curated and accessible but also that the information is structured in such a way that comparison of biological knowledge across model organisms is facilitated.  相似文献   

13.
Cell membrane proteins play an important role in tissue architecture and cell-cell communication. We hypothesize that segmentation and multidimensional characterization of the distribution of cell membrane proteins, on a cell-by-cell basis, enable improved classification of treatment groups and identify important characteristics that can otherwise be hidden. We have developed a series of computational steps to 1) delineate cell membrane protein signals and associate them with a specific nucleus; 2) compute a coupled representation of the multiplexed DNA content with membrane proteins; 3) rank computed features associated with such a multidimensional representation; 4) visualize selected features for comparative evaluation through heatmaps; and 5) discriminate between treatment groups in an optimal fashion. The novelty of our method is in the segmentation of the membrane signal and the multidimensional representation of phenotypic signature on a cell-by-cell basis. To test the utility of this method, the proposed computational steps were applied to images of cells that have been irradiated with different radiation qualities in the presence and absence of other small molecules. These samples are labeled for their DNA content and E-cadherin membrane proteins. We demonstrate that multidimensional representations of cell-by-cell phenotypes improve predictive and visualization capabilities among different treatment groups, and identify hidden variables.  相似文献   

14.
Building structural models of entire cells has been a long-standing cross-discipline challenge for the research community, as it requires an unprecedented level of integration between multiple sources of biological data and enhanced methods for computational modeling and visualization. Here, we present the first 3D structural models of an entire Mycoplasma genitalium (MG) cell, built using the CellPACK suite of computational modeling tools. Our model recapitulates the data described in recent whole-cell system biology simulations and provides a structural representation for all MG proteins, DNA and RNA molecules, obtained by combining experimental and homology-modeled structures and lattice-based models of the genome. We establish a framework for gathering, curating and evaluating these structures, exposing current weaknesses of modeling methods and the boundaries of MG structural knowledge, and visualization methods to explore functional characteristics of the genome and proteome. We compare two approaches for data gathering, a manually-curated workflow and an automated workflow that uses homologous structures, both of which are appropriate for the analysis of mesoscale properties such as crowding and volume occupancy. Analysis of model quality provides estimates of the regularization that will be required when these models are used as starting points for atomic molecular dynamics simulations.  相似文献   

15.
MOTIVATION: Structural genomics projects aim to solve a large number of protein structures with the ultimate objective of representing the entire protein space. The computational challenge is to identify and prioritize a small set of proteins with new, currently unknown, superfamilies or folds. RESULTS: We develop a method that assigns each protein a likelihood of it belonging to a new, yet undetermined, structural superfamily. The method relies on a variant of ProtoNet, an automatic hierarchical classification scheme of all protein sequences from SwissProt. Our results show that proteins that are remote from solved structures in the ProtoNet hierarchy are more likely to belong to new superfamilies. The results are validated against SCOP releases from recent years that account for about half of the solved structures known to date. We show that our new method and the representation of ProtoNet are superior in detecting new targets, compared to our previous method using ProtoMap classification. Furthermore, our method outperforms PSI-BLAST search in detecting potential new superfamilies.  相似文献   

16.

Background

Network-based approaches for the analysis of large-scale genomics data have become well established. Biological networks provide a knowledge scaffold against which the patterns and dynamics of ‘omics’ data can be interpreted. The background information required for the construction of such networks is often dispersed across a multitude of knowledge bases in a variety of formats. The seamless integration of this information is one of the main challenges in bioinformatics. The Semantic Web offers powerful technologies for the assembly of integrated knowledge bases that are computationally comprehensible, thereby providing a potentially powerful resource for constructing biological networks and network-based analysis.

Results

We have developed the Gene eXpression Knowledge Base (GeXKB), a semantic web technology based resource that contains integrated knowledge about gene expression regulation. To affirm the utility of GeXKB we demonstrate how this resource can be exploited for the identification of candidate regulatory network proteins. We present four use cases that were designed from a biological perspective in order to find candidate members relevant for the gastrin hormone signaling network model. We show how a combination of specific query definitions and additional selection criteria derived from gene expression data and prior knowledge concerning candidate proteins can be used to retrieve a set of proteins that constitute valid candidates for regulatory network extensions.

Conclusions

Semantic web technologies provide the means for processing and integrating various heterogeneous information sources. The GeXKB offers biologists such an integrated knowledge resource, allowing them to address complex biological questions pertaining to gene expression. This work illustrates how GeXKB can be used in combination with gene expression results and literature information to identify new potential candidates that may be considered for extending a gene regulatory network.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0386-y) contains supplementary material, which is available to authorized users.  相似文献   

17.
Mass spectrometry (MS) is a technique that is used for biological studies. It consists in associating a spectrum to a biological sample. A spectrum consists of couples of values (intensity, m/z), where intensity measures the abundance of biomolecules (as proteins) with a mass-to-charge ratio (m/z) present in the originating sample. In proteomics experiments, MS spectra are used to identify pattern expressions in clinical samples that may be responsible of diseases. Recently, to improve the identification of peptides/proteins related to patterns, MS/MS process is used, consisting in performing cascade of mass spectrometric analysis on selected peaks. Latter technique has been demonstrated to improve the identification and quantification of proteins/peptide in samples. Nevertheless, MS analysis deals with a huge amount of data, often affected by noises, thus requiring automatic data management systems. Tools have been developed and most of the time furnished with the instruments allowing: (i) spectra analysis and visualization, (ii) pattern recognition, (iii) protein databases querying, (iv) peptides/proteins quantification and identification. Currently most of the tools supporting such phases need to be optimized to improve the protein (and their functionalities) identification processes. In this article we survey on applications supporting spectrometrists and biologists in obtaining information from biological samples, analyzing available software for different phases. We consider different mass spectrometry techniques, and thus different requirements. We focus on tools for (i) data preprocessing, allowing to prepare results obtained from spectrometers to be analyzed; (ii) spectra analysis, representation and mining, aimed to identify common and/or hidden patterns in spectra sets or in classifying data; (iii) databases querying to identify peptides; and (iv) improving and boosting the identification and quantification of selected peaks. We trace some open problems and report on requirements that represent new challenges for bioinformatics.  相似文献   

18.
SUMMARY: Genomic Analysis and Rapid Biological ANnotation (GARBAN) is a new tool that provides an integrated framework to analyze simultaneously and compare multiple data sets derived from microarray or proteomic experiments. It carries out automated classifications of genes or proteins according to the criteria of the Gene Ontology Consortium at a level of depth defined by the user. Additionally, it performs clustering analysis of all sets based on functional categories or on differential expression levels. GARBAN also provides graphical representations of the biological pathways in which all the genes/proteins participate. AVAILABILITY: http://garban.tecnun.es.  相似文献   

19.
Novel and improved computational tools are required to transform large-scale proteomics data into valuable information of biological relevance. To this end, we developed ProteoConnections, a bioinformatics platform tailored to address the pressing needs of proteomics analyses. The primary focus of this platform is to organize peptide and protein identifications, evaluate the quality of the acquired data set, profile abundance changes, and accelerate data interpretation. Peptide and protein identifications are stored into a relational database to facilitate data mining and to evaluate the quality of data sets using graphical reports. We integrated databases of known PTMs and other bioinformatics tools to facilitate the analysis of phosphoproteomics data sets and to provide insights for subsequent biological validation experiments. Phosphorylation sites are also annotated according to kinase consensus motifs, contextual environment, protein domains, binding motifs, and evolutionary conservation across different species. The practical application of ProteoConnections is further demonstrated for the analysis of the phosphoproteomics data sets from rat intestinal IEC-6 cells where we identified 9615 phosphorylation sites on 2108 phosphoproteins. Combined proteomics and bioinformatics analyses revealed valuable biological insights on the regulation of phosphoprotein functions via the introduction of new binding sites on scaffold proteins or the modulation of protein-protein, protein-DNA, or protein-RNA interactions. Quantitative proteomics data can be integrated into ProteoConnections to determine the changes in protein phosphorylation under different cell stimulation conditions or kinase inhibitors, as demonstrated here for the MEK inhibitor PD184352.  相似文献   

20.
With numerous whole genomes now in hand, and experimental data about genes and biological pathways on the increase, a systems approach to biological research is becoming essential. Ontologies provide a formal representation of knowledge that is amenable to computational as well as human analysis, an obvious underpinning of systems biology. Mapping function to gene products in the genome consists of two, somewhat intertwined enterprises: ontology building and ontology annotation. Ontology building is the formal representation of a domain of knowledge; ontology annotation is association of specific genomic regions (which we refer to simply as 'genes', including genes and their regulatory elements and products such as proteins and functional RNAs) to parts of the ontology. We consider two complementary representations of gene function: the Gene Ontology (GO) and pathway ontologies. GO represents function from the gene's eye view, in relation to a large and growing context of biological knowledge at all levels. Pathway ontologies represent function from the point of view of biochemical reactions and interactions, which are ordered into networks and causal cascades. The more mature GO provides an example of ontology annotation: how conclusions from the scientific literature and from evolutionary relationships are converted into formal statements about gene function. Annotations are made using a variety of different types of evidence, which can be used to estimate the relative reliability of different annotations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号