首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Protein sequence clustering has been widely exploited to facilitate in-depth analysis of protein functions and families. For some applications of protein sequence clustering, it is highly desirable that a hierarchical structure, also referred to as dendrogram, which shows how proteins are clustered at various levels, is generated. However, as the sizes of contemporary protein databases continue to grow at rapid rates, it is of great interest to develop some summarization mechanisms so that the users can browse the dendrogram and/or search for the desired information more effectively. RESULTS: In this paper, the design of a novel incremental clustering algorithm aimed at generating summarized dendrograms for analysis of protein databases is described. The proposed incremental clustering algorithm employs a statistics-based model to summarize the distributions of the similarity scores among the proteins in the database and to control formation of clusters. Experimental results reveal that, due to the summarization mechanism incorporated, the proposed incremental clustering algorithm offers the users highly concise dendrograms for analysis of protein clusters with biological significance. Another distinction of the proposed algorithm is its incremental nature. As the sizes of the contemporary protein databases continue to grow at fast rates, due to the concern of efficiency, it is desirable that cluster analysis of a protein database can be carried out incrementally, when the protein database is updated. Experimental results with the Swiss-Prot protein database reveal that the time complexity for carrying out incremental clustering with k new proteins added into the database containing n proteins is O(n2betalogn), where beta congruent with 0.865, provided that k < n. AVAILABILITY: The Linux executable is available on the following supplementary page.  相似文献   

2.
We use a geodatabase to investigate the distribution patterns of an important subset of floristic reports recorded for the Parco Nazionale delle Foreste Casentinesi, Monte Falterona, Campigna in the northern Apennines, Italy. This database was analysed using spatial statistical techniques and a digital elevation model. Significant relationships between species presence, sampling effort and species richness were then analysed in relation to topographical variables and to an existing vegetation map. Report-based rarefaction techniques were used to compare areas having different numbers of species recorded. Overall, the analysis shows that some areas of the park are richer in species of conservation interest than others, and that these have been more intensely investigated. Meanwhile, for other areas, botanical knowledge is scarce or even absent. This has led to clustering and redundancy of floristic data in some areas. The study confirms that the existence of a complete and up-to-date geodatabase creates a valuable resource which enables information gaps to be bridged. Such gaps often exist in biological databases for rare and narrowly distributed species. The wider application of these analyses should also give useful indications of how the incidences of these species of conservation interest are associated with particular environmental variables.  相似文献   

3.
4.
MOTIVATION: The development of an annotated global database suitable for a wide range of investigations is a challenging and labor-intensive task. Thus, the development of databases tailored for specific applications remains necessary. For example, in the field of toxicology, no annotated gene array databases are now available that may assist in the correlation of changes in gene activity to cellular functions and processes associated with the toxic response. RESULTS: As an example of a tailored annotated database, an attempt was made to systematize available biological information on genes present on the Affymetrix Rat Toxicology U34 GeneChip, with a focus on how the gene products relate to liver cells and their response to chemical toxins. The information collected was imbedded in a local relational database to analyze data obtained in toxicological gene array experiments with hydrazine-exposed hepatocytes. The advantages and benefits of the tailored database in the biological interpretation of the results are demonstrated.  相似文献   

5.
6.
7.
Mégy K  Audic S  Claverie JM 《Genome biology》2002,3(9):preprint00-3

Background  

Cardio-vascular diseases are the first cause of death worldwide, particularly in the developed countries; the identification of genes specifically expressed in the cardiac muscle is thus of major biomedical interest. In this study, we performed a comprehensive analysis of the expression profiles to identify genes over-expressed in the human adult heart using the public Expressed Sequence Tags (ESTs) database. The initial set of genes expressed in the heart was constructed by clustering and assembling ESTs from human adult heart cDNA libraries. Expression profiles were then generated for each of these genes by counting their cognate ESTs in all libraries. Differential expression was assessed by applying to these profiles a previously published statistical procedure.  相似文献   

8.
Yu C  Zavaljevski N  Desai V  Reifman J 《Proteins》2009,74(2):449-460
In this article, we present a new method termed CatFam (Catalytic Families) to automatically infer the functions of catalytic proteins, which account for 20-40% of all proteins in living organisms and play a critical role in a variety of biological processes. CatFam is a sequence-based method that generates sequence profiles to represent and infer protein catalytic functions. CatFam generates profiles through a stepwise procedure that carefully controls profile quality and employs nonenzymes as negative samples to establish profile-specific thresholds associated with a predefined nominal false-positive rate (FPR) of predictions. The adjustable FPR allows for fine precision control of each profile and enables the generation of profile databases that meet different needs: function annotation with high precision and hypothesis generation with moderate precision but better recall. Multiple tests of CatFam databases (generated with distinct nominal FPRs) against enzyme and nonenzyme datasets show that the method's predictions have consistently high precision and recall. For example, a 1% FPR database predicts protein catalytic functions for a dataset of enzymes and nonenzymes with 98.6% precision and 95.0% recall. Comparisons of CatFam databases against other established profile-based methods for the functional annotation of 13 bacterial genomes indicate that CatFam consistently achieves higher precision and (in most cases) higher recall, and that (on average) CatFam provides 21.9% additional catalytic functions not inferred by the other similarly reliable methods. These results strongly suggest that the proposed method provides a valuable contribution to the automated prediction of protein catalytic functions. The CatFam databases and the database search program are freely available at http://www.bhsai.org/downloads/catfam.tar.gz.  相似文献   

9.
10.
Computational interactomics deals with prediction of functionally related proteins. One approach for solving this problem using comparative genomics consists in analysis of similarities between phylogenetic profiles of proteins. In contrast to most methods, which predict only pairwise interactions between proteins, in the present work we have applied the cluster analysis techniques in order to find modules of functionally related proteins. We have performed the cluster analysis of phylogenetic profiles of E. coli proteins using several clustering techniques and various modes for estimation of distances between profiles. We report here, that the best correspondence in the composition of resultant clusters to known metabolic pathways is achieved using Ward’s clustering together with Hamming’s distance. The proposed technique of assessing predictions of the modules of functionally related proteins can be used for comparative analysis of different algorithms for computational interactomics.  相似文献   

11.
12.
The KEGG databases at GenomeNet   总被引:30,自引:0,他引:30       下载免费PDF全文
The Kyoto Encyclopedia of Genes and Genomes (KEGG) is the primary database resource of the Japanese GenomeNet service (http://www.genome.ad.jp/) for understanding higher order functional meanings and utilities of the cell or the organism from its genome information. KEGG consists of the PATHWAY database for the computerized knowledge on molecular interaction networks such as pathways and complexes, the GENES database for the information about genes and proteins generated by genome sequencing projects, and the LIGAND database for the information about chemical compounds and chemical reactions that are relevant to cellular processes. In addition to these three main databases, limited amounts of experimental data for microarray gene expression profiles and yeast two-hybrid systems are stored in the EXPRESSION and BRITE databases, respectively. Furthermore, a new database, named SSDB, is available for exploring the universe of all protein coding genes in the complete genomes and for identifying functional links and ortholog groups. The data objects in the KEGG databases are all represented as graphs and various computational methods are developed to detect graph features that can be related to biological functions. For example, the correlated clusters are graph similarities which can be used to predict a set of genes coding for a pathway or a complex, as summarized in the ortholog group tables, and the cliques in the SSDB graph are used to annotate genes. The KEGG databases are updated daily and made freely available (http://www.genome.ad.jp/kegg/).  相似文献   

13.
Specific features of energy confinement scalings constructed using different experimental databases for tokamak plasmas are considered. In the multimachine database, some pairs of engineering variables are collinear; e.g., the current I and the input power P both increase with increasing minor radius a. As a result, scalings derived from this database are reliable only for discharges in which such ratios as I/a 2 or P/a 2 are close to their values averaged over the database. The collinearity of variables allows one to exclude the normalized Debye radius d* from the scaling expressed in a nondimensional form. In one-machine databases, the dimensionless variables are functionally dependent, which allow one to cast a scaling without d*. In a database combined from two devices, the collinearity may be absent, so the Debye radius cannot generally be excluded from the scaling. It is shown that the experiments performed in support of the absence of d* in the two-machine scaling are unconvincing. Transformation expressions are given that allow one to compare experiments for the determination of scaling in any set of independent variables.  相似文献   

14.
Human dipeptidyl peptidase IV (hDDP-IV) has a considerable importance in inactivation of glucagon-like peptide-1, which is related to type 2 diabetes. One approach for the treatment is the development of small hDDP-IV inhibitors. In order to design better inhibitors, we analyzed 5-(aminomethyl)-6-(2,4-dichlrophenyl)-2-(3,5-dimethoxyphenyl)pyrimidin-4-amine and a set of 24 molecules found in the BindingDB web database for model designing. The analysis of their molecular properties allowed the design of a multiple linear regression model for activity prediction. Their docking analysis allowed visualization of the interactions between the pharmacophore regions and hDDP-IV. After both analyses were performed, we proposed a set of nine molecules in order to predict their activity. Four of them displayed promising activity, and thus, had their docking performed, as well as, the pharmacokinetic and toxicological study. Two compounds from the proposed set showed suitable pharmacokinetic and toxicological characteristics, and therefore, they were considered promising for future synthesis and in vitro studies.  相似文献   

15.
To develop a useful fermentation process model, it is first necessary to identify which batch operating parameters are critical in determining the process outcome. To identify critical processing inputs in large databases, we have explored the use of Decision Tree Analysis with the decision metrics of Gain (i.e., Shannon Entropy changes), Gain Ratio, and a multiple hypergeometric distribution. The usefulness of this approach lies in its ability to treat "categorical" variables, which are typical of archived fermentation databases, as well as "continuous" variables. In this work, we demonstrate the use of Decision Tree Analysis for the problem of optimizing recombinant green fluorescent protein production in E. coli. A database of 85 fermentations was generated to examine the effect of 15 process input parameters on final biomass yield, maximum recombinant protein concentration, and productivity. The use of Decision Tree Analysis led to a considerable reduction in the fermentation database through the identification of the significant as well as insignificant inputs. However, different decision metrics selected different inputs and different numbers of inputs to classify the data for each output.  相似文献   

16.
Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.  相似文献   

17.

In systems biology, study of a complex and multicomponent system, such as morphogenesis, comprises accumulation of data on morphogenetic processes in databases, classification and logical analysis of this information, and computer simulation of the processes in question using the data accumulated and the results of their analysis. This paper describes realization of the first steps in a systems study of morphogenesis (annotating research papers, compiling information in a database, data systematization, and their logical analysis) by the example of Arabidopsis thaliana, a model object in plant molecular biology. The database AGNS (Arabidopsis GeneNet Supplementary; http://wwwmgs.bionet.nsc.ru/agns) contains the experimentally confirmed information from published papers on specific features of gene expression and phenotypes of wild-type, mutant, and transgenic A. thaliana plants. AGNS queries and logical data analysis with the aid of specially developed software makes it possible to model various morphogenetic processes from gene expression to functioning of gene networks and their contribution to the development of certain traits.

  相似文献   

18.
Next‐generation DNA sequencing has enabled a rapid expansion in the size of molecular fungal ecology studies employing the nuclear internal transcribed spacer (ITS) region. Many sequence‐processing pipelines and protocols require sequence clustering to generate operational taxonomic units (OTUs) based on sequence similarity as a step to reduce total data quantity and complexity prior to taxonomic assignment. However, the consequences of ITS sequence clustering in regard to sample taxonomic coverage have not been carefully examined. Here we demonstrate that typically used clustering thresholds for fungal ITS sequences result in statistically significant losses in taxonomic coverage. Analyses using environmentally derived fungal sequences indicated an average of 3.1% of species went undetected (P < 0.05) if the sequences were denoised and clustered at a 97% threshold prior to taxonomic assignment. Additionally, an in silico analysis using a reference fungal ITS database suggested that approximately 25% of species went undetected if the sequences were clustered prior to taxonomic assignment. Finally, analysis of sequences derived from pure‐cultured fungal isolates of known identity indicated sequence denoising and clustering were not critical in improving identification accuracy.  相似文献   

19.
Multiple myeloma (MM) is a common hematologic malignancy for which the underlying molecular mechanisms remain largely unclear. This study aimed to elucidate key candidate genes and pathways in MM by integrated bioinformatics analysis. Expression profiles GSE6477 and GSE47552 were obtained from the Gene Expression Omnibus database, and differentially expressed genes (DEGs) with p < .05 and [logFC] > 1 were identified. Functional enrichment, protein–protein interaction network construction and survival analyses were then performed. First, 51 upregulated and 78 downregulated DEGs shared between the two GSE datasets were identified. Second, functional enrichment analysis showed that these DEGs are mainly involved in the B cell receptor signaling pathway, hematopoietic cell lineage, and NF-kappa B pathway. Moreover, interrelation analysis of immune system processes showed enrichment of the downregulated DEGs mainly in B cell differentiation, positive regulation of monocyte chemotaxis and positive regulation of T cell proliferation. Finally, the correlation between DEG expression and survival in MM was evaluated using the PrognoScan database. In conclusion, we identified key candidate genes that affect the outcomes of patients with MM, and these genes might serve as potential therapeutic targets.  相似文献   

20.
【目的】通过对杜氏盐藻的转录组进行测序和基因功能分析,阐明不同浓度盐胁迫对杜氏盐藻生长发育以及不同信号途径的影响。【方法】分别获取9%NaCl浓度和24%NaCl浓度培养下的杜氏盐藻转录组并通过Illumina平台进行测序。将所得的序列进行拼接、去冗余处理。【结果】获得40682个unigenes,其中注释到NR数据库的10905个,注释到NT数据库的2768个,注释到SWISS-PROT数据库的7261个,注释到COG/KOG数据库的6499个。受到高盐胁迫的杜氏盐藻细胞相比低盐环境下,有717个基因表达上调,1012个基因表达下调。进一步对60个显著差异基因进行了功能聚类,发现盐胁迫诱导了光合作用途径的基因表达。【结论】杜氏盐藻通过提高光合作用基因表达增强耐盐性。该研究最大范围上挖掘了杜氏盐藻在高盐和低盐环境的基因转录水平,为深入揭示杜氏盐藻盐胁迫下基因差异表达提供了平台,并为进一步研究杜氏盐藻耐盐机理提供理论依据。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号