首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MOTIVATION: Despite the growing literature devoted to finding differentially expressed genes in assays probing different tissues types, little attention has been paid to the combinatorial nature of feature selection inherent to large, high-dimensional gene expression datasets. New flexible data analysis approaches capable of searching relevant subgroups of genes and experiments are needed to understand multivariate associations of gene expression patterns with observed phenotypes. RESULTS: We present in detail a deterministic algorithm to discover patterns of multivariate gene associations in gene expression data. The patterns discovered are differential with respect to a control dataset. The algorithm is exhaustive and efficient, reporting all existent patterns that fit a given input parameter set while avoiding enumeration of the entire pattern space. The value of the pattern discovery approach is demonstrated by finding a set of genes that differentiate between two types of lymphoma. Moreover, these genes are found to behave consistently in an independent dataset produced in a different laboratory using different arrays, thus validating the genes selected using our algorithm. We show that the genes deemed significant in terms of their multivariate statistics will be missed using other methods. AVAILABILITY: Our set of pattern discovery algorithms including a user interface is distributed as a package called Genes@Work. This package is freely available to non-commercial users and can be downloaded from our website (http://www.research.ibm.com/FunGen).  相似文献   

2.
The problem of detecting DNA motifs with functional relevance in real biological sequences is difficult due to a number of biological, statistical and computational issues and also because of the lack of knowledge about the structure of searched patterns. Many algorithms are implemented in fully automated processes, which are often based upon a guess of input parameters from the user at the very first step. In this paper, we present a novel method for the detection of seeded DNA motifs, composed by regions with a different extent of variability. The method is based on a multi-step approach, which was implemented in a motif searching web tool (MOST). Overrepresented exact patterns are extracted from input sequences and clustered to produce motifs core regions, which are then extended and scored to generate seeded motifs. The combination of automated pattern discovery algorithms and different display tools for the evaluation and selection of results at several analysis steps can potentially lead to much more meaningful results than complete automation can produce. Experimental results on different yeast and human real datasets proved the methodology to be a promising solution for finding seeded motifs. MOST web tool is freely available at http://telethon.bio.unipd.it/bioinfo/MOST.  相似文献   

3.
4.
Complementarity-determining regions (CDRs) are antibody loops that make up the antigen binding site. Here, we show that all CDR types have structurally similar loops of different lengths. Based on these findings, we created length-independent canonical classes for the non-H3 CDRs. Our length variable structural clusters show strong sequence patterns suggesting either that they evolved from the same original structure or result from some form of convergence. We find that our length-independent method not only clusters a larger number of CDRs, but also predicts canonical class from sequence better than the standard length-dependent approach.

To demonstrate the usefulness of our findings, we predicted cluster membership of CDR-L3 sequences from 3 next-generation sequencing datasets of the antibody repertoire (over 1,000,000 sequences). Using the length-independent clusters, we can structurally classify an additional 135,000 sequences, which represents a ~20% improvement over the standard approach. This suggests that our length-independent canonical classes might be a highly prevalent feature of antibody space, and could substantially improve our ability to accurately predict the structure of novel CDRs identified by next-generation sequencing.  相似文献   


5.
新疆天山雪莲(Sasussured involucrata)具有较高的极端低温耐受特性,为低温耐受机制研究提供了一种非常好的模式植物。新疆天山雪莲转录组注解知识库(http://www.shengtingbiology.com/Saussurea KBase/index.jsp)是基于网络数据资源的综合性数据库,由html、Perl、Perl CGI/DBD/DBI、Java和Java Script编程所设计的前端界面和用于数据存取、注释及管理的后端数据库管理系统Postgrel SQL构成。知识库包含基因组数据、转录组原始数据、质量控制数据、GC含量、功能基因序列及注释、功能基因代谢通路、功能基因的注释统计、雪莲与其它物种的转录组或基因组比较分析数据和生物分析软件包等资源。该数据库不仅有利于低温功能基因组学及低温耐受机制研究,而且为冷耐受性状物种的分子育种提供基因资源平台和理论依据。  相似文献   

6.
SUMMARY: VizRank is a tool that finds interesting two-dimensional projections of class-labeled data. When applied to multi-dimensional functional genomics datasets, VizRank can systematically find relevant biological patterns. AVAILABILITY: http://www.ailab.si/supp/bi-vizrank SUPPLEMENTARY INFORMATION: http://www.ailab.si/supp/bi-vizrank.  相似文献   

7.
The large variety of clustering algorithms and their variants can be daunting to researchers wishing to explore patterns within their microarray datasets. Furthermore, each clustering method has distinct biases in finding patterns within the data, and clusterings may not be reproducible across different algorithms. A consensus approach utilizing multiple algorithms can show where the various methods agree and expose robust patterns within the data. In this paper, we present a software package - Consense, written for R/Bioconductor - that utilizes such an approach to explore microarray datasets. Consense produces clustering results for each of the clustering methods and produces a report of metrics comparing the individual clusterings. A feature of Consense is identification of genes that cluster consistently with an index gene across methods. Utilizing simulated microarray data, sensitivity of the metrics to the biases of the different clustering algorithms is explored. The framework is easily extensible, allowing this tool to be used by other functional genomic data types, as well as other high-throughput OMICS data types generated from metabolomic and proteomic experiments. It also provides a flexible environment to benchmark new clustering algorithms. Consense is currently available as an installable R/Bioconductor package (http://www.ohsucancer.com/isrdev/consense/).  相似文献   

8.
9.
SPLASH: structural pattern localization analysis by sequential histograms   总被引:6,自引:0,他引:6  
MOTIVATION: The discovery of sparse amino acid patterns that match repeatedly in a set of protein sequences is an important problem in computational biology. Statistically significant patterns, that is patterns that occur more frequently than expected, may identify regions that have been preserved by evolution and which may therefore play a key functional or structural role. Sparseness can be important because a handful of non-contiguous residues may play a key role, while others, in between, may be changed without significant loss of function or structure. Similar arguments may be applied to conserved DNA patterns. Available sparse pattern discovery algorithms are either inefficient or impose limitations on the type of patterns that can be discovered. RESULTS: This paper introduces a deterministic pattern discovery algorithm, called Splash, which can find sparse amino or nucleic acid patterns matching identically or similarly in a set of protein or DNA sequences. Sparse patterns of any length, up to the size of the input sequence, can be discovered without significant loss in performances. Splash is extremely efficient and embarrassingly parallel by nature. Large databases, such as a complete genome or the non-redundant SWISS-PROT database can be processed in a few hours on a typical workstation. Alternatively, a protein family or superfamily, with low overall homology, can be analyzed to discover common functional or structural signatures. Some examples of biologically interesting motifs discovered by Splash are reported for the histone I and for the G-Protein Coupled Receptor families. Due to its efficiency, Splash can be used to systematically and exhaustively identify conserved regions in protein family sets. These can then be used to build accurate and sensitive PSSM or HMM models for sequence analysis. AVAILABILITY: Splash is available to non-commercial research centers upon request, conditional on the signing of a test field agreement. CONTACT: acal@us.ibm.com, Splash main page http://www.research.ibm.com/splash  相似文献   

10.
SUMMARY: In this paper we present a data mining system, which allows the application of different clustering and cluster validity algorithms for DNA microarray data. This tool may improve the quality of the data analysis results, and may support the prediction of the number of relevant clusters in the microarray datasets. This systematic evaluation approach may significantly aid genome expression analyses for knowledge discovery applications. The developed software system may be effectively used for clustering and validating not only DNA microarray expression analysis applications but also other biomedical and physical data with no limitations. AVAILABILITY: The program is freely available for non-profit use on request at http://www.cs.tcd.ie/Nadia.Bolshakova/Machaon.html CONTACT: Nadia.Bolshakova@cs.tcd.ie.  相似文献   

11.
12.
CROW (Columns and Rows Of Workstations - http://www.sicmm.org/crow/) is a parallel computer cluster based on the Beowulf (http://www.beowulf.org/) idea, modified to support a larger number of processors. Its architecture is based on point-to-point network architecture, which does not require the use of any network switching equipment in the system. Thus, the cost is lower, and there is no degradation in network performance even for a larger number of processors.  相似文献   

13.
ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. ART also allows the flexibility to use customized read error model parameters and quality profiles. AVAILABILITY: Both source and binary software packages are available at http://www.niehs.nih.gov/research/resources/software/art.  相似文献   

14.
We describe a web-based resource to identify, search and analyze sequence patterns conserved in the multiple sequence alignments of orthologous promoters from closely related / distant Saccharomyces spp. The webtool interfaces with a database where conserved sequence patterns (greater than 4 bp) have been previously extracted from genome-wide promoter alignments, allowing one to carry out user-defined genome-wide searches for conserved sequences to assist in the discovery of novel promoter elements based on comparative genomics. The web-based server can be accessed at http://www2.imtech.res.in/ anand/sacch_prom_pat.html.  相似文献   

15.
Background

Genomic islands (GIs) are clusters of alien genes in some bacterial genomes, but not be seen in the genomes of other strains within the same genus. The detection of GIs is extremely important to the medical and environmental communities. Despite the discovery of the GI associated features, accurate detection of GIs is still far from satisfactory.

Results

In this paper, we combined multiple GI-associated features, and applied and compared various machine learning approaches to evaluate the classification accuracy of GIs datasets on three genera: Salmonella, Staphylococcus, Streptococcus, and their mixed dataset of all three genera. The experimental results have shown that, in general, the decision tree approach outperformed better than other machine learning methods according to five performance evaluation metrics. Using J48 decision trees as base classifiers, we further applied four ensemble algorithms, including adaBoost, bagging, multiboost and random forest, on the same datasets. We found that, overall, these ensemble classifiers could improve classification accuracy.

Conclusions

We conclude that decision trees based ensemble algorithms could accurately classify GIs and non-GIs, and recommend the use of these methods for the future GI data analysis. The software package for detecting GIs can be accessed at http://www.esu.edu/cpsc/che_lab/software/GIDetector/.

  相似文献   

16.
PartiGene--constructing partial genomes   总被引:4,自引:0,他引:4  
Expressed sequence tags (ESTs) offer a low-cost approach to gene discovery and are being used by an increasing number of laboratories to obtain sequence information for a wide variety of organisms. The challenge lies in processing and organizing this data within a genomic context to facilitate large scale analyses. Here we present PartiGene, an integrated sequence analysis suite that uses freely available public domain software to (1) process raw trace chromatograms into sequence objects suitable for submission to dbEST; (2) place these sequences within a genomic context; (3) perform customizable first-pass annotation of the data; and (4) present the data as HTML tables and an SQL database resource. PartiGene has been used to create a number of non-model organism database resources including NEMBASE (http://www.nematodes.org) and LumbriBase (http://www.earthworms.org/). The packages are readily portable, freely available and can be run on simple Linux-based workstations. AVAILABILITY: PartiGene is available from http://www.nematodes.org/PartiGene and also forms part of the EST analysis software, associated with the Natural Environmental Research Council (UK) Bio-Linux project (http://envgen.nox.ac.uk/biolinux.html).  相似文献   

17.
Enrichment analysis methods, e.g., gene set enrichment analysis, represent one class of important bioinformatical resources for mining patterns in biomedical datasets. However, tools for inferring patterns and rules of a list of drugs are limited. In this study, we developed a web-based tool, DrugPattern, for drug set enrichment analysis. We first collected and curated 7019 drug sets, including indications, adverse reactions, targets, pathways, etc. from public databases. For a list of interested drugs, DrugPattern then evaluates the significance of the enrichment of these drugs in each of the 7019 drug sets. To validate DrugPattern, we employed it for the prediction of the effects of oxidized low-density lipoprotein (oxLDL), a factor expected to be deleterious. We predicted that oxLDL has beneficial effects on some diseases, most of which were supported by evidence in the literature. Because DrugPattern predicted the potential beneficial effects of oxLDL in type 2 diabetes (T2D), animal experiments were then performed to further verify this prediction. As a result, the experimental evidences validated the DrugPattern prediction that oxLDL indeed has beneficial effects on T2D in the case of energy restriction. These data confirmed the prediction accuracy of our approach and revealed unexpected protective roles for oxLDL in various diseases. This study provides a tool to infer patterns and rules in biomedical datasets based on drug set enrichment analysis. DrugPattern is available at http://www.cuilab.cn/drugpattern.  相似文献   

18.
MOTIVATION: Subcellular localization is a key functional characteristic of proteins. A fully automatic and reliable prediction system for protein subcellular localization is needed, especially for the analysis of large-scale genome sequences. RESULTS: In this paper, Support Vector Machine has been introduced to predict the subcellular localization of proteins from their amino acid compositions. The total prediction accuracies reach 91.4% for three subcellular locations in prokaryotic organisms and 79.4% for four locations in eukaryotic organisms. Predictions by our approach are robust to errors in the protein N-terminal sequences. This new approach provides superior prediction performance compared with existing algorithms based on amino acid composition and can be a complementary method to other existing methods based on sorting signals. AVAILABILITY: A web server implementing the prediction method is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/. SUPPLEMENTARY INFORMATION: Supplementary material is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/.  相似文献   

19.
MOTIVATION: Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists about structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together in the computed alignment. The currently developed programs use the so-called progressive approach to efficiently obtain a constrained alignment of several sequences. However, the kernels of these programs, the dynamic programming algorithms for computing an optimal constrained alignment between two sequences, run in (gamman2) memory, where gamma is the number of the constraints and n is the maximum of the lengths of sequences. As a result, such a high memory requirement limits the overall programs to align short sequences only. RESULTS: We adopt the divide-and-conquer approach to design a memory-efficient algorithm for computing an optimal constrained alignment between two sequences, which greatly reduces the memory requirement of the dynamic programming approaches at the expense of a small constant factor in CPU time. This new algorithm consumes only O(alphan) space, where alpha is the sum of the lengths of constraints and usually alpha < n in practical applications. Based on this algorithm, we have developed a memory-efficient tool for multiple sequence alignment with constraints. AVAILABILITY: http://genome.life.nctu.edu.tw/MUSICME.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号