共查询到20条相似文献,搜索用时 8 毫秒
1.
2.
MOTIVATION: Discovery of regulatory motifs in unaligned DNA sequences remains a fundamental problem in computational biology. Two categories of algorithms have been developed to identify common motifs from a set of DNA sequences. The first can be called a 'multiple genes, single species' approach. It proposes that a degenerate motif is embedded in some or all of the otherwise unrelated input sequences and tries to describe a consensus motif and identify its occurrences. It is often used for co-regulated genes identified through experimental approaches. The second approach can be called 'single gene, multiple species'. It requires orthologous input sequences and tries to identify unusually well conserved regions by phylogenetic footprinting. Both approaches perform well, but each has some limitations. It is tempting to combine the knowledge of co-regulation among different genes and conservation among orthologous genes to improve our ability to identify motifs. RESULTS: Based on the Consensus algorithm previously established by our group, we introduce a new algorithm called PhyloCon (Phylogenetic Consensus) that takes into account both conservation among orthologous genes and co-regulation of genes within a species. This algorithm first aligns conserved regions of orthologous sequences into multiple sequence alignments, or profiles, then compares profiles representing non-orthologous sequences. Motifs emerge as common regions in these profiles. Here we present a novel statistic to compare profiles of DNA sequences and a greedy approach to search for common subprofiles. We demonstrate that PhyloCon performs well on both synthetic and biological data. AVAILABILITY: Software available upon request from the authors. http://ural.wustl.edu/softwares.html 相似文献
3.
MOTIVATION: Accurate computational prediction of protein functional sites is critical to maximizing the utility of recent high-throughput sequencing efforts. Among the available approaches, position-specific conservation scores remain among the most popular due to their accuracy and ease of computation. Unfortunately, high false positive rates remain a limiting factor. Using phylogenetic motifs (PMs), we have developed two combined (conservation + PMs) prediction schemes that significantly improve prediction accuracy. RESULTS: Our first approach, called position-specific MINER (psMINER), rank orders alignment columns by conservation. Subsequently, positions that are also not identified as PMs are excluded from the prediction set. This approach improves prediction accuracy, in a statistically significant way, compared to the underlying conservation scores. Increased accuracy is a general result, meaning improvement is observed over several different conservation scores that span a continuum of complexity. In addition, a hybrid MINER (hMINER) that quantitatively considers both scoring regimes provides further improvement. More importantly, it provides critical insight into the relative importance of phylogeny versus alignment conservation. Both methods outperform other common prediction algorithms that also utilize phylogenetic concepts. Finally, we demonstrate that the presented results are critically sensitive to functional site definition, thus highlighting the need for more complete benchmarks within the prediction community. 相似文献
4.
5.
CMDWave (Conserved Motif Detection using WAVElets) is a web server that predicts conserved motifs in protein sequences. A set of query protein sequences are first aligned using ClustalW to obtain equal sized sequences. CMDWave then converts the sequences into a numerical representation using electron-ion interaction potential (EIIP). This is followed by a wavelet decomposition and reconstruction. A new similarity metric along with thresholding is then used to identify conserved motifs across all the query sequences. Users need not specify the number of motifs to be identified. For larger groups of sequences, results can be emailed to the users. 相似文献
6.
7.
The Brassicaceae is an economically and scientifically important family distributed globally, including oilseed rape and the model plant, Arabidopsis thaliana. Although growing molecular data have been used in phylogenetic studies, the relationships among major clades and tribes of Brassicaceae are still controversial. Here, we investigated the core Brassicaceae phylogenetics using 222 plastomes and 235 nrDNA cistrons, including 106 plastomes and 112 nrDNA cistrons assembled from newly sequenced genome skimming data of 112 taxa. The sampling covered 73 genera from 61.5% tribes and four unassigned genera and species. Three well supported lineages LI, LII, and LIII were revealed in our plastomic analyses, with LI sister to LII + LIII. In addition, the monophyly of the newly delimitated LII was strongly supported by three different partition strategies, concatenated methods under Bayesian and Maximum Likelihood analyses. LII comprised 13 tribes, including four tribes previously unassigned to any lineage, that is Biscutelleae as the earliest diverging clade and Cochlearieae as the sister to Megacarpaeeae + Anastaticeae. Within LII, the intertribal relationships were also well resolved, except that a conflicting position of Orychophragmus was detected among different datasets. In LIII, Shehbazia was resolved as a member of Chorisproreae, but Chorisproreae, Dontostemoneae, and Euclidieae were all resolved as paraphyletic, which was also confirmed by nrDNA analyses. Moreover, the loss of the rps16 gene was detected as likely to be a synapomorphy of the tribes Arabideae and Alysseae. Overall, using genome skimming data, we resolved robust phylogenetic relationships of core Brassicaceae and shed new light on the complex evolutionary history of this family. 相似文献
8.
9.
Tino P Zhao H Yan H 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2011,8(4):1093-1107
The effects of a drug on the genomic scale can be assessed in a three-color cDNA microarray with the three color intensities represented through the so-called hexaMplot. In our recent study, we have shown that the Hough Transform (HT) applied to the hexaMplot can be used to detect groups of coexpressed genes in the normal-disease-drug samples. However, the standard HT is not well suited for the purpose because 1) the assayed genes need first to be hard-partitioned into equally and differentially expressed genes, with HT ignoring possible information in the former group; 2) the hexaMplot coordinates are negatively correlated and there is no direct way of expressing this in the standard HT and 3) it is not clear how to quantify the association of coexpressed genes with the line along which they cluster. We address these deficiencies by formulating a dedicated probabilistic model-based HT. The approach is demonstrated by assessing effects of the drug Rg1 on homocysteine-treated human umbilical vein endothetial cells. Compared with our previous study, we robustly detect stronger natural groupings of coexpressed genes. Moreover, the gene groups show coherent biological functions with high significance, as detected by the Gene Ontology analysis. 相似文献
10.
MOTIVATION: Genes with identical patterns of occurrence across the phyla tend to function together in the same protein complexes or participate in the same biochemical pathway. However, the requirement that the profiles be identical (i) severely restricts the number of functional links that can be established by such phylogenetic profiling; (ii) limits detection to very strong functional links, failing to capture relations between genes that are not in the same pathway, but nevertheless subserve a common function and (iii) misses relations between analogous genes. Here we present and apply a method for relaxing the restriction, based on the probability that a given arbitrary degree of similarity between two profiles would occur by chance, with no biological pressure. Function is then inferred at any desired level of confidence. RESULTS: We derive an expression for the probability distribution of a given number of chance co-occurrences of a pair of non-homologous orthologs across a set of genomes. The method is applied to 2905 clusters of orthologous genes (COGs) from 44 fully sequenced microbial genomes representing all three domains of life. Among the results are the following. (1) Of the 51 000 annotated intrapathway gene pairs, 8935 are linked at a level of significance of 0.01. This is over 30-fold greater than the 271 intrapathway pairs obtained at the same confidence level when identical profiles are used. (2) Of the 540 000 interpathway genes pairs, some 65 000 are linked at the 0.01 level of significance, some 12 standard deviations beyond the number expected by chance at this confidence level. We speculate that many of these links involve nearest-neighbor path, and discuss some examples. (3) The difference in the percentage of linked interpathway and intrapathway genes is highly significant, consistent with the intuitive expectation that genes in the same pathway are generally under greater selective pressure than those that are not. (4) The method appears to recover well metabolic networks. This is illustrated by the TCA cycle which is recovered as a highly connected, weighted edge network of 30 of its 31 COGs. (5) The fraction of pairs having a common pathway is a symmetric function of the Hamming distance between their profiles. This finding, that the functional correlation between profiles with near maximum Hamming distance is as large as between profiles with near zero Hamming distance, and as statistically significant, is plausibly explained if the former group represents analogous genes. 相似文献
11.
Detecting DNA-binding helix-turn-helix structural motifs using sequence and structure information 下载免费PDF全文
In this work, we analyse the potential for using structural knowledge to improve the detection of the DNA-binding helix–turn–helix (HTH) motif from sequence. Starting from a set of DNA-binding protein structures that include a functional HTH motif and have no apparent sequence similarity to each other, two different libraries of hidden Markov models (HMMs) were built. One library included sequence models of whole DNA-binding domains, which incorporate the HTH motif, the second library included shorter models of ‘partial’ domains, representing only the fraction of the domain that corresponds to the functionally relevant HTH motif itself. The libraries were scanned against a dataset of protein sequences, some containing the HTH motifs, others not. HMM predictions were compared with the results obtained from a previously published structure-based method and subsequently combined with it. The combined method proved more effective than either of the single-featured approaches, showing that information carried by motif sequences and motif structures are to some extent complementary and can successfully be used together for the detection of DNA-binding HTHs in proteins of unknown function. 相似文献
12.
To quantify target genes in biological samples using DNA microarrays, we employed reference DNA to normalize variations in spot size and hybridization. This method was tested using nitrate reductase (nirS), naphthalene dioxygenase (nahA), and Escherichia coli O157 O-antigen biosynthesis genes as model genes and lambda DNA as the reference DNA. We observed a good linearity between the log signal ratio and log DNA concentration ratio at DNA concentrations above the method's detection limit, which was approximately 10 pg. This approach for designing quantitative microarrays and the inferred equation from this study provide a simple and convenient way to estimate the target gene concentration from the hybridization signal ratio. 相似文献
13.
14.
Homology-based methods fail to assign genes to many metabolic activities present in sequenced organisms. To suggest genes for these orphan activities we developed a novel method that efficiently combines local structure of a metabolic network with phylogenetic profiles. We validated our method using known metabolic genes in Saccharomyces cerevisiae and Escherichia coli. We show that our method should be easily transferable to other organisms, and that it is robust to errors in incomplete metabolic networks. 相似文献
15.
Kosakovsky Pond SL Posada D Gravenor MB Woelk CH Frost SD 《Molecular biology and evolution》2006,23(10):1891-1901
The evolution of homologous sequences affected by recombination or gene conversion cannot be adequately explained by a single phylogenetic tree. Many tree-based methods for sequence analysis, for example, those used for detecting sites evolving nonneutrally, have been shown to fail if such phylogenetic incongruity is ignored. However, it may be possible to propose several phylogenies that can correctly model the evolution of nonrecombinant fragments. We propose a model-based framework that uses a genetic algorithm to search a multiple-sequence alignment for putative recombination break points, quantifies the level of support for their locations, and identifies sequences or clades involved in putative recombination events. The software implementation can be run quickly and efficiently in a distributed computing environment, and various components of the methods can be chosen for computational expediency or statistical rigor. We evaluate the performance of the new method on simulated alignments and on an array of published benchmark data sets. Finally, we demonstrate that prescreening alignments with our method allows one to analyze recombinant sequences for positive selection. 相似文献
16.
N Maizels 《Annals of the New York Academy of Sciences》2012,1267(1):53-60
17.
18.
Rogov SI Momynaliev KT Govorun VM 《Journal of bioinformatics and computational biology》2006,4(4):853-864
RESULTS: A new algorithm is developed which is intended to find groups of genes whose expression values change in a concordant manner in a series of experiments with DNA arrays. This algorithm is named as CoexpressionFinder. It can find more complete and internally coordinated groups of gene expression vectors than hierarchical clustering. Also, it finds more genes having coordinated expression. The algorithm's design allows parallel execution. AVAILABILITY: The algorithm is implemented as a Java application which is freely available at: http://www.bioinformatics.ru/cf/index.jsp and http://bioinformatics.ru/cf/index.jsp. 相似文献
19.
Wernicke S 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2006,3(4):347-359
Motifs in a given network are small connected subnetworks that occur in significantly higher frequencies than would be expected in random networks. They have recently gathered much attention as a concept to uncover structural design principles of complex networks. Kashtan et al. [Bioinformatics, 2004] proposed a sampling algorithm for performing the computationally challenging task of detecting network motifs. However, among other drawbacks, this algorithm suffers from a sampling bias and scales poorly with increasing subgraph size. Based on a detailed analysis of the previous algorithm, we present a new algorithm for network motif detection which overcomes these drawbacks. Furthermore, we present an efficient new approach for estimating the frequency of subgraphs in random networks that, in contrast to previous approaches, does not require the explicit generation of random networks. Experiments on a testbed of biological networks show our new algorithms to be orders of magnitude faster than previous approaches, allowing for the detection of larger motifs in bigger networks than previously possible and thus facilitating deeper insight into the field 相似文献
20.