首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The cWINNOWER algorithm detects fuzzy motifs in DNA sequences rich in protein-binding signals. A signal is defined as any short nucleotide pattern having up to d mutations differing from a motif of length l. The algorithm finds such motifs if a clique consisting of a sufficiently large number of mutated copies of the motif (i.e., the signals) is present in the DNA sequence. The cWINNOWER algorithm substantially improves the sensitivity of the winnower method of Pevzner and Sze by imposing a consensus constraint, enabling it to detect much weaker signals. We studied the minimum detectable clique size qc as a function of sequence length N for random sequences. We found that qc increases linearly with N for a fast version of the algorithm based on counting three-member sub-cliques. Imposing consensus constraints reduces qc by a factor of three in this case, which makes the algorithm dramatically more sensitive. Our most sensitive algorithm, which counts four-member sub-cliques, needs a minimum of only 13 signals to detect motifs in a sequence of length N = 12,000 for (l, d) = (15, 4).  相似文献   

2.
We describe an algorithm (IRSA) for identification of common regulatory signals in samples of unaligned DNA sequences. The algorithm was tested on randomly generated sequences of fixed length with implanted signal of length 15 with 4 mutations, and on natural upstream regions of bacterial genes regulated by PurR, ArgR and CRP. Then it was applied to upstream regions of orthologous genes from Escherichia coli and related genomes. Some new palindromic binding and direct repeats signals were identified. Finally we present a parallel version suitable for computers supporting the MPI protocol. This implementation is not strictly bounded by the number of available processors. The computation speed linearly depends on the number of processors.  相似文献   

3.
The analysis of repeats in the DNA sequences is an important subject in bioinformatics. In this paper, we propose a novel projection-assemble algorithm to find unknown interspersed repeats in DNA sequences. The algorithm employs random projection algorithm to obtain a candidate fragment set, and exhaustive search algorithm to search each pair of fragments from the candidate fragment set to find potential linkage, and then assemble them together. The complexity of our projection-assemble algorithm is nearly linear to the length of the genome sequence, and its memory usage is limited by the hardware. We tested our algorithm with both simulated data and real biology data, and the results show that our projection-assemble algorithm is efficient. By means of this algorithm, we found an un-labeled repeat region that occurs five times in Escherichia coil genome, with its length more than 5,000 bp, and a mismatch probability less than 4%.  相似文献   

4.
The degeneracy of codons allows a multitude of possible sequences to code for the same protein. Hidden within the particular choice of sequence for each organism are over 100 previously undiscovered biologically significant, short oligonucleotides (length, 2 to 7 nucleotides). We present an information-theoretic algorithm that finds these novel signals. Applying this algorithm to the 209 sequenced bacterial genomes in the NCBI database, we determine a set of oligonucleotides for each bacterium which uniquely characterizes the organism. Some of these signals have known biological functions, like restriction enzyme binding sites, but most are new. An accompanying scoring algorithm is introduced that accurately (92%) places sequences of 100 kb with their correct species among the choice of hundreds. This algorithm also does far better than previous methods at relating phage genomes to their bacterial hosts, suggesting that the lists of oligonucleotides are "genomic fingerprints" that encode information about the effects of the cellular environment on DNA sequence. Our approach provides a novel basis for phylogeny and is potentially ideally suited for classifying the short DNA fragments obtained by environmental shotgun sequencing. The methods developed here can be readily extended to other problems in bioinformatics.  相似文献   

5.
6.
We present an efficient algorithm for detecting putative regulatory elements in the upstream DNA sequences of genes, using gene expression information obtained from microarray experiments. Based on a generalized suffix tree, our algorithm looks for motif patterns whose appearance in the upstream region is most correlated with the expression levels of the genes. We are able to find the optimal pattern, in time linear in the total length of the upstream sequences. We implement and apply our algorithm to publicly available microarray gene expression data, and show that our method is able to discover biologically significant motifs, including various motifs which have been reported previously using the same data set. We further discuss applications for which the efficiency of the method is essential, as well as possible extensions to our algorithm.  相似文献   

7.
MOTIVATION: Despite the growing literature devoted to finding differentially expressed genes in assays probing different tissues types, little attention has been paid to the combinatorial nature of feature selection inherent to large, high-dimensional gene expression datasets. New flexible data analysis approaches capable of searching relevant subgroups of genes and experiments are needed to understand multivariate associations of gene expression patterns with observed phenotypes. RESULTS: We present in detail a deterministic algorithm to discover patterns of multivariate gene associations in gene expression data. The patterns discovered are differential with respect to a control dataset. The algorithm is exhaustive and efficient, reporting all existent patterns that fit a given input parameter set while avoiding enumeration of the entire pattern space. The value of the pattern discovery approach is demonstrated by finding a set of genes that differentiate between two types of lymphoma. Moreover, these genes are found to behave consistently in an independent dataset produced in a different laboratory using different arrays, thus validating the genes selected using our algorithm. We show that the genes deemed significant in terms of their multivariate statistics will be missed using other methods. AVAILABILITY: Our set of pattern discovery algorithms including a user interface is distributed as a package called Genes@Work. This package is freely available to non-commercial users and can be downloaded from our website (http://www.research.ibm.com/FunGen).  相似文献   

8.
MOTIVATION: A consensus sequence for a family of related sequences is, as the name suggests, a sequence that captures the features common to most members of the family. Consensus sequences are important in various DNA sequencing applications and are a convenient way to characterize a family of molecules. RESULTS: This paper describes a new algorithm for finding a consensus sequence, using the popular optimization method known as simulated annealing. Unlike the conventional approach of finding a consensus sequence by first forming a multiple sequence alignment, this algorithm searches for a sequence that minimises the sum of pairwise distances to each of the input sequences. The resulting consensus sequence can then be used to induce a multiple sequence alignment. The time required by the algorithm scales linearly with the number of input sequences and quadratically with the length of the consensus sequence. We present results demonstrating the high quality of the consensus sequences and alignments produced by the new algorithm. For comparison, we also present similar results obtained using ClustalW. The new algorithm outperforms ClustalW in many cases.  相似文献   

9.
SPLASH: structural pattern localization analysis by sequential histograms   总被引:6,自引:0,他引:6  
MOTIVATION: The discovery of sparse amino acid patterns that match repeatedly in a set of protein sequences is an important problem in computational biology. Statistically significant patterns, that is patterns that occur more frequently than expected, may identify regions that have been preserved by evolution and which may therefore play a key functional or structural role. Sparseness can be important because a handful of non-contiguous residues may play a key role, while others, in between, may be changed without significant loss of function or structure. Similar arguments may be applied to conserved DNA patterns. Available sparse pattern discovery algorithms are either inefficient or impose limitations on the type of patterns that can be discovered. RESULTS: This paper introduces a deterministic pattern discovery algorithm, called Splash, which can find sparse amino or nucleic acid patterns matching identically or similarly in a set of protein or DNA sequences. Sparse patterns of any length, up to the size of the input sequence, can be discovered without significant loss in performances. Splash is extremely efficient and embarrassingly parallel by nature. Large databases, such as a complete genome or the non-redundant SWISS-PROT database can be processed in a few hours on a typical workstation. Alternatively, a protein family or superfamily, with low overall homology, can be analyzed to discover common functional or structural signatures. Some examples of biologically interesting motifs discovered by Splash are reported for the histone I and for the G-Protein Coupled Receptor families. Due to its efficiency, Splash can be used to systematically and exhaustively identify conserved regions in protein family sets. These can then be used to build accurate and sensitive PSSM or HMM models for sequence analysis. AVAILABILITY: Splash is available to non-commercial research centers upon request, conditional on the signing of a test field agreement. CONTACT: acal@us.ibm.com, Splash main page http://www.research.ibm.com/splash  相似文献   

10.
ABSTRACT: BACKGROUND: The molecular recognition based on the complementary base pairing of deoxyribonucleicacid (DNA) is the fundamental principle in the fields of genetics, DNA nanotechnologyand DNA computing. We present an exhaustive DNA sequence design algorithm thatallows to generate sets containing a maximum number of sequences with definedproperties. EGNAS (Exhaustive Generation of Nucleic Acid Sequences) offers thepossibility of controlling both interstrand and intrastrand properties. The guanine-cytosinecontent can be adjusted. Sequences can be forced to start and end with guanine orcytosine. This option reduces the risk of "fraying" of DNA strands. It is possible to limitcross hybridizations of a defined length, and to adjust the uniqueness of sequences.Self-complementarity and hairpin structures of certain length can be avoided. Sequencesand subsequences can optionally be forbidden. Furthermore, sequences can be designed tohave minimum interactions with predefined strands and neighboring sequences. RESULTS: The algorithm is realized in a C++ program. TAG sequences can be generated andcombined with primers for single-base extension reactions, which were described formultiplexed genotyping of single nucleotide polymorphisms. Thereby, possible foldbackthrough intrastrand interaction of TAG-primer pairs can be limited. The design ofsequences for specific attachment of molecular constructs to DNA origami is presented. CONCLUSIONS: We developed a new software tool called EGNAS for the design of unique nucleic acidsequences. The presented exhaustive algorithm allows to generate greater sets ofsequences than with previous software and equal constraints. EGNAS is freely availablefor noncommercial use at http://www.chm.tu-dresden.de/pc6/EGNAS.  相似文献   

11.
Storage of sequence data is a big concern as the amount of data generated is exponential in nature at several locations. Therefore, there is a need to develop techniques to store data using compression algorithm. Here we describe optimal storage algorithm (OPTSDNA) for storing large amount of DNA sequences of varying length. This paper provides performance analysis of optimal storage algorithm (OPTSDNA) of a distributed bioinformatics computing system for analysis of DNA sequences. OPTSDNA algorithm is used for storing various sizes of DNA sequences into database. DNA sequences of different lengths were stored by using this algorithm. These input DNA sequences are varied in size from very small to very large. Storage size is calculated by this algorithm. Response time is also calculated in this work. The efficiency and performance of the algorithm is high (in size calculation with percentage) when compared with other known with sequential approach.  相似文献   

12.
The recent interest sparked due to the discovery of a variety of functions for non-coding RNA molecules has highlighted the need for suitable tools for the analysis and the comparison of RNA sequences. Many trans-acting non-coding RNA genes and cis-acting RNA regulatory elements present motifs, conserved both in structure and sequence, that can be hardly detected by primary sequence analysis alone. We present an algorithm that takes as input a set of unaligned RNA sequences expected to share a common motif, and outputs the regions that are most conserved throughout the sequences, according to a similarity measure that takes into account both the sequence of the regions and the secondary structure they can form according to base-pairing and thermodynamic rules. Only a single parameter is needed as input, which denotes the number of distinct hairpins the motif has to contain. No further constraints on the size, number and position of the single elements comprising the motif are required. The algorithm can be split into two parts: first, it extracts from each input sequence a set of candidate regions whose predicted optimal secondary structure contains the number of hairpins given as input. Then, the regions selected are compared with each other to find the groups of most similar ones, formed by a region taken from each sequence. To avoid exhaustive enumeration of the search space and to reduce the execution time, a greedy heuristic is introduced for this task. We present different experiments, which show that the algorithm is capable of characterizing and discovering known regulatory motifs in mRNA like the iron responsive element (IRE) and selenocysteine insertion sequence (SECIS) stem–loop structures. We also show how it can be applied to corrupted datasets in which a motif does not appear in all the input sequences, as well as to the discovery of more complex motifs in the non-coding RNA.  相似文献   

13.
Results of experiments in transmitting information by the aid of the sense of touch are presented. The patterns are limited sequences of binary signals, presented as vibrotactile pulses at the forearm. The information, transmitted by a signal decreases with the length of the sequence and the serial number of the signal within the sequence. With increasing difficulty in pattern recognition the information transmitted by the whole pattern exceeds the sum of the information transmitted by the signals. It is shown, that the process during a sequence is not stationary. Finally it is shown, how the recognition of signals is correlated to the adjoining signals in the sequence.  相似文献   

14.
The analysis of signals consisting of discrete and irregular data causes methodological problems for the Fourier spectral Analysis: Since it is based on sinusoidal functions, rectangular signals with unequal periodicities cannot easily be replicated. The Walsh spectral Analysis is based on the so called "Walsh functions", a complete set of orthonormal, rectangular waves and thus seems to be the method of choice for analysing signals consisting of binary or ordinal data. The paper compares the Walsh spectral analysis and the Fourier spectral analysis on the basis of simulated and real binary data sets of various length. Simulated data were derived from signals with defined cyclic patterns that were noised by randomly generated signals of the same length. The Walsh and Fourier spectra of each set were determined and up to 25% of the periodogram coefficients were utilized as input for an inverse transform. Mean square approximation error (MSE) was calculated for each of the series in order to compare the goodness of fit between the original and the reconstructed signal. The same procedure was performed with real data derived from a behavioral observation in pigs. The comparison of the two methods revealed that, in the analysis of discrete and binary time series, Walsh spectral analysis is the more appropriate method, if the time series is rather short. If the length of the signal increases, the difference between the two methods is less substantial.  相似文献   

15.
The analysis of signals consisting of discrete and irregular data causes methodological problems for the Fourier spectral Analysis: Since it is based on sinusoidal functions, rectangular signals with unequal periodicities cannot easily be replicated. The Walsh spectral Analysis is based on the so called "Walsh functions", a complete set of orthonormal, rectangular waves and thus seems to be the method of choice for analysing signals consisting of binary or ordinal data. The paper compares the Walsh spectral analysis and the Fourier spectral analysis on the basis of simulated and real binary data sets of various length. Simulated data were derived from signals with defined cyclic patterns that were noised by randomly generated signals of the same length. The Walsh and Fourier spectra of each set were determined and up to 25% of the periodogram coefficients were utilized as input for an inverse transform. Mean square approximation error (MSE) was calculated for each of the series in order to compare the goodness of fit between the original and the reconstructed signal. The same procedure was performed with real data derived from a behavioral observation in pigs. The comparison of the two methods revealed that, in the analysis of discrete and binary time series, Walsh spectral analysis is the more appropriate method, if the time series is rather short. If the length of the signal increases, the difference between the two methods is less substantial.  相似文献   

16.
We describe a fast computer algorithm for identifying consensuspatterns in DNA sequences. The method requires no prior assumptionsabout the consensus pattern other than its length. In particularno previous knowledge of the frequency or spacing of consensuspatterns is required. However, a priori information about theshape of the consensus pattern, or invariability of individualpositions, or the overall conservation level, can be utilizedto enhance the selectivity and sensitivity of search. As thenumber of all possible consensus words increases very rapidlywith length, comprehensive searches have usually been restrictedto a maximum of 10–12 nucleotides, even when large mainframesare used. Our algorithm enables searching for consensus patternsof this order on current mid-range and powerful microcomputers.Searches may be conducted on single, long sequences or a setof possibly aligned shorter sequences. We give examples of identifiedconsensus patterns in both prokaryotic and eukaryotic DNA sequences,along with some typical program timings. Received on January 14, 1991; accepted on March 5, 1991  相似文献   

17.
MOTIVATION: To devise a method that, unlike available methods, directly measures variations in phylogenetic signals in gene sequences that result from recombination, tests the significance of the signal variations and distinguishes misleading signals. RESULTS: We have developed a method, that we call 'sister-scanning', for assessing phylogenetic and compositional signals in the various patterns of identity that occur between four nucleotide sequences. A Monte Carlo randomization is done for all columns (positions) within a window and Z-scores are obtained for four real sequences or three real sequences with an outlier that is also randomized. The usefulness of the approach is demonstrated using tobamovirus and luteovirus sequences. Contradictory phylogenetic signals were distinguished in both datasets, as were regions of sequence that contained no clear signal or potentially misleading signals related to compositional similarities. In the tobamovirus dataset, contradictory phylogenetic signals were separated by coding sequences up to a kilobase long that contained no clear signal. Our re-analysis of this dataset using sister-scanning also yielded the first evidence known to us of an inter-species recombination site within a viral RNA-dependent RNA polymerase gene together with evidence of an unusual pattern of conservation in the three codon positions.  相似文献   

18.
Comparative ab initio prediction of gene structures using pair HMMs   总被引:3,自引:0,他引:3  
We present a novel comparative method for the ab initio prediction of protein coding genes in eukaryotic genomes. The method simultaneously predicts the gene structures of two un-annotated input DNA sequences which are homologous to each other and retrieves the subsequences which are conserved between the two DNA sequences. It is capable of predicting partial, complete and multiple genes and can align pairs of genes which differ by events of exon-fusion or exon-splitting. The method employs a probabilistic pair hidden Markov model. We generate annotations using our model with two different algorithms: the Viterbi algorithm in its linear memory implementation and a new heuristic algorithm, called the stepping stone, for which both memory and time requirements scale linearly with the sequence length. We have implemented the model in a computer program called DOUBLESCAN. In this article, we introduce the method and confirm the validity of the approach on a test set of 80 pairs of orthologous DNA sequences from mouse and human. More information can be found at: http://www.sanger.ac.uk/Software/analysis/doublescan/  相似文献   

19.
In vitro molecular circuits, based on DNA-programmable chemistries, can perform an increasing range of high-level functions, such as molecular level computation, image or chemical pattern recognition and pattern generation. Most reported demonstrations, however, can only accept nucleic acids as input signals. Real-world applications of these programmable chemistries critically depend on strategies to interface them with a variety of non-DNA inputs, in particular small biologically relevant chemicals. We introduce here a general strategy to interface DNA-based circuits with non-DNA signals, based on input-translating modules. These translating modules contain a DNA response part and an allosteric protein sensing part, and use a simple design that renders them fully tunable and modular. They can be repurposed to either transmit or invert the response associated with the presence of a given input. By combining these translating-modules with robust and leak-free amplification motifs, we build sensing circuits that provide a fluorescent quantitative time-response to the concentration of their small-molecule input, with good specificity and sensitivity. The programmability of the DNA layer can be leveraged to perform DNA based signal processing operations, which we demonstrate here with logical inversion, signal modulation and a classification task on two inputs. The DNA circuits are also compatible with standard biochemical conditions, and we show the one-pot detection of an enzyme through its native metabolic activity. We anticipate that this sensitive small-molecule-to-DNA conversion strategy will play a critical role in the future applications of molecular-level circuitry.  相似文献   

20.
The synaptonemal complex isolated from the spermatocyte nuclei by exhaustive hydrolysis of the latter by DNase II contains tightly associated DNA sequences (SCAR DNA). Here we studied the compositional properties of a cloned family of SCAR DNA of golden hamster, namely we performed the localization of 27 SCAR DNA clones on compositionally fractionated genomic DNA from golden hamster. We observed that sequences of the SCAR DNA family are mainly localized in the GC-poor isochore families L1 and L2, that showed 63% hybridization signals. This means that 37% of signals is referred to the GC-rich isochores, indicating the presence of SCAR DNA overall the genome, even if each isochore family presents differences in density and sequence type. Moreover, the SCAR DNA sequences containing regions of homology with LINE/SINE repeats were observed in all the isochore families. The compositional localization of SCAR DNA is in agreement with the hypothesis that SC and SCAR DNA participate in the chromatin organization during the meiosis prophase I, which should result in the attachment of chromatin loops to lateral elements of SC along the whole length of the latter.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号