首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Efficient detection of unusual words.   总被引:3,自引:0,他引:3  
Words that are, by some measure, over- or underrepresented in the context of larger sequences have been variously implicated in biological functions and mechanisms. In most approaches to such anomaly detections, the words (up to a certain length) are enumerated more or less exhaustively and are individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and significance thereof. Here we take the global approach of annotating the suffix tree of a sequence with some such values and scores, having in mind to use it as a collective detector of all unexpected behaviors, or perhaps just as a preliminary filter for words suspicious enough to undergo a more accurate scrutiny. We consider in depth the simple probabilistic model in which sequences are produced by a random source emitting symbols from a known alphabet independently and according to a given distribution. Our main result consists of showing that, within this model, full tree annotations can be carried out in a time-and-space optimal fashion for the mean, variance and some of the adopted measures of significance. This result is achieved by an ad hoc embedding in statistical expressions of the combinatorial structure of the periods of a string. Specifically, we show that the expected value and variance of all substrings in a given sequence of n symbols can be computed and stored in (optimal) O(n2) overall worst-case, O (n log n) expected time and space. The O (n2) time bound constitutes an improvement by a linear factor over direct methods. Moreover, we show that under several accepted measures of deviation from expected frequency, the candidates over- or underrepresented words are restricted to the O(n) words that end at internal nodes of a compact suffix tree, as opposed to the theta(n2) possible substrings. This surprising fact is a consequence of properties in the form that if a word that ends in the middle of an arc is, say, overrepresented, then its extension to the nearest node of the tree is even more so. Based on this, we design global detectors of favored and unfavored words for our probabilistic framework in overall linear time and space, discuss related software implementations and display the results of preliminary experiments.  相似文献   

2.

Background  

Sequence comparison by alignment is a fundamental tool of molecular biology. In this paper we show how a number of sequence comparison tasks, including the detection of unique genomic regions, can be accomplished efficiently without an alignment step. Our procedure for nucleotide sequence comparison is based on shortest unique substrings. These are substrings which occur only once within the sequence or set of sequences analysed and which cannot be further reduced in length without losing the property of uniqueness. Such substrings can be detected using generalized suffix trees.  相似文献   

3.
The ability of DNA sequences to adopt unusual structures under the superhelical torsional stress has been studied. Sequences that are forced to adopt unusual conformation in topologically constrained pBR322 form V DNA (Lk = 0) were mapped using restriction enzymes as probes. Restriction enzymes such as BamHI, PstI, AvaI and HindIII could not cleave their recognition sequences. The removal of topological constraint relieved this inhibition. The influence of neighbouring sequences on the ability of a given sequence to adopt unusual DNA structure, presumably left handed Z conformation, was studied through single hit analysis. Using multiple cut restriction enzymes such as NarI and FspI, it could be shown that under identical topological strain, the extent of structural alteration is greatly influenced by the neighbouring sequences. In the light of the variety of sequences and locations that could be mapped to adopt non-B conformation in pBR322 form V DNA, restriction enzymes appear as potential structural probes for natural DNA sequences.  相似文献   

4.
5.
6.
High-throughput sequencing techniques are becoming attractive to molecular biologists and ecologists as they provide a time- and cost-effective way to explore diversity patterns in environmental samples at an unprecedented resolution. An issue common to many studies is the definition of what fractions of a data set should be considered as rare or dominant. Yet this question has neither been satisfactorily addressed, nor is the impact of such definition on data set structure and interpretation been fully evaluated. Here we propose a strategy, MultiCoLA (Multivariate Cutoff Level Analysis), to systematically assess the impact of various abundance or rarity cutoff levels on the resulting data set structure and on the consistency of the further ecological interpretation. We applied MultiCoLA to a 454 massively parallel tag sequencing data set of V6 ribosomal sequences from marine microbes in temperate coastal sands. Consistent ecological patterns were maintained after removing up to 35–40% rare sequences and similar patterns of beta diversity were observed after denoising the data set by using a preclustering algorithm of 454 flowgrams. This example validates the importance of exploring the impact of the definition of rarity in large community data sets. Future applications can be foreseen for data sets from different types of habitats, e.g. other marine environments, soil and human microbiota.  相似文献   

7.

Background

Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between pairs of input sequences. Haubold et al. (J Comput Biol 16:1487–1500, 2009) showed how the average number of substitutions per position between two DNA sequences can be estimated based on the average length of exact common substrings.

Results

In this paper, we study the length distribution of k-mismatch common substrings between two sequences. We show that the number of substitutions per position can be accurately estimated from the position of a local maximum in the length distribution of their k-mismatch common substrings.
  相似文献   

8.
Accumulating molecular data, particularly complete organellar genome sequences, continue to advance our understanding of the evolution of mitochondrial and chloroplast DNAs. Although the notion of a single primary origin for each organelle has been reinforced, new models have been proposed that tie the acquisition of mitochondria more closely to the origin of the eukaryotic cell per se than is implied by classic endosymbiont theory. The form and content of the ancestral proto-mitochondrial and proto-chloroplast genomes are becoming clearer but unusual patterns of organellar genome structure and organization continue to be discovered. The 'single-gene circle' arrangement recently reported for dinoflagellate chloroplast genomes is a notable example of a highly derived organellar genome.  相似文献   

9.
Larvae of the deep-sea lanternfish genus Hygophum (Myctophidae) exhibit a remarkable morphological diversity that is quite unexpected, considering their homogeneous adult morphology. In an attempt to elucidate the evolutionary patterns of such larval morphological diversity, nucleotide sequences of a portion of the mitochondrially encoded 16S ribosomal RNA gene were determined for seven Hygophum species and three outgroup taxa. Secondary structure-based alignment resulted in a character matrix consisting of 1172 bp of unambiguously aligned sequences, which were subjected to phylogenetic analyses using maximum-parsimony, maximum-likelihood, and neighbor-joining methods. The resultant tree topologies from the three methods were congruent, with most nodes, including that of the genus Hygophum, being strongly supported by various tree statistics. The most parsimonious reconstruction of the three previously recognized, distinct larval morphs onto the molecular phylogeny revealed that one of the morphs had originated as the common ancestor of the genus, the other two having diversified separately in two subsequent major clades. The patterns of such diversification are discussed in terms of the unusual larval eye morphology and geographic distribution.  相似文献   

10.
Unusual chromosome architecture and behaviour at an HSR   总被引:2,自引:0,他引:2  
Sullivan BA  Bickmore WA 《Chromosoma》2000,109(3):181-189
Amplification of sequences within mammalian chromosomes is often accompanied by the formation of homogeneously staining regions (HSRs). The arrangement of DNA sequences within such amplicons has been investigated, but little is known about the chromosome structure or behaviour of these unusual regions. We have analysed the metaphase chromosome structure of the dihydrofolate reductase (DHFR) amplicon of CHOC400 cells. The chromatin in this region contains hyperacetylated nucleosomes yet, at the same time, appears to be densely packed like heterochromatin. The region does not bind heterochromatin proteins. We show that the dense packing of the region is restricted to DNA located close to the chromosome core/scaffold. In contrast, levels of the chromosome scaffold protein topoisomerase II at HSRs are the same as those found at other euchromatic locations. Metaphase chromosome condensation of the HSR is shown to be sensitive to topoisomerase II inhibitors, and sister chromatids often appear to remain attached within the HSRs at metaphase. We suggest that these features underlie anaphase bridging and the aberrant interphase structure of the HSR. The DHFR amplicon is widely used as a model system to study mammalian DNA replication. We conclude that the higher-order chromosome structure of this amplicon is unusual and suggest that caution needs to be exercised in extrapolating data from HSRs to normal chromosomal loci. Received: 19 October 1999; in revised form: 13 December 1999 / Accepted: 27 December 1999  相似文献   

11.
On the statistical assessment of similarities in DNA sequences   总被引:3,自引:2,他引:1       下载免费PDF全文
The statistical behavior of the similarity score for unrelated DNA sequences calculated as letter-by-letter comparison or from various forms of optimal alignment was studied. It was found that natural DNA-sequences from a data base and true random sequences show the same statistical behavior in terms of such scores. This makes it possible to adopt a simple criterion for the rejection of fortuitous similarity. It is based on the mean and standard deviation of chance scores whose expected values, depending on chain length, gap penalty and probability of letter coincidence, may be calculated from formulae given in the paper.  相似文献   

12.
It is generally thought that skilled behavior in human beings results from a functional hierarchy of the motor control system, within which reusable motor primitives are flexibly integrated into various sensori-motor sequence patterns. The underlying neural mechanisms governing the way in which continuous sensori-motor flows are segmented into primitives and the way in which series of primitives are integrated into various behavior sequences have, however, not yet been clarified. In earlier studies, this functional hierarchy has been realized through the use of explicit hierarchical structure, with local modules representing motor primitives in the lower level and a higher module representing sequences of primitives switched via additional mechanisms such as gate-selecting. When sequences contain similarities and overlap, however, a conflict arises in such earlier models between generalization and segmentation, induced by this separated modular structure. To address this issue, we propose a different type of neural network model. The current model neither makes use of separate local modules to represent primitives nor introduces explicit hierarchical structure. Rather than forcing architectural hierarchy onto the system, functional hierarchy emerges through a form of self-organization that is based on two distinct types of neurons, each with different time properties ("multiple timescales"). Through the introduction of multiple timescales, continuous sequences of behavior are segmented into reusable primitives, and the primitives, in turn, are flexibly integrated into novel sequences. In experiments, the proposed network model, coordinating the physical body of a humanoid robot through high-dimensional sensori-motor control, also successfully situated itself within a physical environment. Our results suggest that it is not only the spatial connections between neurons but also the timescales of neural activity that act as important mechanisms leading to functional hierarchy in neural systems.  相似文献   

13.
14.
Prevalence of quadruplexes in the human genome   总被引:28,自引:17,他引:11  
Guanine-rich DNA sequences of a particular form have the ability to fold into four-stranded structures called G-quadruplexes. In this paper, we present a working rule to predict which primary sequences can form this structure, and describe a search algorithm to identify such sequences in genomic DNA. We count the number of quadruplexes found in the human genome and compare that with the figure predicted by modelling DNA as a Bernoulli stream or as a Markov chain, using windows of various sizes. We demonstrate that the distribution of loop lengths is significantly different from what would be expected in a random case, providing an indication of the number of potentially relevant quadruplex-forming sequences. In particular, we show that there is a significant repression of quadruplexes in the coding strand of exonic regions, which suggests that quadruplex-forming patterns are disfavoured in sequences that will form RNA.  相似文献   

15.
Prokaryotic, eukaryotic and mitochondrial DNA sequences of total Length 300 000 nucleotides have been analyzed to find out whether stretches of alternating purines and pyrimidines are unusual in terms of occurrence, composition and base sequence. Alternating runs longer than 5 nucleotides are significantly under-represented in the natural sequences as compared to random ones. Octanucleotides are the most deficient, occurring at only 60% of the frequency expected in random sequences. An unexpectedly high proportion of these octamers consists of alternating tetramers with the repeat structure (PuPyPuPy)2 or (PyPuPyPu)2. DNA stretches containing such sequences can potentially form a S1 nuclease sensitive slippage (staggered loop) structure, which might serve as a locally unstacked intermediate in the B- to Z-DNA conformational transition.  相似文献   

16.
We present a computer-aided approach for identifying and aligning consensus secondary structure within a set of functionally related oligonucleotide sequences aligned by sequence. The method relies on visualization of secondary structure using a generalization of the dot matrix representation appropriate for consensus sequence data sets. An interactive computer program implementing such a visualization of consensus structure has been developed. The program allows for alignment editing, data and display filtering and various modes of base pair representation, including co-variation. The utility of this approach is demonstrated with four sample data sets derived from in vitro selection experiments and one data set comprising tRNA sequences.  相似文献   

17.
PcoC is a soluble periplasmic protein encoded by the plasmid-born pco copper resistance operon of Escherichia coli. Like PcoA, a multicopper oxidase encoded in the same locus and its chromosomal homolog CueO, PcoC contains unusual methionine rich sequences. Although essential for copper resistance, the functions of PcoC, PcoA, and their conserved methionine-rich sequences are not known. Similar methionine motifs observed in eukaryotic copper transporters have been proposed to bind copper, but there are no precedents for such metal binding sites in structurally characterized proteins. The high-resolution structures of apo PcoC, determined for both the native and selenomethionine-containing proteins, reveal a seven-stranded beta barrel with the methionines unexpectedly housed on a solvent-exposed loop. Several potential metal-binding sites can be discerned by comparing the structures to spectroscopic data reported for copper-loaded PcoC. In the native structure, the methionine loop interacts with the same loop on a second molecule in the asymmetric unit. In the selenomethionine structure, the methionine loops are more exposed, forming hydrophobic patches on the protein surface. These two arrangements suggest that the methionine motifs might function in protein-protein interactions between PcoC molecules or with other methionine-rich proteins such as PcoA. Analytical ultracentrifugation data indicate that a weak monomer-dimer equilibrium exists in solution for the apo protein. Dimerization is significantly enhanced upon binding Cu(I) with a measured delta(deltaG degrees )相似文献   

18.
For the realization of a practical high-throughput protein detection and analysis system, a novel peptide array has been constructed using a designed glycopeptide model library with an α-helical secondary structure. This study will contribute the increment of the diversity of such an array system and the application to focused proteomics and ligand screening by effective detection of sugar-binding proteins. Fluorescent glycopeptides with an α-helix, a β-strand, or a loop structure were designed initially to select a suitable scaffold for the detection of a model protein. After selection of the α-helical structure as the best scaffold, a small model library with various saccharides was constructed to have charge and hydrophobicity variations in the peptide sequences. When various sugar-binding proteins were added to the peptide library array, the fluorescent peptides showed different responses in fluorescence intensities depending on their sequences as well as saccharides. The patterns of these responses could be regarded as “protein fingerprints” (PFPs), which are able to establish the identities of the target proteins. The resulting PFPs reflected the recognition properties of the proteins. Furthermore, statistical data analysis from obtained PFPs was performed using a cluster analysis. The PFPs of sugar-binding proteins were clustered successfully depending on their families and binding properties. These studies demonstrate that arrays with glycopeptide libraries based on designed structures can be promising tools to detect and analyze the target proteins. Designed peptides with functional groups such as sugars will play roles as the capturing agents of high-throughput protein nano/micro arrays for focused proteomics and ligand screening studies.  相似文献   

19.
MOTIVATION: To devise a method that, unlike available methods, directly measures variations in phylogenetic signals in gene sequences that result from recombination, tests the significance of the signal variations and distinguishes misleading signals. RESULTS: We have developed a method, that we call 'sister-scanning', for assessing phylogenetic and compositional signals in the various patterns of identity that occur between four nucleotide sequences. A Monte Carlo randomization is done for all columns (positions) within a window and Z-scores are obtained for four real sequences or three real sequences with an outlier that is also randomized. The usefulness of the approach is demonstrated using tobamovirus and luteovirus sequences. Contradictory phylogenetic signals were distinguished in both datasets, as were regions of sequence that contained no clear signal or potentially misleading signals related to compositional similarities. In the tobamovirus dataset, contradictory phylogenetic signals were separated by coding sequences up to a kilobase long that contained no clear signal. Our re-analysis of this dataset using sister-scanning also yielded the first evidence known to us of an inter-species recombination site within a viral RNA-dependent RNA polymerase gene together with evidence of an unusual pattern of conservation in the three codon positions.  相似文献   

20.
A simple way to look at DNA   总被引:8,自引:1,他引:8  
A method is presented for embedding nucleotide sequence data in a simple metric space. Computer graphical examination of spatially-represented sequences permits rapid searches for canonical patterns or interesting structures. Sequence comparisons are facilitated by plots of distance measures for homologous sequences, and the large-scale structure of the genetic code can be studied by measures such as fractal dimensionality.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号