首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
An accurate approximation is derived to the distribution of the length of the longest matching word present between two random DNA sequences of finite length, using only elementary probability arguments. The distribution is shown to be consistent with previous asymptotic results for the mean and variance of longest common words. The application of the distribution to assessing the statistical significance of sequence similarities is considered. It is shown how the distribution can be modified to take account of non-independence of neighbouring bases in real sequences.  相似文献   

An algorithm was developed to compare simultaneously severalDNA, RNA or protein sequences. With the algorithm, conservedregions of one sequence are located by doing pairwise comparisonswith other sequences, which is advantageous in planning site-directedmutagenesis studies. The observation matrices filled with scoresof comparisons are superimposed and added together and thosepoints having values greater than or equal to stringency areaccepted. The predicted secondary structural features can alsobe compared. Received on August 21, 1987; accepted on November 20, 1987  相似文献   

Drosophila melanogaster telomeres contain arrays of two non-LTR retrotransposons called HeT-A and TART. Previous studies have shown that HeT-A- and TART-like sequences are also located at non-telomeric sites in the Y chromosome heterochromatin. By in situ hybridization experiments, we mapped TART sequences in the h16 region of the long arm close to the centromere of the Y chromosome of D. melanogaster. HeT-A sequences were localized in two different regions on the Y chromosome, one very close to the centromere in the short arm (h18-h19) and the other in the long arm (h13-h14). To assess a possible heterochromatic location of TART and HeT-A elements in other Drosophila species, we performed in situ hybridization experiments, using both TART and HeT-A probes, on mitotic and polytene chromosomes of D. simulans, D. sechellia, D. mauritiana, D. yakuba and D. teissieri. We found that TART and HeT-A probes hybridize at specific heterochromatic regions of the Y chromosome in all Drosophila species that we analyzed.  相似文献   

Summary We examine in this paper one of the expected consequences of the hypothesis that modern proteins evolved from random heteropeptide sequences. Specifically, we investigate the lengthwise distributions of amino acids in a set of 1,789 protein sequences with little sequence identity using the run test statistic (r o) of Mood (1940,Ann. Math. Stat. 11, 367–392). The probability density ofr o for a collection of random sequences has mean=0 and variance=1 [the N(0,1) distribution] and can be used to measure the tendency of amino acids of a given type to cluster together in a sequence relative to that of a random sequence. We implement the run test using binary representations of protein sequences in which the amino acids of interest are assigned a value of 1 and all others a value of 0. We consider individual amino acids and sets of various combinations of them based upon hydrophobicity (4 sets), charge (3 sets), volume (4 sets), and secondary structure propensity (3 sets). We find that any sequence chosen randomly has a 90% or greater chance of having a lengthwise distribution of amino acids that is indistinguishable from the random expectation regardless of amino acid type. We regard this as strong support for the random-origin hypothesis. However, we do observe significant deviations from the random expectation as might be expected after billions years of evolution. Two important global trends are found: (1) Amino acids with a strong α-helix propensity show a strong tendency to cluster whereas those with β-sheet or reverse-turn propensity do not. (2) Clustered rather than evenly distributed patterns tend to be preferred by the individual amino acids and this is particularly so for methionine. Finally, we consider the problem of reconciling the random nature of protein sequences with structurally meaningful periodic “patterns” that can be detected by sliding-window, autocorrelation, and Fourier analyses. Two examples, rhodopsin and bacteriorhodopsin, show that such patterns are a natural feature of random sequences.  相似文献   

The present work describes an attempt to identify reliable criteria which could be used as distance indices between protein sequences. Seven different criteria have been tested: i and ii) the scores of the alignments as given by the BESTFIT and the FASTA programs; iii) the ratio parameter, i.e. the BESTFIT score divided by the length of the aligned peptides; iv and v) the statistical significance (Z-scores) of the scores calculated by BESTFIT and FASTA, as obtained by comparison with shuffled sequences; vi) the Z-scores provided by the program RELATE which performs a segment-by-segment comparison of 2 sequences, and vii) an original distance index calculated by the program DOCMA from all the pairwise dotplots between the sequences. These 7 criteria have been tested against the aminoacid sequences of 39 globins and those of the 20 aminoacyl-tRNA synthetases from E. coli. The distances between the sequences were analyzed by the multivariate analysis techniques. The results show that the distances calculated from the scores of the pairwise alignments are not adequately sensitive. The Z-score from RELATE is not selective enough and too demanding in computer time. Three criteria gave a classification consistent with the known similarities between the sequences in the sets, namely the Z-scores from BESTFIT and FASTA and the multiple dotplot comparison distance index from DOCMA.  相似文献   

A method previously developed for computation of pattern probabilitiesin random sequences under Markov chain models. We extend thismethod to the calculation of the joint distribution for twopatterns. An application yields the distribution of the rightchoice measure for expressivity and how significance boundsdepend on sequence length. These bounds are used to show thatthe choice of pyrimidine in codon position 3 of Escherichiacoli genes deviates considerably from a general Markov processmodel for coding regions. We also derive some statistical evidencethat this significant deviation is limited to codon position3.  相似文献   

In this paper, we give an overview about the different results existing on the statistical distribution of word counts in a Markovian sequence of letters. Results concerning the number of overlapping occurrences, the number of renewals and the number of clumps will be presented. Counts of single words and also multiple words are considered. Most of the results are approximations as the length of the sequence tends to infinity. We will see that Gaussian approximations switch to (compound) Poisson approximations for rare words. Modeling DNA sequences or proteins by stationary Markov chains, these results can be used to study the statistical frequency of motifs in a given sequence.  相似文献   

We present a theory for cooperative chiral order in the transition between right-handed B-DNA and left-handed Z-DNA. This theory, based on the random-field Ising model, predicts the characteristic length scale of Z-DNA segments. This length scale depends on whether the DNA is a homopolymer or a random sequence: it is approximately 4000 nucleotides in a homopolymer but only approximately 25 nucleotides in a random sequence. These theoretical results are consistent with experiments on DNA homopolymers and random sequences.  相似文献   

Protein-protein interactions are fundamentally important in many biological processes and it is in pressing need to understand the principles of protein-protein interactions. Mutagenesis studies have found that only a small fraction of surface residues, known as hot spots, are responsible for the physical binding in protein complexes. However, revealing hot spots by mutagenesis experiments are usually time consuming and expensive. In order to complement the experimental efforts, we propose a new computational approach in this paper to predict hot spots. Our method, Rough Set-based Multiple Criteria Linear Programming (RS-MCLP), integrates rough sets theory and multiple criteria linear programming to choose dominant features and computationally predict hot spots. Our approach is benchmarked by a dataset of 904 alanine-mutated residues and the results show that our RS-MCLP method performs better than other methods, e.g., MCLP, Decision Tree, Bayes Net, and the existing HotSprint database. In addition, we reveal several biological insights based on our analysis. We find that four features (the change of accessible surface area, percentage of the change of accessible surface area, size of a residue, and atomic contacts) are critical in predicting hot spots. Furthermore, we find that three residues (Tyr, Trp, and Phe) are abundant in hot spots through analyzing the distribution of amino acids.  相似文献   

Identifying and predicting the structural characteristics of novel repeats throughout the genome can lend insight into biological function. Specific repeats are believed to have biological significance as a function of their distribution patterns. We have developed 'GenomeMark,' a computer program that detects and statistically analyzes candidate repeats. Specifically, 'GenomeMark' identifies the periodic distribution of unique words, calculating their chi2 and Z-score values. Using 'GenomeMark,' we identified novel sequence words present in tandem throughout genomes. We found that these sequences have remarkable spacer sequence distributions and many were genome specific, validating the genome signature theory. Further analysis confirmed that many of these sequences have a specific biological function. The program is available from the authors upon request and is freely available for non-commercial and academic entities.  相似文献   

We present data-analytic and statistical tools for studying rates of rearrangement of whole genomes and to assess the stability of these methods with changes in the level of resolution of the genomic data. We construct datasets on the numbers of conserved syntenies and conserved segments shared by pairs of animal genomes at different levels of resolution. We fit these data to an evolutionary tree and find the rates of rearrangement on various evolutionary lineages. We document the lack of clocklike behavior of rearrangement processes, the independence of translocation and inversion rates, and the level of resolution beyond which translocations rates are lost in noise due to other processes.  相似文献   

Distribution data for 111 mainly sclerophyll forest tree species in southern and eastern Australia were analysed to detect possible phytogeographic provinces. The suitability of fourteen methods of numerical classification was assessed. Overall, the Kulczynski (1927) similarity coefficient together with the flexible clustering strategy (β= 0.25) (Lance & Williams 1967) and the asymmetric information statistic (Dale, Lance & Albrecht 1971) were considered to produce the most satisfactory results.  相似文献   

Integrated sequences of mouse mammary tumor virus (MMTV) have been localized in the genomes of five inbred mouse strains (Balb/c, C3H, DBA/2, A.TH, 129-SV) and one mammary tumor cell line (GR). Two major classes of MMTV sequences have been detected in mouse DNA fractions as obtained by Cs2SO4/BAMD (3,6-bis-(acetatomercurimethyl)dioxane) density gradient centrifugation. The first one corresponds to previously described endogenous sequences (Mtv loci), whereas the second one corresponds to endogenous sequences not previously known, and/or recently acquired; in the case of GR cells exogenous sequences may also be present in this class. The genome distribution is somewhat different for the two classes of sequences, the first one being practically only present in the lightest DNA segments of the mouse genome (GC congruent to 38%); the second one being also represented in heavier segments (GC congruent to 43%). This integration pattern suggests that "ancient" endogenous sequences are practically only localized in genome segments of roughly matching composition, whereas exogenous and recently acquired endogenous MMTV sequences may also be present in heavier fractions.  相似文献   

The distribution of RNA motifs in natural sequences.   总被引:5,自引:3,他引:2       下载免费PDF全文
Functional analysis of genome sequences has largely ignored RNA genes and their structures. We introduce here the notion of 'ribonomics' to describe the search for the distribution of and eventually the determination of the physiological roles of these RNA structures found in the sequence databases. The utility of this approach is illustrated here by the identification in the GenBank database of RNA motifs having known binding or chemical activity. The frequency of these motifs indicates that most have originated from evolutionary drift and are selectively neutral. On the other hand, their distribution among species and their location within genes suggest that the destiny of these motifs may be more elaborate. For example, the hammerhead motif has a skewed organismal presence, is phylogenetically stable and recent work on a schistosome version confirms its in vivo biological activity. The under-representation of the valine-binding motif and the Rev-binding element in GenBank hints at a detrimental effect on cell growth or viability. Data on the presence and the location of these motifs may provide critical guidance in the design of experiments directed towards the understanding and the manipulation of RNA complexes and activities in vivo.  相似文献   

Studying the distribution of a motif along sequences may help in the understanding of its biological function, or to detect regions of interest. A statistical model is needed to assess the significance of the observed distribution. We propose a heterogenous compound Poisson process to model the possibility of overlap between occurrences and some heterogeneity of the sequence known a priori. The estimation procedure of the parameters is described and tests of homogenous sub-models are proposed. We also consider the detection of rich regions using either cumulated distances or moving intervals, via a homogenization technique. Illustrations of the method are given with applications to bacterial genomes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号