首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
The nucleotide (nt) sequences of inverted terminal repeats (ITR) from human adenovirus (Ad) 19, bovine Ad1 (BAd1), bovine Ad3 (BAd3), canine Ad2 (CAd2) and an avian Ad, EDS-76, were determined. The length of the ITR sequence was 160 bp in Ad19, 159 bp in BAd1, 195 bp in BAd3, 196 bp in CAd2 and 52 bp in EDS-76. CAd2 had the longest ITR among the examined Ads, BAd3 the second longest, and EDS-76 had the shortest ITR. A TAAT sequence located between the 10th and 13th nt counted from the ends was conserved in all Ads examined so far. To determine phylogenetic relationships among human and animal Ads, sequences of their ITRs were compared, and a phylogenetic tree was constructed by using the maximum-likelihood method. It is the method involving statistical analysis of computing the probability of a particular set of sequences on a given tree and maximizing this probability over all evolutionary trees [Felsenstein, J. Mol. Evol. 17 (1981) 368-376]. From these analyses, it was found that members belonging to the same human Ad subgenus are related closely to each other, whereas representatives of different human subgenera are distributed rather divergently among animal Ads.  相似文献   

3.

Background  

In the last decade, there have been many applications of formal language theory in bioinformatics such as RNA structure prediction and detection of patterns in DNA. However, in the field of proteomics, the size of the protein alphabet and the complexity of relationship between amino acids have mainly limited the application of formal language theory to the production of grammars whose expressive power is not higher than stochastic regular grammars. However, these grammars, like other state of the art methods, cannot cover any higher-order dependencies such as nested and crossing relationships that are common in proteins. In order to overcome some of these limitations, we propose a Stochastic Context Free Grammar based framework for the analysis of protein sequences where grammars are induced using a genetic algorithm.  相似文献   

4.

Background  

The comprehension of the gene regulatory code in eukaryotes is one of the major challenges of systems biology, and is a requirement for the development of novel therapeutic strategies for multifactorial diseases. Its bi-fold degeneration precludes brute force and statistical approaches based on the genomic sequence alone. Rather, recursive integration of systematic, whole-genome experimental data with advanced statistical regulatory sequence predictions needs to be developed. Such experimental approaches as well as the prediction tools are only starting to become available and increasing numbers of genome sequences and empirical sequence annotations are under continual discovery-driven change. Furthermore, given the complexity of the question, a decade(s) long multi-laboratory effort needs to be envisioned. These constraints need to be considered in the creation of a framework that can pave a road to successful comprehension of the gene regulatory code.  相似文献   

5.
Given the availability of complete genome sequences from related organisms, sequence conservation can provide important clues for predicting gene structure. In particular, one should be able to leverage information about known genes in one species to help determine the structures of related genes in another. Such an approach is appealing in that high-quality gene prediction can be achieved for newly sequenced species, such as mouse and puffer fish, using the extensive knowledge that has been accumulated about human genes. This article reports a novel approach to predicting the exon-intron structures of mouse genes by incorporating constraints from orthologous human genes using techniques that have previously been exploited in speech and natural language processing applications. The approach uses a context-free grammar to parse a training corpus of annotated human genes. A statistical training procedure produces a weighted recursive transition network (RTN) intended to capture the general features of a mammalian gene. This RTN is expanded into a finite state transducer (FST) and composed with an FST capturing the specific features of the human orthologue. This model includes a trigram language model on the amino acid sequence as well as exon length constraints. A final stage uses the free software package ClustalW to align the top n candidates in the search space. For a set of 98 orthologous human-mouse pairs, we achieved 96% sensitivity and 97% specificity at the exon level on the mouse genes, given only knowledge gleaned from the annotated human genome.  相似文献   

6.
7.
An accurate approximation is derived to the distribution of the length of the longest matching word present between two random DNA sequences of finite length, using only elementary probability arguments. The distribution is shown to be consistent with previous asymptotic results for the mean and variance of longest common words. The application of the distribution to assessing the statistical significance of sequence similarities is considered. It is shown how the distribution can be modified to take account of non-independence of neighbouring bases in real sequences.  相似文献   

8.
9.
Using two sets of nucleotide sequences of the human and simian T-cell leukemia/lymphoma virus type I (HTLV-I/STLV-I), one consisting of 522 bp of the env gene from 70 viral strains and the other a 140-bp segment from the pol gene of 52 viral strains, I estimated cladograms based on a statistical parsimony procedure that was developed specifically to estimate within-species gene trees. An extension of a nesting procedure is offered for sequence data that forms nested clades used in hypothesis testing. The nested clades were used to test three hypotheses relating to transmission of HTLV/STLV sequences: (1) Have cross-species transmissions occurred and, if so, how many? (2) In what direction have they occurred? (3) What are the geographic relationships of these transmission events? The analyses support a range of 11-16 cross-species transmissions throughout the history of these sequences. Additionally, outgroup weights were assigned to haplotypes using arguments from coalescence theory to infer directionality of transmission events. Conclusions on geographic origins of transmission events and particular viral strains are inconclusive due to small samples and inadequate sampling design. Finally, this approach is compared directly to results obtained from a traditional maximum parsimony approach and found to be superior at establishing relationships and identifying instances of transmission.   相似文献   

10.
Li W 《Gene》2001,276(1-2):57-72
The concept of homogeneity of G+C content is always relative and subjective. This point is emphasized and quantified in this paper using a simple example of one sequence segmented into two subsequences. Whether the sequence is homogeneous or not can be answered by whether the two-subsequence model describes the DNA sequence better than the one-sequence model. There are at least three equivalent ways of looking at the 1-to-2 segmentation: Jensen-Shannon divergence measure, log likelihood ratio test, and model selection using Bayesian information criterion. Once a criterion is chosen, a DNA sequence can be recursively segmented into multiple domains. We use one subjective criterion called segmentation strength based on the Bayesian information criterion. Whether or not a sequence is homogeneous and how many domains it has depend on this criterion. We compare six different genome sequences (yeast S. cerevisiae chromosome III and IV, bacterium M. pneumoniae, human major histocompatibility complex sequence, longest contigs in human chromosome 21 and 22) by recursive segmentations at different strength criteria. Results by recursive segmentation confirm that yeast chromosome IV is more homogeneous than yeast chromosome III, human chromosome 21 is more homogeneous than human chromosome 22, and bacterial genomes may not be homogeneous due to short segments with distinct base compositions. The recursive segmentation also provides a quantitative criterion for identifying isochores in human sequences. Some features of our recursive segmentation, such as the possibility of delineating domain borders accurately, are superior to those of the moving-window approach commonly used in such analyses.  相似文献   

11.
An allometric model for trees   总被引:1,自引:0,他引:1  
  相似文献   

12.
13.
The present study investigated the effects of sequence complexity, defined in terms of phonemic similarity and phonotoactic probability, on the timing and accuracy of serial ordering for speech production in healthy speakers and speakers with either hypokinetic or ataxic dysarthria. Sequences were comprised of strings of consonant-vowel (CV) syllables with each syllable containing the same vowel, /a/, paired with a different consonant. High complexity sequences contained phonemically similar consonants, and sounds and syllables that had low phonotactic probabilities; low complexity sequences contained phonemically dissimilar consonants and high probability sounds and syllables. Sequence complexity effects were evaluated by analyzing speech error rates and within-syllable vowel and pause durations. This analysis revealed that speech error rates were significantly higher and speech duration measures were significantly longer during production of high complexity sequences than during production of low complexity sequences. Although speakers with dysarthria produced longer overall speech durations than healthy speakers, the effects of sequence complexity on error rates and speech durations were comparable across all groups. These findings indicate that the duration and accuracy of processes for selecting items in a speech sequence is influenced by their phonemic similarity and/or phonotactic probability. Moreover, this robust complexity effect is present even in speakers with damage to subcortical circuits involved in serial control for speech.  相似文献   

14.
The analysis of repeats in the DNA sequences is an important subject in bioinformatics. In this paper, we propose a novel projection-assemble algorithm to find unknown interspersed repeats in DNA sequences. The algorithm employs random projection algorithm to obtain a candidate fragment set, and exhaustive search algorithm to search each pair of fragments from the candidate fragment set to find potential linkage, and then assemble them together. The complexity of our projection-assemble algorithm is nearly linear to the length of the genome sequence, and its memory usage is limited by the hardware. We tested our algorithm with both simulated data and real biology data, and the results show that our projection-assemble algorithm is efficient. By means of this algorithm, we found an un-labeled repeat region that occurs five times in Escherichia coil genome, with its length more than 5,000 bp, and a mismatch probability less than 4%.  相似文献   

15.
MOTIVATION: Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences. RESULTS: We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith-Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith-Waterman algorithm and two hidden Markov model-based algorithms.  相似文献   

16.
A computer package written in Fortran-IV for the PDP-11 minicomputer is described. The package's novel features are: software for voice-entry of sequence data; a less memory intensive algorithm for optimal sequence alignment; and programs that fit statistical models to nucleic acid and protein sequences.  相似文献   

17.
We quantify the VDJ recombination and somatic hypermutation processes in human B cells using probabilistic inference methods on high-throughput DNA sequence repertoires of human B-cell receptor heavy chains. Our analysis captures the statistical properties of the naive repertoire, first after its initial generation via VDJ recombination and then after selection for functionality. We also infer statistical properties of the somatic hypermutation machinery (exclusive of subsequent effects of selection). Our main results are the following: the B-cell repertoire is substantially more diverse than T-cell repertoires, owing to longer junctional insertions; sequences that pass initial selection are distinguished by having a higher probability of being generated in a VDJ recombination event; somatic hypermutations have a non-uniform distribution along the V gene that is well explained by an independent site model for the sequence context around the hypermutation site.  相似文献   

18.
19.
20.
Polymerase chain reaction (PCR) products were characterized for a repeated sequence family (designated "O-150") of the human filarial parasite Onchocerca volvulus. In phylogenetic inferences, the O-150 sequences clustered into closely related groups, suggesting that concerted evolution maintains sequence homology in this family. Using a novel mathematical model based on a nested application of an analysis of variance, we demonstrated that African rainforest and savannah strain parasite populations are significantly different. In contrast, parasites collected in the New World are indistinguishable from African savannah strains of O. volvulus. This finding supports the hypothesis that onchocerciasis was recently introduced into the New World, possibly as a result of the slave trade.   相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号