首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
A coding sequence is defined as a DNA sequence coding the primary structure of a protein (a polypeptide). Such a sequence must satisfy a specific constraint, which consists in coding a functional protein. As the genetic code is degenerated, there exists, for a given polypeptide, a set of synonymous sequences which would code the same polypeptide. Translation conditional models are being defined on such sets. The aim of this paper is to give a common formalism. Besides the codon bias model, a few other conditional models will be defined. Statistical estimators and comparison methods will be briefly presented. These models can be used for gene classification, or to find out, in a real sequence, remarkable features. An example will be presented on Escherichia coli genes.  相似文献   

2.
Bostick DL  Shen M  Vaisman II 《Proteins》2004,56(3):487-501
A topological representation of proteins is developed that makes use of two metrics: the Euclidean metric for identifying natural nearest neighboring residues via the Delaunay tessellation in Cartesian space and the distance between residues in sequence space. Using this representation, we introduce a quantitative and computationally inexpensive method for the comparison of protein structural topology. The method ultimately results in a numerical score quantifying the distance between proteins in a heuristically defined topological space. The properties of this scoring scheme are investigated and correlated with the standard Calpha distance root-mean-square deviation measure of protein similarity calculated by rigid body structural alignment. The topological comparison method is shown to have a characteristic dependence on protein conformational differences and secondary structure. This distinctive behavior is also observed in the comparison of proteins within families of structural relatives. The ability of the comparison method to successfully classify proteins into classes, superfamilies, folds, and families that are consistent with standard classification methods, both automated and human-driven, is demonstrated. Furthermore, it is shown that the scoring method allows for a fine-grained classification on the family, protein, and species level that agrees very well with currently established phylogenetic hierarchies. This fine classification is achieved without requiring visual inspection of proteins, sequence analysis, or the use of structural superimposition methods. Implications of the method for a fast, automated, topological hierarchical classification of proteins are discussed.  相似文献   

3.
A biologically realistic method was used to simulate evolutionary trees. The method uses a real DNA coding sequence as the starting point, simulates mutation according to the mutational spectrum of Escherichia coli-including base substitutions, insertions, and deletions-and separates the processes of mutation and selection. Trees of 8, 16, 32, and 64 taxa were simulated with average branch lengths of 50, 100, 150, 200, and 250 changes per branch. The resulting sequences were aligned with ClustalX, and trees were estimated by Neighbor Joining, Parsimony, Maximum Likelihood, and Bayesian methods from both DNA sequences and the corresponding protein sequences. The estimated trees were compared with the true trees, and both topological and branch length accuracies were scored. Over the variety of conditions tested, Bayesian trees estimated from DNA sequences that had been aligned according to the alignment of the corresponding protein sequences were the most accurate, followed by Maximum Likelihood trees estimated from DNA sequences and Parsimony trees estimated from protein sequences.  相似文献   

4.
Comparing DNA or protein sequences plays an important role in the functional analysis of genomes. Despite many methods available for sequences comparison, few methods retain the information content of sequences. We propose a new approach, the Yau-Hausdorff method, which considers all translations and rotations when seeking the best match of graphical curves of DNA or protein sequences. The complexity of this method is lower than that of any other two dimensional minimum Hausdorff algorithm. The Yau-Hausdorff method can be used for measuring the similarity of DNA sequences based on two important tools: the Yau-Hausdorff distance and graphical representation of DNA sequences. The graphical representations of DNA sequences conserve all sequence information and the Yau-Hausdorff distance is mathematically proved as a true metric. Therefore, the proposed distance can preciously measure the similarity of DNA sequences. The phylogenetic analyses of DNA sequences by the Yau-Hausdorff distance show the accuracy and stability of our approach in similarity comparison of DNA or protein sequences. This study demonstrates that Yau-Hausdorff distance is a natural metric for DNA and protein sequences with high level of stability. The approach can be also applied to similarity analysis of protein sequences by graphic representations, as well as general two dimensional shape matching.  相似文献   

5.
Rong Liu  Jianjun Hu 《Proteins》2013,81(11):1885-1899
Accurate prediction of DNA‐binding residues has become a problem of increasing importance in structural bioinformatics. Here, we presented DNABind, a novel hybrid algorithm for identifying these crucial residues by exploiting the complementarity between machine learning‐ and template‐based methods. Our machine learning‐based method was based on the probabilistic combination of a structure‐based and a sequence‐based predictor, both of which were implemented using support vector machines algorithms. The former included our well‐designed structural features, such as solvent accessibility, local geometry, topological features, and relative positions, which can effectively quantify the difference between DNA‐binding and nonbinding residues. The latter combined evolutionary conservation features with three other sequence attributes. Our template‐based method depended on structural alignment and utilized the template structure from known protein–DNA complexes to infer DNA‐binding residues. We showed that the template method had excellent performance when reliable templates were found for the query proteins but tended to be strongly influenced by the template quality as well as the conformational changes upon DNA binding. In contrast, the machine learning approach yielded better performance when high‐quality templates were not available (about 1/3 cases in our dataset) or the query protein was subject to intensive transformation changes upon DNA binding. Our extensive experiments indicated that the hybrid approach can distinctly improve the performance of the individual methods for both bound and unbound structures. DNABind also significantly outperformed the state‐of‐art algorithms by around 10% in terms of Matthews's correlation coefficient. The proposed methodology could also have wide application in various protein functional site annotations. DNABind is freely available at http://mleg.cse.sc.edu/DNABind/ . Proteins 2013; 81:1885–1899. © 2013 Wiley Periodicals, Inc.  相似文献   

6.
Histone octamers (hos) and DNA topoisomerase I contribute, along with other proteins, to the higher order structure of chromatin. Here we report on the similar topological requirements of these two protein model systems for their interaction with DNA. Both histone octamers and topoisomerase I positively and consistently respond to DNA supercoiling and curvature, and to the spatial accessibility of the preferential interaction sites. These findings (1) point to the relevance of the topology-related DNA conformation in protein interactions and define the particular role of the helically phased rotational information; and (2) help to solve the apparent paradoxical behaviour of ubiquitous and abundant proteins that interact with defined DNA sites in spite of the lack of clear sequence consensuses. Considering firstly, that the interactions with DNA of both DNA topoisomerase I and histone octamers are topology-sensitive and that upon their interaction the DNA conformation is modified; and secondly, that similar behaviours have also been reported for DNA topoisomerase II and histone H1, a topology-based functional correlation among all these determinants of the higher order structure of chromatin is here suggested.  相似文献   

7.
拓扑树间的通经拓扑距离   总被引:1,自引:1,他引:0  
给出了一种新的系统树间的拓扑距离,使用NJ,MP,UPGMA等3种方法对13种动物的线粒体中14个基因(含组合的)DNA序列数据进行系统树的构建,利用分割拓扑距离和本文给出的通经拓扑距离对这14种系统树这间及其与真树进行比较。结果显示,NJ法对获得已知树的有效率最高,MP法次之,UPGMA法最低。这14种DNA序列所构建的系统树与已知树的拓扑距离基本上是随其DNA序列长度增加而减小,但两者的相关系数并未达到显著水平,分割拓扑距离在总体上可反映树间的拓扑结构差异,但其测度精确度比通经拓扑距离要低。  相似文献   

8.
The constitution of the centromeric portions of the sex chromosomes of the red-necked wallaby, Macropus rufogriseus (family Macropodidae, subfamily Macropodinae), was investigated to develop an overview of the sequence composition of centromeres in a marsupial genome that harbors large amounts of centric and pericentric heterochromatin. The large, C-band-positive centromeric region of the X chromosome was microdissected and the isolated DNA was microcloned. Further sequence and cytogenetic analyses of three representative clones show that all chromosomes in this species carry a 178-bp satellite sequence containing a CENP-B DNA binding domain (CENP-B box) shown herein to selectively bind marsupial CENP-B protein. Two other repeats isolated in this study localize specifically to the sex chromosomes yet differ in copy number and intrachromosomal distribution. Immunocytohistochemistry assays with anti-CENP-E, anti-CREST, anti-CENP-B, and anti-trimethyl-H3K9 antibodies defined a restricted point localization of the outer kinetochore at the functional centromere within an enlarged pericentric and heterochromatic region. The distribution of these repeated sequences within the karyotype of this species, coupled with the apparent high copy number of these sequences, indicates a capacity for retention of large amounts of centromere-associated DNA in the genome of M. rufogriseus.  相似文献   

9.
A variety of different methods to generate diverse proteins, including random mutagenesis and recombination, are currently available and most of them accumulate the mutations on the target gene of a protein, whose sequence space remains unchanged. On the other hand, a pool of diverse genes, which is generated by random insertions, deletions and exchange of the homologous domains with different lengths in the target gene, would present the protein lineages resulting in new fitness landscapes. Here we report a method to generate a pool of protein variants with different sequence spaces by employing green fluorescent protein (GFP) as a model protein. This process, designated functional salvage screen (FSS), comprises the following procedures: a defective GFP template expressing no fluorescence is first constructed by genetically disrupting a predetermined region(s) of the protein and a library of GFP variants is generated from the defective template by incorporating the randomly fragmented genomic DNA from Escherichia coli into the defined region(s) of the target gene, followed by screening of the functionally salvaged, fluorescence-emitting GFPs. Two approaches, sequence-directed and PCR-coupled methods, were attempted to generate the library of GFP variants with new sequences derived from the genomic segments of E.coli. The functionally salvaged GFPs were selected and analyzed in terms of the sequence space and functional properties. The results demonstrate that the functional salvage process not only can be a simple and effective method to create protein lineages with new sequence spaces, but also can be useful in elucidating the involvement of a specific region(s) or domain(s) in the structure and function of protein.  相似文献   

10.
Shachar O  Linial M 《Proteins》2004,57(3):531-538
With currently available sequence data, it is feasible to conduct extensive comparisons among large sets of protein sequences. It is still a much more challenging task to partition the protein space into structurally and functionally related families solely based on sequence comparisons. The ProtoNet system automatically generates a treelike classification of the whole protein space. It stands to reason that this classification reflects evolutionary relationships, both close and remote. In this article, we examine this hypothesis. We present a semiautomatic procedure that singles out certain inner nodes in the ProtoNet tree that should ideally correspond to structurally and functionally defined protein families. We compare the performance of this method against several expert systems. Some of the competing methods incorporate additional extraneous information on protein structure or on enzymatic activities. The ProtoNet-based method performs at least as well as any of the methods with which it was compared. This article illustrates the ProtoNet-based method on several evolutionarily diverse families. Using this new method, an evolutionary divergence scheme can be proposed for a large number of structural and functional related superfamilies.  相似文献   

11.
12.
13.
Protein structure prediction is based mainly on the modeling of proteins by homology to known structures; this knowledgebased approach is the most promising method to date. Although it is used in the whole area of protein research, no general rules concerning the quality and applicability of concepts and procedures used in homology modeling have been put forward yet. Therefore, the main goal of the present work is to provide tools for the assessment of accuracy of modeling at a given level of sequence homology. A large set of known structures from different conformational and functional classes, but various degrees of homology was selected. Pairwise structure superpositions were performed. Starting with the definition of the structurally conserved regions and determination of topologically correct sequence alignments, we correlated geometrical properties with sequence homology (defined by the 250 PAM Dayhoff Matrix) and identity. It is shown that both the topological differences of the protein backbones and the relative positions of corresponding side chains diverge with decreasing sequence identity. Below 50% identity, the deviation in regions that are structurally not conserved continually increases, thus implying that with decreasing sequence identity modeling has to take into account more and more structurally diverging loop regions that are difficult to predict. © 1993 Wiley-Liss, Inc.  相似文献   

14.
A number of experimental methods have been reported for estimating the number of genes in a genome, or the closely related coding density of a genome, defined as the fraction of base pairs in codons. Recently, DNA sequence data representative of the genome as a whole have become available for several organisms, making the problem of estimating coding density amenable to sequence analytic methods. Estimates of coding density for a single genome vary widely, so that methods with characterized error bounds have become increasingly desirable. We present a method to estimate the protein coding density in a corpus of DNA sequence data, in which a 'coding statistic' is calculated for a large number of windows of the sequence under study, and the distribution of the statistic is decomposed into two normal distributions, assumed to be the distributions of the coding statistic in the coding and noncoding fractions of the sequence windows. The accuracy of the method is evaluated using known data and application is made to the yeast chromosome III sequence and to C. elegans cosmid sequences. It can also be applied to fragmentary data, for example a collection of short sequences determined in the course of STS mapping.  相似文献   

15.
The E. coli dam (DNA adenine methylase) enzyme is known to methylate the sequence GATC. A general method for cloning sequence-specific DNA methylase genes was used to isolate the dam gene on a 1.14 kb fragment, inserted in the plasmid vector pBR322. Subsequent restriction mapping and subcloning experiments established a set of approximate boundaries of the gene. The nucleotide sequence of the dam gene was determined, and analysis of that sequence revealed a unique open reading frame which corresponded in length to that necessary to code for a protein the size of dam. Amino acid composition derived from this sequence corresponds closely to the amino acid composition of the purified dam protein. Enzymatic and DNA:DNA hybridization methods were used to investigate the possible presence of dam genes in a variety of prokaryotic organisms.  相似文献   

16.
Mutation frequencies vary along a nucleotide sequence, and nucleotide positions with an exceptionally high mutation frequency are called hotspots. Mutation hotspots in DNA often reflect intrinsic properties of the mutation process, such as the specificity with which mutagens interact with nucleic acids and the sequence-specificity of DNA repair/replication enzymes. They might also reflect structural and functional features of target protein or RNA sequences in which they occur. The determinants of mutation frequency and specificity are complex and there are many analytical methods for their study. This paper discusses computational approaches to analysing mutation spectra (distribution of mutations along the target genes) that include many detectable (mutable) positions. The following methods are reviewed: mutation hotspot prediction; pairwise and multiple comparisons of mutation spectra; derivation of a consensus sequence; and analysis of correlation between nucleotide sequence features and mutation spectra. Spectra of spontaneous and induced mutations are used for illustration of the complexities and pitfalls of such analyses. In general, the DNA sequence context of mutation hotspots is a fingerprint of interactions between DNA and DNA repair/replication/modification enzymes, and the analysis of hotspot context provides evidence of such interactions.  相似文献   

17.
Sumiyama K  Kim CB  Ruddle FH 《Genomics》2001,71(2):260-262
The discovery of cis-element control motifs in noncoding DNA poses a difficult problem in genome analysis. Functional analysis by means of reporter constructs expressed in transgenic organisms is the most reliable method, but is by itself time-consuming and expensive. Searching noncoding DNA for known control motifs by sequence analysis is problematic, since protein binding motifs are short, in the range of 8-10 bp, and occur frequently by chance. Heretofore, the most reliable sequence analysis method has been the comparison of homologous sequence domains in related but moderately evolutionarily divergent species such as, for example, mouse and human. In such pairwise combinations, control regions are conserved because they serve a vital function and can be identified by their similar sequences. Single pairwise comparisons, however, allow the discovery of conserved sequence strings only at low resolution and without specific identity. We have investigated the possibility of using multiple sequence comparisons to correct these shortcomings. We applied this method to the Hoxc8 early enhancer region that has been previously analyzed in depth by functional methods and through its application successfully identified known protein binding cis-element motifs. Candidate protein binding sites could also be identified. This method, based on evolutionarily related sequence comparisons, should be quite useful as a prescreening step prior to functional analysis with corresponding savings in time and resources.  相似文献   

18.
We propose a new approach to study protein coding and non-coding regions in DNA sequences, by making use of two complementary statistical methods. The principal component analysis (PCA) is a graphical method to represent DNA sequences which are characterized by some quantitative parameters: it is a help to the intuition. The discriminating analysis (DA) is a quantitative method which permits to classify the DNA sequences. It leads to an evaluation of the first method and to a decision. The value of this approach has been confirmed since we also have found some results which had been described recently in the literature. Furthermore, this general methodology has permitted us to show the existence of parameters which identify the nucleic acid sequence functional domains, without having to make use of the properties of the genetic code.  相似文献   

19.
20.
During the last two decades a large number of computational methods have been developed for predicting transmembrane protein topology. Current predictors rely on topogenic signals in the protein sequence, such as the distribution of positively charged residues in extra-membrane loops and the existence of N-terminal signals. However, phosphorylation and glycosylation are post-translational modifications (PTMs) that occur in a compartment-specific manner and therefore the presence of a phosphorylation or glycosylation site in a transmembrane protein provides topological information. We examine the combination of phosphorylation and glycosylation site prediction with transmembrane protein topology prediction. We report the development of a Hidden Markov Model based method, capable of predicting the topology of transmembrane proteins and the existence of kinase specific phosphorylation and N/O-linked glycosylation sites along the protein sequence. Our method integrates a novel feature in transmembrane protein topology prediction, which results in improved performance for topology prediction and reliable prediction of phosphorylation and glycosylation sites. The method is freely available at http://bioinformatics.biol.uoa.gr/HMMpTM.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号