首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We describe an alternative method for scoring of the pairwise alignment of two biological sequences. Designed to overcome the bias due to the composition of the alignment, it measures the distance (in standard deviations) between the given alignment and the mean value of all other alignments that can be obtained by a permutation of either sequence. We demonstrate that the standard deviation can be calculated efficiently. By concentrating upon the ungapped case, the mean and standard deviation can be calculated exactly and in two steps, the first being O(N) time, where N is the length of the sequence, the second in a fixed number of calculations, i.e., in O(1) time. We argue that this statistic is a more consistent measure than a similarity score based upon a standard scoring matrix. Even in the ungapped case, the statistic proves in many cases to be more accurate than the commonly used (FASTA) (Pearson and Lipman, 1988) gapped Z-score in which the sequence is matched against a random sample of the database. We demonstrate the use of the POZ-score as a secondary filter which screens out several well-known types of false positive, reducing the amount of manual screening to be done by the biologist.  相似文献   

2.
C Sander  R Schneider 《Proteins》1991,9(1):56-68
The database of known protein three-dimensional structures can be significantly increased by the use of sequence homology, based on the following observations. (1) The database of known sequences, currently at more than 12,000 proteins, is two orders of magnitude larger than the database of known structures. (2) The currently most powerful method of predicting protein structures is model building by homology. (3) Structural homology can be inferred from the level of sequence similarity. (4) The threshold of sequence similarity sufficient for structural homology depends strongly on the length of the alignment. Here, we first quantify the relation between sequence similarity, structure similarity, and alignment length by an exhaustive survey of alignments between proteins of known structure and report a homology threshold curve as a function of alignment length. We then produce a database of homology-derived secondary structure of proteins (HSSP) by aligning to each protein of known structure all sequences deemed homologous on the basis of the threshold curve. For each known protein structure, the derived database contains the aligned sequences, secondary structure, sequence variability, and sequence profile. Tertiary structures of the aligned sequences are implied, but not modeled explicitly. The database effectively increases the number of known protein structures by a factor of five to more than 1800. The results may be useful in assessing the structural significance of matches in sequence database searches, in deriving preferences and patterns for structure prediction, in elucidating the structural role of conserved residues, and in modeling three-dimensional detail by homology.  相似文献   

3.
Boni MF  Posada D  Feldman MW 《Genetics》2007,176(2):1035-1047
Statistical tests for detecting mosaic structure or recombination among nucleotide sequences usually rely on identifying a pattern or a signal that would be unlikely to appear under clonal reproduction. Dozens of such tests have been described, but many are hampered by long running times, confounding of selection and recombination, and/or inability to isolate the mosaic-producing event. We introduce a test that is exact, nonparametric, rapidly computable, free of the infinite-sites assumption, able to distinguish between recombination and variation in mutation/fixation rates, and able to identify the breakpoints and sequences involved in the mosaic-producing event. Our test considers three sequences at a time: two parent sequences that may have recombined, with one or two breakpoints, to form the third sequence (the child sequence). Excess similarity of the child sequence to a candidate recombinant of the parents is a sign of recombination; we take the maximum value of this excess similarity as our test statistic Delta(m,n,b). We present a method for rapidly calculating the distribution of Delta(m,n,b) and demonstrate that it has comparable power to and a much improved running time over previous methods, especially in detecting recombination in large data sets.  相似文献   

4.
Adrenodoxin reductase is an NADP dependent flavoenzyme which functions as the reductase of mitochondrial P 450 systems. We sequenced two adrenodoxin reductase cDNAs isolated from a bovine adrenal cortex cDNA library. The deduced amino acid sequence shows no similarity to the sequence of the microsomal P 450 systems or other known protein sequences. Nonetheless, by sequence analysis and c comparisons with known sequences of dinucleotide-binding folds of two NADP-binding flavoenzymes, two regions of adrenodoxin reductase sequence were identified as the FAD- and NADP-binding sites. These analyses revealed a consensus sequence for the NADP-binding dinucleotide fold (GXGXXAXXXAXXXXXXG, in one-letter amino acid code) that differs from FAD and NAD-binding dinucleotide-fold sequences. In the data base of protein sequences, the NADP-binding-site sequence appears solely in NADP-dependent enzymes, the binding sites of which were not known to date. Thus, this sequence may be used for identification of a certain type of NADP-binding site of enzymes that show no significant sequence similarity.  相似文献   

5.
A challenge for mammalian genetics is the recognition of critical regulatory regions in primary gene sequence. One approach to this problem is to compare sequences from genes exhibiting highly conserved expression patterns in disparate organisms. Previous transgenic and transfection analyses defined conserved regulatory domains in the mouse and human adenosine deaminase (ADA) genes. We have thus attempted to identify regions with comparable similarity levels potentially indicative of critical ADA regulatory regions. On the basis of aligned regions of the mouse and human ADA gene, using a 24-bp window, we find that similarity overall (67.7%) and throughout the noncoding sequences (67.1%) is markedly lower than that of the coding regions (81%). This low overall similarity facilitated recognition of more highly conserved regions. In addition to the highly conserved exons, ten noncoding regions >100 bp in length displayed >70% sequence similarity. Most of these contained numerous 24-bp windows with much higher levels of similarity. A number of these regions, including the promoter and the thymic enhancer, were more similar than several exons. A third block, located near the thymic enhancer but just outside of a minimally defined locus control region, exhibited stronger similarity than the promoter or thymic enhancer. In contrast, only fragmentary similarity was exhibited in a region that harbors a strong duodenal enhancer in the human gene. These studies show that comparative sequence analysis can be a powerful tool for identifying conserved regulatory domains, but that some conserved sequences may not be detected by certain functional analyses as transgenic mice. Received: 27 March 1998 / Accepted: 22 September 1998  相似文献   

6.
We developed a new method which searches sequence segments responsible for the recognition of a given chemical structure. These segments are detected as those locally conserved among a sequence to be analyzed (target sequence) and a set of sequences (reference sequences). Reference sequences are the sequences of functionally related proteins, ligands of which contain a common chemical substructure in their molecular structures. 'Similarity graphing' cuts target sequences into segments, aligns them with reference sequence pairwise, calculates the degree of similarity for each alignment, and shows graphically cumulative similarity values on target sequence. Any locally conserved regions, short or long in length and weak or strong in similarity, are detected at their optimal conditions by adjusting three parameters. The 'enzyme-reaction database' contains chemical structures and their related enzymes. When a chemical substructure is input into the database, sequences of the enzymes related to the input substructure are systematically searched from the NBRF sequence database and output as reference sequences. Examples of analysis using similarity graphing in combination with the enzyme-reaction database showed a great potentiality in the systematic analysis of the relationships between sequences and molecular recognitions for protein engineering.  相似文献   

7.
Sequences in public databases may contain a number of sequencing errors. A double binomial model describing the distribution of indel-excluded similarity coefficients (S) among repeatedly sequenced 16S rRNA was previously developed and it produced a confidence interval of S useful for testing sequence identity among sequences of 400-bp length. We characterized patterns in sequencing errors found in nearly complete 16S rRNA sequences of Vibrionaceae as highly variable in reported sequence length and containing a small number of indels. To accommodate these characteristics, a simple binomial model for distribution of the similarity coefficient (H) that included indels was derived from the double binomial model for S. The model showed good fit to empirical data. By using either a pre-determined or bootstrapping estimated standard probability of base matching, we were able to use the exact binomial test to determine the relative level of sequencing error for a given pair of duplicated sequences. A limitation of the method is the requirement that duplicated sequences for the same template sequence be paired, but this can be overcome by using only conserved regions of 16S rRNA sequences and pairing a given sequence with its highest scoring BLAST search hit from the nr database of GenBank.  相似文献   

8.
We have isolated a cDNA clone for the Chlamydomonas reinhardtii pre-apoplastocyanin. The sequence contains codons for the complete pre-protein including a two-domain, lumen-targeting transit sequence and the mature apoprotein. The transit sequence (47 amino acids) is the shortest one described for chloroplast lumenal proteins, and like other C. reinhardtii lumen-targeting transit sequences appears to lack an uncharged amino-terminal domain usually present in plant lumen-directing sequences. The mature protein is deduced to be 98 amino acids in length and shows highest primary sequence similarity (74-76% identity) to other unicellular algal plastocyanins. Southern hybridization analysis of C. reinhardtii genomic DNA indicates the presence of a single nuclear gene, as is the case for all other plastocyanin genes characterized to date, although the algal gene might be interrupted. Codon usage in this gene reflects the high GC content of C. reinhardtii nuclear DNA, but is more highly biased than that found in the C. reinhardtii copper-repressible gene for the functionally equivalent pre-apocytochrome c552 (perhaps contributing to the more efficient synthesis in vivo of plastocyanin over cytochrome c552). The deduced physical properties of this plastocyanin are compared to those of the C. reinhardtii plastidic cytochrome c552.  相似文献   

9.
We have cloned and characterized a complete set of seven U1-related sequences from Drosophila melanogaster. These sequences are located at the three cytogenetic loci 21D, 82E, and 95C. Three of these sequences have been previously studied: one U1 gene at 21D which encodes the prototype U1 sequence (U1a), one U1 gene at 82E which encodes a U1 variant with a single nucleotide substitution (U1b), and a pseudogene at 82E. The four previously uncharacterized genes are another U1b gene at 82E, two additional U1a genes at 95C, and a U1 gene at 95C which encodes a new variant (U1c) with a distinct single nucleotide change relative to U1a. Three blocks of 5' flanking sequence similarity are common to all six full length genes. Using specific primer extension assays, we have observed that the U1b RNA is expressed in Drosophila Kc cells and is associated with snRNP proteins, suggesting that the U1b-containing snRNP particles are able to participate in the process of pre-mRNA splicing. We have also examined the expression throughout Drosophila development of the two U1 variants relative to the prototype sequence. The U1c variant is undetectable by our methods, while the U1b variant exhibits a primarily embryonic pattern reminiscent of the expression of certain U1 variants in sea urchin, Xenopus, and mouse.  相似文献   

10.
Yang XL  Bai DZ  Qiu W  Dong HQ  Li DQ  Chen F  Ma RL  Hugh TB  Gao JF 《遗传》2012,34(7):887-894
在已知中国美利奴羊MHC(Major histocompatibility complex)区段BAC(Bacterial artificial chromosome)克隆序列信息和预测的基因注释前提下,用位于中国美利奴羊基因组BAC文库MHC区段的6个BAC克隆酶切片段为探针,以噬菌斑原位杂交筛选法筛选中国美利奴羊混合组织cDNA文库(库库杂交),对分离到的cDNA阳性克隆进行全序列测定,并与相应的已知序列信息和基因注释的BAC克隆比对以及在NCBI Blastn数据库中序列相似性检索,旨在验证基因注释结果的准确性和对基因(序列)功能的初步分析。实验中,经过两轮杂交共筛选出27个cDNA阳性克隆(序列),并发现这些序列均可定位到相应的BAC克隆上,且25条序列处在注释基因的外显子部分;在NCBI数据库中经Blastn序列相似性检索发现,23条序列与牛基因的序列相似性最高,且与免疫功能密切相关。  相似文献   

11.
AIMS: The aim of the study was to isolate and characterize the endophytic fungi from the rhizomes of the Chinese traditional medicinal plant Dioscorea zingiberensis and to detect their antibacterial activities. METHODS AND RESULTS: After strict sterile sample preparation, nine fungal endophytes were isolated from rhizomes of the Chinese traditional medicinal plant D. zingiberensis. The endophytes were classified by morphological traits and internal transcribed spacer (ITS) rRNA gene sequence analysis. Their ITS rDNA sequences were 99-100% identical to Nectria, Fusarium, Rhizopycnis, Acremonium and Penicillium spp. respectively. Of these, the most frequent genera were Fusarium and Nectria. One isolate, Dzf7, was unclassified on the basis of its low sequence similarity. The next closest species was Alternaria longissima (c. 92.4% sequence similarity). Endophyte isolate Dzf5 showed the closest sequence similarity (c. 99.5%) to an uncultured soil fungus (DQ420800) obtained from Cedar Creek, USA. Bioassays using a modified broth dilution test were used to detect the antibacterial activity of n-butanol extracts of both mycelia and culture filtrates of D. zingiberensis showed biological activity against Bacillus subtilis, Staphylococcus haemolyticus, Escherichia coli and Xanthomonas vesicatoria. Minimal inhibitory concentration (MIC) values of the extracts were between 31 x 25 microg ml(-1) and 125 microg ml(-1). CONCLUSIONS: Endophytic fungus Dzf2 (c. 99 x 8% sequence similarity to Fusarium redolens) isolated from D. zingiberensis rhizome showed the most potent antibacterial activities. SIGNIFICANCE AND IMPACT OF THE STUDY: Endophytic fungi isolated from D. zingiberensis may be used as potential producers of antibacterial natural products.  相似文献   

12.
We report the sequence of a cDNA encoding a rabbit immunoglobulin gamma heavy chain of d12 and e14 allotypes with high homology to partial cDNA sequences from rabbits of d11 and e15 allotypes. The encoded rabbit protein shows homologies with human (68-70%) and mouse (60-63%) gamma chains. The nucleotide sequence homologies of the CH domains range from 76-84% with human and 64-76% with mouse sequences. Comparison of the portion of VH encoding amino acid positions 34-112 with a previously determined VH sequence of the same allotype shows high conservation of sequences in the second and third framework segments but more marked differences both in length and encoded amino acids of the second and third complementarity-determining regions (CDRs). We also found a high degree of homology with a human genomic V-region, VH26 (77%) and a remarkable similarity between rabbit and human second CDR sequences and human genomic D minigenes. These results provide additional evidence that D minigene sequences share information with the CDR2 portion of VH regions.  相似文献   

13.
A Markov analysis of DNA sequences   总被引:12,自引:0,他引:12  
We present a model by which we look at the DNA sequence as a Markov process. It has been suggested by several workers that some basic biological or chemical features of nucleic acids stand behind the frequencies of dinucleotides (doublets) in these chains. Comparing patterns of doublet frequencies in DNA of different organisms was shown to be a fruitful approach to some phylogenetic questions (Russel & Subak-Sharpe, 1977). Grantham (1978) formulated mRNA sequence indices, some of which involve certain doublet frequencies. He suggested that using these indices may provide indications of the molecular constraints existing during gene evolution. Nussinov (1981) has shown that a set of dinucleotide preference rules holds consistently for eukaryotes, and suggested a strong correlation between these rules and degenerate codon usage. Gruenbaum, Cedar & Razin (1982) found that methylation in eukaryotic DNA occurs exclusively at C-G sites. Important biological information thus seems to be contained in the doublet frequencies. One of the basic questions to be asked (the "correlation question") is to what extent are the 64 trinucleotide (triplet) frequencies measured in a sequence determined by the 16 doublet frequencies in the same sequence. The DNA is described here as a Markov process, with the nucleotides being outcomes of a sequence generator. Answering the correlation question mentioned above means finding the order of the Markov process. The difficulty is that natural sequences are of finite length, and statistical noise is quite strong. We show that even for a 16000 nucleotide long sequence (like that of the human mitochondrial genome) the finite length effect cannot be neglected. Using the Markov chain model, the correlation between doublet and triplet frequencies can, however, be determined even for finite sequences, taking proper account of the finite length. Two natural DNA sequences, the human mitochondrial genome and the SV40 DNA, are analysed as examples of the method.  相似文献   

14.
de Franco FF  Kuhn GC  de Sene FM  Manfrin MH 《Genetica》2006,128(1-3):287-295
In this study, we have compared 34 repetition units of pBuM-2 satellite DNA of individuals from six isolated populations of Drosophila gouveai, a cactophilic member of Drosophila buzzatii cluster (repleta group). In contrast to the results of previous morphological and molecular data, which suggest differentiation among the D. gouveai populations, the sequences and the cluster analysis of pBuM-2 monomers showed that this repetitive element is highly conserved among the six D. gouveai populations (97.8% similarity), indicating a slow rate of evolution of pBuM-2 sequences at the population level. Probably, some homogenization mechanisms of tandem sequences, such as unequal crossing or gene conversion, have maintained the sequence similarity of pBuM-2 among D. gouveai populations. Alternatively, such a result may be associated with a functional role of pBuM-2 sequences, although it is not understood at present.  相似文献   

15.
16.
We have developed a computational method of protein design to detect amino acid sequences that are adaptable to given main-chain coordinates of a protein. In this method, the selection of amino acid types employs a Metropolis Monte Carlo method with a scoring function in conjunction with the approximation of free energies computed from 3D structures. To compute the scoring function, a side-chain prediction using another Metropolis Monte Carlo method was performed to select structurally suitable side-chain conformations from a side-chain library. In total, two layers of Monte Carlo procedures were performed, first to select amino acid types (1st layer Monte Carlo) and then to predict side-chain conformations (2nd layers Monte Carlo). We applied this method to sequence design for the entire sequence on the SH3 domain, Protein G, and BPTI. The predicted sequences were similar to those of the wild-type proteins. We compared the results of the predictions with and without the 2nd layer Monte Carlo method. The results revealed that the two-layer Monte Carlo method produced better sequence similarity to the wild-type proteins than the one-layer method. Finally, we applied this method to neuraminidase of influenza virus. The results were consistent with the sequences identified from the isolated viruses.  相似文献   

17.
Guan Z  Meng X  Sun Z  Xu Z  Song R 《Gene》2008,423(1):36-42
The sodium-dependent phosphate transporter gene from unicellular green algae Dunaliella viridis, DvSPT1, shares similarity with members of Pi transporter family. Sequencing analysis of D. viridis BAC clone containing the DvSPT1 gene revealed two inverted duplicated copies of this gene (DvSPT1 and DvSPT1-2 respectively). The duplication covered most of both genes except for their 3' downstream region. The duplicated genomic sequences exhibited 97.9% identity with a synonymous divergence of Ks=0.0126 in the coding region. This data indicated very recent gene duplication in D. viridis genome, providing an excellent opportunity to investigate sequence and expression divergence of duplicated genes at an early stage. Scatted point mutations and length polymorphism of simple sequence repeats (SSRs) were predominant among the sequence divergence soon after gene duplication. Due to sequence divergence in the 5' regulatory regions and a swap of the entire 3' downstream regions (3'-UTR), DvSPT1 and DvSPT1-2 showed expression divergence in response to extra-cellular NaCl concentration changes. According to their expression patterns, the two diverged gene copies would provide better adaptation to a broader range of extra-cellular NaCl concentration. Furthermore, Southern blot analysis indicated that there might be a large phosphate transporter gene family in D. viridis.  相似文献   

18.
We propose a computational method to measure and visualize interrelationships among any number of DNA sequences allowing, for example, the examination of hundreds or thousands of complete mitochondrial genomes. An "image distance" is computed for each pair of graphical representations of DNA sequences, and the distances are visualized as a Molecular Distance Map: Each point on the map represents a DNA sequence, and the spatial proximity between any two points reflects the degree of structural similarity between the corresponding sequences. The graphical representation of DNA sequences utilized, Chaos Game Representation (CGR), is genome- and species-specific and can thus act as a genomic signature. Consequently, Molecular Distance Maps could inform species identification, taxonomic classifications and, to a certain extent, evolutionary history. The image distance employed, Structural Dissimilarity Index (DSSIM), implicitly compares the occurrences of oligomers of length up to k (herein k = 9) in DNA sequences. We computed DSSIM distances for more than 5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional Scaling (MDS) to obtain Molecular Distance Maps that visually display the sequence relatedness in various subsets, at different taxonomic levels. This general-purpose method does not require DNA sequence alignment and can thus be used to compare similar or vastly different DNA sequences, genomic or computer-generated, of the same or different lengths. We illustrate potential uses of this approach by applying it to several taxonomic subsets: phylum Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class Amphibia, and order Primates. This analysis of an extensive dataset confirms that the oligomer composition of full mtDNA sequences can be a source of taxonomic information. This method also correctly finds the mtDNA sequences most closely related to that of the anatomically modern human (the Neanderthal, the Denisovan, and the chimp), and that the sequence most different from it in this dataset belongs to a cucumber.  相似文献   

19.
Annotations of the genes and their products are largely guided by inferring homology. Sequence similarity is the primary measure used for annotation purpose however, the domain content and order were given less importance albeit the fact that domain insertion, deletion, positional changes can bring in functional varieties. Of late, several methods developed quantify domain architecture similarity depending on alignments of their sequences and are focused on only homologous proteins. We present an alignment-free domain architecture-similarity search (ADASS) algorithm that identifies proteins that share very poor sequence similarity yet having similar domain architectures. We introduce a “singlet matching-triplet comparison” method in ADASS, wherein triplet of domains is compared with other triplets in a pair-wise comparison of two domain architectures. Different events in the triplet comparison are scored as per a scoring scheme and an average pairwise distance score (Domain Architecture Distance score - DAD Score) is calculated between protein domains architectures. We use domain architectures of a selected domain termed as centric domain and cluster them based on DAD score. The algorithm has high Positive Prediction Value (PPV) with respect to the clustering of the sequences of selected domain architectures. A comparison of domain architecture based dendrograms using ADASS method and an existing method revealed that ADASS can classify proteins depending on the extent of domain architecture level similarity. ADASS is more relevant in cases of proteins with tiny domains having little contribution to the overall sequence similarity but contributing significantly to the overall function.  相似文献   

20.
The protein phosphatase type-1 catalytic subunit (PP1c) does not exist freely in the cell and its activity must be very strictly controlled. Several protein inhibitors of PP1c have been described including the classical mammalian inhibitor-1 (I-1) and inhibitor-2 (I-2). Association of these inhibitors with PP1c appears to involve multiple contacts and in the case of I-2 no less than five I-2 interaction subdomains have been proposed. In this report, we provide both in vitro and in vivo evidence that the Dictyostelium discoideum genome encodes a protein (DdI-2) that is an ortholog of mammalian I-2, being the first PP1c interacting protein characterized in this social amoeba. Despite the low overall sequence similarity of DdI-2 with other I-2 sequences and its long N-terminal extension, the five PP1c interaction motifs proposed for mammalian I-2 are reasonably conserved in the Dictyostelium ortholog. We demonstrate that DdI-2 interacts with and inhibits D. discoideum PP1c (DdPP1c), which we have previously characterized. Moreover, using yeast two-hybrid assays we show that a stable interaction of DdI-2 with DdPP1c requires multiple contacts.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号