首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 63 毫秒
1.
We present a method based on hierarchical self-organizing maps (SOMs) for recognizing patterns in protein sequences. The method is fully automatic, does not require prealigned sequences, is insensitive to redundancy in the training set, and works surprisingly well even with small learning sets. Because it uses unsupervised neural networks, it is able to extract patterns that are not present in all of the unaligned sequences of the learning set. The identification of these patterns in sequence databases is sensitive and efficient. The procedure comprises three main training stages. In the first stage, one SOM is trained to extract common features from the set of unaligned learning sequences. A feature is a number of ungapped sequence segments (usually 4-16 residues long) that are similar to segments in most of the sequences of the learning set according to an initial similarity matrix. In the second training stage, the recognition of each individual feature is refined by selecting an optimal weighting matrix out of a variety of existing amino acid similarity matrices. In a third stage of the SOM procedure, the position of the features in the individual sequences is learned. This allows for variants with feature repeats and feature shuffling. The procedure has been successfully applied to a number of notoriously difficult cases with distinct recognition problems: helix-turn-helix motifs in DNA-binding proteins, the CUB domain of developmentally regulated proteins, and the superfamily of ribokinases. A comparison with the established database search procedure PROFILE (and with several others) led to the conclusion that the new automatic method performs satisfactorily.  相似文献   

2.
Strong sequence similarity has been reported among WrbA (the Trp repressor-binding protein of Escherichia coli); Ycp4, a protein of unknown function from the budding yeast Saccharomyces cerevisiae; P25, the pap1-dependent protein of the fission yeast Schizosaccharomyces pombe; and the translation product of a partial cDNA sequence from rice seedling root (Oryza sativa, locus Ricr02421a; here referred to as RicR). Further homology search with the profile method indicates that all the above sequences are related to the flavodoxin family and, in turn, allows detection of the recently proposed flavodoxin-like proteins from E. coli, MioC and the hypothetical protein YihB. We discuss sequence conservation with reference to the known 3-dimensional structures of flavodoxins. Conserved sequence and hydrophobicity patterns, as well as residue-pair interaction potentials, strongly support the hypothesis that these proteins share the alpha/beta twisted open-sheet fold typical of flavodoxins, with an additional alpha/beta unit in the WrbA family. On the basis of the proposed structural homology, we discuss the details of the putative FMN-binding sites. Our analysis also suggests that the helix-turn-helix motif we identified previously in the C-terminal region of the WrbA family is unlikely to reflect a DNA-binding function of this new protein family.  相似文献   

3.
Lac repressor (LacR) is a helix-turn-helix motif sequence-specific DNA binding protein. Based on proton NMR spectroscopic investigations, Kaptein and co-workers have proposed that the helix-turn-helix motif of LacR binds to DNA in an orientation opposite to that of the helix-turn-helix motifs of lambda repressor, lambda cro, 434 repressor, 434 cro, and CAP [Boelens, R., Scheek, R., van Boom, J. and Kaptein, R., J. Mol. Biol. 193, 1987, 213-216]. In the present work, we have determined the orientation of the helix-turn-helix motif of LacR in the LacR-DNA complex by the affinity cleaving method. The DNA cleaving moiety EDTA.Fe was attached to the N-terminus of a 56-residue synthetic protein corresponding to the DNA binding domain of LacR. We have formed the complex between the modified protein and the left DNA half site for LacR. The locations of the resulting DNA cleavage positions relative to the left DNA half site provide strong support for the proposal of Kaptein and co-workers.  相似文献   

4.
Lo SL  Cai CZ  Chen YZ  Chung MC 《Proteomics》2005,5(4):876-884
Knowledge of protein-protein interaction is useful for elucidating protein function via the concept of 'guilt-by-association'. A statistical learning method, Support Vector Machine (SVM), has recently been explored for the prediction of protein-protein interactions using artificial shuffled sequences as hypothetical noninteracting proteins and it has shown promising results (Bock, J. R., Gough, D. A., Bioinformatics 2001, 17, 455-460). It remains unclear however, how the prediction accuracy is affected if real protein sequences are used to represent noninteracting proteins. In this work, this effect is assessed by comparison of the results derived from the use of real protein sequences with that derived from the use of shuffled sequences. The real protein sequences of hypothetical noninteracting proteins are generated from an exclusion analysis in combination with subcellular localization information of interacting proteins found in the Database of Interacting Proteins. Prediction accuracy using real protein sequences is 76.9% compared to 94.1% using artificial shuffled sequences. The discrepancy likely arises from the expected higher level of difficulty for separating two sets of real protein sequences than that for separating a set of real protein sequences from a set of artificial sequences. The use of real protein sequences for training a SVM classification system is expected to give better prediction results in practical cases. This is tested by using both SVM systems for predicting putative protein partners of a set of thioredoxin related proteins. The prediction results are consistent with observations, suggesting that real sequence is more practically useful in development of SVM classification system for facilitating protein-protein interaction prediction.  相似文献   

5.
The defective prophage of Bacillus subtilis 168, PBSX, is a chromosomally based element which encodes a non-infectious phage-like particle with bactericidal activity. PBSX is induced by agents which elicit the SOS response. In a PBSX thermoinducible strain which carries the xhi1479 mutation, PBSX is induced by raising the growth temperature from 37 degrees C to 48 degrees C. A 1.2-kb fragment has been cloned which complements the xhi1479 mutation. The nucleotide sequence of this fragment contains an open reading frame (ORF) which encodes a protein of 113 amino acids (aa). This aa sequence resembles that of other bacteriophage repressors and suggests that the N-terminal region forms a helix-turn-helix motif, typical of the DNA-binding domain of many bacterial regulatory proteins. The ORF is preceded by four 15-bp direct repeats, each of which contains an internal palindromic sequence, and by sequences resembling a SigA-dependent promoter. The nt sequence of an equivalent fragment from the PBSX thermoinducible strain has also been determined. There are three aa differences within the ORF compared to the wild type, one of which lies within the helix-turn-helix segment. This ORF encodes a repressor protein of PBSX.  相似文献   

6.
The chromosomal gene from Pseudomonas aeruginosa encoding beta-lactamase has been cloned, and the sequence determined and compared with corresponding sequences of beta-lactamases from members of the enterobacteriaceae. Upstream of the beta-lactamase gene is an open reading frame which we postulate encodes a regulatory protein, AmpR. We identified a helix-turn-helix region in AmpR and a putative AmpR-binding site.  相似文献   

7.
Sequence-based approach for motif prediction is of great interest and remains a challenge. In this work, we develop a local combinational variable approach for sequence-based helix-turn-helix (HTH) motif prediction. First we choose a sequence data set for 88 proteins of 22 amino acids in length to launch an optimized traversal for extracting local combinational segments (LCS) from the data set. Then after LCS refinement, local combinational variables (LCV) are generated to construct prediction models for HTH motifs. Prediction ability of LCV sets at different thresholds is calculated to settle a moderate threshold. The large data set we used comprises 13 HTH families, with 17 455 sequences in total. Our approach predicts HTH motifs more precisely using only primary protein sequence information, with 93.29% accuracy, 93.93% sensitivity and 92.66% specificity. Prediction results of newly reported HTH-containing proteins compared with other prediction web service presents a good prediction model derived from the LCV approach. Comparisons with profile-HMM models from the Pfam protein families database show that the LCV approach maintains a good balance while dealing with HTH-containing proteins and non-HTH proteins at the same time. The LCV approach is to some extent a complementary to the profile-HMM models for its better identification of false-positive data. Furthermore, genome-wide predictions detect new HTH proteins in both Homo sapiens and Escherichia coli organisms, which enlarge applications of the LCV approach. Software for mining LCVs from sequence data set can be obtained from anonymous ftp site ftp://cheminfo.tongji.edu.cn/LCV/freely.  相似文献   

8.
Joo K  Lee J  Kim I  Lee SJ  Lee J 《Biophysical journal》2008,95(10):4813-4819
We present a new method for multiple sequence alignment (MSA), which we call MSACSA. The method is based on the direct application of a global optimization method called the conformational space annealing (CSA) to a consistency-based score function constructed from pairwise sequence alignments between constituting sequences. We applied MSACSA to two MSA databases, the 82 families from the BAliBASE reference set 1 and the 366 families from the HOMSTRAD set. In all 450 cases, we obtained well optimized alignments satisfying more pairwise constraints producing, in consequence, more accurate alignments on average compared with a recent alignment method SPEM. One of the advantages of MSACSA is that it provides not just the global minimum alignment but also many distinct low-lying suboptimal alignments for a given objective function. This is due to the fact that conformational space annealing can maintain conformational diversity while searching for the conformations with low energies. This characteristics can help us to alleviate the problem arising from using an inaccurate score function. The method was the key factor for our success in the recent blind protein structure prediction experiment.  相似文献   

9.
10.
In Rhizobium meliloti, expression of the nodulation genes (nod and nol genes) is under both positive and negative controls. These genes are activated by the products of the three related nodD genes, in conjunction with signal molecules from the host plants. We showed that negative regulation is mediated by a repressor protein, binding to the overlapping nodD1 and nodA as well as to the nodD2 promoters. The encoding gene, termed nolR, was identified and cloned from strain 41. By subcloning, deletion and Tn5 mutagenesis, a region of 594 base-pairs was found to be necessary and sufficient for repressor production in strains of R. meliloti lacking the repressor or in Escherichia coli. Sequence analysis revealed that nolR encodes a 13,349 Da protein, which is in agreement with the molecular weight of the NolR protein, determined after purification by affinity chromatography, utilizing long synthetic DNA multimers of the 21 base-pair conserved repressor-binding sequence. Our data suggest that the native NolR binds to the operator site in dimeric form. The NolR contains a helix-turn-helix motif, which shows homology to the DNA-binding sequences of numerous prokaryotic regulatory proteins such as the repressor XylR or the activator NodD and other members of the LysR family. Comparison of the putative DNA-binding helix-turn-helix motifs of a large number of regulatory proteins pointed to a number of novel regularities in this sequence. Hybridizations with an internal nolR fragment showed that sequences homologous to the nolR gene are present in all R. meliloti isolates tested, even in those that do not produce the repressor. In another species, such as Rhizobium leguminosarum, where NodD is autoregulated, however, such sequences were not detected.  相似文献   

11.
mop is the structural gene for the molybdenum-pterin binding protein, which is the major molybdenum binding protein in Clostridium pastuerianum. The mop gene was detected by immunoscreening genomic libraries of C. pastuerianum and identified by determining the nucleotide sequence of the cloned insert of clostridial DNA. The deduced amino acid sequence of an open reading frame proved to be identical to the first twelve residues of purified Mop. The DNA sequence flanking the mop gene contains promoter-like consensus sequences which are probably responsible for the expression of Mop in Escherichia coli. The deduced amino acid composition shows that the protein is hydrophobic, lacks aromatic and cysteine residues and has a calculated molecular weight of 7,038. The N-terminal amino acid sequence of Mop has sequence homology with DNA binding proteins. The pattern and type of residues in the N-terminal region suggest it forms the helix-turn-helix structure observed in DNA binding proteins. We propose that Mop may be a regulatory protein binding the anabolic source of molybdenum.  相似文献   

12.
The availability of fast and accurate sequencing procedures along with the use of PCR has led to a proliferation of studies of variability at the molecular level in populations. Nevertheless, it is often impractical to examine long genomic stretches and a large number of individuals at the same time. In order to optimize this kind of study, we suggest a heuristic procedure for detection of the shortest region whose informational content can be considered sufficient for significant phylogenetic reconstruction. The method is based on the comparison of the pairwise genetic distances obtained from a set of sequences of reference to those obtained for different windows of variable size and position by means of a simple index. We also present an approach for testing whether the informative content in the stretches selected in this way is significantly different from the corresponding content shown by the larger genomic regions used as reference. Application of this test to the analysis of the VP1 protein gene of foot-and-mouth-disease type C virus allowed us to define optimal stretches whose informative content is not significantly different from that displayed by the complete VP1 sequence. We showed that the predictions made for type C sequences are valid for type O sequences, indicating that the results of the procedure are consistent. Correspondence to: J. Dopazo  相似文献   

13.
The detection and alignment of locally conserved regions (motifs) in multiple sequences can provide insight into protein structure, function, and evolution. A new Gibbs sampling algorithm is described that detects motif-encoding regions in sequences and optimally partitions them into distinct motif models; this is illustrated using a set of immunoglobulin fold proteins. When applied to sequences sharing a single motif, the sampler can be used to classify motif regions into related submodels, as is illustrated using helix-turn-helix DNA-binding proteins. Other statistically based procedures are described for searching a database for sequences matching motifs found by the sampler. When applied to a set of 32 very distantly related bacterial integral outer membrane proteins, the sampler revealed that they share a subtle, repetitive motif. Although BLAST (Altschul SF et al., 1990, J Mol Biol 215:403-410) fails to detect significant pairwise similarity between any of the sequences, the repeats present in these outer membrane proteins, taken as a whole, are highly significant (based on a generally applicable statistical test for motifs described here). Analysis of bacterial porins with known trimeric beta-barrel structure and related proteins reveals a similar repetitive motif corresponding to alternating membrane-spanning beta-strands. These beta-strands occur on the membrane interface (as opposed to the trimeric interface) of the beta-barrel. The broad conservation and structural location of these repeats suggests that they play important functional roles.  相似文献   

14.
Modeling residue usage in aligned protein sequences via maximum likelihood   总被引:9,自引:6,他引:3  
A computational method is presented for characterizing residue usage, i.e., site-specific residue frequencies, in aligned protein sequences. The method obtains frequency estimates that maximize the likelihood of the sequences in a simple model for sequence evolution, given a tree or a set of candidate trees computed by other methods. These maximum- likelihood frequencies constitute a profile of the sequences, and thus the method offers a rigorous alternative to sequence weighting for constructing such a profile. The ability of this method to discard misleading phylogenetic effects allows the biochemical propensities of different positions in a sequence to be more clearly observed and interpreted.   相似文献   

15.
Despite the establishment of design principles to optimize codon choice for heterologous expression vector design, the relationship between codon sequence and final protein yield remains poorly understood. In this work, we present a computational framework for the identification of a set of mutant codon sequences for optimized heterologous protein production, which uses a codon-sequence mechanistic model of protein synthesis. Through a sensitivity analysis on the optimal steady state configuration of protein synthesis we are able to identify the set of codons, that are the most rate limiting with respect to steady state protein synthesis rate, and we replace them with synonymous codons recognized by charged tRNAs more efficient for translation, so that the resulting codon-elongation rate is higher. Repeating this procedure, we iteratively optimize the codon sequence for higher protein synthesis rate taking into account multiple constraints of various types. We determine a small set of optimized synonymous codon sequences that are very close to each other in sequence space, but they have an impact on properties such as ribosomal utilization or secondary structure. This limited number of sequences can then be offered for further experimental study. Overall, the proposed method is very valuable in understanding the effects of the different properties of mRNA sequences on the final protein yield in heterologous protein production and it can find applications in synthetic biology and biotechnology.  相似文献   

16.
We developed a new method which searches sequence segments responsible for the recognition of a given chemical structure. These segments are detected as those locally conserved among a sequence to be analyzed (target sequence) and a set of sequences (reference sequences). Reference sequences are the sequences of functionally related proteins, ligands of which contain a common chemical substructure in their molecular structures. 'Similarity graphing' cuts target sequences into segments, aligns them with reference sequence pairwise, calculates the degree of similarity for each alignment, and shows graphically cumulative similarity values on target sequence. Any locally conserved regions, short or long in length and weak or strong in similarity, are detected at their optimal conditions by adjusting three parameters. The 'enzyme-reaction database' contains chemical structures and their related enzymes. When a chemical substructure is input into the database, sequences of the enzymes related to the input substructure are systematically searched from the NBRF sequence database and output as reference sequences. Examples of analysis using similarity graphing in combination with the enzyme-reaction database showed a great potentiality in the systematic analysis of the relationships between sequences and molecular recognitions for protein engineering.  相似文献   

17.
The CI protein of coliphage 186 is responsible for maintaining the stable lysogenic state. To do this CI must recognize two distinct DNA sequences, termed A type sites and B type sites. Here we investigate whether CI contains two separate DNA binding motifs or whether CI has one motif that recognizes two different operator sequences. Sequence alignment with 186-like repressors predicts an N-terminal helix-turn-helix (HTH) motif, albeit with poor homology to a large master set of such motifs. The domain structure of CI was investigated by linker insertion mutagenesis and limited proteolysis. CI consists of an N-terminal domain, which weakly dimerizes and binds both A and B type sequences, and a C-terminal domain, which associates to octamers but is unable to bind DNA. A fusion protein consisting of the 186 N-terminal domain and the phage lambda oligomerization domain binds A and B type sequences more efficiently than the isolated 186 CI N-terminal domain, hence the 186 C-terminal domain likely mediates oligomerization and cooperativity. Site-directed mutation of the putative 186 HTH motif eliminates binding to both A and B type sites, supporting the idea that binding to the two distinct DNA sequences is mediated by a variant HTH motif.  相似文献   

18.
MOTIVATION: We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can be incorporated to improve performance. RESULTS: The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on the Pfam database of protein families with more than satisfactory performance. Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sensitive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster.  相似文献   

19.
Statistical methods have been developed for finding local patterns, also called motifs, in multiple protein sequences. The aligned segments may imply functional or structural core regions. However, the existing methods often have difficulties in aligning multiple proteins when sequence residue identities are low (e.g., less than 25%). In this article, we develop a Bayesian model and Markov chain Monte Carlo (MCMC) methods for identifying subtle motifs in protein sequences. Specifically, a motif is defined not only in terms of specific sites characterized by amino acid frequency vectors, but also as a combination of secondary characteristics such as hydrophobicity, polarity, etc. Markov chain Monte Carlo methods are proposed to search for a motif pattern with high posterior probability under the new model. A special MCMC algorithm is developed, involving transitions between state spaces of different dimensions. The proposed methods were supported by a simulated study. It was then tested by two real datasets, including a group of helix-turn-helix proteins, and one set from the CATH Protein Structure Classification Database. Statistical comparisons showed that the new approach worked better than a typical Gibbs sampling approach which is based only on an amino acid model.  相似文献   

20.
Nucleotide sequences of the cysB region of Salmonella typhimurium and Escherichia coli have been determined and compared. A total of 1759 nucleotides were sequenced in S. typhimurium and 1840 in E. coli. Both contain a 972-nucleotide open reading frame identified as the coding region for the cysB regulatory protein on the basis of sequence homology and by comparison of the deduced amino acid sequences with known physicochemical properties of this protein. The DNA sequence identity for the cysB coding region in the two species is 80.5%. The deduced amino acid sequences are 95% identical. The predicted cysB polypeptide molecular weights are 36,013 for S. typhimurium and 36,150 for E. coli. For both proteins a helix-turn-helix region similar to that found in other DNA-binding proteins is predicted from the deduced amino acid sequence. Sequences upstream to cysB contain open reading frames which represent the carboxyl-terminal end of the topA gene product, DNA topoisomerase I. A pattern of highly conserved nucleotide sequences in the 151 nucleotides immediately preceding the cysB initiator codon in both species suggests that this region may contain multiple signals for the regulation of cysB expression.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号