首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MOTIVATION: We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can be incorporated to improve performance. RESULTS: The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on the Pfam database of protein families with more than satisfactory performance. Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sensitive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster.  相似文献   

2.

Background

Discovering sequence patterns with variation can unveil functions of a protein family that are important for drug discovery. Exploring protein families using existing methods such as multiple sequence alignment is computationally expensive, thus pattern search, called motif finding in Bioinformatics, is used. However, at present, combinatorial algorithms result in large sets of solutions, and probabilistic models require a richer representation of the amino acid associations. To overcome these shortcomings, we present a method for ranking and compacting these solutions in a new representation referred to as Aligned Pattern Clusters (APCs). To tackle the problem of a large solution set, our method reveals a reduced set of candidate solutions without losing any information. To address the problem of representation, our method captures the amino acid associations and conservations of the aligned patterns. Our algorithm renders a set of APCs in which a set of patterns is discovered, pruned, aligned, and synthesized from the input sequences of a protein family.

Results

Our algorithm identifies the binding or other functional segments and their embedded residues which are important drug targets from the cytochrome c and the ubiquitin protein families taken from Unitprot. The results are independently confirmed by pFam's multiple sequence alignment. For cytochrome c protein the number of resulting patterns with variations are reduced by 76.62% from the number of original patterns without variations. Furthermore, all of the top four candidate APCs correspond to the binding segments with one of each of their conserved amino acid as the binding residue. The discovered proximal APCs agree with pFam and PROSITE results. Surprisingly, the distal binding site discovered by our algorithm is not discovered by pFam nor PROSITE, but confirmed by the three-dimensional cytochrome c structure. When applied to the ubiquitin protein family, our results agree with pFam and reveals six of the seven Lysine binding residues as conserved aligned columns with entropy redundancy measure of 1.0.

Conclusion

The discovery, ranking, reduction, and representation of a set of patterns is important to avert time-consuming and expensive simulations and experimentations during proteomic study and drug discovery.
  相似文献   

3.
4.
黄静 《生物数学学报》2003,18(3):351-356
提出了一种利用神经网络为蛋白质家族建立模型的方法,这一方法的理论出发点是利用神经网络从一组同家族蛋白质序列中识别出共同的特征模式,建好的模型可用于预测蛋白质家族,使用这一方法。所能识别的模式在长度、位点等方面都不受限制。而且建模及预测过程中输入神经网络的蛋白质序列不需要作预对齐。对Pfam蛋白质库中的二十个家族运用此方法,预测的平均正确率达到了95.5%。  相似文献   

5.
Systematic and fully automated identification of protein sequence patterns.   总被引:4,自引:0,他引:4  
We present an efficient algorithm to systematically and automatically identify patterns in protein sequence families. The procedure is based on the Splash deterministic pattern discovery algorithm and on a framework to assess the statistical significance of patterns. We demonstrate its application to the fully automated discovery of patterns in 974 PROSITE families (the complete subset of PROSITE families which are defined by patterns and contain DR records). Splash generates patterns with better specificity and undiminished sensitivity, or vice versa, in 28% of the families; identical statistics were obtained in 48% of the families, worse statistics in 15%, and mixed behavior in the remaining 9%. In about 75% of the cases, Splash patterns identify sequence sites that overlap more than 50% with the corresponding PROSITE pattern. The procedure is sufficiently rapid to enable its use for daily curation of existing motif and profile databases. Third, our results show that the statistical significance of discovered patterns correlates well with their biological significance. The trypsin subfamily of serine proteases is used to illustrate this method's ability to exhaustively discover all motifs in a family that are statistically and biologically significant. Finally, we discuss applications of sequence patterns to multiple sequence alignment and the training of more sensitive score-based motif models, akin to the procedure used by PSI-BLAST. All results are available at httpl//www.research.ibm.com/spat/.  相似文献   

6.
We report a method for detection of recurring side-chain patterns (DRESPAT) using an unbiased and automated graph theoretic approach. We first list all structural patterns as sub-graphs where the protein is represented as a graph. The patterns from proteins are compared pair-wise to detect patterns common to a protein pair based on content and geometry criteria. The recurring pattern is then detected using an automated search algorithm from the all-against-all pair-wise comparison data of proteins. Intra-protein pattern comparison data are used to enable detection of patterns recurring within a protein. A method has been proposed for empirical calculation of statistical significance of recurring pattern. The method was tested on 17 protein sets of varying size, composed of non-redundant representatives from SCOP superfamilies. Recurring patterns in serine proteases, cysteine proteases, lipases, cupredoxin, ferredoxin, ferritin, cytochrome c, aspartoyl proteases, peroxidases, phospholipase A2, endonuclease, SH3 domain, EF-hand and lectins show additional residues conserved in the vicinity of the known functional sites. On the basis of the recurring patterns in ferritin, EF-hand and lectins, we could separate proteins or domains that are structurally similar yet different in metal ion-binding characteristics. In addition, novel recurring patterns were observed in glutathione-S-transferase, phospholipase A2 and ferredoxin with potential structural/functional roles. The results are discussed in relation to the known functional sites in each family. Between 2000 and 50,000 patterns were enumerated from each protein with between ten and 500 patterns detected as common to an evolutionarily related protein pair. Our results show that unbiased extraction of functional site pattern is not feasible from an evolutionarily related protein pair but is feasible from protein sets comprising five or more proteins. The DRESPAT method does not require a user-defined pattern, size or location of the pattern and therefore, has the potential to uncover new functional sites in protein families.  相似文献   

7.
Several proteins and genes are members of families that share a public evolutionary. In order to outline the evolutionary relationships and to recognize conserved patterns, sequence comparison becomes an emerging process. The current work investigates critically the k-mer role in composition vector method for comparing genome sequences. Generally, composition vector methods using k-mer are applied under choice of different value of k to compare genome sequences. For some values of k, results are satisfactory, but for other values of k, results are unsatisfactory. Standard composition vector method is carried out in the proposed work using 3-mer string length. In addition, special type of information based similarity index is used as a distance measure. It establishes that use of 3-mer and information based similarity index provide satisfactory results especially for comparison of whole genome sequences in all cases. These selections provide a sort of unified approach towards comparison of genome sequences.  相似文献   

8.
A large portion of the usual eukaryotic genome is comprised of repetitive sequences. A common situation, when several related but different repeat families share the same conserved motif, complicates repeat classification and repeat boundary definition. If the repeats are aligned by the motif position, then the sequence profile (pattern) resulting from the alignment will represent overlapping of the profiles (patterns) corresponding to the individual families. A novel algorithm for the decomposition of overlapping patterns is proposed. It can be used with both continuous and gapped patterns. The technique is based on accumulation of simultaneously occurring pattern features found by cross-correlation procedure with limited lag length; thus, the name is Cumulative Local Cross-Correlation (referred further as CLCC). Its sensitivity is tested on human genomic sequences. Software implementation of the algorithm is available on request from the author.  相似文献   

9.
Worms by number     
This paper investigates alternation patterns in length, shape and orientation of dorsal cirri (fleshy segmental appendages) of phyllodocidans, a large group of polychaete worms (Annelida). We document the alternation patterns in several families of Phyllodocida (Syllidae, Hesionidae, Sigalionidae, Polynoidae, Aphroditidae and Acoetidae) and identify the simple mathematical rule bases that describe the progression of these sequences. Two fundamentally different binary alternation patterns were found on the first four segments: 1011 for nereidiform families and 1010 for aphroditiform families. The alternation pattern in all aphroditiform families matches a simple one-dimensional cellular automaton and that for Syllidae (nereidiform) matches the Fibonacci string sequence. Hesionidae (nereidiform) showed the greatest variation in alternation patterns, but all corresponded to various known substitution rules. Comparison of binary patterns of the first 22 segments using a distance measure supports the current ideas on phylogeny within Phyllodocida. These results suggest that gene(s) involved in post-larval segmental growth employ a switching sequence that corresponds to simple mathematical substitution rules.  相似文献   

10.
Cappello V  Tramontano A  Koch U 《Proteins》2002,47(2):106-115
Comparative analysis of protein binding sites for similar ligands yields information about conserved interactions, relevant for ligand affinity, and variable interactions, which are important for specificity. The pattern of variability can indicate new targets for a pharmacologically validated class of compounds binding to a similar site. A particularly vast group of therapeutically interesting proteins using the same or similar substrates are those that bind adenine-containing ligands. Drug development is focusing on compounds occupying the adenine-binding site and their specificity is an issue of paramount importance. We use a simple scheme to characterize and classify the adenine-binding sites in terms of their intermolecular interactions, and show that this classification does not necessarily correspond to protein classifications based on either sequence or structural similarity. We find that only a limited number of the different hydrogen bond patterns possible for adenine-binding is used, which can be utilized as an effective classification scheme. Closely related protein families usually share similar hydrogen patterns, whereas non-polar interactions are less well conserved. Our classification scheme can be used to select groups of proteins with a similar ligand-binding site, thus facilitating the definition of the properties that can be exploited to design specific inhibitors.  相似文献   

11.
《Ecological Informatics》2007,2(2):121-127
Scaling of ecological data can present a challenge firstly because of the large amount of information contained in an ecological data set, and secondly because of the problem of fitting data to models that we want to use to capture structure. We present a measure of similarity between data collected at several scales using the same set of attributes. The measure is based on the concept of Kolmogorov complexity and implemented through minimal message length estimates of information content and cluster analysis (the models). The similarity represents common patterns across scales, within the model class. We thus provide a novel solution to the problem of simultaneously considering data structure, model fit and scale. The methods are illustrated in application to an ecological data set.  相似文献   

12.
Nucleotide binding site (NBS)–leucine-rich repeat (LRR) genes belong to the largest class of disease-resistance gene super groups in plants, and their intra- or interspecies nucleotide variations have been studied extensively to understand their evolution and function. However, little is known about the evolutionary patterns of their copy numbers in related species. Here, 129, 245, 239 and 508 NBSs were identified in maize, sorghum, brachypodium and rice, respectively, suggesting considerable variations of these genes. Based on phylogenetic relationships from a total of 496 ancestral branches of grass NBS families, three gene number variation patterns were categorized: conserved, sharing two or more species, and species-specific. Notably, the species-specific NBS branches are dominant (71.6%), while there is only a small percentage (3.83%) of conserved families. In contrast, the conserved families are dominant in 51 randomly selected house-keeping genes (96.1%). The opposite patterns between NBS and the other gene groups suggest that natural selection is responsible for the drastic number variation of NBS genes. The rapid expansion and/or contraction may be a fundamentally important strategy for a species to adapt to the quickly changing species-specific pathogen spectrum. In addition, the small proportion of conserved NBSs suggests that the loss of NBSs may be a general tendency in grass species.  相似文献   

13.
MOTIVATION: Improved comparisons of multiple sequence alignments (profiles) with other profiles can identify subtle relationships between protein families and motifs significantly beyond the resolution of sequence-based comparisons. RESULTS: The local alignment of multiple alignments (LAMA) method was modified to estimate alignment score significance by applying a new measure based on Fisher's combining method. To verify the new procedure, we used known protein structures, sequence annotations and cyclical relations consistency analysis (CYRCA) sets of consistently aligned blocks. Using the new significance measure improved the sensitivity of LAMA without altering its selectivity. The program performed better than other profile-to-profile methods (COMPASS and Prof_sim) and a sequence-to-profile method (PSI-BLAST). The testing was large scale and used several parameters, including pseudo-counts profile calculations and local ungapped blocks or more extended gapped profiles. This comparison provides guidelines to the relative advantages of each method for different cases. We demonstrate and discuss the unique advantages of using block multiple alignments of protein motifs.  相似文献   

14.
We present a system for multi-class protein classification based on neural networks. The basic issue concerning the construction of neural network systems for protein classification is the sequence encoding scheme that must be used in order to feed the neural network. To deal with this problem we propose a method that maps a protein sequence into a numerical feature space using the matching scores of the sequence to groups of conserved patterns (called motifs) into protein families. We consider two alternative ways for identifying the motifs to be used for feature generation and provide a comparative evaluation of the two schemes. We also evaluate the impact of the incorporation of background features (2-grams) on the performance of the neural system. Experimental results on real datasets indicate that the proposed method is highly efficient and is superior to other well-known methods for protein classification.  相似文献   

15.
16.
Carrot (Daucus carota L.) chromosomes are small and uniform in shape and length. Here, mitotic chromosomes were subjected to multicolour fluorescence in situ hybridization (mFISH) with probes derived from conserved plant repetitive DNA (18-25S and 5S rDNA, telomeres), a carrot-specific centromeric repeat (Cent-Dc), carrot-specific repetitive elements (DCREs), and miniature inverted-repeat transposable elements (MITEs). A set of major chromosomal landmarks comprising rDNA and telomeric and centromeric sequences in combination with chromosomal measurements enabled discrimination of carrot chromosomes. In addition, reproducible and unique FISH patterns generated by three carrot genome-specific repeats (DCRE22, DCRE16, and DCRE9) and two transposon families (DcSto and Krak) in combination with telomeric and centromeric reference probes allowed identification of chromosome pairs and construction of detailed carrot karyotypes. Hybridization patterns for DCREs were observed as pericentromeric and interstitial dotted tracks (DCRE22), signals in pericentromeric regions (DCRE16), or scattered signals (DCRE9) along chromosomes similar to those observed for both MITE families.  相似文献   

17.
Protein-protein interactions play an essential role in the functioning of cell. The importance of charged residues and their diverse role in protein-protein interactions have been well studied using experimental and computational methods. Often, charged residues located in protein interaction interfaces are conserved across the families of homologous proteins and protein complexes. However, on a large scale, it has been recently shown that charged residues are significantly less conserved than other residue types in protein interaction interfaces. The goal of this work is to understand the role of charged residues in the protein interaction interfaces through their conservation patterns. Here, we propose a simple approach where the structural conservation of the charged residue pairs is analyzed among the pairs of homologous binary complexes. Specifically, we determine a large set of homologous interactions using an interaction interface similarity measure and catalog the basic types of conservation patterns among the charged residue pairs. We find an unexpected conservation pattern, which we call the correlated reappearance, occurring among the pairs of homologous interfaces more frequently than the fully conserved pairs of charged residues. Furthermore, the analysis of the conservation patterns across different superkingdoms as well as structural classes of proteins has revealed that the correlated reappearance of charged residues is by far the most prevalent conservation pattern, often occurring more frequently than the unconserved charged residues. We discuss a possible role that the new conservation pattern may play in the long-range electrostatic steering effect.  相似文献   

18.
Identification of structural domains in uncharacterized protein sequences is important in the prediction of protein tertiary folds and functional sites, and hence in designing biologically active molecules. We present a new predictive computational method of classifying a protein into single, two continuous or two discontinuous domains using Bayesian Data Mining. The algorithm requires only the primary sequence and computer-predicted secondary structure. It incorporates correlation patterns between certain 3-dimensional motifs and some local helical folds found conserved in the vicinity of protein domains with high statistical confidence. The prediction of domain-class by this computationally simple and fast method shows good accuracy of prediction-average accuracies 83.3% for single domain, 60% for two continuous and 65.7% for two discontinuous domain proteins. Experiments on the large validation sample show its performance to be significantly better than that of DGS and DomSSEA. Computations of Bayesian probabilities show important features in terms of correlation of certain conserved patterns of secondary folds and tertiary motifs and give new insight. Applications for improved accuracy of predicting domain boundary points relevant to protein structural and functional modeling are also highlighted.  相似文献   

19.
Liu Y  Engelman DM  Gerstein M 《Genome biology》2002,3(10):research0054.1-research005412

Background

Polytopic membrane proteins can be related to each other on the basis of the number of transmembrane helices and sequence similarities. Building on the Pfam classification of protein domain families, and using transmembrane-helix prediction and sequence-similarity searching, we identified a total of 526 well-characterized membrane protein families in 26 recently sequenced genomes. To this we added a clustering of a number of predicted but unclassified membrane proteins, resulting in a total of 637 membrane protein families.

Results

Analysis of the occurrence and composition of these families revealed several interesting trends. The number of assigned membrane protein domains has an approximately linear relationship to the total number of open reading frames (ORFs) in 26 genomes studied. Caenorhabditis elegans is an apparent outlier, because of its high representation of seven-span transmembrane (7-TM) chemoreceptor families. In all genomes, including that of C. elegans, the number of distinct membrane protein families has a logarithmic relation to the number of ORFs. Glycine, proline, and tyrosine locations tend to be conserved in transmembrane regions within families, whereas isoleucine, valine, and methionine locations are relatively mutable. Analysis of motifs in putative transmembrane helices reveals that GxxxG and GxxxxxxG (which can be written GG4 and GG7, respectively; see Materials and methods) are among the most prevalent. This was noted in earlier studies; we now find these motifs are particularly well conserved in families, however, especially those corresponding to transporters, symporters, and channels.

Conclusions

We carried out a genome-wide analysis on patterns of the classified polytopic membrane protein families and analyzed the distribution of conserved amino acids and motifs in the transmembrane helix regions in these families.
  相似文献   

20.
Plant disease resistance (R) genes have undergone significant evolutionary divergence to cope with rapid changes in pathogens. These highly variable evolutionary patterns may have contributed to diversity in R gene protein families or structures. Here, the evolutionary patterns of 76 identified R genes and their homologs were investigated within and between plant species. Results demonstrated that nucleotide binding sites and leucine-rich-repeat genes located in loci with complex evolutionary histories tended to evolve rapidly, have high variation in copy numbers, exhibit high levels of nucleotide variation and frequent gene conversion events, and also exhibit high non-synonymous to synonymous substitution ratios in LRR regions. However, non-NBS-LRR R genes are relatively well conserved with constrained variation and are more likely to participate in the basic defense system of hosts. In addition, both conserved and highly divergent evolutionary patterns were observed for the same R genes and were consistent with inter- and intra-specific distributions of some R genes. These results thus indicate either continuous or altered evolutionary patterns between and within species. The present investigation is the first attempt to investigate evolutionary patterns among all clearly functional R genes. The results reported here thus provide a foundation for future plant disease studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号