首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 36 毫秒
1.
艾亮  冯杰 《生物信息学》2023,21(3):179-186
本文提出了一种新的快速非比对的蛋白质序列相似性与进化分析方法。在刻画蛋白质序列特征时,首先将氨基酸的10种理化性质通过主成分分析浓缩为6个主成分,并且将每条蛋白质序列里的氨基酸数目作为权重对主成分得分值进行加权平均,然后再融合氨基酸的位置信息构成一个26维的蛋白质序列特征向量,最后利用欧式距离度量蛋白质序列间的相似性及进化关系。通过对3个蛋白质序列数据集的测试表明,本文提出的方法能将每条蛋白质序列准确聚类,并且简便快捷,说明了该方法的有效性。  相似文献   

2.
The data deluge in post-genomic era demands development of novel data mining tools. Existing molecular phylogeny analyses (MPAs) developed for individual gene/protein sequences are alignment-based. However, the size of genomic data and uncertainties associated with alignments, necessitate development of alignment-free methods for MPA. Derivation of distances between sequences is an important step in both, alignment-dependant and alignment-free methods. Various alignment-free distance measures based on oligo-nucleotide frequencies, information content, compression techniques, etc. have been proposed. However, these distance measures do not account for relative order of components viz. nucleotides or amino acids. A new distance measure, based on the concept of 'return time distribution' (RTD) of k-mers is proposed, which accounts for the sequence composition and their relative orders. Statistical parameters of RTDs are used to derive a distance function. The resultant distance matrix is used for clustering and phylogeny using Neighbor-joining. Its performance for MPA and subtyping was evaluated using simulated data generated by block-bootstrap, receiver operating characteristics and leave-one-out cross validation methods. The proposed method was successfully applied for MPA of family Flaviviridae and subtyping of Dengue viruses. It is observed that method retains resolution for classification and subtyping of viruses at varying levels of sequence similarity and taxonomic hierarchy.  相似文献   

3.
A probabilistic measure for alignment-free sequence comparison   总被引:3,自引:0,他引:3  
MOTIVATION: Alignment-free sequence comparison methods are still in the early stages of development compared to those of alignment-based sequence analysis. In this paper, we introduce a probabilistic measure of similarity between two biological sequences without alignment. The method is based on the concept of comparing the similarity/dissimilarity between two constructed Markov models. RESULTS: The method was tested against six DNA sequences, which are the thrA, thrB and thrC genes of the threonine operons from Escherichia coli K-12 and from Shigella flexneri; and one random sequence having the same base composition as thrA from E.coli. These results were compared with those obtained from CLUSTAL W algorithm (alignment-based) and the chaos game representation (alignment-free). The method was further tested against a more complex set of 40 DNA sequences and compared with other existing sequence similarity measures (alignment-free). AVAILABILITY: All datasets and computer codes written in MATLAB are available upon request from the first author.  相似文献   

4.
In the past, a large number of methods have been developed for predicting various characteristics of a protein from its composition. In order to exploit the full potential of protein composition, we developed the web-server COPid to assist the researchers in annotating the function of a protein from its composition using whole or part of the protein. COPid has three modules called search, composition and analysis. The search module allows searching of protein sequences in six different databases. Search results list database proteins in ascending order of Euclidian distance or descending order of compositional similarity with the query sequence. The composition module allows calculation of the composition of a sequence and average composition of a group of sequences. The composition module also allows computing composition of various types of amino acids (e.g. charge, polar, hydrophobic residues). The analysis module provides the following options; i) comparing composition of two classes of proteins, ii) creating a phylogenetic tree based on the composition and iii) generating input patterns for machine learning techniques. We have evaluated the performance of composition-based (or alignment-free) similarity search in the subcellular localization of proteins. It was found that the alignment free method performs reasonably well in predicting certain classes of proteins. The COPid web-server is available at http://www.imtech.res.in/raghava/copid/.  相似文献   

5.

Background  

The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "phylogenetic" in the sense of "relatedness of biological functions".  相似文献   

6.
7.
Measures of genetic distance based on alignment methods are confined to studying sequences that are conserved and identifiable in all organisms under study. A number of alignment-free techniques based on either statistical linguistics or information theory have been developed to overcome the limitations of alignment methods. We present a novel alignment-free approach to measuring the similarity among genetic sequences that incorporates elements from both word rank order-frequency statistics and information theory. We first validate this method on the human influenza A viral genomes as well as on the human mitochondrial DNA database. We then apply the method to study the origin of the SARS coronavirus. We find that the majority of the SARS genome is most closely related to group 1 coronaviruses, with smaller regions of matches to sequences from groups 2 and 3. The information based similarity index provides a new tool to measure the similarity between datasets based on their information content and may have a wide range of applications in the large-scale analysis of genomic databases.  相似文献   

8.
【背景】目前犬布鲁氏菌病诊断存在一定的困难。【目的】筛选并研究犬种布鲁氏菌单克隆抗体4H3株的特异性抗原表位。【方法】利用噬菌体肽库展示技术,以犬种布鲁氏菌单克隆抗体4H3株作为靶分子,包被酶标板,用12肽随机肽库经过3轮生物淘洗程序进行筛选。经过3轮筛选后,噬菌体产出率从5.00×10-7增加到9.84×10-6,假阳性率逐轮降低。从第3轮筛选的阳性克隆中随机挑取14个进行增殖,提取基因组DNA,进行测序分析;并通过iELISA和cELISA检测阳性克隆的亲和性和特异性。【结果】14株阳性单克隆噬菌体共出现3种不同的短肽序列,分别是KMSIRHPIRLPI、ILRRRRKRIIQI和QRIHMRLTTQS;iELISA结果表明3种短肽序列与单克隆抗体的亲和性依次为KMSIRHPIRLPI>ILRRRRKRIIQI>QRIHMRLTTQS;cELISA结果显示短肽KMSIRHPIRLPI和ILRRRRKRIIQI特异性较强。对亲和性较强、特异性较高的2条短肽KMSIRHPIRLPI和ILRRRRKRIIQI展开具体分析,比对分析表...  相似文献   

9.

Background

Protein phosphorylation is a generic way to regulate signal transduction pathways in all kingdoms of life. In many organisms, it is achieved by the large family of Ser/Thr/Tyr protein kinases which are traditionally classified into groups and subfamilies on the basis of the amino acid sequence of their catalytic domains. Many protein kinases are multi-domain in nature but the diversity of the accessory domains and their organization are usually not taken into account while classifying kinases into groups or subfamilies.

Methodology

Here, we present an approach which considers amino acid sequences of complete gene products, in order to suggest refinements in sets of pre-classified sequences. The strategy is based on alignment-free similarity scores and iterative Area Under the Curve (AUC) computation. Similarity scores are computed by detecting common patterns between two sequences and scoring them using a substitution matrix, with a consistent normalization scheme. This allows us to handle full-length sequences, and implicitly takes into account domain diversity and domain shuffling. We quantitatively validate our approach on a subset of 212 human protein kinases. We then employ it on the complete repertoire of human protein kinases and suggest few qualitative refinements in the subfamily assignment stored in the KinG database, which is based on catalytic domains only. Based on our new measure, we delineate 37 cases of potential hybrid kinases: sequences for which classical classification based entirely on catalytic domains is inconsistent with the full-length similarity scores computed here, which implicitly consider multi-domain nature and regions outside the catalytic kinase domain. We also provide some examples of hybrid kinases of the protozoan parasite Entamoeba histolytica.

Conclusions

The implicit consideration of multi-domain architectures is a valuable inclusion to complement other classification schemes. The proposed algorithm may also be employed to classify other families of enzymes with multi-domain architecture.  相似文献   

10.
Genes encoding dextranolytic enzymes were isolated from Paenibacillus strains Dex40-8 and Dex50-2. Single, similar but non-identical dex1 genes were isolated from each strain, and a more divergent dex2 gene was isolated from strain Dex50-2. The protein deduced from the Dex40-8 dex1 gene sequence had 716 amino acids, with a predicted Mr of 80.8 kDa. The proteins deduced from the Dex50-2 dex1 and dex2 gene sequences had 905 and 596 amino acids, with predicted Mr of 100.1 kDa and 68.3 kDa, respectively. The deduced amino acid sequences of all three dextranolytic proteins had similarity to family 66 glycosyl hydrolases and were predicted to possess cleavable N-terminal signal peptides. Homology searches suggest that the Dex40-8 and Dex50-2 Dex1 proteins have one and two copies, respectively, of a carbohydrate-binding module similar to CBM_4_9 (pfam02018.11). The Dex50-2 Dex2 deduced amino acid sequence had highest sequence similarity to thermotolerant dextranases from thermophilic Paenibacillus strains, while the Dex40-8 and Dex50-2 Dex1 deduced protein sequences formed a distinct sequence clade among the family 66 proteins. Examination of seven Paenibacillus strains, using a polymerase chain reaction-based assay, indicated that multiple family 66 genes are common within this genus. The three recombinant proteins expressed in Escherichia coli possessed dextranolytic activity and were able to convert ethanol-insoluble blue dextran into an ethanol-soluble product, indicating they are endodextranases (EC 3.2.1.11). The reaction catalysed by each enzyme had a distinct temperature and pH dependence.  相似文献   

11.
12.
The successful prediction of thermophilic proteins is useful for designing stable enzymes that are functional at high temperature. We have used the increment of diversity (ID), a novel amino acid composition-based similarity distance, in a 2-class K-nearest neighbor classifier to classify thermophilic and mesophilic proteins. And the KNN-ID classifier was successfully developed to predict the thermophilic proteins. Instead of extracting features from protein sequences as done previously, our approach was based on a diversity measure of symbol sequences. The similarity distance between each pair of protein sequences was first calculated to quantitatively measure the similarity level of one given sequence and the other. The query protein is then determined using the K-nearest neighbor algorithm. Comparisons with multiple recently published methods showed that the KNN-ID proposed in this study outperforms the other methods. The improved predictive performance indicated it is a simple and effective classifier for discriminating thermophilic and mesophilic proteins. At last, the influence of protein length and protein identity on prediction accuracy was discussed further. The prediction model and dataset used in this article can be freely downloaded from http://wlxy.imu.edu.cn/college/biostation/fuwu/KNN-ID/index.htm.  相似文献   

13.

Background

The function of a protein can be deciphered with higher accuracy from its structure than from its amino acid sequence. Due to the huge gap in the available protein sequence and structural space, tools that can generate functionally homogeneous clusters using only the sequence information, hold great importance. For this, traditional alignment-based tools work well in most cases and clustering is performed on the basis of sequence similarity. But, in the case of multi-domain proteins, the alignment quality might be poor due to varied lengths of the proteins, domain shuffling or circular permutations. Multi-domain proteins are ubiquitous in nature, hence alignment-free tools, which overcome the shortcomings of alignment-based protein comparison methods, are required. Further, existing tools classify proteins using only domain-level information and hence miss out on the information encoded in the tethered regions or accessory domains. Our method, on the other hand, takes into account the full-length sequence of a protein, consolidating the complete sequence information to understand a given protein better.

Results

Our web-server, CLAP (Classification of Proteins), is one such alignment-free software for automatic classification of protein sequences. It utilizes a pattern-matching algorithm that assigns local matching scores (LMS) to residues that are a part of the matched patterns between two sequences being compared. CLAP works on full-length sequences and does not require prior domain definitions.Pilot studies undertaken previously on protein kinases and immunoglobulins have shown that CLAP yields clusters, which have high functional and domain architectural similarity. Moreover, parsing at a statistically determined cut-off resulted in clusters that corroborated with the sub-family level classification of that particular domain family.

Conclusions

CLAP is a useful protein-clustering tool, independent of domain assignment, domain order, sequence length and domain diversity. Our method can be used for any set of protein sequences, yielding functionally relevant clusters with high domain architectural homogeneity. The CLAP web server is freely available for academic use at http://nslab.mbu.iisc.ernet.in/clap/.  相似文献   

14.

Background  

The chemical property and biological function of a protein is a direct consequence of its primary structure. Several algorithms have been developed which determine alignment and similarity of primary protein sequences. However, character based similarity cannot provide insight into the structural aspects of a protein. We present a method based on spectral similarity to compare subsequences of amino acids that behave similarly but are not aligned well by considering amino acids as mere characters. This approach finds a similarity score between sequences based on any given attribute, like hydrophobicity of amino acids, on the basis of spectral information after partial conversion to the frequency domain.  相似文献   

15.
The profile hidden Markov model (PHMM) is widely used to assign the protein sequences to their respective families. A major limitation of a PHMM is the assumption that given states the observations (amino acids) are independent. To overcome this limitation, the dependency between amino acids in a multiple sequence alignment (MSA) which is the representative of a PHMM can be appended to the PHMM. Due to the fact that with a MSA, the sequences of amino acids are biologically related, the one-by-one dependency between two amino acids can be considered. In other words, based on the MSA, the dependency between an amino acid and its corresponding amino acid located above can be combined with the PHMM. For this purpose, the new emission probability matrix which considers the one-by-one dependencies between amino acids is constructed. The parameters of a PHMM are of two types; transition and emission probabilities which are usually estimated using an EM algorithm called the Baum-Welch algorithm. We have generalized the Baum-Welch algorithm using similarity emission matrix constructed by integrating the new emission probability matrix with the common emission probability matrix. Then, the performance of similarity emission is discussed by applying it to the top twenty protein families in the Pfam database. We show that using the similarity emission in the Baum-Welch algorithm significantly outperforms the common Baum-Welch algorithm in the task of assigning protein sequences to protein families.  相似文献   

16.
A 3133-bp nucleotide sequence of the gene Paz1 on chromosome 4 of barley, encoding endosperm protein Z4, has been determined. The sequence includes 1079 bp 5' upstream and 523 bp 3' downstream of the coding region. The 1079-bp 5' upstream region of the gene shows little similarity to 5' regions of other sequences genes expressed in the developing cereal endosperm. The coding sequence is interrupted by one 334-bp-long intron (bases 1497-1830). The deduced amino acid sequence, which was corroborated by peptide sequences, consists of 399 amino acids and has a molecular mass of 43,128 Da. This sequence confirms protein Z4 to be a member of the serpin superfamily of proteins. The similarity with other members of the family expressed as amino acids in identical positions is in the order of 25-30% and pronounced in the carboxy-terminal half of the molecule. Sequence residues assumed to form clusters stabilizing the tertiary structure are highly conserved. Protein Z4 is synthesized in the developing endosperm without a signal peptide and protein Z4 mRNA was evenly distributed among the free and membrane-bound polyribosomes of the endosperm cell. An internal hydrophobic region of 21 amino acids (residues 36-56) may serve as a signal for targeting the polypeptide into the lumen of the endoplasmic reticulum. The gene for protein Z4 could not be detected in the barley variety Maskin and some of its descendants. The 'high-lysine' allees, lys1 (Hiproly barley) and lys3a (Bomi mutant 1508) on chromosome 7, enhance and repress, respectively, the expression of the protein Z4 gene. Also, 1554 bp of another 8-kbp fragment of the barley genome Paz psi, similar to the protein-Z4-coding region, have been determined. Small insertions and deletions and the presence of an internal stop codon identify this fragment as part of a pseudogene related to the protein Z4 gene.  相似文献   

17.
The introduction of two-dimension (2D) graphs and their numerical characterization for comparative analyses of DNA/RNA and protein sequences without the need of sequence alignments is an active yet recent research topic in bioinformatics. Here, we used a 2D artificial representation (four-color maps) with a simple numerical characterization through topological indices (TIs) to aid the discovering of remote homologous of Adenylation domains (A-domains) from the Nonribosomal Peptide Synthetases (NRPS) class in the proteome of the cyanobacteria Microcystis aeruginosa. Cyanobacteria are a rich source of structurally diverse oligopeptides that are predominantly synthesized by NPRS. Several A-domains share amino acid identities lower than 20 % being a possible source of remote homologous. Therefore, A-domains cannot be easily retrieved by BLASTp searches using a single template. To cope with the sequence diversity of the A-domains we have combined homology-search methods with an alignment-free tool that uses protein four-color-maps. TI2BioP (Topological Indices to BioPolymers) version 2.0, available at http://ti2biop.sourceforge.net/ allowed the calculation of simple TIs from the protein sequences (four-color maps). Such TIs were used as input predictors for the statistical estimations required to build the alignment-free models. We concluded that the use of graphical/numerical approaches in cooperation with other sequence search methods, like multi-templates BLASTp and profile HMM, can give the most complete exploration of the repertoire of highly diverse protein families.  相似文献   

18.
Forty original sequences of peptide substrates and inhibitors of protein kinases and phosphatases were aligned in a chain matrix without artificial gaps. Fifteen protein kinase peptide substrates and inhibitors (PKSI peptides) contained a common dipeptide ArgArg and also additional important tetra-, tri- and dipeptide homologies. Three further peptide substrates were significantly similar to these peptides but lacked the ArgArg dipeptide. Sequence comparison of individual PKSI peptides revealed probabilistically restricted consensus sequence—PKSI motif—comprising 8 homologous and 13 non-randomly distributed amino acids without considering mutation analysis. This template motif was compared with the consensus sequences of 12 different immunoglobulin domains. In 11 of 12 these domains, the starts of homologous segments were found at nearly the same domain related sites, beginning with serine. A single-triplet mutation, of any of the first two triplet bases that encode equally localized amino acids in each of the two sequence sets (PKSI and Ig) revealed additional homologies with the other set. A primary derived motif version composed of 9 homologous and seven non-randomly distributed amino acids was consequently established by its feedback projection into the original sequence sets. This procedure yielded a second preliminary motif version (revised motif) formed by a sequence of 9 homologous amino acids and two non-randomly distributed amino acids. In addition, three shorter oligopeptide motifs called important stereotypes were derived, based on repeated homology between Ig chains and the revised motif. The most extensive similarities in terms of these stereotypes occurred in the CH2 and CH4 domains of Ig peptides, and inhibitors of cAMP dependent protein kinase and protein kinase A. Further comparisons based on a reference sequence set arranged with the aid of feedback projection revealed a lower similarity between variable Ig chains reflected in a decreased number of homologous amino acids. Two final motif versions, FMC and FMV, were found in two different subsets of constant and variable Ig chains, respectively. FMC was composed of seven homologous and one non-randomly distributed amino acids forming the dispersed structure STLR(C)LVSD, whereas 6 homologous and one questionable amino acid constituted FMV. Only CH4 and CH1 domain segments contained all five high-incidence amino acids, which represented a higher level of similarity than homologous amino acids of all preliminary and final motifs. Four such amino acids were present also in three PKSI peptides. All similarities described here occur in domain segments positionally overlapping with the CDR1 region of variable chains. The results are discussed in terms of immunoglobulin evolution, the position of Fc receptor binding sites and degeneration or mutability of the triplets of motif-constituting amino acids.  相似文献   

19.
A method for comparison of protein sequences based on their primary and secondary structure is described. Protein sequences are annotated with predicted secondary structures (using a modified Chou and Fasman method). Two lettered code sequences are generated (Xx, where X is the amino acid and x is its annotated secondary structure). Sequences are compared with a dynamic programming method (STRALIGN) that includes a similarity matrix for both the amino acids and secondary structures. The similarity value for each paired two-lettered code is a linear combination of similarity values for the paired amino acids and their annotated secondary structures. The method has been applied to eight globin proteins (28 pairs) for which the X-ray structure is known. For protein pairs with high primary sequence similarity (greater than 45%), STRALIGN alignment is identical to that obtained by a dynamic programming method using only primary sequence information. However, alignment of protein pairs with lower primary sequence similarity improves significantly with the addition of secondary structure annotation. Alignment of the pair with the least primary sequence similarity of 16% was improved from 0 to 37% 'correct' alignment using this method. In addition, STRALIGN was successfully applied to seven pairs of distantly related cytochrome c proteins, and three pairs of distantly related picornavirus proteins.  相似文献   

20.
Digital signal processing (DSP) techniques for biological sequence analysis continue to grow in popularity due to the inherent digital nature of these sequences. DSP methods have demonstrated early success for detection of coding regions in a gene. Recently, these methods are being used to establish DNA gene similarity. We present the inter-coefficient difference (ICD) transformation, a novel extension of the discrete Fourier transformation, which can be applied to any DNA sequence. The ICD method is a mathematical, alignment-free DNA comparison method that generates a genetic signature for any DNA sequence that is used to generate relative measures of similarity among DNA sequences. We demonstrate our method on a set of insulin genes obtained from an evolutionarily wide range of species, and on a set of avian influenza viral sequences, which represents a set of highly similar sequences. We compare phylogenetic trees generated using our technique against trees generated using traditional alignment techniques for similarity and demonstrate that the ICD method produces a highly accurate tree without requiring an alignment prior to establishing sequence similarity.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号