期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Dynamic protein domains: identification, interdependence, and stability

Yesylevskyy SO Kharkyanen VN Demchenko AP 《Biophysical journal》2006,91(2):670-685

Existing methods of domain identification in proteins usually provide no information about the degree of domain independence and stability. However, this information is vital for many areas of protein research. The recently developed hierarchical clustering of correlation patterns (HCCP) technique provides machine-based domain identification in a computationally simple and physically consistent way. Here we present the modification of this technique, which not only allows determination of the most plausible number of dynamic domains but also makes it possible to estimate the degree of their independence (the extent of correlated motion) and stability (the range of environmental conditions, where domains remain intact). With this technique we provided domain assignments and calculated intra- and interdomain correlations and interdomain energies for >2500 test proteins. It is shown that mean intradomain correlation of motions can serve as a quantitative criterion of domain independence, and the HCCP stability gap is a measure of their stability. Our data show that the motions of domains with high stability are usually independent. In contrast, the domains with moderate stability usually exhibit a substantial degree of correlated motions. It is shown that in multidomain proteins the domains are most stable if they are of similar size, and this correlates with the observed abundance of such proteins. 相似文献

2.

The blind search for the closed states of hinge-bending proteins

Yesylevskyy SO Kharkyanen VN Demchenko AP 《Proteins》2008,71(2):831-843

The hinge-bending proteins provide the most pronounced example of the large-amplitude slow motions in a number of proteins, which are critical for their functioning. They are often used as the test ground for developing novel approaches to the simulation of slow protein dynamics. In the present study, we present the algorithm, which allows physically-consistent simulations of slow protein dynamics in globular proteins. Our algorithm is based on the hierarchical clustering of the correlation patterns (HCCP) technique of domain identification, which allows subdividing the protein into the hierarchy of the rigid-body-like clusters. The clusters are allowed to rotate relative to one another on the automatically identified hinges. The clusters are found in the course of automated, objective and well-tested procedure. In the present communication, our technique is applied to 10 hinge-bending proteins. For each of the proteins, we performed the blind search for the closed conformation, staring from the open one. Resulting closed conformations are compared with the closed states observed in crystallographic structures. It is shown that our technique produces realistic closed conformations for 8 out of 10 studied proteins. This demonstrates that HCCP technique can be used for finding alternative protein conformations and for sampling the slow motions in proteins. 相似文献

3.

Continuous and discontinuous domains: an algorithm for the automatic generation of reliable protein domain definitions. 总被引：9，自引：5，他引：4

下载免费PDF全文

A. S. Siddiqui G. J. Barton 《Protein science : a publication of the Protein Society》1995,4(5):872-884

An algorithm is presented for the fast and accurate definition of protein structural domains from coordinate data without prior knowledge of the number or type of domains. The algorithm explicitly locates domains that comprise one or two continuous segments of protein chain. Domains that include more than two segments are also located. The algorithm was applied to a nonredundant database of 230 protein structures and the results compared to domain definitions obtained from the literature, or by inspection of the coordinates on molecular graphics. For 70% of the proteins, the derived domains agree with the reference definitions, 18% show minor differences and only 12% (28 proteins) show very different definitions. Three screens were applied to identify the derived domains least likely to agree with the subjective definition set. These screens revealed a set of 173 proteins, 97% of which agree well with the subjective definitions. The algorithm represents a practical domain identification tool that can be run routinely on the entire structural database. Adjustment of parameters also allows smaller compact units to be identified in proteins. 相似文献

4.

Armadillo: domain boundary prediction by amino acid composition

Dumontier M Yao R Feldman HJ Hogue CW 《Journal of molecular biology》2005,350(5):1061-1073

The identification and annotation of protein domains provides a critical step in the accurate determination of molecular function. Both computational and experimental methods of protein structure determination may be deterred by large multi-domain proteins or flexible linker regions. Knowledge of domains and their boundaries may reduce the experimental cost of protein structure determination by allowing researchers to work on a set of smaller and possibly more successful alternatives. Current domain prediction methods often rely on sequence similarity to conserved domains and as such are poorly suited to detect domain structure in poorly conserved or orphan proteins. We present here a simple computational method to identify protein domain linkers and their boundaries from sequence information alone. Our domain predictor, Armadillo (http://armadillo.blueprint.org), uses any amino acid index to convert a protein sequence to a smoothed numeric profile from which domains and domain boundaries may be predicted. We derived an amino acid index called the domain linker propensity index (DLI) from the amino acid composition of domain linkers using a non-redundant structure dataset. The index indicates that Pro and Gly show a propensity for linker residues while small hydrophobic residues do not. Armadillo predicts domain linker boundaries from Z-score distributions and obtains 35% sensitivity with DLI in a two-domain, single-linker dataset (within +/-20 residues from linker). The combination of DLI and an entropy-based amino acid index increases the overall Armadillo sensitivity to 56% for two domain proteins. Moreover, Armadillo achieves 37% sensitivity for multi-domain proteins, surpassing most other prediction methods. Armadillo provides a simple, but effective method by which prediction of domain boundaries can be obtained with reasonable sensitivity. Armadillo should prove to be a valuable tool for rapidly delineating protein domains in poorly conserved proteins or those with no sequence neighbors. As a first-line predictor, domain meta-predictors could yield improved results with Armadillo predictions. 相似文献

5.

Prediction of phosphotyrosine signaling networks using a scoring matrix-assisted ligand identification approach

下载免费PDF全文

Li L Wu C Huang H Zhang K Gan J Li SS 《Nucleic acids research》2008,36(10):3263-3273

Systematic identification of binding partners for modular domains such as Src homology 2 (SH2) is important for understanding the biological function of the corresponding SH2 proteins. We have developed a worldwide web-accessible computer program dubbed SMALI for scoring matrix-assisted ligand identification for SH2 domains and other signaling modules. The current version of SMALI harbors 76 unique scoring matrices for SH2 domains derived from screening oriented peptide array libraries. These scoring matrices are used to search a protein database for short peptides preferred by an SH2 domain. An experimentally determined cut-off value is used to normalize an SMALI score, therefore allowing for direct comparison in peptide-binding potential for different SH2 domains. SMALI employs distinct scoring matrices from Scansite, a popular motif-scanning program. Moreover, SMALI contains built-in filters for phosphoproteins, Gene Ontology (GO) correlation and colocalization of subject and query proteins. Compared to Scansite, SMALI exhibited improved accuracy in identifying binding peptides for SH2 domains. Applying SMALI to a group of SH2 domains identified hundreds of interactions that overlap significantly with known networks mediated by the corresponding SH2 proteins, suggesting SMALI is a useful tool for facile identification of signaling networks mediated by modular domains that recognize short linear peptide motifs. 相似文献

6.

GlobPlot: Exploring protein sequences for globularity and disorder 总被引：2，自引：0，他引：2

Linding R Russell RB Neduva V Gibson TJ 《Nucleic acids research》2003,31(13):3701-3708

A major challenge in the proteomics and structural genomics era is to predict protein structure and function, including identification of those proteins that are partially or wholly unstructured. Non-globular sequence segments often contain short linear peptide motifs (e.g. SH3-binding sites) which are important for protein function. We present here a new tool for discovery of such unstructured, or disordered regions within proteins. GlobPlot (http://globplot.embl.de) is a web service that allows the user to plot the tendency within the query protein for order/globularity and disorder. We show examples with known proteins where it successfully identifies inter-domain segments containing linear motifs, and also apparently ordered regions that do not contain any recognised domain. GlobPlot may be useful in domain hunting efforts. The plots indicate that instances of known domains may often contain additional N- or C-terminal segments that appear ordered. Thus GlobPlot may be of use in the design of constructs corresponding to globular proteins, as needed for many biochemical studies, particularly structural biology. GlobPlot has a pipeline interface--GlobPipe--for the advanced user to do whole proteome analysis. GlobPlot can also be used as a generic infrastructure package for graphical displaying of any possible propensity. 相似文献

7.

The change of protein intradomain mobility on ligand binding: is it a commonly observed phenomenon?

下载免费PDF全文

Yesylevskyy SO Kharkyanen VN Demchenko AP 《Biophysical journal》2006,91(8):3002-3013

Analysis of changes in the dynamics of protein domains on ligand binding is important in several aspects: for the understanding of the hierarchical nature of protein folding and dynamics at equilibrium; for analysis of signal transduction mechanisms triggered by ligand binding, including allostery; for drug design; and for construction of biosensors reporting on the presence of target ligand in studied media. In this work we use the recently developed HCCP computational technique for the analysis of stabilities of dynamic domains in proteins, their intrinsic motions and of their changes on ligand binding. The work is based on comparative studies of 157 ligand binding proteins, for which several crystal structures (in ligand-free and ligand-bound forms) are available. We demonstrate that the domains of the proteins presented in the Protein DataBank are far more robust than it was thought before: in the majority of the studied proteins (152 out of 157), the ligand binding does not lead to significant change of domain stability. The exceptions from this rule are only four bacterial periplasmic transport proteins and calmodulin. Thus, as a rule, the pattern of correlated motions in dynamic domains, which determines their stability, is insensitive to ligand binding. This rule may be the general feature for a vast majority of proteins. 相似文献

8.

Method for identification of rigid domains and hinge residues in proteins based on exhaustive enumeration

下载免费PDF全文

Eunsung Park Julian Lee 《Proteins》2015,83(6):1054-1067

Many proteins undergo large‐scale motions where relatively rigid domains move against each other. The identification of rigid domains, as well as the hinge residues important for their relative movements, is important for various applications including flexible docking simulations. In this work, we develop a method for protein rigid domain identification based on an exhaustive enumeration of maximal rigid domains, the rigid domains not fully contained within other domains. The computation is performed by mapping the problem to that of finding maximal cliques in a graph. A minimal set of rigid domains are then selected, which cover most of the protein with minimal overlap. In contrast to the results of existing methods that partition a protein into non‐overlapping domains using approximate algorithms, the rigid domains obtained from exact enumeration naturally contain overlapping regions, which correspond to the hinges of the inter‐domain bending motion. The performance of the algorithm is demonstrated on several proteins. Proteins 2015; 83:1054–1067. © 2015 Wiley Periodicals, Inc. 相似文献

9.

Prediction of Ras-effector interactions using position energy matrices

Kiel C Serrano L 《Bioinformatics (Oxford, England)》2007,23(17):2226-2230

MOTIVATION: One of the more challenging problems in biology is to determine the cellular protein interaction network. Progress has been made to predict protein-protein interactions based on structural information, assuming that structural similar proteins interact in a similar way. In a previous publication, we have determined a genome-wide Ras-effector interaction network based on homology models, with a high accuracy of predicting binding and non-binding domains. However, for a prediction on a genome-wide scale, homology modelling is a time-consuming process. Therefore, we here successfully developed a faster method using position energy matrices, where based on different Ras-effector X-ray template structures, all amino acids in the effector binding domain are sequentially mutated to all other amino acid residues and the effect on binding energy is calculated. Those pre-calculated matrices can then be used to score for binding any Ras or effector sequences. RESULTS: Based on position energy matrices, the sequences of putative Ras-binding domains can be scanned quickly to calculate an energy sum value. By calibrating energy sum values using quantitative experimental binding data, thresholds can be defined and thus non-binding domains can be excluded quickly. Sequences which have energy sum values above this threshold are considered to be potential binding domains, and could be further analysed using homology modelling. This prediction method could be applied to other protein families sharing conserved interaction types, in order to determine in a fast way large scale cellular protein interaction networks. Thus, it could have an important impact on future in silico structural genomics approaches, in particular with regard to increasing structural proteomics efforts, aiming to determine all possible domain folds and interaction types. AVAILABILITY: All matrices are deposited in the ADAN database (http://adan-embl.ibmc.umh.es/). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. 相似文献

10.

Use of a Probabilistic Motif Search to Identify Histidine Phosphotransfer Domain-Containing Proteins

Defne Surujon David I. Ratner 《PloS one》2016,11(1)

The wealth of newly obtained proteomic information affords researchers the possibility of searching for proteins of a given structure or function. Here we describe a general method for the detection of a protein domain of interest in any species for which a complete proteome exists. In particular, we apply this approach to identify histidine phosphotransfer (HPt) domain-containing proteins across a range of eukaryotic species. From the sequences of known HPt domains, we created an amino acid occurrence matrix which we then used to define a conserved, probabilistic motif. Examination of various organisms either known to contain (plant and fungal species) or believed to lack (mammals) HPt domains established criteria by which new HPt candidates were identified and ranked. Search results using a probabilistic motif matrix compare favorably with data to be found in several commonly used protein structure/function databases: our method identified all known HPt proteins in the Arabidopsis thaliana proteome, confirmed the absence of such motifs in mice and humans, and suggests new candidate HPts in several organisms. Moreover, probabilistic motif searching can be applied more generally, in a manner both readily customized and computationally compact, to other protein domains; this utility is demonstrated by our identification of histones in a range of eukaryotic organisms. 相似文献

11.

SMART: a web-based tool for the study of genetically mobile domains 总被引：61，自引：2，他引：59

Schultz J Copley RR Doerks T Ponting CP Bork P 《Nucleic acids research》2000,28(1):231-234

SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures (http://SMART.embl-heidelberg.de ). More than 400 domain families found in signalling, extra-cellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database as well as search parameters and taxonomic information are stored in a relational database system. User interfaces to this database allow searches for proteins containing specific combinations of domains in defined taxa. 相似文献

12.

Clustering of multi‐domain protein sequences

下载免费PDF全文

Prachi Mehrotra Vimla Kany G. Ami Narayanaswamy Srinivasan 《Proteins》2018,86(7):759-776

The overall function of a multi‐domain protein is determined by the functional and structural interplay of its constituent domains. Traditional sequence alignment‐based methods commonly utilize domain‐level information and provide classification only at the level of domains. Such methods are not capable of taking into account the contributions of other domains in the proteins, and domain‐linker regions and classify multi‐domain proteins. An alignment‐free protein sequence comparison tool, CLAP (CLAssification of Proteins) was previously developed in our laboratory to especially handle multi‐domain protein sequences without a requirement of defining domain boundaries and sequential order of domains. Through this method we aim to achieve a biologically meaningful classification scheme for multi‐domain protein sequences. In this article, CLAP‐based classification has been explored on 5 datasets of multi‐domain proteins and we present detailed analysis for proteins containing (1) Tyrosine phosphatase and (2) SH3 domain. At the domain‐level CLAP‐based classification scheme resulted in a clustering similar to that obtained from an alignment‐based method. CLAP‐based clusters obtained for full‐length datasets were shown to comprise of proteins with similar functions and domain architectures. Our study demonstrates that multi‐domain proteins could be classified effectively by considering full‐length sequences without a requirement of identification of domains in the sequence. 相似文献

13.

Conservation of inter-residue interactions and prediction of folding rates of domain repeats

Rajathei David Mary Mani K Saravanan 《Journal of biomolecular structure & dynamics》2013,31(3):534-551

Domains are the main structural and functional units of larger proteins. They tend to be contiguous in primary structure and can fold and function independently. It has been observed that 10–20% of all encoded proteins contain duplicated domains and the average pairwise sequence identity between them is usually low. In the present study, we have analyzed the structural similarity between domain repeats of proteins with known structures available in the Protein Data Bank using structure-based inter-residue interaction measures such as the number of long-range contacts, surrounding hydrophobicity, and pairwise interaction energy. We used RADAR program for detecting the repeats in a protein sequence which were further validated using Pfam domain assignments. The sequence identity between the repeats in domains ranges from 20 to 40% and their secondary structural elements are well conserved. The number of long-range contacts, surrounding hydrophobicity calculations and pairwise interaction energy of the domain repeats clearly reveal the conservation of 3-D structure environment in the repeats of domains. The proportions of mainchain–mainchain hydrogen bonds and hydrophobic interactions are also highly conserved between the repeats. The present study has suggested that the computation of these structure-based parameters will give better clues about the tertiary environment of the repeats in domains. The folding rates of individual domains in the repeats predicted using the long-range order parameter indicate that the predicted folding rates correlate well with most of the experimentally observed folding rates for the analyzed independently folded domains. 相似文献

14.

MALDI/MS peptide mass fingerprinting for proteome analysis: identification of hydrophobic proteins attached to eucaryote keratinocyte cytoplasmic membrane using different matrices in concert

Gonnet F Lemaître G Waksman G Tortajada J 《Proteome science》2003,1(1):2

BACKGROUND: MALDI-TOF-MS has become an important analytical tool in the identification of proteins and evaluation of their role in biological processes. A typical protocol consists of sample purification, separation of proteins by 2D-PAGE, enzymatic digestion and identification of proteins by peptide mass fingerprint. Unfortunately, this approach is not appropriate for the identification of membrane or low or high pI proteins. An alternative technique uses 1D-PAGE, which results in a mixture of proteins in each gel band. The direct analysis of the proteolytic digestion of this mixture is often problematic because of poor peptide detection and consequent poor sequence coverage in databases. Sequence coverage can be improved through the combination of several matrices. RESULTS: The aim of this study was to trust the MALDI analysis of complex biological samples, in order to identify proteins that interact with the membrane network of keratinocytes. Peptides obtained from protein trypsin digestions may have either hydrophobic or hydrophilic sections, in which case, the direct analysis of such a mixture by MALDI does not allow desorbing of all peptides. In this work, MALDI/MS experiments were thus performed using four different matrices in concert. The data were analysed with three algorithms in order to test each of them. We observed that the use of at least two matrices in concert leads to a twofold increase of the coverage of each protein. Considering data obtained in this study, we recommend the use of HCCA in concert with the SA matrix in order to obtain a good coverage of hydrophilic proteins, and DHB in concert with the SA matrix to obtain a good coverage of hydrophobic proteins. CONCLUSION: In this work, experiments were performed directly on complex biological samples, in order to see systematic comparison between different matrices for real-life samples and to show a correlation that will be applicable to similar studies. When 1D gel is needed, each band may contain a great number of proteins, each present in small amounts. To improve the proteins coverage, we have performed experiments with some matrices in concert. These experiments enabled reliable identification of proteins, without the use of Nanospray MS/MS experiments. 相似文献

15.

A simple model for proteins with interacting domains. Applications to scanning calorimetry data 总被引：18，自引：0，他引：18

J F Brandts C Q Hu L N Lin M T Mos 《Biochemistry》1989,28(21):8588-8596

A simple thermodynamic model is formulated for the purpose of interpreting scanning calorimetry data on proteins that have interacting domains. Interactions are quantified by inclusion of an interface free energy, delta GAB, in the thermodynamics of unfolding for multidomain proteins. The assumption is made that delta GAB goes to zero with the unfolding of either domain involved in pairwise interaction, so the interaction term appears to stabilize only the domain with the lower TM. Application of the model to calorimetric data leads to an estimate of -25,000 cal/mol for interactions between the regulatory and catalytic subunits of native aspartate transcarbamoylase and to a value of 0 for delta GAB between the transmembrane and cytoplasmic domains of band 3 of the human erythrocyte membrane. Estimates of changes in delta GAB are also obtained for mutant forms of yeast phosphoglycerate kinase that have been altered in the hinge region between amino-terminal and carboxy-terminal domains. The model is also applied to ligand binding to proteins having domains that communicate through pairwise interaction. It is shown that whenever the delta GAB term is ligand-dependent, then attachment of the ligand to the binding domain will be partially controlled by the other (regulatory) domain. This situation can sometimes be recognized and quantified when calorimetric scans are carried out at varying ligand concentrations. According to the model, the binding of MgATP to the carboxy-terminal domain of phosphoglycerate kinase is strongly stabilized (ca. 20% of the unitary free energy of binding) by participation of the amino-terminal domain, which acts to increase the binding constant 25-fold.(ABSTRACT TRUNCATED AT 250 WORDS) 相似文献

16.

DDBASE2.0: updated domain database with improved identification of structural domains

Vinayagam A Shi J Pugalenthi G Meenakshi B Blundell TL Sowdhamini R 《Bioinformatics (Oxford, England)》2003,19(14):1760-1764

MOTIVATION: Although many methods are available for the identification of structural domains from protein three-dimensional structures, accurate definition of protein domains and the curation of such data for a large number of proteins are often possible only after manual intervention. The availability of domain definitions for protein structural entries is useful for the sequence analysis of aligned domains, structure comparison, fold recognition procedures and understanding protein folding, domain stability and flexibility. RESULTS: We have improved our method of domain identification starting from the concept of clustering secondary structural elements, but with an intention of reducing the number of discontinuous segments in identified domains. The results of our modified and automatic approach have been compared with the domain definitions from other databases. On a test data set of 55 proteins, this method acquires high agreement (88%) in the number of domains with the crystallographers' definition and resources such as SCOP, CATH, DALI, 3Dee and PDP databases. This method also obtains 98% overlap score with the other resources in the definition of domain boundaries of the 55 proteins. We have examined the domain arrangements of 4592 non-redundant protein chains using the improved method to include 5409 domains leading to an update of the structural domain database. AVAILABILITY: The latest version of the domain database and online domain identification methods are available from http://www.ncbs.res.in/~faculty/mini/ddbase/ddbase.html Supplementary information: http://www.ncbs.res.in/~faculty/mini/ddbase/supplementary/supplementary.html 相似文献

17.

Identification of domains and domain interface residues in multidomain proteins from graph spectral method

Sistla RK K V B Vishveshwara S 《Proteins》2005,59(3):616-626

We present a novel method for the identification of structural domains and domain interface residues in proteins by graph spectral method. This method converts the three-dimensional structure of the protein into a graph by using atomic coordinates from the PDB file. Domain definitions are obtained by constructing either a protein backbone graph or a protein side-chain graph. The graph is constructed based on the interactions between amino acid residues in the three-dimensional structure of the proteins. The spectral parameters of such a graph contain information regarding the domains and subdomains in the protein structure. This is based on the fact that the interactions among amino acids are higher within a domain than across domains. This is evident in the spectra of the protein backbone and the side-chain graphs, thus differentiating the structural domains from one another. Further, residues that occur at the interface of two domains can also be easily identified from the spectra. This method is simple, elegant, and robust. Moreover, a single numeric computation yields both the domain definitions and the interface residues. 相似文献

18.

Sequence optimization for native state stability determines the evolution and folding kinetics of a small protein

Larson SM Pande VS 《Journal of molecular biology》2003,332(1):275-286

Investigating the relative importance of protein stability, function, and folding kinetics in driving protein evolution has long been hindered by the fact that we can only compare modern natural proteins, the products of the very process we seek to understand, to each other, with no external references or baselines. Through a large-scale all-atom simulation of protein evolution, we have created a large diverse alignment of SH3 domain sequences which have been selected only for native state stability, with no other influencing factors. Although the average pairwise identity between computationally evolved and natural sequences is only 17%, the residue frequency distributions of the computationally evolved sequences are similar to natural SH3 sequences at 86% of the positions in the domain, suggesting that optimization for the native state structure has dominated the evolution of natural SH3 domains. Additionally, the positions which play a consistent role in the transition state of three well-characterized SH3 domains (by phi-value analysis) are structurally optimized for the native state, and vice versa. Indeed, we see a specific and significant correlation between sequence optimization for native state stability and conservation of transition state structure. 相似文献

19.

On the relationship between the protein structure and protein dynamics

Lu CH Huang SW Lai YL Lin CP Shih CH Huang CC Hsu WL Hwang JK 《Proteins》2008,72(2):625-634

Recently, we have developed a method (Shih et al., Proteins: Structure, Function, and Bioinformatics 2007;68: 34-38) to compute correlation of fluctuations of proteins. This method, referred to as the protein fixed-point (PFP) model, is based on the positional vectors of atoms issuing from the fixed point, which is the point of the least fluctuations in proteins. One corollary from this model is that atoms lying on the same shell centered at the fixed point will have the same thermal fluctuations. In practice, this model provides a convenient way to compute the average dynamical properties of proteins directly from the geometrical shapes of proteins without the need of any mechanical models, and hence no trajectory integration or sophisticated matrix operations are needed. As a result, it is more efficient than molecular dynamics simulation or normal mode analysis. Though in the previous study the PFP model has been successfully applied to a number of proteins of various folds, it is not clear to what extent this model will be applied. In this article, we have carried out the comprehensive analysis of the PFP model for a dataset comprising 972 high-resolution X-ray structures with pairwise sequence identity or=0.5. Our result shows that the fixed-point model is indeed quite general and will be a useful tool for high throughput analysis of dynamical properties of proteins. 相似文献

20.

Rate matrices for analyzing large families of protein sequences.

C Devauchelle A Grossmann A Hénaut M Holschneider M Monnerot J L Risler B Torrésani 《Journal of computational biology》2001,8(4):381-399

We propose and study a new approach for the analysis of families of protein sequences. This method is related to the LogDet distances used in phylogenetic reconstructions; it can be viewed as an attempt to embed these distances into a multidimensional framework. The proposed method starts by associating a Markov matrix to each pairwise alignment deduced from a given multiple alignment. The central objects under consideration here are matrix-valued logarithms L of these Markov matrices, which exist under conditions that are compatible with fairly large divergence between the sequences. These logarithms allow us to compare data from a family of aligned proteins with simple models (in particular, continuous reversible Markov models) and to test the adequacy of such models. If one neglects fluctuations arising from the finite length of sequences, any continuous reversible Markov model with a single rate matrix Q over an arbitrary tree predicts that all the observed matrices L are multiples of Q. Our method exploits this fact, without relying on any tree estimation. We test this prediction on a family of proteins encoded by the mitochondrial genome of 26 multicellular animals, which include vertebrates, arthropods, echinoderms, molluscs, and nematodes. A principal component analysis of the observed matrices L shows that a single rate model can be used as a rough approximation to the data, but that systematic deviations from any such model are unmistakable and related to the evolutionary history of the species under consideration. 相似文献