首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Revisiting the problem of intron-exon identification, we use a principal component analysis (PCA) to classify DNA sequences and present first results that validate our approach. Sequences are translated into document vectors that represent their word content; a principal component analysis then defines Gaussian-distributed sequence classes. The classification uses word content and variation of word usage to distinguish sequences. We test our approach with several data sets of genomic DNA and are able to classify introns and exons with an accuracy of up to 96%. We compare the method with the best traditional coding measure, the non-overlapping hexamer frequency count, and find that the PCA method produces better results. We also investigate the degree of cross-validation between different data sets of introns and exons and find evidence that the quality of a data set can be detected.  相似文献   

2.
In this study, we wanted to inspect whether the evolutionary driven differences in primary sequences could correlate, and thus predict the genetic diversity of related marker loci, which is an important criterion to assess the quality of any DNA marker. We adopted new approach of quantitative symbolic DNA sequence analysis called DNA random walk representation to study multiallelic marker loci from Begonia × tuberhybrida Voss. We described significant correlation of random walk-derived digital invariants to genetic diversity of the marker loci. Specifically, on the 3D-contour plot of multivariate principal component analysis (PCA), we revealed statistical correlation between the first two PCA factors and the number of alleles per marker locus. Based on that correlation, we suggest that DNA walk representation may predict allele-rich loci solely from their primary sequences, which improves current design of new DNA germplasm identificators.  相似文献   

3.
We propose a new approach to study protein coding and non-coding regions in DNA sequences, by making use of two complementary statistical methods. The principal component analysis (PCA) is a graphical method to represent DNA sequences which are characterized by some quantitative parameters: it is a help to the intuition. The discriminating analysis (DA) is a quantitative method which permits to classify the DNA sequences. It leads to an evaluation of the first method and to a decision. The value of this approach has been confirmed since we also have found some results which had been described recently in the literature. Furthermore, this general methodology has permitted us to show the existence of parameters which identify the nucleic acid sequence functional domains, without having to make use of the properties of the genetic code.  相似文献   

4.
Defining the RNA target selectivity of the proteins regulating mRNA metabolism is a key issue in RNA biology. Here we present a novel use of principal component analysis (PCA) to extract the RNA sequence preference of RNA binding proteins. We show that PCA can be used to compare the changes in the nuclear magnetic resonance (NMR) spectrum of a protein upon binding a set of quasi-degenerate RNAs and define the nucleobase specificity. We couple this application of PCA to an automated NMR spectra recording and processing protocol and obtain an unbiased and high-throughput NMR method for the analysis of nucleobase preference in protein–RNA interactions. We test the method on the RNA binding domains of three important regulators of RNA metabolism.  相似文献   

5.
Sequence analysis of large protein families can produce sub-clusters even within the same family. In some cases, it is of interest to know precisely which amino acid position variations are most responsible for driving separation into sub-clusters. In large protein families composed of large proteins, it can be quite challenging to assign the relative importance to specific amino acid positions. Principal components analysis (PCA) is ideal for such a task, since the problem is posed in a large variable space, i.e. the number of amino acids that make up the protein sequence, and PCA is powerful at reducing the dimensionality of complex problems by projecting the data into an eigenspace that represents the directions of greatest variation. However, PCA of aligned protein sequence families is complicated by the fact that protein sequences are traditionally represented by single letter alphabetic codes, whereas PCA of protein sequence families requires conversion of sequence information into a numerical representation. Here, we introduce a new amino acid sequence conversion algorithm optimized for PCA data input. The method is demonstrated using a small artificial dataset to illustrate the characteristics and performance of the algorithm, as well as a small protein sequence family consisting of nine members, COG2263, and finally with a large protein sequence family, Pfam04237, which contains more than 1,800 sequences that group into two sub-clusters.  相似文献   

6.
Principal component analysis (PCA) is a dimensionality reduction and data analysis tool commonly used in many areas. The main idea of PCA is to represent high-dimensional data with a few representative components that capture most of the variance present in the data. However, there is an obvious disadvantage of traditional PCA when it is applied to analyze data where interpretability is important. In applications, where the features have some physical meanings, we lose the ability to interpret the principal components extracted by conventional PCA because each principal component is a linear combination of all the original features. For this reason, sparse PCA has been proposed to improve the interpretability of traditional PCA by introducing sparsity to the loading vectors of principal components. The sparse PCA can be formulated as an ? 1 regularized optimization problem, which can be solved by proximal gradient methods. However, these methods do not scale well because computation of the exact gradient is generally required at each iteration. Stochastic gradient framework addresses this challenge by computing an expected gradient at each iteration. Nevertheless, stochastic approaches typically have low convergence rates due to the high variance. In this paper, we propose a convex sparse principal component analysis (Cvx-SPCA), which leverages a proximal variance reduced stochastic scheme to achieve a geometric convergence rate. We further show that the convergence analysis can be significantly simplified by using a weak condition which allows a broader class of objectives to be applied. The efficiency and effectiveness of the proposed method are demonstrated on a large-scale electronic medical record cohort.  相似文献   

7.
The objective of the present investigation was to develop a quantitative electroencephalographic measure (qEEG) that is sensitive and specific to changes in sustained human performance. A principal components analysis (PCA) was performed on the qEEG obtained from participants during a continuous performance test. Measures of sensitivity (proportion of correctly identified correct responses, or hits) and specificity (proportion of correctly identified incorrect responses, or misses) were calculated to assess the classification accuracy of each newly derived component. PCA solutions produced a right hemisphere component comprised of beta-wave activity measured from four unipolar sites (F8, C6a, C6, and T4) that appeared to be sensitive and specific to changes in human performance. Results provide evidence for the validity of a right hemisphere qEEG measure that is sensitive and specific to changes in sustained human performance. Consistent with the findings of previous research, the present findings implicate the right cerebral hemisphere in the sustained attention process.  相似文献   

8.
We present herein the first complete genome sequence of a thermophilic Bacillus-related species, Geobacillus kaustophilus HTA426, which is composed of a 3.54 Mb chromosome and a 47.9 kb plasmid, along with a comparative analysis with five other mesophilic bacillar genomes. Upon orthologous grouping of the six bacillar sequenced genomes, it was found that 1257 common orthologous groups composed of 1308 genes (37%) are shared by all the bacilli, whereas 839 genes (24%) in the G.kaustophilus genome were found to be unique to that species. We were able to find the first prokaryotic sperm protamine P1 homolog, polyamine synthase, polyamine ABC transporter and RNA methylase in the 839 unique genes; these may contribute to thermophily by stabilizing the nucleic acids. Contrasting results were obtained from the principal component analysis (PCA) of the amino acid composition and synonymous codon usage for highlighting the thermophilic signature of the G.kaustophilus genome. Only in the PCA of the amino acid composition were the Bacillus-related species located near, but were distinguishable from, the borderline distinguishing thermophiles from mesophiles on the second principal axis. Further analysis revealed some asymmetric amino acid substitutions between the thermophiles and the mesophiles, which are possibly associated with the thermoadaptation of the organism.  相似文献   

9.
Evolution of Substrate Specificities in the P-Type ATPase Superfamily   总被引:23,自引:0,他引:23  
P-type ATPases make up a large superfamily of ATP-driven pumps involved in the transmembrane transport of charged substrates. We have performed an analysis of conserved core sequences in 159 P-type ATPases. The various ATPases group together in five major branches according to substrate specificity, and not according to the evolutionary relationship of the parental species, indicating that invention of new substrate specificities is accompanied by abrupt changes in the rate of sequence evolution. A hitherto-unrecognized family of P-type ATPases has been identified that is expected to be represented in all the major phyla of eukarya. Received: 21 May 1997 / Accepted: 1 August 1997  相似文献   

10.
This study investigated the species diversity and substrate utilization patterns of culturable thermophilic bacterial communities in hot aerobic poultry and cattle manure composts by coupling 16S rDNA analysis with Biolog data. Based on the phylogenetic relationships of 16S rDNA sequences, 34 thermophilic (grown at 60 degrees C) bacteria isolated during aerobic composting of poultry manure and cattle manure were classified as Bacillus licheniformis, B. atrophaeus, Geobacillus stearothermophilus, G. thermodenitrificans, Brevibacillus thermoruber, Ureibacillus terrenus, U. thermosphaericus, and Paenibacillus cookii. In this study, B. atrophaeus, Br. thermoruber, and P. cookii were recorded for the first time in hot compost. Physiological profiles of these bacteria, obtained from the Biolog Gram-positive (GP) microplate system, were subjected to principal component analysis (PCA). All isolates were categorized into eight different PCA groups based on their substrate utilization patterns. The bacterial community from poultry manure compost comprised more divergent species (21 isolates, seven species) and utilized more diverse substrates (eight PCA groups) than that from cattle manure compost (13 isolates, five species, and four PCA groups). Many thermophilic bacteria isolated in this study could use a variety of carboxylic acids. Isolate B110 (from poultry manure compost), which is 97.6% similar to U. terrenus in its 16S rDNA sequence, possesses particularly high activity in utilizing a broad spectrum of substrates. This isolate may have potential applications in industry.  相似文献   

11.
Exploitation of microbial wealth, of which almost 95% or more is still unexplored, is a growing need. The taxonomic placements of a new isolate based on phenotypic characteristics are now being supported by information preserved in the 16S rRNA gene. However, the analysis of 16S rDNA sequences retrieved from metagenome, by the available bioinformatics tools, is subject to limitations. In this study, the occurrences of nucleotide features in 16S rDNA sequences have been used to ascertain the taxonomic placement of organisms. The tetra- and penta-nucleotide features were extracted from the training data set of the 16S rDNA sequence, and was subjected to an artificial neural network (ANN) based tool known as self-organizing map (SOM), which helped in visualization of unsupervised classification. For selection of significant features, principal component analysis (PCA) or curvilinear component analysis (CCA) was applied. The SOM along with these techniques could discriminate the sample sequences with more than 90% accuracy, highlighting the relevance of features. To ascertain the confidence level in the developed classification approach, the test data set was specifically evaluated for Thiobacillus, with Acidiphilium, Paracocus and Starkeya, which are taxonomically reassigned. The evaluation proved the excellent generalization capability of the developed tool. The topology of genera in SOM supported the conventional chemo-biochemical classification reported in the Bergey manual.  相似文献   

12.
The contributions of conformational dynamics to substrate specificity have been examined by the application of principal component analysis to molecular dynamics trajectories of alpha-lytic protease. The wild-type alpha-lytic protease is highly specific for substrates with small hydrophobic side chains at the specificity pocket, while the Met190-->Ala binding pocket mutant has a much broader specificity, actively hydrolyzing substrates ranging from Ala to Phe. Based on a combination of multiconformation analysis of cryo-X-ray crystallographic data, solution nuclear magnetic resonance (NMR), and normal mode calculations, we had hypothesized that the large alteration in specificity of the mutant enzyme is mainly attributable to changes in the dynamic movement of the two walls of the specificity pocket. To test this hypothesis, we performed a principal component analysis using 1-nanosecond molecular dynamics simulations using either a global or local solvent boundary condition. The results of this analysis strongly support our hypothesis and verify the results previously obtained by in vacuo normal mode analysis. We found that the walls of the wild-type substrate binding pocket move in tandem with one another, causing the pocket size to remain fixed so that only small substrates are recognized. In contrast, the M190A mutant shows uncoupled movement of the binding pocket walls, allowing the pocket to sample both smaller and larger sizes, which appears to be the cause of the observed broad specificity. The results suggest that the protein dynamics of alpha-lytic protease may play a significant role in defining the patterns of substrate specificity. As shown here, concerted local movements within proteins can be efficiently analyzed through a combination of principal component analysis and molecular dynamics trajectories using a local solvent boundary condition to reduce computational time and matrix size.  相似文献   

13.
Phosphagen kinases constitute a large family of enzymes catalyzing the reversible phosphorylation of guanidino acceptor compounds. These guanidino substrates differ substantially in size and chemical properties. In spite of the appearance of X-ray crystal structures for two members of this family, creatine kinase (CK) and arginine kinase (AK), the structural correlates of substrate specificity remain to be fully elucidated. We have determined the cDNA and deduced amino acid sequences for lombricine (guanidinethylphosphoserine) kinase (LK) from the echiuroid worm Urechis caupo and expressed the cDNA in Escherichia coli. The recombinant protein was purified by affinity chromatography and showed high capacity for phosphorylation of lombricine. Phosphagen kinases consist of a small, N-terminal domain and a much larger domain connected by a linker sequence. A key event in catalysis in CK and AK, and certainly all other phosphagen kinases, is a large conformational change involving involving a rotation of the two domains and the movement of two highly conserved flexible loops (one located in the small domain; the other located in the large domain of these enzymes) which clamp down on the substrates. Multiple sequence alignments of Urechis LK with the only other LK sequence available and CK, AK and glycocyamine kinase sequences, confirm the importance of the small flexible loop located in the N-terminal domain of phosphagen kinases as one component of the structural determinants of guanidine specificity. The role of the other flexible loop in the large domain in terms of substrate specificity remains questionable.  相似文献   

14.
Summary Kallikrein-like simple serine proteases are encoded by closely related members of a gene family in several mammalian species. Molecular cloning and genomic Southern blot analysis after conventional and pulsed-field gel electrophoresis indicate that the rat kallikrein gene family comprises 15–20 members, probably closely linked at a single locus. Determination of the nucleotide sequences of the rGK-3,-4, and-6 genes here completes sequence data for a total of nine rat kallikrein family members. Comparison of the rat gene sequences to each other and to those of human and mouse kallikrein family genes reveals patterns of relatedness indicative of concerted evolution. Analysis of nucleotide sequence variants in kallikrein family members shows that most sequence variants are shared by multiple family members; the patterns of shared variants are complex and indicate multiple short gene conversions between family members. Sequence exchanges between family members generate novel assortments of variants in amino acid coding regions that may affect substrate specificity and thereby contribute to the diversity of enzyme activity. Furthermore, small sequence exchanges also may play a role in generating the diverse patterns of tissue-specific expression of rat family members. These analyses indicate an important role for gene conversion in the evolution of the functional diversity of these duplicated genes.  相似文献   

15.
Members of the 70-kDa family of molecular chaperones assist in a number of molecular interactions that are essential under both normal and stress conditions. These functions require ATP and co-chaperone molecules and are associated with a cyclic transition of intramolecular conformational changes. As a new putative function, we have previously shown that mammalian Hsp/Hsc70 as well as a distant relative, Hsp110, selectively bind certain RNA sequences via their N-terminal ATP-binding domain. To investigate this phenomenon in more detail, here we examined RNA-binding affinity and specificity of various deletion mutants of human Hsp70. We demonstrate, that, although the N-terminal ATPase domain alone is sufficient for RNA binding, its binding affinity is considerably reduced when compared to that of the full-length protein. Additionally, we provide evidence that binding of RNA to a membrane-immobilized protein partner results in complete loss of RNA sequence specificity. Using various Hsp70 homologs, we show distinct RNA-binding properties of these proteins judged by sequence specificity, ribopolymer sensitivity, and northwestern analysis. Finally, we present data disclosing that RNA binding by DnaK, the Escherichia coli homolog, is influenced by the activity of its co-chaperones, DnaJ and GrpE. We conclude that the RNA-binding capability of this class of molecular chaperones is a conserved feature and it is strongly influenced by the structural and conformational properties. Furthermore, the notion that RNA binding of some Hsp70 family members is influenced by co-chaperones suggests an RNA-binding cycle resembling the protein-binding property of the chaperones.  相似文献   

16.
An important aspect of the functional annotation of enzymes is not only the type of reaction catalysed by an enzyme, but also the substrate specificity, which can vary widely within the same family. In many cases, prediction of family membership and even substrate specificity is possible from enzyme sequence alone, using a nearest neighbour classification rule. However, the combination of structural information and sequence information can improve the interpretability and accuracy of predictive models. The method presented here, Active Site Classification (ASC), automatically extracts the residues lining the active site from one representative three-dimensional structure and the corresponding residues from sequences of other members of the family. From a set of representatives with known substrate specificity, a Support Vector Machine (SVM) can then learn a model of substrate specificity. Applied to a sequence of unknown specificity, the SVM can then predict the most likely substrate. The models can also be analysed to reveal the underlying structural reasons determining substrate specificities and thus yield valuable insights into mechanisms of enzyme specificity. We illustrate the high prediction accuracy achieved on two benchmark data sets and the structural insights gained from ASC by a detailed analysis of the family of decarboxylating dehydrogenases. The ASC web service is available at http://asc.informatik.uni-tuebingen.de/.  相似文献   

17.
Can genome analysis tell us about the lifestyle of an organism? We ask this question considering a thorough cross comparison of thermophilic and mesophilic genomes, since presently the number of available genomes is enough to ensure statistical significance of the results. We analyze, by means of principal component analysis (PCA), the codon composition of a database comprising 116 genomes, selected so as to include one species for each genus and show that a cross genomic approach can allow the extraction of common determinants of thermostability at the genome level. The results of our analysis indicate that all the known features of thermostability can be found in the 64 component loadings of the second principal axis of PCA. By this, we develop an index of thermostability whose discriminative power between mesophiles and thermophiles scores with 98% accuracy at the genome level and with 95% accuracy at the protein sequence level. We also prove that these results are not due to phylogenetic differences between archaea and bacteria.  相似文献   

18.
PLK1 (polo-like kinase 1) is a key mitotic kinase and a therapeutic target in the treatment of proliferative diseases. Here we investigate the relative substrate specificity and pharmacological relatedness of PLK1, -2, -3, and -4 that together comprise a conserved family of Ser/Thr kinases (PLK family). We report consensus substrate sequences for PLK2, -3, and -4 and an expanded consensus sequence for PLK1, which we use to design an optimal peptide substrate, PLKtide. We report inhibitory activity for the entire PLK family across a diverse set of small-molecule ATP-competitive inhibitors including several clinical compounds. With respect to both substrate and ATP-site specificity, highest similarity is observed between PLK2 and PLK3, PLK1 is next most similar, and PLK4 is least similar. Further, we have identified and report time-dependent inhibition by two potent and selective PLK inhibitors.  相似文献   

19.
We present a new support vector machine (SVM)-based approach to predict the substrate specificity of subtypes of a given protein sequence family. We demonstrate the usefulness of this method on the example of aryl acid-activating and amino acid-activating adenylation domains (A domains) of nonribosomal peptide synthetases (NRPS). The residues of gramicidin synthetase A that are 8 A around the substrate amino acid and corresponding positions of other adenylation domain sequences with 397 known and unknown specificities were extracted and used to encode this physico-chemical fingerprint into normalized real-valued feature vectors based on the physico-chemical properties of the amino acids. The SVM software package SVM(light) was used for training and classification, with transductive SVMs to take advantage of the information inherent in unlabeled data. Specificities for very similar substrates that frequently show cross-specificities were pooled to the so-called composite specificities and predictive models were built for them. The reliability of the models was confirmed in cross-validations and in comparison with a currently used sequence-comparison-based method. When comparing the predictions for 1230 NRPS A domains that are currently detectable in UniProt, the new method was able to give a specificity prediction in an additional 18% of the cases compared with the old method. For 70% of the sequences both methods agreed, for <6% they did not, mainly on low-confidence predictions by the existing method. None of the predictive methods could infer any specificity for 2.4% of the sequences, suggesting completely new types of specificity.  相似文献   

20.
A class of UDP-glycosyltransferases (UGTs) defined by the presence of a C-terminal consensus sequence is found throughout the plant and animal kingdoms. Whereas mammalian enzymes use UDP-glucuronic acid, the plant enzymes typically use UDP-glucose in the transfer reactions. A diverse array of aglycones can be glucosylated by these UGTs. In plants, the aglycones include plant hormones, secondary metabolites involved in stress and defense responses, and xenobiotics such as herbicides. Glycosylation is known to regulate many properties of the aglycones such as their bioactivity, their solubility, and their transport properties within the cell and throughout the plant. As a means of providing a framework to start to understand the substrate specificities and structure-function relationships of plant UGTs, we have now applied a molecular phylogenetic analysis to the multigene family of 99 UGT sequences in Arabidopsis. We have determined the overall organization and evolutionary relationships among individual members with a surprisingly high degree of confidence. Through constructing a composite phylogenetic tree that also includes all of the additional plant UGTs with known catalytic activities, we can start to predict both the evolutionary history and substrate specificities of new sequences as they are identified. The tree already suggests that while the activities of some subgroups of the UGT family are highly conserved among different plant species, others subgroups shift substrate specificity with relative ease.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号