共查询到20条相似文献,搜索用时 46 毫秒
1.
Revisiting the problem of intron-exon identification, we use a principal component analysis (PCA) to classify DNA sequences and present first results that validate our approach. Sequences are translated into document vectors that represent their word content; a principal component analysis then defines Gaussian-distributed sequence classes. The classification uses word content and variation of word usage to distinguish sequences. We test our approach with several data sets of genomic DNA and are able to classify introns and exons with an accuracy of up to 96%. We compare the method with the best traditional coding measure, the non-overlapping hexamer frequency count, and find that the PCA method produces better results. We also investigate the degree of cross-validation between different data sets of introns and exons and find evidence that the quality of a data set can be detected. 相似文献
2.
In this study, we wanted to inspect whether the evolutionary driven differences in primary sequences could correlate, and
thus predict the genetic diversity of related marker loci, which is an important criterion to assess the quality of any DNA
marker. We adopted new approach of quantitative symbolic DNA sequence analysis called DNA random walk representation to study
multiallelic marker loci from Begonia × tuberhybrida Voss. We described significant correlation of random walk-derived digital invariants to genetic diversity of the marker loci.
Specifically, on the 3D-contour plot of multivariate principal component analysis (PCA), we revealed statistical correlation
between the first two PCA factors and the number of alleles per marker locus. Based on that correlation, we suggest that DNA
walk representation may predict allele-rich loci solely from their primary sequences, which improves current design of new
DNA germplasm identificators. 相似文献
3.
New statistical approach to discriminate between protein coding and non-coding regions in DNA sequences and its evaluation 总被引:3,自引:0,他引:3
C J Michel 《Journal of theoretical biology》1986,120(2):223-236
We propose a new approach to study protein coding and non-coding regions in DNA sequences, by making use of two complementary statistical methods. The principal component analysis (PCA) is a graphical method to represent DNA sequences which are characterized by some quantitative parameters: it is a help to the intuition. The discriminating analysis (DA) is a quantitative method which permits to classify the DNA sequences. It leads to an evaluation of the first method and to a decision. The value of this approach has been confirmed since we also have found some results which had been described recently in the literature. Furthermore, this general methodology has permitted us to show the existence of parameters which identify the nucleic acid sequence functional domains, without having to make use of the properties of the genetic code. 相似文献
4.
Katherine M. Collins Alain Oregioni Laura E. Robertson Geoff Kelly Andres Ramos 《Nucleic acids research》2015,43(6):e41
Defining the RNA target selectivity of the proteins regulating mRNA metabolism is a key issue in RNA biology. Here we present a novel use of principal component analysis (PCA) to extract the RNA sequence preference of RNA binding proteins. We show that PCA can be used to compare the changes in the nuclear magnetic resonance (NMR) spectrum of a protein upon binding a set of quasi-degenerate RNAs and define the nucleobase specificity. We couple this application of PCA to an automated NMR spectra recording and processing protocol and obtain an unbiased and high-throughput NMR method for the analysis of nucleobase preference in protein–RNA interactions. We test the method on the RNA binding domains of three important regulators of RNA metabolism. 相似文献
5.
Sequence analysis of large protein families can produce sub-clusters even within the same family. In some cases, it is of interest to know precisely which amino acid position variations are most responsible for driving separation into sub-clusters. In large protein families composed of large proteins, it can be quite challenging to assign the relative importance to specific amino acid positions. Principal components analysis (PCA) is ideal for such a task, since the problem is posed in a large variable space, i.e. the number of amino acids that make up the protein sequence, and PCA is powerful at reducing the dimensionality of complex problems by projecting the data into an eigenspace that represents the directions of greatest variation. However, PCA of aligned protein sequence families is complicated by the fact that protein sequences are traditionally represented by single letter alphabetic codes, whereas PCA of protein sequence families requires conversion of sequence information into a numerical representation. Here, we introduce a new amino acid sequence conversion algorithm optimized for PCA data input. The method is demonstrated using a small artificial dataset to illustrate the characteristics and performance of the algorithm, as well as a small protein sequence family consisting of nine members, COG2263, and finally with a large protein sequence family, Pfam04237, which contains more than 1,800 sequences that group into two sub-clusters. 相似文献
6.
Principal component analysis (PCA) is a dimensionality reduction and data analysis tool commonly used in many areas. The main idea of PCA is to represent high-dimensional data with a few representative components that capture most of the variance present in the data. However, there is an obvious disadvantage of traditional PCA when it is applied to analyze data where interpretability is important. In applications, where the features have some physical meanings, we lose the ability to interpret the principal components extracted by conventional PCA because each principal component is a linear combination of all the original features. For this reason, sparse PCA has been proposed to improve the interpretability of traditional PCA by introducing sparsity to the loading vectors of principal components. The sparse PCA can be formulated as an ? 1 regularized optimization problem, which can be solved by proximal gradient methods. However, these methods do not scale well because computation of the exact gradient is generally required at each iteration. Stochastic gradient framework addresses this challenge by computing an expected gradient at each iteration. Nevertheless, stochastic approaches typically have low convergence rates due to the high variance. In this paper, we propose a convex sparse principal component analysis (Cvx-SPCA), which leverages a proximal variance reduced stochastic scheme to achieve a geometric convergence rate. We further show that the convergence analysis can be significantly simplified by using a weak condition which allows a broader class of objectives to be applied. The efficiency and effectiveness of the proposed method are demonstrated on a large-scale electronic medical record cohort. 相似文献
7.
The objective of the present investigation was to develop a quantitative electroencephalographic measure (qEEG) that is sensitive and specific to changes in sustained human performance. A principal components analysis (PCA) was performed on the qEEG obtained from participants during a continuous performance test. Measures of sensitivity (proportion of correctly identified correct responses, or hits) and specificity (proportion of correctly identified incorrect responses, or misses) were calculated to assess the classification accuracy of each newly derived component. PCA solutions produced a right hemisphere component comprised of beta-wave activity measured from four unipolar sites (F8, C6a, C6, and T4) that appeared to be sensitive and specific to changes in human performance. Results provide evidence for the validity of a right hemisphere qEEG measure that is sensitive and specific to changes in sustained human performance. Consistent with the findings of previous research, the present findings implicate the right cerebral hemisphere in the sustained attention process. 相似文献
8.
Thermoadaptation trait revealed by the genome sequence of thermophilic Geobacillus kaustophilus 总被引:1,自引:0,他引:1
下载免费PDF全文
![点击此处可从《Nucleic acids research》网站下载免费的PDF全文](/ch/ext_images/free.gif)
Takami H Takaki Y Chee GJ Nishi S Shimamura S Suzuki H Matsui S Uchiyama I 《Nucleic acids research》2004,32(21):6292-6303
We present herein the first complete genome sequence of a thermophilic Bacillus-related species, Geobacillus kaustophilus HTA426, which is composed of a 3.54 Mb chromosome and a 47.9 kb plasmid, along with a comparative analysis with five other mesophilic bacillar genomes. Upon orthologous grouping of the six bacillar sequenced genomes, it was found that 1257 common orthologous groups composed of 1308 genes (37%) are shared by all the bacilli, whereas 839 genes (24%) in the G.kaustophilus genome were found to be unique to that species. We were able to find the first prokaryotic sperm protamine P1 homolog, polyamine synthase, polyamine ABC transporter and RNA methylase in the 839 unique genes; these may contribute to thermophily by stabilizing the nucleic acids. Contrasting results were obtained from the principal component analysis (PCA) of the amino acid composition and synonymous codon usage for highlighting the thermophilic signature of the G.kaustophilus genome. Only in the PCA of the amino acid composition were the Bacillus-related species located near, but were distinguishable from, the borderline distinguishing thermophiles from mesophiles on the second principal axis. Further analysis revealed some asymmetric amino acid substitutions between the thermophiles and the mesophiles, which are possibly associated with the thermoadaptation of the organism. 相似文献
9.
Evolution of Substrate Specificities in the P-Type ATPase Superfamily 总被引:23,自引:0,他引:23
P-type ATPases make up a large superfamily of ATP-driven pumps involved in the transmembrane transport of charged substrates.
We have performed an analysis of conserved core sequences in 159 P-type ATPases. The various ATPases group together in five
major branches according to substrate specificity, and not according to the evolutionary relationship of the parental species,
indicating that invention of new substrate specificities is accompanied by abrupt changes in the rate of sequence evolution.
A hitherto-unrecognized family of P-type ATPases has been identified that is expected to be represented in all the major phyla
of eukarya.
Received: 21 May 1997 / Accepted: 1 August 1997 相似文献
10.
Species Diversity and Substrate Utilization Patterns of Thermophilic Bacterial Communities in Hot Aerobic Poultry and Cattle Manure Composts 总被引:1,自引:1,他引:0
This study investigated the species diversity and substrate utilization patterns of culturable thermophilic bacterial communities in hot aerobic poultry and cattle manure composts by coupling 16S rDNA analysis with Biolog data. Based on the phylogenetic relationships of 16S rDNA sequences, 34 thermophilic (grown at 60 degrees C) bacteria isolated during aerobic composting of poultry manure and cattle manure were classified as Bacillus licheniformis, B. atrophaeus, Geobacillus stearothermophilus, G. thermodenitrificans, Brevibacillus thermoruber, Ureibacillus terrenus, U. thermosphaericus, and Paenibacillus cookii. In this study, B. atrophaeus, Br. thermoruber, and P. cookii were recorded for the first time in hot compost. Physiological profiles of these bacteria, obtained from the Biolog Gram-positive (GP) microplate system, were subjected to principal component analysis (PCA). All isolates were categorized into eight different PCA groups based on their substrate utilization patterns. The bacterial community from poultry manure compost comprised more divergent species (21 isolates, seven species) and utilized more diverse substrates (eight PCA groups) than that from cattle manure compost (13 isolates, five species, and four PCA groups). Many thermophilic bacteria isolated in this study could use a variety of carboxylic acids. Isolate B110 (from poultry manure compost), which is 97.6% similar to U. terrenus in its 16S rDNA sequence, possesses particularly high activity in utilizing a broad spectrum of substrates. This isolate may have potential applications in industry. 相似文献
11.
Exploitation of microbial wealth, of which almost 95% or more is still unexplored, is a growing need. The taxonomic placements
of a new isolate based on phenotypic characteristics are now being supported by information preserved in the 16S rRNA gene. However, the analysis of 16S rDNA sequences retrieved from metagenome, by the available bioinformatics tools, is subject
to limitations. In this study, the occurrences of nucleotide features in 16S rDNA sequences have been used to ascertain the
taxonomic placement of organisms. The tetra- and penta-nucleotide features were extracted from the training data set of the
16S rDNA sequence, and was subjected to an artificial neural network (ANN) based tool known as self-organizing map (SOM),
which helped in visualization of unsupervised classification. For selection of significant features, principal component analysis
(PCA) or curvilinear component analysis (CCA) was applied. The SOM along with these techniques could discriminate the sample
sequences with more than 90% accuracy, highlighting the relevance of features. To ascertain the confidence level in the developed
classification approach, the test data set was specifically evaluated for Thiobacillus, with Acidiphilium, Paracocus and Starkeya, which are taxonomically reassigned. The evaluation proved the excellent generalization capability of the developed tool.
The topology of genera in SOM supported the conventional chemo-biochemical classification reported in the Bergey manual. 相似文献
12.
Enzyme specificity under dynamic control II: Principal component analysis of α-lytic protease using global and local solvent boundary conditions
下载免费PDF全文
![点击此处可从《Protein science : a publication of the Protein Society》网站下载免费的PDF全文](/ch/ext_images/free.gif)
Nobuyuki Ota David A. Agard 《Protein science : a publication of the Protein Society》2001,10(7):1403-1414
The contributions of conformational dynamics to substrate specificity have been examined by the application of principal component analysis to molecular dynamics trajectories of alpha-lytic protease. The wild-type alpha-lytic protease is highly specific for substrates with small hydrophobic side chains at the specificity pocket, while the Met190-->Ala binding pocket mutant has a much broader specificity, actively hydrolyzing substrates ranging from Ala to Phe. Based on a combination of multiconformation analysis of cryo-X-ray crystallographic data, solution nuclear magnetic resonance (NMR), and normal mode calculations, we had hypothesized that the large alteration in specificity of the mutant enzyme is mainly attributable to changes in the dynamic movement of the two walls of the specificity pocket. To test this hypothesis, we performed a principal component analysis using 1-nanosecond molecular dynamics simulations using either a global or local solvent boundary condition. The results of this analysis strongly support our hypothesis and verify the results previously obtained by in vacuo normal mode analysis. We found that the walls of the wild-type substrate binding pocket move in tandem with one another, causing the pocket size to remain fixed so that only small substrates are recognized. In contrast, the M190A mutant shows uncoupled movement of the binding pocket walls, allowing the pocket to sample both smaller and larger sizes, which appears to be the cause of the observed broad specificity. The results suggest that the protein dynamics of alpha-lytic protease may play a significant role in defining the patterns of substrate specificity. As shown here, concerted local movements within proteins can be efficiently analyzed through a combination of principal component analysis and molecular dynamics trajectories using a local solvent boundary condition to reduce computational time and matrix size. 相似文献
13.
Phosphagen kinases constitute a large family of enzymes catalyzing the reversible phosphorylation of guanidino acceptor compounds. These guanidino substrates differ substantially in size and chemical properties. In spite of the appearance of X-ray crystal structures for two members of this family, creatine kinase (CK) and arginine kinase (AK), the structural correlates of substrate specificity remain to be fully elucidated. We have determined the cDNA and deduced amino acid sequences for lombricine (guanidinethylphosphoserine) kinase (LK) from the echiuroid worm Urechis caupo and expressed the cDNA in Escherichia coli. The recombinant protein was purified by affinity chromatography and showed high capacity for phosphorylation of lombricine. Phosphagen kinases consist of a small, N-terminal domain and a much larger domain connected by a linker sequence. A key event in catalysis in CK and AK, and certainly all other phosphagen kinases, is a large conformational change involving involving a rotation of the two domains and the movement of two highly conserved flexible loops (one located in the small domain; the other located in the large domain of these enzymes) which clamp down on the substrates. Multiple sequence alignments of Urechis LK with the only other LK sequence available and CK, AK and glycocyamine kinase sequences, confirm the importance of the small flexible loop located in the N-terminal domain of phosphagen kinases as one component of the structural determinants of guanidine specificity. The role of the other flexible loop in the large domain in terms of substrate specificity remains questionable. 相似文献
14.
Evolution of the rat kallikrein gene family: Gene conversion leads to functional diversity 总被引:2,自引:0,他引:2
Debora R. Wines James M. Brady E. Michelle Southard Raymond J. MacDonald 《Journal of molecular evolution》1991,32(6):476-492
Summary Kallikrein-like simple serine proteases are encoded by closely related members of a gene family in several mammalian species. Molecular cloning and genomic Southern blot analysis after conventional and pulsed-field gel electrophoresis indicate that the rat kallikrein gene family comprises 15–20 members, probably closely linked at a single locus. Determination of the nucleotide sequences of the rGK-3,-4, and-6 genes here completes sequence data for a total of nine rat kallikrein family members. Comparison of the rat gene sequences to each other and to those of human and mouse kallikrein family genes reveals patterns of relatedness indicative of concerted evolution. Analysis of nucleotide sequence variants in kallikrein family members shows that most sequence variants are shared by multiple family members; the patterns of shared variants are complex and indicate multiple short gene conversions between family members. Sequence exchanges between family members generate novel assortments of variants in amino acid coding regions that may affect substrate specificity and thereby contribute to the diversity of enzyme activity. Furthermore, small sequence exchanges also may play a role in generating the diverse patterns of tissue-specific expression of rat family members. These analyses indicate an important role for gene conversion in the evolution of the functional diversity of these duplicated genes. 相似文献
15.
Analysis of sequence-specific binding of RNA to Hsp70 and its various homologs indicates the involvement of N- and C-terminal interactions.
下载免费PDF全文
![点击此处可从《RNA (New York, N.Y.)》网站下载免费的PDF全文](/ch/ext_images/free.gif)
Members of the 70-kDa family of molecular chaperones assist in a number of molecular interactions that are essential under both normal and stress conditions. These functions require ATP and co-chaperone molecules and are associated with a cyclic transition of intramolecular conformational changes. As a new putative function, we have previously shown that mammalian Hsp/Hsc70 as well as a distant relative, Hsp110, selectively bind certain RNA sequences via their N-terminal ATP-binding domain. To investigate this phenomenon in more detail, here we examined RNA-binding affinity and specificity of various deletion mutants of human Hsp70. We demonstrate, that, although the N-terminal ATPase domain alone is sufficient for RNA binding, its binding affinity is considerably reduced when compared to that of the full-length protein. Additionally, we provide evidence that binding of RNA to a membrane-immobilized protein partner results in complete loss of RNA sequence specificity. Using various Hsp70 homologs, we show distinct RNA-binding properties of these proteins judged by sequence specificity, ribopolymer sensitivity, and northwestern analysis. Finally, we present data disclosing that RNA binding by DnaK, the Escherichia coli homolog, is influenced by the activity of its co-chaperones, DnaJ and GrpE. We conclude that the RNA-binding capability of this class of molecular chaperones is a conserved feature and it is strongly influenced by the structural and conformational properties. Furthermore, the notion that RNA binding of some Hsp70 family members is influenced by co-chaperones suggests an RNA-binding cycle resembling the protein-binding property of the chaperones. 相似文献
16.
An important aspect of the functional annotation of enzymes is not only the type of reaction catalysed by an enzyme, but also the substrate specificity, which can vary widely within the same family. In many cases, prediction of family membership and even substrate specificity is possible from enzyme sequence alone, using a nearest neighbour classification rule. However, the combination of structural information and sequence information can improve the interpretability and accuracy of predictive models. The method presented here, Active Site Classification (ASC), automatically extracts the residues lining the active site from one representative three-dimensional structure and the corresponding residues from sequences of other members of the family. From a set of representatives with known substrate specificity, a Support Vector Machine (SVM) can then learn a model of substrate specificity. Applied to a sequence of unknown specificity, the SVM can then predict the most likely substrate. The models can also be analysed to reveal the underlying structural reasons determining substrate specificities and thus yield valuable insights into mechanisms of enzyme specificity. We illustrate the high prediction accuracy achieved on two benchmark data sets and the structural insights gained from ASC by a detailed analysis of the family of decarboxylating dehydrogenases. The ASC web service is available at http://asc.informatik.uni-tuebingen.de/. 相似文献
17.
Can genome analysis tell us about the lifestyle of an organism? We ask this question considering a thorough cross comparison of thermophilic and mesophilic genomes, since presently the number of available genomes is enough to ensure statistical significance of the results. We analyze, by means of principal component analysis (PCA), the codon composition of a database comprising 116 genomes, selected so as to include one species for each genus and show that a cross genomic approach can allow the extraction of common determinants of thermostability at the genome level. The results of our analysis indicate that all the known features of thermostability can be found in the 64 component loadings of the second principal axis of PCA. By this, we develop an index of thermostability whose discriminative power between mesophiles and thermophiles scores with 98% accuracy at the genome level and with 95% accuracy at the protein sequence level. We also prove that these results are not due to phylogenetic differences between archaea and bacteria. 相似文献
18.
Pharmacological and functional comparison of the polo-like kinase family: insight into inhibitor and substrate specificity 总被引:1,自引:0,他引:1
PLK1 (polo-like kinase 1) is a key mitotic kinase and a therapeutic target in the treatment of proliferative diseases. Here we investigate the relative substrate specificity and pharmacological relatedness of PLK1, -2, -3, and -4 that together comprise a conserved family of Ser/Thr kinases (PLK family). We report consensus substrate sequences for PLK2, -3, and -4 and an expanded consensus sequence for PLK1, which we use to design an optimal peptide substrate, PLKtide. We report inhibitory activity for the entire PLK family across a diverse set of small-molecule ATP-competitive inhibitors including several clinical compounds. With respect to both substrate and ATP-site specificity, highest similarity is observed between PLK2 and PLK3, PLK1 is next most similar, and PLK4 is least similar. Further, we have identified and report time-dependent inhibition by two potent and selective PLK inhibitors. 相似文献
19.
Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs)
下载免费PDF全文
![点击此处可从《Nucleic acids research》网站下载免费的PDF全文](/ch/ext_images/free.gif)
We present a new support vector machine (SVM)-based approach to predict the substrate specificity of subtypes of a given protein sequence family. We demonstrate the usefulness of this method on the example of aryl acid-activating and amino acid-activating adenylation domains (A domains) of nonribosomal peptide synthetases (NRPS). The residues of gramicidin synthetase A that are 8 A around the substrate amino acid and corresponding positions of other adenylation domain sequences with 397 known and unknown specificities were extracted and used to encode this physico-chemical fingerprint into normalized real-valued feature vectors based on the physico-chemical properties of the amino acids. The SVM software package SVM(light) was used for training and classification, with transductive SVMs to take advantage of the information inherent in unlabeled data. Specificities for very similar substrates that frequently show cross-specificities were pooled to the so-called composite specificities and predictive models were built for them. The reliability of the models was confirmed in cross-validations and in comparison with a currently used sequence-comparison-based method. When comparing the predictions for 1230 NRPS A domains that are currently detectable in UniProt, the new method was able to give a specificity prediction in an additional 18% of the cases compared with the old method. For 70% of the sequences both methods agreed, for <6% they did not, mainly on low-confidence predictions by the existing method. None of the predictive methods could infer any specificity for 2.4% of the sequences, suggesting completely new types of specificity. 相似文献
20.
Phylogenetic analysis of the UDP-glycosyltransferase multigene family of Arabidopsis thaliana 总被引:2,自引:0,他引:2
A class of UDP-glycosyltransferases (UGTs) defined by the presence of a C-terminal consensus sequence is found throughout the plant and animal kingdoms. Whereas mammalian enzymes use UDP-glucuronic acid, the plant enzymes typically use UDP-glucose in the transfer reactions. A diverse array of aglycones can be glucosylated by these UGTs. In plants, the aglycones include plant hormones, secondary metabolites involved in stress and defense responses, and xenobiotics such as herbicides. Glycosylation is known to regulate many properties of the aglycones such as their bioactivity, their solubility, and their transport properties within the cell and throughout the plant. As a means of providing a framework to start to understand the substrate specificities and structure-function relationships of plant UGTs, we have now applied a molecular phylogenetic analysis to the multigene family of 99 UGT sequences in Arabidopsis. We have determined the overall organization and evolutionary relationships among individual members with a surprisingly high degree of confidence. Through constructing a composite phylogenetic tree that also includes all of the additional plant UGTs with known catalytic activities, we can start to predict both the evolutionary history and substrate specificities of new sequences as they are identified. The tree already suggests that while the activities of some subgroups of the UGT family are highly conserved among different plant species, others subgroups shift substrate specificity with relative ease. 相似文献