期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Classification of protein families and detection of the determinant residues with an improved self-organizing map

Miguel A. Andrade Georg Casari Chris Sander Alfonso Valencia 《Biological cybernetics》1997,76(6):441-450

Using a SOM (self-organizing map) we can classify sequences within a protein family into subgroups that generally correspond to biological subcategories. These maps tend to show sequence similarity as proximity in the map. Combining maps generated at different levels of resolution, the structure of relations in protein families can be captured that could not otherwise be represented in a single map. The underlying representation of maps enables us to retrieve characteristic sequence patterns for individual subgroups of sequences. Such patterns tend to correspond to functionally important regions. We present a modified SOM algorithm that includes a convergence test that dynamically controls the learning parameters to adapt them to the learning set instead of being fixed and externally optimized by trial and error. Given the variability of protein family size and distribution, the addition of this feature is necessary. The method is successfully tested with a number of families. The rab family of small GTPases is used to illustrate the performance of the method. Received: 25 July 1996 / Accepted in revised form: 13 February 1997 相似文献

2.

Recognition of multiple patterns in unaligned sets of sequences: comparison of kernel clustering method with other methods

Kel A Tikunov Y Voss N Wingender E 《Bioinformatics (Oxford, England)》2004,20(10):1512-1516

相似文献

3.

Motif recognition and alignment for many sequences by comparison of dot-matrices 总被引：4，自引：0，他引：4

M Vingron P Argos 《Journal of molecular biology》1991,218(1):33-43

Calculation of dot-matrices is a widespread tool in the search for sequence similarities. When sequences are distant, even this approach may fail to point out common regions. If several plots calculated for all members of a sequence set consistently displayed a similarity between them, this would increase its credibility. We present an algorithm to delineate dot-plot agreement. A novel procedure based on matrix multiplication is developed to identify common patterns and reliably aligned regions in a set of distantly related sequences. The algorithm finds motifs independent of input sequence lengths and reduces the dependence on gap penalties. When sequences share greater similarity, the same approach converts to a multiple sequence alignment procedure. 相似文献

4.

FuzzyART neural network for protein classification

Angadi UB Venkatesulu M 《Journal of bioinformatics and computational biology》2010,8(5):825-841

One of the major research directions in bioinformatics is that of predicting the protein superfamily in large databases and classifying a given set of protein domains into superfamilies. The classification reflects the structural, evolutionary and functional relatedness. These relationships are embodied in hierarchical classification such as Structural Classification of Protein (SCOP), which is manually curated. Such classification is essential for the structural and functional analysis of proteins. Yet, a large number of proteins remain unclassified. We have proposed an unsupervised machine-learning FuzzyART neural network algorithm to classify a given set of proteins into SCOP superfamilies. The proposed method is fast learning and uses an atypical non-linear pattern recognition technique. In this approach, we have constructed a similarity matrix from p-values of BLAST all-against-all, trained the network with FuzzyART unsupervised learning algorithm using the similarity matrix as input vectors and finally the trained network offers SCOP superfamily level classification. In this experiment, we have evaluated the performance of our method with existing techniques on six different datasets. We have shown that the trained network is able to classify a given similarity matrix of a set of sequences into SCOP superfamilies at high classification accuracy. 相似文献

5.

Emergence of invariant-feature detectors in the adaptive-subspace self-organizing map

Teuvo Kohonen 《Biological cybernetics》1996,75(4):281-291

A new self-organizing map (SOM) architecture called the ASSOM (adaptive-subspace SOM) is shown to create sets of translation-invariant filters when randomly displaced or moving input patterns are used as training data. No analytical functional forms for these filters are thereby postulated. Different kinds of filters are formed by the ASSOM when pictures are rotated during learning, or when they are zoomed. The ASSOM can thus act as a learning feature-extraction stage for pattern recognizers, being able to adapt to many sensory environments and to many different transformation groups of patterns. Received: 14 September 1995 / Accepted in revised form: 8 May 1996 相似文献

6.

K Sumi T Nishioka J Oda 《Protein engineering》1991,4(4):413-420

We developed a new method which searches sequence segments responsible for the recognition of a given chemical structure. These segments are detected as those locally conserved among a sequence to be analyzed (target sequence) and a set of sequences (reference sequences). Reference sequences are the sequences of functionally related proteins, ligands of which contain a common chemical substructure in their molecular structures. 'Similarity graphing' cuts target sequences into segments, aligns them with reference sequence pairwise, calculates the degree of similarity for each alignment, and shows graphically cumulative similarity values on target sequence. Any locally conserved regions, short or long in length and weak or strong in similarity, are detected at their optimal conditions by adjusting three parameters. The 'enzyme-reaction database' contains chemical structures and their related enzymes. When a chemical substructure is input into the database, sequences of the enzymes related to the input substructure are systematically searched from the NBRF sequence database and output as reference sequences. Examples of analysis using similarity graphing in combination with the enzyme-reaction database showed a great potentiality in the systematic analysis of the relationships between sequences and molecular recognitions for protein engineering. 相似文献

7.

Prediction of Gene Expression Specificity by Promoter Sequence Patterns 总被引：1，自引：0，他引：1

Fujibuchi Wataru; Kanehisa Minoru 《DNA research》1997,4(2):81-90

相似文献

8.

Detecting homology of distantly related proteins with consensus sequences 总被引：15，自引：0，他引：15

L Patthy 《Journal of molecular biology》1987,198(4):567-577

A simple protocol is described that is suitable for the detection of distantly related members of a protein family. In this procedure, similarity to a consensus sequence is used to distinguish chance similarity from similarity due to common ancestry. The consensus sequence is constructed from the sequences of established members of a protein family and it incorporates features characteristic of the protein fold of this family: conserved residues, the pattern of variable and conserved segments, preferred location of gaps etc. The database is searched with the consensus sequence, using the unitary matrix or log odds matrix for scoring the alignments, with variable gap penalty. The advantage of the method is that it weights key residues, ignores sequence similarity in variable segments (thus partially eliminating "background noise" coming from chance similarity), distinguishes gaps disrupting conserved segments from those occurring in positions known to be tolerant of gap events. The utility of the method was demonstrated in the case of the protein family homologous with the internal repeats of complement B as well as the internal repeats identified in fibroblast proteoglycan PG40. The consensus sequence method succeeded in finding some new members of these protein families that could not be detected by earlier methods of sequence comparison. 相似文献

9.

Sequence-based classification using discriminatory motif feature selection

Xiong H Capurso D Sen S Segal MR 《PloS one》2011,6(11):e27382

Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all k-mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length ≤ k, such that potentially important, longer (> k) predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/. 相似文献

10.

An unsupervised automatic method for sorting neuronal spike waveforms in awake and freely moving animals 总被引：1，自引：0，他引：1

Aksenova TI Chibirova OK Dryga OA Tetko IV Benabid AL Villa AE 《Methods (San Diego, Calif.)》2003,30(2):178-187

The present study introduces an approach to automatic classification of extracellularly recorded action potentials of neurons. The classification of spike waveform is considered a pattern recognition problem of special segments of signal that correspond to the appearance of spikes. The spikes generated by one neuron should be recognized as members of the same class. The spike waveforms are described by the nonlinear oscillating model as an ordinary differential equation with perturbation, thus characterizing the signal distortions in both amplitude and phase. It is shown that the use of local variables reduces the problem of spike recognition to the separation of a mixture of normal distributions in the transformed feature space. We have developed an unsupervised iteration-learning algorithm that estimates the number of classes and their centers according to the distance between spike trajectories in phase space. This algorithm scans the learning set to evaluate spike trajectories with maximal probability density in their neighborhood. Following the learning, the procedure of minimal distance is used to perform spike recognition. Estimation of trajectories in phase space requires calculation of the first- and second-order derivatives, and integral operators with piecewise polynomial kernels were used. This provided the computational efficiency of the developed approach for real-time application as required by recordings in behaving animals and in human neurosurgical operations. The new method of spike sorting was tested on simulated and real data and performed better than other approaches currently used in neurophysiology. 相似文献

11.

Identification of consensus patterns in unaligned DNA sequences known to be functionally related 总被引：16，自引：0，他引：16

Hertz Gerald Z.; Hartzell George W. III; Stormo Gary D. 《Bioinformatics (Oxford, England)》1990,6(2):81-92

We have developed a method for identifying consensus patternsin a set of unaligned DNA sequences known to bind a common proteinor to have some other common biochemical function. The methodis based on a tnatrix representation of binding site patterns.Each row of the matrix represents one of the four possible bases,each column represents one of the positions of the binding siteand each element is determined by the frequency the indicatedbase occurs at the indicated position. The goal of the methodis to find the most significant matrix-i.e. the one with thelowest probability of occurring by chance-out of all the matricesthat can be formed from the set of related sequences. The reliabilityof the method improves with the number of sequences, while thetime required increases only linearly with the number of sequences.To test this method, we analysed 11 DNA sequences containingpromoters regulated by the Escherichia coli LexA protein. Thematrices we' found were consistent with the known consensussequence, and could distinguish the generally accepted LexAbinding sites from other DNA sequences. Received on November 6, 1989; accepted on December 20, 1989 相似文献

12.

Self-organized neural maps of human protein sequences.

下载免费PDF全文

E. A. Ferrán B. Pflugfelder P. Ferrara 《Protein science : a publication of the Protein Society》1994,3(3):507-521

We have recently described a method based on artificial neural networks to cluster protein sequences into families. The network was trained with Kohonen''s unsupervised learning algorithm using, as inputs, the matrix patterns derived from the dipeptide composition of the proteins. We present here a large-scale application of that method to classify the 1,758 human protein sequences stored in the SwissProt database (release 19.0), whose lengths are greater than 50 amino acids. In the final 2-dimensional topologically ordered map of 15 x 15 neurons, proteins belonging to known families were associated with the same neuron or with neighboring ones. Also, as an attempt to reduce the time-consuming learning procedure, we compared 2 learning protocols: one of 500 epochs (100 SUN CPU-hours [CPU-h]), and another one of 30 epochs (6.7 CPU-h). A further reduction of learning-computing time, by a factor of about 3.3, with similar protein clustering results, was achieved using a matrix of 11 x 11 components to represent the sequences. Although network training is time consuming, the classification of a new protein in the final ordered map is very fast (14.6 CPU-seconds). We also show a comparison between the artificial neural network approach and conventional methods of biosequence analysis. 相似文献

13.

EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference 总被引：1，自引：0，他引：1

Tian W Arakaki AK Skolnick J 《Nucleic acids research》2004,32(21):6226-6239

EFICAz (Enzyme Function Inference by Combined Approach) is an automatic engine for large-scale enzyme function inference that combines predictions from four different methods developed and optimized to achieve high prediction accuracy: (i) recognition of functionally discriminating residues (FDRs) in enzyme families obtained by a Conservation-controlled HMM Iterative procedure for Enzyme Family classification (CHIEFc), (ii) pairwise sequence comparison using a family specific Sequence Identity Threshold, (iii) recognition of FDRs in Multiple Pfam enzyme families, and (iv) recognition of multiple Prosite patterns of high specificity. For FDR (i.e. conserved positions in an enzyme family that discriminate between true and false members of the family) identification, we have developed an Evolutionary Footprinting method that uses evolutionary information from homofunctional and heterofunctional multiple sequence alignments associated with an enzyme family. The FDRs show a significant correlation with annotated active site residues. In a jackknife test, EFICAz shows high accuracy (92%) and sensitivity (82%) for predicting four EC digits in testing sequences that are <40% identical to any member of the corresponding training set. Applied to Escherichia coli genome, EFICAz assigns more detailed enzymatic function than KEGG, and generates numerous novel predictions. 相似文献

14.

Classification of G-protein coupled receptors by alignment-independent extraction of principal chemical properties of primary amino acid sequences

下载免费PDF全文

Lapinsh M Gutcaits A Prusis P Post C Lundstedt T Wikberg JE 《Protein science : a publication of the Protein Society》2002,11(4):795-805

We have developed an alignment-independent method for classification of G-protein coupled receptors (GPCRs) according to the principal chemical properties of their amino acid sequences. The method relies on a multivariate approach where the primary amino acid sequences are translated into vectors based on the principal physicochemical properties of the amino acids and transformation of the data into a uniform matrix by applying a modified autocross-covariance transform. The application of principal component analysis to a data set of 929 class A GPCRs showed a clear separation of the major classes of GPCRs. The application of partial least squares projection to latent structures created a highly valid model (cross-validated correlation coefficient, Q(2) = 0.895) that gave unambiguous classification of the GPCRs in the training set according to their ligand binding class. The model was further validated by external prediction of 535 novel GPCRs not included in the training set. Of the latter, only 14 sequences, confined in rapidly expanding GPCR classes, were mispredicted. Moreover, 90 orphan GPCRs out of 165 were tentatively identified to GPCR ligand binding class. The alignment-independent method could be used to assess the importance of the principal chemical properties of every single amino acid in the protein sequences for their contributions in explaining GPCR family membership. It was then revealed that all amino acids in the unaligned sequences contributed to the classifications, albeit to varying extent; the most important amino acids being those that could also be determined to be conserved by using traditional alignment-based methods. 相似文献

15.

Estimating the number of clusters in multivariate data by self-organizing maps

Costa JA Netto ML 《International journal of neural systems》1999,9(3):195-202

Determining the structure of data without prior knowledge of the number of clusters or any information about their composition is a problem of interest in many fields, such as image analysis, astrophysics, biology, etc. Partitioning a set of n patterns in a p-dimensional feature space must be done such that those in a given cluster are more similar to each other than the rest. As there are approximately Kn/K! possible ways of partitioning the patterns among K clusters, finding the best solution is very hard when n is large. The search space is increased when we have no a priori number of partitions. Although the self-organizing feature map (SOM) can be used to visualize clusters, the automation of knowledge discovery by SOM is a difficult task. This paper proposes region-based image processing methods to post-processing the U-matrix obtained after the unsupervised learning performed by SOM. Mathematical morphology is applied to identify regions of neurons that are similar. The number of regions and their labels are automatically found and they are related to the number of clusters in a multivariate data set. New data can be classified by labeling it according to the best match neuron. Simulations using data sets drawn from finite mixtures of p-variate normal densities are presented as well as related advantages and drawbacks of the method. 相似文献

16.

A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes 总被引：4，自引：0，他引：4

Stuart GW Moffett K Leader JJ 《Molecular biology and evolution》2002,19(4):554-562

We recently developed a method for producing comprehensive gene and species phylogenies from unaligned whole genome data using singular value decomposition (SVD) to analyze character string frequencies. This work provides an integrated gene and species phylogeny for 64 vertebrate mitochondrial genomes composed of 832 total proteins. In addition, to provide a theoretical basis for the method, we present a graphical interpretation of both the original frequency matrix and the SVD-derived matrix. These large matrices describe high-dimensional Euclidean spaces within which biomolecular sequences can be uniquely represented as vectors. In particular, the SVD-derived vector space describes each protein relative to a restricted set of newly defined, independent axes, each of which represents a novel form of conserved motif, termed a correlated peptide motif. A quantitative comparison of the relative orientations of protein vectors in this space provides accurate and straightforward estimates of sequence similarity, which can in turn be used to produce comprehensive gene trees. Alternatively, the vector representations of genes from individual species can be summed, allowing species trees to be produced. 相似文献

17.

Regularized common spatial patterns with subject-to-subject transfer of EEG signals

Minmin Cheng Zuhong Lu Haixian Wang 《Cognitive neurodynamics》2017,11(2):173-181

In the context of brain-computer interface (BCI) system, the common spatial patterns (CSP) method has been used to extract discriminative spatial filters for the classification of electroencephalogram (EEG) signals. However, the classification performance of CSP typically deteriorates when a few training samples are collected from a new BCI user. In this paper, we propose an approach that maintains or improves the recognition accuracy of the system with only a small size of training data set. The proposed approach is formulated by regularizing the classical CSP technique with the strategy of transfer learning. Specifically, we incorporate into the CSP analysis inter-subject information involving the same task, by minimizing the difference between the inter-subject features. Experimental results on two data sets from BCI competitions show that the proposed approach greatly improves the classification performance over that of the conventional CSP method; the transformed variant proved to be successful in almost every case, based on a small number of available training samples. 相似文献

18.

Finding flexible patterns in unaligned protein sequences.

I. Jonassen J. F. Collins D. G. Higgins 《Protein science : a publication of the Protein Society》1995,4(8):1587-1595

We present a new method for the identification of conserved patterns in a set of unaligned related protein sequences. It is able to discover patterns of a quite general form, allowing for both ambiguous positions and for variable length wildcard regions. It allows the user to define a class of patterns (e.g., the degree of ambiguity allowed and the length and number of gaps), and the method is then guaranteed to find the conserved patterns in this class scoring highest according to a significance measure defined. Identified patterns may be refined using one of two new algorithms. We present a new (nonstatistical) significance measure for flexible patterns. The method is shown to recover known motifs for PROSITE families and is also applied to some recently described families from the literature. 相似文献

19.

Automated search of natively folded protein fragments for high-throughput structure determination in structural genomics

下载免费PDF全文

Kuroda Y Tani K Matsuo Y Yokoyama S 《Protein science : a publication of the Protein Society》2000,9(12):2313-2321

Structural genomic projects envision almost routine protein structure determinations, which are currently imaginable only for small proteins with molecular weights below 25,000 Da. For larger proteins, structural insight can be obtained by breaking them into small segments of amino acid sequences that can fold into native structures, even when isolated from the rest of the protein. Such segments are autonomously folding units (AFU) and have sizes suitable for fast structural analyses. Here, we propose to expand an intuitive procedure often employed for identifying biologically important domains to an automatic method for detecting putative folded protein fragments. The procedure is based on the recognition that large proteins can be regarded as a combination of independent domains conserved among diverse organisms. We thus have developed a program that reorganizes the output of BLAST searches and detects regions with a large number of similar sequences. To automate the detection process, it is reduced to a simple geometrical problem of recognizing rectangular shaped elevations in a graph that plots the number of similar sequences at each residue of a query sequence. We used our program to quantitatively corroborate the premise that segments with conserved sequences correspond to domains that fold into native structures. We applied our program to a test data set composed of 99 amino acid sequences containing 150 segments with structures listed in the Protein Data Bank, and thus known to fold into native structures. Overall, the fragments identified by our program have an almost 50% probability of forming a native structure, and comparable results are observed with sequences containing domain linkers classified in SCOP. Furthermore, we verified that our program identifies AFU in libraries from various organisms, and we found a significant number of AFU candidates for structural analysis, covering an estimated 5 to 20% of the genomic databases. Altogether, these results argue that methods based on sequence similarity can be useful for dissecting large proteins into small autonomously folding domains, and such methods may provide an efficient support to structural genomics projects. 相似文献

20.

Prediction of common secondary structures of RNAs: a genetic algorithm approach 总被引：10，自引：3，他引：7

Chen JH Le SY Maizel JV 《Nucleic acids research》2000,28(4):991-999

In this study we apply a genetic algorithm to a set of RNA sequences to find common RNA secondary structures. Our method is a three-step procedure. At the first stage of the procedure for each sequence, a genetic algorithm is used to optimize the structures in a population to a certain degree of stability. In this step, the free energy of a structure is the fitness criterion for the algorithm. Next, for each structure, we define a measure of structural conservation with respect to those in other sequences. We use this measure in a genetic algorithm to improve the structural similarity among sequences for the structures in the population of a sequence. Finally, we select those structures satisfying certain conditions of structural stability and similarity as predicted common structures for a set of RNA sequences. We have obtained satisfactory results from a set of tRNA, 5S rRNA, rev response elements (RRE) of HIV-1 and RRE of HIV-2/SIV, respectively. 相似文献