共查询到20条相似文献,搜索用时 0 毫秒
1.
Clustering with neural networks 总被引:3,自引:0,他引:3
Behzad Kamgar-Parsi J. A. Gualtieri J. E. Devaney Behrooz Kamgar-Parsi 《Biological cybernetics》1990,63(3):201-208
Partitioning a set ofN patterns in ad-dimensional metric space intoK clusters — in a way that those in a given cluster are more similar to each other than the rest — is a problem of interest in many fields, such as, image analysis, taxonomy, astrophysics, etc. As there are approximatelyK
N/K! possible ways of partitioning the patterns amongK clusters, finding the best solution is beyond exhaustive search whenN is large. We show that this problem, in spite of its exponential complexity, can be formulated as an optimization problem for which very good, but not necessarily optimal, solutions can be found by using a Hopfield model of neural networks. To obtain a very good solution, the network must start from many randomly selected initial states. The network is simulated on the MPP, a 128 × 128 SIMD array machine, where we use the massive parallelism not only in solving the differential equations that govern the evolution of the network, but also in starting the network from many initial states at once thus obtaining many solutions in one run. We achieve speedups of two to three orders of magnitude over serial implementations and the promise through Analog VLSI implementations of further speedups of three to six orders of magnitude.Supported by a National Research Council-NASA Research Associatship 相似文献
2.
3.
Background
The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "phylogenetic" in the sense of "relatedness of biological functions". 相似文献4.
O'Sullivan O Suhre K Abergel C Higgins DG Notredame C 《Journal of molecular biology》2004,340(2):385-395
Most bioinformatics analyses require the assembly of a multiple sequence alignment. It has long been suspected that structural information can help to improve the quality of these alignments, yet the effect of combining sequences and structures has not been evaluated systematically. We developed 3DCoffee, a novel method for combining protein sequences and structures in order to generate high-quality multiple sequence alignments. 3DCoffee is based on TCoffee version 2.00, and uses a mixture of pairwise sequence alignments and pairwise structure comparison methods to generate multiple sequence alignments. We benchmarked 3DCoffee using a subset of HOMSTRAD, the collection of reference structural alignments. We found that combining TCoffee with the threading program Fugue makes it possible to improve the accuracy of our HOMSTRAD dataset by four percentage points when using one structure only per dataset. Using two structures yields an improvement of ten percentage points. The measures carried out on HOM39, a HOMSTRAD subset composed of distantly related sequences, show a linear correlation between multiple sequence alignment accuracy and the ratio of number of provided structure to total number of sequences. Our results suggest that in the case of distantly related sequences, a single structure may not be enough for computing an accurate multiple sequence alignment. 相似文献
5.
6.
7.
The increasing number and diversity of protein sequence families requires new methods to define and predict details regarding function. Here, we present a method for analysis and prediction of functional sub-types from multiple protein sequence alignments. Given an alignment and set of proteins grouped into sub-types according to some definition of function, such as enzymatic specificity, the method identifies positions that are indicative of functional differences by comparison of sub-type specific sequence profiles, and analysis of positional entropy in the alignment. Alignment positions with significantly high positional relative entropy correlate with those known to be involved in defining sub-types for nucleotidyl cyclases, protein kinases, lactate/malate dehydrogenases and trypsin-like serine proteases. We highlight new positions for these proteins that suggest additional experiments to elucidate the basis of specificity. The method is also able to predict sub-type for unclassified sequences. We assess several variations on a prediction method, and compare them to simple sequence comparisons. For assessment, we remove close homologues to the sequence for which a prediction is to be made (by a sequence identity above a threshold). This simulates situations where a protein is known to belong to a protein family, but is not a close relative of another protein of known sub-type. Considering the four families above, and a sequence identity threshold of 30 %, our best method gives an accuracy of 96 % compared to 80 % obtained for sequence similarity and 74 % for BLAST. We describe the derivation of a set of sub-type groupings derived from an automated parsing of alignments from PFAM and the SWISSPROT database, and use this to perform a large-scale assessment. The best method gives an average accuracy of 94 % compared to 68 % for sequence similarity and 79 % for BLAST. We discuss implications for experimental design, genome annotation and the prediction of protein function and protein intra-residue distances. 相似文献
8.
For applications such as comparative modelling one major issue is the reliability of sequence alignments. Reliable regions in alignments can be predicted using sub-optimal alignments of the same pair of sequences. Here we show that reliable regions in alignments can also be predicted from multiple sequence profile information alone.Alignments were created for a set of remotely related pairs of proteins using five different test methods. Structural alignments were used to assess the quality of the alignments and the aligned positions were scored using information from the observed frequencies of amino acid residues in sequence profiles pre-generated for each template structure. High-scoring regions of these profile-derived alignment scores were a good predictor of reliably aligned regions.These profile-derived alignment scores are easy to obtain and are applicable to any alignment method. They can be used to detect those regions of alignments that are reliably aligned and to help predict the quality of an alignment. For those residues within secondary structure elements, the regions predicted as reliably aligned agreed with the structural alignments for between 92% and 97.4% of the residues. In loop regions just under 92% of the residues predicted to be reliable agreed with the structural alignments. The percentage of residues predicted as reliable ranged from 32.1% for helix residues to 52.8% for strand residues.This information could also be used to help predict conserved binding sites from sequence alignments. Residues in the template that were identified as binding sites, that aligned to an identical amino acid residue and where the sequence alignment agreed with the structural alignment were in highly conserved, high scoring regions over 80% of the time. This suggests that many binding sites that are present in both target and template sequences are in sequence-conserved regions and that there is the possibility of translating reliability to binding site prediction. 相似文献
9.
Konstantinos Blekas Dimitrios I Fotiadis Aristidis Likas 《Journal of computational biology》2005,12(1):64-82
We present a system for multi-class protein classification based on neural networks. The basic issue concerning the construction of neural network systems for protein classification is the sequence encoding scheme that must be used in order to feed the neural network. To deal with this problem we propose a method that maps a protein sequence into a numerical feature space using the matching scores of the sequence to groups of conserved patterns (called motifs) into protein families. We consider two alternative ways for identifying the motifs to be used for feature generation and provide a comparative evaluation of the two schemes. We also evaluate the impact of the incorporation of background features (2-grams) on the performance of the neural system. Experimental results on real datasets indicate that the proposed method is highly efficient and is superior to other well-known methods for protein classification. 相似文献
10.
G Y Srinivasarao L S Yeh C R Marzec B C Orcutt W C Barker 《Bioinformatics (Oxford, England)》1999,15(5):382-390
MOTIVATION: The Protein Information Resource (PIR) maintains a database of annotated and curated alignments in order to visually represent interrelationships among sequences in the PIR-International Protein Sequence Database, to spread and standardize protein names, features and keywords among members of a family or superfamily, and to aid us in classifying sequences, in identifying conserved regions, and in defining new homology domains. RESULTS: Release 22.0, (December 1998), of the PIR-ALN database contains a total of 3806 alignments, including 1303 superfamily, 2131 family and 372 homology domain alignments. This is an appropriate dataset to develop and extract patterns, test profiles, train neural networks or build Hidden Markov Models (HMMs). These alignments can be used to standardize and spread annotation to newer members by homology, as well as to understand the modular architecture of multidomain proteins. PIR-ALN includes 529 alignments that can be used to develop patterns not represented in PROSITE, Blocks, PRINTS and Pfam databases. The ATLAS information retrieval system can be used to browse and query the PIR-ALN alignments. AVAILABILITY: PIR-ALN is currently being distributed as a single ASCII text file along with the title, member, species, superfamily and keyword indexes. The quarterly and weekly updates can be accessed via the WWW at pir.georgetown.edu. The quarterly updates can also be obtained by anonymous FTP from the PIR FTP site at NBRF.Georgetown.edu, directory [ANONYMOUS.PIR.ALIGNMENT]. 相似文献
11.
Background
In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences. 相似文献12.
MOTIVATION: In the previous works, we developed ATGpr, a computer program for predicting the fullness of a cDNA, i.e. whether it contains an initiation codon or not. Statistical information of short nucleotide fragments was fully exploited in the prediction algorithm. However, sequence similarities to known proteins, which are becoming increasingly available due to recent rapid growth of protein database, were not used in the prediction. In this work, we present a new prediction algorithm based on both statistical and similarity information, which provides better performance in sensitivity and specificity. RESULTS: We evaluated the accuracy of ATGpr for predicting fullness of cDNA sequences from human clustered ESTs of UniGene, and we obtained specificity, sensitivity, and correlation coefficient of this prediction. Specificity and sensitivity crossed at 46% over the ATGpr score threshold of 0.33 and the maximum correlation coefficient of 0.34 was obtained at this threshold. Without ATGpr we found it effective to use alignments with known proteins for predicting the fullness of cDNA sequences. That is, specificity increased monotonously as similarity (identity of the alignments) increased. Specificity was achieved greater than 80% if identity was greater than 40%. For more effective prediction of fullness of cDNA sequences we combined the similarity (identity of query sequence) with known proteins and ATGpr score. As a result, specificity became greater than 80% if identity was greater than 20%. AVAILABILITY: The prediction program, called ATGpr_ sim, is available at http://www.hri.co.jp/atgpr/ATGpr_sim.html CONTACT: nisikawa@crl.hitachi.co.jp 相似文献
13.
14.
A new method to analyze the similarity between multiply aligned protein motifs (blocks) was developed. It identifies sets of consistently aligned blocks. These are found to be protein regions of similar function and structure that appear in different contexts. For example, the Rossmann fold ligand-binding region is found similar to TIM barrel and methylase regions, various protein families are predicted to have a TIM-barrel fold and the structural relation between the ClpP protease and crotonase folds is identified from their sequence. Besides identifying local structure features, sequence similarity across short sequence-regions (less than 20 amino acid regions) also predicts structure similarity of whole domains (folds) a few hundred amino acid residues long. Most of these relations could not be identified by other advanced sequence-to-sequence or sequence-to-multiple alignments comparisons. We describe the method (termed CYRCA), present examples of our findings, and discuss their implications. 相似文献
15.
We present here a neural network-based method for detection of signal peptides (abbreviation used: SP) in proteins. The method is trained on sequences of known signal peptides extracted from the Swiss-Prot protein database and is able to work separately on prokaryotic and eukaryotic proteins. A query protein is dissected into overlapping short sequence fragments, and then each fragment is analyzed with respect to the probability of it being a signal peptide and containing a cleavage site. While the accuracy of the method is comparable to that of other existing prediction tools, it provides a significantly higher speed and portability. The accuracy of cleavage site prediction reaches 73% on heterogeneous source data that contains both prokaryotic and eukaryotic sequences while the accuracy of discrimination between signal peptides and non-signal peptides is above 93% for any source dataset. As a consequence, the method can be easily applied to genome-wide datasets. The software can be downloaded freely from http://rpsp.bioinfo.pl/RPSP.tar.gz. 相似文献
16.
An important task in functional genomics is to cluster homologous proteins, which may share common functions. Annotating proteins of unknown function by transferring annotations from their homologues of known annotations is one of the most efficient ways to predict protein function. In this paper, we use a modularity-based method called CD for grouping together homologous proteins. The method employs a global heuristic search strategy to find the partitioning of the weighted adjacency graph with the largest modularity. The weighted adjacency graph is constructed by the sigmodal transformation of all pairwise sequence similarities between all protein sequences in a given dataset. The method has been extensively tested on several subsets from the superfamily level of the SCOP (Structural Classification of Proteins) database, where some homologous proteins have very low sequence similarity. Compared with a widely used method MCL, we observe that the number of clusters obtained by CD is closer to the number of superfamilies in the dataset, the value of the F-measure given by CD is 10% better than MCL on average, and CD is more tolerant to noise to the sequence similarity. The experiment results indicate that CD is ideally suitable for clustering homologous proteins when sequence similarity is low. 相似文献
17.
18.
A 35-kDa protein (designated p35) showing antigenic homology with an N-terminal epitope on the SV-40 large T-antigen oncoprotein was purified from transformed cardiomyocytes. Sequence analysis of several tryptic peptides indicated that p35 was not homologous to previously described sequences. Polyclonal antibody raised against synthetic peptide containing one of the tryptic fragments was used in Western blot analyses to ascertain the tissue-specific pattern of p35 expression. p35 was expressed ubiquitously in adult mouse tissues, and was detected in both embryonic and transformed cardiomyocyte preparations. Subcellular fractionation studies indicated that p35 is an integral membrane protein. Expression of p35 appeared to be regulated by growth conditions as evidenced by a transient decrease in protein levels following the addition of serum to quiescent NIH 3T3 cells. 相似文献
19.
A model has been developed that permits the prediction of mRNA nucleic acid sequence from the sequences of the translated proteins. The model relies on the information obtained from the comparison of protein sequences in related species to reduce the number of possible codons for those amino acids where mutations are observed. The predictions so obtained have been tested by applying the model to proteins whose mRNA sequences are known. The model's predictions have been found to be 100% accurate if three or more different amino acids are known at a given position and if the protein sequences are restricted to relatively closely related species (within the same class). The use of this model may permit a reduction of the mRNA sequence degeneracy and therefore be helpful in the synthesis of cDNA probes or for the prediction of restriction endonuclease sites. Computer programs have been developed to ease the use of the model. 相似文献
20.
Ohnishi H Nakahara T Furuse K Sasaki H Tsukita S Furuse M 《The Journal of biological chemistry》2004,279(44):46014-46022
The apical junctional complex is composed of various cell adhesion molecules and cytoplasmic plaque proteins. Using a monoclonal antibody that recognizes a chicken 155-kDa cytoplasmic antigen (p155) localizing at the apical junctional complex, we have cloned a cDNA of its mouse homologue. The full-length cDNA of mouse p155 encoded a 148-kDa polypeptide containing a coiled-coil domain with sequence similarity to cingulin, a tight junction (TJ)-associated plaque protein. We designated this protein JACOP (junction-associated coiled-coil protein). Immunofluorescence staining showed that JACOP was concentrated in the junctional complex in various types of epithelial and endothelial cells. Furthermore, in the liver and kidney, JACOP was also distributed along non-junctional actin filaments. Upon immunoelectron microscopy, JACOP was found to be localized to the undercoat of TJs in the liver, but in some tissues, its distribution was not restricted to TJs but extended to the area of adherens junctions. Overexpression studies have revealed that JACOP was recruited to the junctional complex in epithelial cells and to cell-cell contacts and stress fibers in fibroblasts. These findings suggest that JACOP is involved in anchoring the apical junctional complex, especially TJs, to actin-based cytoskeletons. 相似文献