首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We propose new methods for finding similarities in protein structure databases. These methods extract feature vectors on triplets of SSEs (Secondary Structure Elements) of proteins. The feature vectors are then indexed using a multidimensional index structure. Our first technique considers the problem of finding proteins similar to a given query protein in a protein dataset. It quickly finds promising proteins using the index structure. These proteins are then aligned to the query protein using a popular pairwise alignment tool such as VAST. We also develop a novel statistical model to estimate the goodness of a match using the SSEs. Our second technique considers the problem of joining two protein datasets to find an all-to-all similarity. Experimental results show that our techniques improve the pruning time of VAST 3 to 3.5 times, while keeping the sensitivity similar. Our technique can also be incorporated with DALI and CE to improve their running times by a factor of 2 and 2.7 respectively. The software is available online at http://bioserver.cs.ucsb.edu/.  相似文献   

2.
MOTIVATION: Recognizing proteins that have similar tertiary structure is the key step of template-based protein structure prediction methods. Traditionally, a variety of alignment methods are used to identify similar folds, based on sequence similarity and sequence-structure compatibility. Although these methods are complementary, their integration has not been thoroughly exploited. Statistical machine learning methods provide tools for integrating multiple features, but so far these methods have been used primarily for protein and fold classification, rather than addressing the retrieval problem of fold recognition-finding a proper template for a given query protein. RESULTS: Here we present a two-stage machine learning, information retrieval, approach to fold recognition. First, we use alignment methods to derive pairwise similarity features for query-template protein pairs. We also use global profile-profile alignments in combination with predicted secondary structure, relative solvent accessibility, contact map and beta-strand pairing to extract pairwise structural compatibility features. Second, we apply support vector machines to these features to predict the structural relevance (i.e. in the same fold or not) of the query-template pairs. For each query, the continuous relevance scores are used to rank the templates. The FOLDpro approach is modular, scalable and effective. Compared with 11 other fold recognition methods, FOLDpro yields the best results in almost all standard categories on a comprehensive benchmark dataset. Using predictions of the top-ranked template, the sensitivity is approximately 85, 56, and 27% at the family, superfamily and fold levels respectively. Using the 5 top-ranked templates, the sensitivity increases to 90, 70, and 48%.  相似文献   

3.
4.
Proteins that contain similar structural elements often have analogous functions regardless of the degree of sequence similarity or structure connectivity in space. In general, protein structure comparison (PSC) provides a straightforward methodology for biologists to determine critical aspects of structure and function. Here, we developed a novel PSC technique based on angle-distance image (A-D image) transformation and matching, which is independent of sequence similarity and connectivity of secondary structure elements (SSEs). An A-D image is constructed by utilizing protein secondary structure information. According to various types of SSEs, the mutual SSE pairs of the query protein are classified into three different types of sub-images. Subsequently, corresponding sub-images between query and target protein structures are compared using modified cross-correlation approaches to identify the similarity of various patterns. Structural relationships among proteins are displayed by hierarchical clustering trees, which facilitate the establishment of the evolutionary relationships between structure and function of various proteins.Four standard testing datasets and one newly created dataset were used to evaluate the proposed method. The results demonstrate that proteins from these five datasets can be categorized in conformity with their spatial distribution of SSEs. Moreover, for proteins with low sequence identity that share high structure similarity, the proposed algorithms are an efficient and effective method for structural comparison.  相似文献   

5.
Fish species recognition is an important task to preserve ecosystems, feed humans, and tourism. In particular, the Pantanal is a wetland region that harbors hundreds of species and is considered one of the most important ecosystems in the world. In this paper, we present a new method based on convolutional neural networks (CNNs) for Pantanal fish species recognition. A new CNN composed of three branches that classify the fish species, family and order is proposed with the aim of improving the recognition of species with similar characteristics. The branch that classifies the fish species uses information learned from the family and order, which has shown to improve the overall accuracy. Results on unrestricted image dataset showed that the proposed method provides superior results to traditional approaches. Our method obtained an accuracy of 0.873 versus 0.864 of traditional CNN in recognition of 68 fish species. In addition, our method provides fish family and order recognition, which obtained accuracies of 0.938 and 0.96, respectively. We hope that, with these promising results, an automatic tool can be developed to monitor species in an important region such as the Pantanal.  相似文献   

6.
We develop a novel method of asserting the similarity between two biological sequences without the need for alignment. The proposed method uses free energy of nearest-neighbor interactions as a simple measure of dissimilarity. It is used to perform a search for similarities of a query sequence against three complex datasets. The sensitivity and selectivity are computed and evaluated and the performance of the proposed distance measure is compared. Real data analysis shows that is a very efficient, sensitive and high-selective algorithm in comparing large dataset of DNA sequences.  相似文献   

7.
MOTIVATION: We consider the problem of finding similarities in protein structure databases. Current techniques sequentially compare the given query protein to all of the proteins in the database to find similarities. Therefore, the cost of similarity queries increases linearly as the volume of the protein databases increase. As the sizes of experimentally determined and theoretically estimated protein structure databases grow, there is a need for scalable searching techniques. RESULTS: Our techniques extract feature vectors on triplets of SSEs (Secondary Structure Elements). Later, these feature vectors are indexed using a multidimensional index structure. For a given query protein, this index structure is used to quickly prune away unpromising proteins in the database. The remaining proteins are then aligned using a popular alignment tool such as VAST. We also develop a novel statistical model to estimate the goodness of a match using the SSEs. Experimental results show that our techniques improve the pruning time of VAST 3 to 3.5 times while maintaining similar sensitivity.  相似文献   

8.
PepPat, a hybrid method that combines pattern matching with similarity scoring, is described. We also report PepPat's application in the identification of a novel tachykinin-like peptide. PepPat takes as input a query peptide and a user-specified regular expression pattern within the peptide. It first performs a database pattern match and then ranks candidates on the basis of their similarity to the query peptide. PepPat calculates similarity over the pattern spanning region, enhancing PepPat's sensitivity for short query peptides. PepPat can also search for a user-specified number of occurrences of a repeated pattern within the target sequence. We illustrate PepPat's application in short peptide ligand mining. As a validation example, we report the identification of a novel tachykinin-like peptide, C14TKL-1, and show it is an NK1 (neuokinin receptor 1) agonist whose message is widely expressed in human periphery. Availability: PepPat is offered online at:  相似文献   

9.
An effective image retrieval system is developed based on the use of neural networks (NNs). It takes advantages of association ability of multilayer NNs as matching engines which calculate similarities between a user's drawn sketch and the stored images. The NNs memorize pixel information of every size-reduced image (thumbnail) in the learning phase. In the retrieval phase, pixel information of a user's drawn rough sketch is inputted to the learned NNs and they estimate the candidates. Thus the system can retrieve candidates quickly and correctly by utilizing the parallelism and association ability of NNs. In addition, the system has learning capability: it can automatically extract features of a user's drawn sketch during the retrieval phase and can store them as additional information to improve the performance. The software for querying, including efficient graphical user interfaces, has been implemented and tested. The effectiveness of the proposed system has been investigated through various experimental tests.  相似文献   

10.
Functionally analogous enzymes are those that catalyze similar reactions on similar substrates but do not share common ancestry, providing a window on the different structural strategies nature has used to evolve required catalysts. Identification and use of this information to improve reaction classification and computational annotation of enzymes newly discovered in the genome projects would benefit from systematic determination of reaction similarities. Here, we quantified similarity in bond changes for overall reactions and catalytic mechanisms for 95 pairs of functionally analogous enzymes (non-homologous enzymes with identical first three numbers of their EC codes) from the MACiE database. Similarity of overall reactions was computed by comparing the sets of bond changes in the transformations from substrates to products. For similarity of mechanisms, sets of bond changes occurring in each mechanistic step were compared; these similarities were then used to guide global and local alignments of mechanistic steps. Using this metric, only 44% of pairs of functionally analogous enzymes in the dataset had significantly similar overall reactions. For these enzymes, convergence to the same mechanism occurred in 33% of cases, with most pairs having at least one identical mechanistic step. Using our metric, overall reaction similarity serves as an upper bound for mechanistic similarity in functional analogs. For example, the four carbon-oxygen lyases acting on phosphates (EC 4.2.3) show neither significant overall reaction similarity nor significant mechanistic similarity. By contrast, the three carboxylic-ester hydrolases (EC 3.1.1) catalyze overall reactions with identical bond changes and have converged to almost identical mechanisms. The large proportion of enzyme pairs that do not show significant overall reaction similarity (56%) suggests that at least for the functionally analogous enzymes studied here, more stringent criteria could be used to refine definitions of EC sub-subclasses for improved discrimination in their classification of enzyme reactions. The results also indicate that mechanistic convergence of reaction steps is widespread, suggesting that quantitative measurement of mechanistic similarity can inform approaches for functional annotation.  相似文献   

11.
12.
13.
R.R. Janghel  Y.K. Rathore 《IRBM》2021,42(4):258-267
ObjectivesAlzheimer's Disease (AD) is the most general type of dementia. In all leading countries, it is one of the primary reasons of death in senior citizens. Currently, it is diagnosed by calculating the MSME score and by the manual study of MRI Scan. Also, different machine learning methods are utilized for automatic diagnosis but existing has some limitations in terms of accuracy. So, main objective of this paper to include a preprocessing method before CNN model to increase the accuracy of classification.Materials and methodIn this paper, we present a deep learning-based approach for detection of Alzheimer's Disease from ADNI database of Alzheimer's disease patients, the dataset contains fMRI and PET images of Alzheimer's patients along with normal person's image. We have applied 3D to 2D conversion and resizing of images before applying VGG-16 architecture of Convolution neural network for feature extraction. Finally, for classification SVM, Linear Discriminate, K means clustering, and Decision tree classifiers are used.ResultsThe experimental result shows that the average accuracy of 99.95% is achieved for the classification of the fMRI dataset, while the average accuracy of 73.46% is achieved with the PET dataset. On comparing results on the basis of accuracy, specificity, sensitivity and on some other parameters we found that these results are better than existing methods.Conclusionsthis paper, suggested a unique way to increase the performance of CNN models by applying some preprocessing on image dataset before sending to CNN architecture for feature extraction. We applied this method on ADNI database and on comparing the accuracies with other similar approaches it shows better results.  相似文献   

14.
With great potential for assisting radiological image interpretation and decision making, content-based image retrieval in the medical domain has become a hot topic in recent years. Many methods to enhance the performance of content-based medical image retrieval have been proposed, among which the relevance feedback (RF) scheme is one of the most promising. Given user feedback information, RF algorithms interactively learn a user’s preferences to bridge the “semantic gap” between low-level computerized visual features and high-level human semantic perception and thus improve retrieval performance. However, most existing RF algorithms perform in the original high-dimensional feature space and ignore the manifold structure of the low-level visual features of images. In this paper, we propose a new method, termed dual-force ISOMAP (DFISOMAP), for content-based medical image retrieval. Under the assumption that medical images lie on a low-dimensional manifold embedded in a high-dimensional ambient space, DFISOMAP operates in the following three stages. First, the geometric structure of positive examples in the learned low-dimensional embedding is preserved according to the isometric feature mapping (ISOMAP) criterion. To precisely model the geometric structure, a reconstruction error constraint is also added. Second, the average distance between positive and negative examples is maximized to separate them; this margin maximization acts as a force that pushes negative examples far away from positive examples. Finally, the similarity propagation technique is utilized to provide negative examples with another force that will pull them back into the negative sample set. We evaluate the proposed method on a subset of the IRMA medical image dataset with a RF-based medical image retrieval framework. Experimental results show that DFISOMAP outperforms popular approaches for content-based medical image retrieval in terms of accuracy and stability.  相似文献   

15.
16.
  1. Changes in insect biomass, abundance, and diversity are challenging to track at sufficient spatial, temporal, and taxonomic resolution. Camera traps can capture habitus images of ground‐dwelling insects. However, currently sampling involves manually detecting and identifying specimens. Here, we test whether a convolutional neural network (CNN) can classify habitus images of ground beetles to species level, and estimate how correct classification relates to body size, number of species inside genera, and species identity.
  2. We created an image database of 65,841 museum specimens comprising 361 carabid beetle species from the British Isles and fine‐tuned the parameters of a pretrained CNN from a training dataset. By summing up class confidence values within genus, tribe, and subfamily and setting a confidence threshold, we trade‐off between classification accuracy, precision, and recall and taxonomic resolution.
  3. The CNN classified 51.9% of 19,164 test images correctly to species level and 74.9% to genus level. Average classification recall on species level was 50.7%. Applying a threshold of 0.5 increased the average classification recall to 74.6% at the expense of taxonomic resolution. Higher top value from the output layer and larger sized species were more often classified correctly, as were images of species in genera with few species.
  4. Fine‐tuning enabled us to classify images with a high mean recall for the whole test dataset to species or higher taxonomic levels, however, with high variability. This indicates that some species are more difficult to identify because of properties such as their body size or the number of related species.
  5. Together, species‐level image classification of arthropods from museum collections and ecological monitoring can substantially increase the amount of occurrence data that can feasibly be collected. These tools thus provide new opportunities in understanding and predicting ecological responses to environmental change.
  相似文献   

17.
We present a new method for protein structure comparison that combines indexing and dynamic programming (DP). The method is based on simple geometric features of triplets of secondary structures of proteins. These features provide indexes to a hash table that allows fast retrieval of similarity information for a query protein. After the query protein is matched with all proteins in the hash table producing a list of putative similarities, the dynamic programming algorithm is used to align the query protein with each protein of this list. Since the pairwise comparison with DP is applied only to a small subset of proteins and, furthermore, DP re-uses information that is already computed and stored in the hash table, the approach is very fast even when searching the entire PDB. We have done extensive experimentation showing that our approach achieves results of quality comparable to that of other existing approaches but is generally faster.  相似文献   

18.
MOTIVATION: The Dictionary of Interfaces in Proteins (DIP) is a database collecting the 3D structure of interacting parts of proteins that are called patches. It serves as a repository, in which patches similar to given query patches can be found. The computation of the similarity of two patches is time consuming and traversing the entire DIP requires some hours. In this work we address the question of how the patches similar to a given query can be identified by scanning only a small part of DIP. The answer to this question requires the investigation of the distribution of the similarity of patches. RESULTS: The score values describing the similarity of two patches can roughly be divided into three ranges that correspond to different levels of spatial similarity. Interestingly, the two iso-score lines separating the three classes can be determined by two different approaches. Applying a concept of the theory of random graphs reveals significant structural properties of the data in DIP. These can be used to accelerate scanning the DIP for patches similar to a given query. Searches for very similar patches could be accelerated by a factor of more than 25. Patches with a medium similarity could be found 10 times faster than by brute-force search.  相似文献   

19.
We present a new method for conducting protein structure similarity searches, which improves on the efficiency of some existing techniques. Our method is grounded in the theory of differential geometry on 3D space curve matching. We generate shape signatures for proteins that are invariant, localized, robust, compact, and biologically meaningful. The invariancy of the shape signatures allows us to improve similarity searching efficiency by adopting a hierarchical coarse-to-fine strategy. We index the shape signatures using an efficient hashing-based technique. With the help of this technique we screen out unlikely candidates and perform detailed pairwise alignments only for a small number of candidates that survive the screening process. Contrary to other hashing based techniques, our technique employs domain specific information (not just geometric information) in constructing the hash key, and hence, is more tuned to the domain of biology. Furthermore, the invariancy, localization, and compactness of the shape signatures allow us to utilize a well-known local sequence alignment algorithm for aligning two protein structures. One measure of the efficacy of the proposed technique is that we were able to perform structure alignment queries 36 times faster (on the average) than a well-known method while keeping the quality of the query results at an approximately similar level.  相似文献   

20.
Convolutional Neural Networks (CNNs) are statistical models suited for learning complex visual patterns. In the context of Species Distribution Models (SDM) and in line with predictions of landscape ecology and island biogeography, CNN could grasp how local landscape structure affects prediction of species occurrence in SDMs. The prediction can thus reflect the signatures of entangled ecological processes. Although previous machine-learning based SDMs can learn complex influences of environmental predictors, they cannot acknowledge the influence of environmental structure in local landscapes (hence denoted “punctual models”). In this study, we applied CNNs to a large dataset of plant occurrences in France (GBIF), on a large taxonomical scale, to predict ranked relative probability of species (by joint learning) to any geographical position. We examined the way local environmental landscapes improve prediction by performing alternative CNN models deprived of information on landscape heterogeneity and structure (“ablation experiments”). We found that the landscape structure around location crucially contributed to improve predictive performance of CNN-SDMs. CNN models can classify the predicted distributions of many species, as other joint modelling approaches, but they further prove efficient in identifying the influence of local environmental landscapes. CNN can then represent signatures of spatially structured environmental drivers. The prediction gain is noticeable for rare species, which open promising perspectives for biodiversity monitoring and conservation strategies. Therefore, the approach is of both theoretical and practical interest. We discuss the way to test hypotheses on the patterns learnt by CNN, which should be essential for further interpretation of the ecological processes at play.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号