首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.  相似文献   

2.
《Biophysical journal》2021,120(20):4472-4483
Single-molecule (SM) approaches have provided valuable mechanistic information on many biophysical systems. As technological advances lead to ever-larger data sets, tools for rapid analysis and identification of molecules exhibiting the behavior of interest are increasingly important. In many cases the underlying mechanism is unknown, making unsupervised techniques desirable. The divisive segmentation and clustering (DISC) algorithm is one such unsupervised method that idealizes noisy SM time series much faster than computationally intensive approaches without sacrificing accuracy. However, DISC relies on a user-selected objective criterion (OC) to guide its estimation of the ideal time series. Here, we explore how different OCs affect DISC’s performance for data typical of SM fluorescence imaging experiments. We find that OCs differing in their penalty for model complexity each optimize DISC’s performance for time series with different properties such as signal/noise and number of sample points. Using a machine learning approach, we generate a decision boundary that allows unsupervised selection of OCs based on the input time series to maximize performance for different types of data. This is particularly relevant for SM fluorescence data sets, which often have signal/noise near the derived decision boundary and include time series of nonuniform length because of stochastic bleaching. Our approach, AutoDISC, allows unsupervised per-molecule optimization of DISC, which will substantially assist in the rapid analysis of high-throughput SM data sets with noisy samples and nonuniform time windows.  相似文献   

3.
Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.  相似文献   

4.

Background

RNA secondary structure prediction is a mainstream bioinformatic domain, and is key to computational analysis of functional RNA. In more than 30 years, much research has been devoted to defining different variants of RNA structure prediction problems, and to developing techniques for improving prediction quality. Nevertheless, most of the algorithms in this field follow a similar dynamic programming approach as that presented by Nussinov and Jacobson in the late 70's, which typically yields cubic worst case running time algorithms. Recently, some algorithmic approaches were applied to improve the complexity of these algorithms, motivated by new discoveries in the RNA domain and by the need to efficiently analyze the increasing amount of accumulated genome-wide data.

Results

We study Valiant's classical algorithm for Context Free Grammar recognition in sub-cubic time, and extract features that are common to problems on which Valiant's approach can be applied. Based on this, we describe several problem templates, and formulate generic algorithms that use Valiant's technique and can be applied to all problems which abide by these templates, including many problems within the world of RNA Secondary Structures and Context Free Grammars.

Conclusions

The algorithms presented in this paper improve the theoretical asymptotic worst case running time bounds for a large family of important problems. It is also possible that the suggested techniques could be applied to yield a practical speedup for these problems. For some of the problems (such as computing the RNA partition function and base-pair binding probabilities), the presented techniques are the only ones which are currently known for reducing the asymptotic running time bounds of the standard algorithms.  相似文献   

5.
P Gao  X Zhou  ZN Wang  YX Song  LL Tong  YY Xu  ZY Yue  HM Xu 《PloS one》2012,7(7):e42015

Objective

Over the past decades, many studies have used data mining technology to predict the 5-year survival rate of colorectal cancer, but there have been few reports that compared multiple data mining algorithms to the TNM classification of malignant tumors (TNM) staging system using a dataset in which the training and testing data were from different sources. Here we compared nine data mining algorithms to the TNM staging system for colorectal survival analysis.

Methods

Two different datasets were used: 1) the National Cancer Institute''s Surveillance, Epidemiology, and End Results dataset; and 2) the dataset from a single Chinese institution. An optimization and prediction system based on nine data mining algorithms as well as two variable selection methods was implemented. The TNM staging system was based on the 7th edition of the American Joint Committee on Cancer TNM staging system.

Results

When the training and testing data were from the same sources, all algorithms had slight advantages over the TNM staging system in predictive accuracy. When the data were from different sources, only four algorithms (logistic regression, general regression neural network, Bayesian networks, and Naïve Bayes) had slight advantages over the TNM staging system. Also, there was no significant differences among all the algorithms (p>0.05).

Conclusions

The TNM staging system is simple and practical at present, and data mining methods are not accurate enough to replace the TNM staging system for colorectal cancer survival prediction. Furthermore, there were no significant differences in the predictive accuracy of all the algorithms when the data were from different sources. Building a larger dataset that includes more variables may be important for furthering predictive accuracy.  相似文献   

6.
Obtaining satisfactory results with neural networks depends on the availability of large data samples. The use of small training sets generally reduces performance. Most classical Quantitative Structure-Activity Relationship (QSAR) studies for a specific enzyme system have been performed on small data sets. We focus on the neuro-fuzzy prediction of biological activities of HIV-1 protease inhibitory compounds when inferring from small training sets. We propose two computational intelligence prediction techniques which are suitable for small training sets, at the expense of some computational overhead. Both techniques are based on the FAMR model. The FAMR is a Fuzzy ARTMAP (FAM) incremental learning system used for classification and probability estimation. During the learning phase, each sample pair is assigned a relevance factor proportional to the importance of that pair. The two proposed algorithms in this paper are: 1) The GA-FAMR algorithm, which is new, consists of two stages: a) During the first stage, we use a genetic algorithm (GA) to optimize the relevances assigned to the training data. This improves the generalization capability of the FAMR. b) In the second stage, we use the optimized relevances to train the FAMR. 2) The Ordered FAMR is derived from a known algorithm. Instead of optimizing relevances, it optimizes the order of data presentation using the algorithm of Dagher et al. In our experiments, we compare these two algorithms with an algorithm not based on the FAM, the FS-GA-FNN introduced in [4], [5]. We conclude that when inferring from small training sets, both techniques are efficient, in terms of generalization capability and execution time. The computational overhead introduced is compensated by better accuracy. Finally, the proposed techniques are used to predict the biological activities of newly designed potential HIV-1 protease inhibitors.  相似文献   

7.
8.
9.
In a social network, users hold and express positive and negative attitudes (e.g. support/opposition) towards other users. Those attitudes exhibit some kind of binary relationships among the users, which play an important role in social network analysis. However, some of those binary relationships are likely to be latent as the scale of social network increases. The essence of predicting latent binary relationships have recently began to draw researchers'' attention. In this paper, we propose a machine learning algorithm for predicting positive and negative relationships in social networks inspired by structural balance theory and social status theory. More specifically, we show that when two users in the network have fewer common neighbors, the prediction accuracy of the relationship between them deteriorates. Accordingly, in the training phase, we propose a segment-based training framework to divide the training data into two subsets according to the number of common neighbors between users, and build a prediction model for each subset based on support vector machine (SVM). Moreover, to deal with large-scale social network data, we employ a sampling strategy that selects small amount of training data while maintaining high accuracy of prediction. We compare our algorithm with traditional algorithms and adaptive boosting of them. Experimental results of typical data sets show that our algorithm can deal with large social networks and consistently outperforms other methods.  相似文献   

10.
Sequence analysis is the basis of bioinformatics, while sequence alignment is a fundamental task for sequence analysis. The widely used alignment algorithm, Dynamic Programming, though generating optimal alignment, takes too much time due to its high computation complexity O(N(2)). In order to reduce computation complexity without sacrificing too much accuracy, we have developed a new approach to align two homologous sequences. The new approach presented here, adopting our novel algorithm which combines the methods of probabilistic and combinatorial analysis, reduces the computation complexity to as low as O(N). The computation speed by our program is at least 15 times faster than traditional pairwise alignment algorithms without a loss of much accuracy. We hence named the algorithm Super Pairwise Alignment (SPA). The pairwise alignment execution program based on SPA and the detailed results of the aligned sequences discussed in this article are available upon request.  相似文献   

11.
《IRBM》2020,41(5):267-275
Background and objectiveClustering is a widely used popular method for data analysis within many clustering algorithms for years. Today it is used in many predictions, collaborative filtering and automatic segmentation systems on different domains. Also, to be broadly used in practice, such clustering algorithms need to give both better performance and robustness when compared to the ones currently used. In recent years, evolutionary algorithms are used in many domains since they are robust and easy to implement. And many clustering problems can be easily solved with such algorithms if the problem is modeled as an optimization problem. In this paper, we present an optimization approach for clustering by using four well-known evolutionary algorithms which are Biogeography-Based Optimization (BBO), Grey Wolf Optimization (GWO), Genetic Algorithm (GA) and Particle Swarm Optimization (PSO).Methodthe objective function has been specified to minimize the total distance from cluster centers to the data points. Euclidean distance is used for distance calculation. We have applied this objective function to the given algorithms both to find the most efficient clustering algorithm and to compare the clustering performances of algorithms against different data sizes. In order to benchmark the clustering performances of algorithms in the experiments, we have used a number of datasets with different data sizes such as some small scale, medium and big data. The clustering performances have been compared to K-means as it is a widely used clustering algorithm for years in literature. Rand Index, Adjusted Rand Index, Mirkin's Index and Hubert's Index have been considered as parameters for evaluating the clustering performances.ResultAs a result of the clustering experiments of algorithms over different datasets with varying data sizes according to the specified performance criteria, GA and GWO algorithms show better clustering performances among the others.ConclusionsThe results of the study showed that although the algorithms have shown satisfactory clustering results on small and medium scale datasets, the clustering performances on Big data need to be improved.  相似文献   

12.
The calculation of multipoint likelihoods of pedigree data is crucial for extracting the full available information needed for both parametric and nonparametric linkage analysis. Recent mathematical advances in both the Elston-Stewart and Lander-Green algorithms for computing exact multipoint likelihoods of pedigree data have enabled researchers to analyze data sets containing more markers and more individuals both faster and more efficiently. This paper presents novel algorithms that further extend the computational boundary of the Elston-Stewart algorithm. They have been implemented into the software package VITESSE v. 2 and are shown to be several orders of magnitude faster than the original implementation of the Elston-Stewart algorithm in VITESSE v. 1 on a variety of real pedigree data. VITESSE v. 2 was faster by a factor ranging from 168 to over 1,700 on these data sets, thus making a qualitative difference in the analysis. The main algorithm is based on the faster computation of the conditional probability of a component nuclear family within the pedigree by summing over the joint genotypes of the children instead of the parents as done in the VITESSE v. 1. This change in summation allows the parent-child transmission part of the calculation to be not only computed for each parent separately, but also for each locus separately by using inheritance vectors as is done in the Lander-Green algorithm. Computing both of these separately can lead to substantial computational savings. The use of inheritance vectors in the nuclear family calculation represents a partial synthesis of the techniques of the Lander-Green algorithm into the Elston-Stewart algorithm. In addition, the technique of local set recoding is introduced to further reduce the complexity of the nuclear family computation. These new algorithms, however, are not universally faster on all types of pedigree data compared to the method implemented in VITESSE v. 1 of summing over the parents. Therefore, a hybrid algorithm is introduced which combines the strength of both summation methods by using a numerical heuristic to decide which of the two to use for a given nuclear family within the pedigree and is shown to be faster than either method on its own. Finally, this paper discusses various complexity issues regarding both the Elston-Stewart and Lander-Green algorithms and possible future directions of further synthesis.  相似文献   

13.
The reconstruction of gene regulatory networks (GRNs) from high-throughput experimental data has been considered one of the most important issues in systems biology research. With the development of high-throughput technology and the complexity of biological problems, we need to reconstruct GRNs that contain thousands of genes. However, when many existing algorithms are used to handle these large-scale problems, they will encounter two important issues: low accuracy and high computational cost. To overcome these difficulties, the main goal of this study is to design an effective parallel algorithm to infer large-scale GRNs based on high-performance parallel computing environments. In this study, we proposed a novel asynchronous parallel framework to improve the accuracy and lower the time complexity of large-scale GRN inference by combining splitting technology and ordinary differential equation (ODE)-based optimization. The presented algorithm uses the sparsity and modularity of GRNs to split whole large-scale GRNs into many small-scale modular subnetworks. Through the ODE-based optimization of all subnetworks in parallel and their asynchronous communications, we can easily obtain the parameters of the whole network. To test the performance of the proposed approach, we used well-known benchmark datasets from Dialogue for Reverse Engineering Assessments and Methods challenge (DREAM), experimentally determined GRN of Escherichia coli and one published dataset that contains more than 10 thousand genes to compare the proposed approach with several popular algorithms on the same high-performance computing environments in terms of both accuracy and time complexity. The numerical results demonstrate that our parallel algorithm exhibits obvious superiority in inferring large-scale GRNs.  相似文献   

14.
With the rapid growth of the Internet and overwhelming amount of information and choices that people are confronted with, recommender systems have been developed to effectively support users’ decision-making process in the online systems. However, many recommendation algorithms suffer from the data sparsity problem, i.e. the user-object bipartite networks are so sparse that algorithms cannot accurately recommend objects for users. This data sparsity problem makes many well-known recommendation algorithms perform poorly. To solve the problem, we propose a recommendation algorithm based on the semi-local diffusion process on the user-object bipartite network. The simulation results on two sparse datasets, Amazon and Bookcross, show that our method significantly outperforms the state-of-the-art methods especially for those small-degree users. Two personalized semi-local diffusion methods are proposed which further improve the recommendation accuracy. Finally, our work indicates that sparse online systems are essentially different from the dense online systems, so it is necessary to reexamine former algorithms and conclusions based on dense data in sparse systems.  相似文献   

15.
MOTIVATION: The increasing use of DNA microarray-based tumor gene expression profiles for cancer diagnosis requires mathematical methods with high accuracy for solving clustering, feature selection and classification problems of gene expression data. RESULTS: New algorithms are developed for solving clustering, feature selection and classification problems of gene expression data. The clustering algorithm is based on optimization techniques and allows the calculation of clusters step-by-step. This approach allows us to find as many clusters as a data set contains with respect to some tolerance. Feature selection is crucial for a gene expression database. Our feature selection algorithm is based on calculating overlaps of different genes. The database used, contains over 16 000 genes and this number is considerably reduced by feature selection. We propose a classification algorithm where each tissue sample is considered as the center of a cluster which is a ball. The results of numerical experiments confirm that the classification algorithm in combination with the feature selection algorithm perform slightly better than the published results for multi-class classifiers based on support vector machines for this data set. AVAILABILITY: Available on request from the authors.  相似文献   

16.
Multiple sequence alignments (MSAs) have become one of the most studied approaches in bioinformatics to perform other outstanding tasks such as structure prediction, biological function analysis or next-generation sequencing. However, current MSA algorithms do not always provide consistent solutions, since alignments become increasingly difficult when dealing with low similarity sequences. As widely known, these algorithms directly depend on specific features of the sequences, causing relevant influence on the alignment accuracy. Many MSA tools have been recently designed but it is not possible to know in advance which one is the most suitable for a particular set of sequences. In this work, we analyze some of the most used algorithms presented in the bibliography and their dependences on several features. A novel intelligent algorithm based on least square support vector machine is then developed to predict how accurate each alignment could be, depending on its analyzed features. This algorithm is performed with a dataset of 2180 MSAs. The proposed system first estimates the accuracy of possible alignments. The most promising methodologies are then selected in order to align each set of sequences. Since only one selected algorithm is run, the computational time is not excessively increased.  相似文献   

17.
Large-scale artificial neural networks have many redundant structures, making the network fall into the issue of local optimization and extended training time. Moreover, existing neural network topology optimization algorithms have the disadvantage of many calculations and complex network structure modeling. We propose a Dynamic Node-based neural network Structure optimization algorithm (DNS) to handle these issues. DNS consists of two steps: the generation step and the pruning step. In the generation step, the network generates hidden layers layer by layer until accuracy reaches the threshold. Then, the network uses a pruning algorithm based on Hebb’s rule or Pearson’s correlation for adaptation in the pruning step. In addition, we combine genetic algorithm to optimize DNS (GA-DNS). Experimental results show that compared with traditional neural network topology optimization algorithms, GA-DNS can generate neural networks with higher construction efficiency, lower structure complexity, and higher classification accuracy.  相似文献   

18.
In statistical mechanics, the equilibrium properties of a physical system of particles can be calculated as the statistical average over accessible microstates of the system. In general, these calculations are computationally intractable since they involve summations over an exponentially large number of microstates. Clustering algorithms are one of the methods used to numerically approximate these sums. The most basic clustering algorithms first sub-divide the system into a set of smaller subsets (clusters). Then, interactions between particles within each cluster are treated exactly, while all interactions between different clusters are ignored. These smaller clusters have far fewer microstates, making the summation over these microstates, tractable. These algorithms have been previously used for biomolecular computations, but remain relatively unexplored in this context. Presented here, is a theoretical analysis of the error and computational complexity for the two most basic clustering algorithms that were previously applied in the context of biomolecular electrostatics. We derive a tight, computationally inexpensive, error bound for the equilibrium state of a particle computed via these clustering algorithms. For some practical applications, it is the root mean square error, which can be significantly lower than the error bound, that may be more important. We how that there is a strong empirical relationship between error bound and root mean square error, suggesting that the error bound could be used as a computationally inexpensive metric for predicting the accuracy of clustering algorithms for practical applications. An example of error analysis for such an application-computation of average charge of ionizable amino-acids in proteins-is given, demonstrating that the clustering algorithm can be accurate enough for practical purposes.  相似文献   

19.
A pseudo-random generator is an algorithm to generate a sequence of objects determined by a truly random seed which is not truly random. It has been widely used in many applications, such as cryptography and simulations. In this article, we examine current popular machine learning algorithms with various on-line algorithms for pseudo-random generated data in order to find out which machine learning approach is more suitable for this kind of data for prediction based on on-line algorithms. To further improve the prediction performance, we propose a novel sample weighted algorithm that takes generalization errors in each iteration into account. We perform intensive evaluation on real Baccarat data generated by Casino machines and random number generated by a popular Java program, which are two typical examples of pseudo-random generated data. The experimental results show that support vector machine and k-nearest neighbors have better performance than others with and without sample weighted algorithm in the evaluation data set.  相似文献   

20.
Protein identification using mass spectrometry is an indispensable computational tool in the life sciences. A dramatic increase in the use of proteomic strategies to understand the biology of living systems generates an ongoing need for more effective, efficient, and accurate computational methods for protein identification. A wide range of computational methods, each with various implementations, are available to complement different proteomic approaches. A solid knowledge of the range of algorithms available and, more critically, the accuracy and effectiveness of these techniques is essential to ensure as many of the proteins as possible, within any particular experiment, are correctly identified. Here, we undertake a systematic review of the currently available methods and algorithms for interpreting, managing, and analyzing biological data associated with protein identification. We summarize the advances in computational solutions as they have responded to corresponding advances in mass spectrometry hardware. The evolution of scoring algorithms and metrics for automated protein identification are also discussed with a focus on the relative performance of different techniques. We also consider the relative advantages and limitations of different techniques in particular biological contexts. Finally, we present our perspective on future developments in the area of computational protein identification by considering the most recent literature on new and promising approaches to the problem as well as identifying areas yet to be explored and the potential application of methods from other areas of computational biology.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号