期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Rapid initial clustering of large data sets

Hugh G. Gauch Jr. 《Plant Ecology》1980,42(1-3):103-111

Summary Multivariate analysis of plant community data has three goals: summarization of redundancy, identification of outliers, and elueidation of relationships. The first two are handled conveniently by initial fast clustering, and the third by subsequent ordination and hierarchical clustering, and perhaps table arrangement.Initial clustering algorithms should achieve withincluster homogeneity and require minimal computer resources. However, algorithmic uniqueness and a hierarchy are not needed. Computing time should be proportional to the amount of data, with no higher dependencies on the number of samples. A method is presented here meeting these requirements, called composite clustering and implemented in a FORTRAN program called COMPCLUS. The computer time required for COMPCLUS clustering is on the order of the time required merely to read the data, regardless of the number of samples.Several large field data sets were analyzed effectively by using COMPCLUS to reduce redundancy and identify outliers, and then ordinating the resulting composite clusters by detrended correspondence analysis (DECORANA). Various clusterings of the same data set can be compared using a percent mutual matches (PMM) index, and a matrix of such values can be ordinated for simultaneous comparison of a number of clusterings.This paper benefited at many points from discussions with Mark O. Hill and Robert H. Whittaker. Mark Hill suggested condensed data storage. This work was done under a National Science Foundation grant to Robert Whittaker. I also appreciate technical assistance from Timothy F. Mason and Steven B. Singer. 相似文献

2.

Multivariate data evaluation V. Analysis of time-series (unifactorial model)

P P Mager 《Gegenbaurs morphologisches Jahrbuch》1974,120(4):485-491

相似文献

3.

Multidimensional scaling for large genomic data sets 总被引：1，自引：0，他引：1

Jengnan Tzeng Henry Horng-Shing Lu Wen-Hsiung Li 《BMC bioinformatics》2008,9(1):179

Background

Multi-dimensional scaling (MDS) is aimed to represent high dimensional data in a low dimensional space with preservation of the similarities between data points. This reduction in dimensionality is crucial for analyzing and revealing the genuine structure hidden in the data. For noisy data, dimension reduction can effectively reduce the effect of noise on the embedded structure. For large data set, dimension reduction can effectively reduce information retrieval complexity. Thus, MDS techniques are used in many applications of data mining and gene network research. However, although there have been a number of studies that applied MDS techniques to genomics research, the number of analyzed data points was restricted by the high computational complexity of MDS. In general, a non-metric MDS method is faster than a metric MDS, but it does not preserve the true relationships. The computational complexity of most metric MDS methods is over O(N ² ), so that it is difficult to process a data set of a large number of genes N, such as in the case of whole genome microarray data. 相似文献

4.

PRAP-computation of Bremer support for large data sets

Müller K 《Molecular phylogenetics and evolution》2004,31(2):780-782

相似文献

5.

Analyzing large data sets: rbcL 500 revisited

Rice KA Donoghue MJ Olmstead RG 《Systematic biology》1997,46(3):554-563

相似文献

6.

Uncovering hidden phylogenetic consensus in large data sets

Pattengale ND Aberer AJ Swenson KM Stamatakis A Moret BM 《IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM》2011,8(4):902-911

Many of the steps in phylogenetic reconstruction can be confounded by “rogue” taxa—taxa that cannot be placed with assurance anywhere within the tree, indeed, whose location within the tree varies with almost any choice of algorithm or parameters. Phylogenetic consensus methods, in particular, are known to suffer from this problem. In this paper, we provide a novel framework to define and identify rogue taxa. In this framework, we formulate a bicriterion optimization problem, the relative information criterion, that models the net increase in useful information present in the consensus tree when certain taxa are removed from the input data. We also provide an effective greedy heuristic to identify a subset of rogue taxa and use this heuristic in a series of experiments, with both pathological examples from the literature and a collection of large biological data sets. As the presence of rogue taxa in a set of bootstrap replicates can lead to deceivingly poor support values, we propose a procedure to recompute support values in light of the rogue taxa identified by our algorithm; applying this procedure to our biological data sets caused a large number of edges to move from “unsupported” to “supported” status, indicating that many existing phylogenies should be recomputed and reevaluated to reduce any inaccuracies introduced by rogue taxa. We also discuss the implementation issues encountered while integrating our algorithm into RAxML v7.2.7, particularly those dealing with scaling up the analyses. This integration enables practitioners to benefit from our algorithm in the analysis of very large data sets (up to 2,500 taxa and 10,000 trees, although we present the results of even larger analyses). 相似文献

7.

Multivariate data processing. VII. Analysis of time series (multifactorial model)

P P Mager 《Gegenbaurs morphologisches Jahrbuch》1974,120(6):791-799

相似文献

8.

Population Level Inference for Multivariate MEG Analysis

Anna Jafarpour Gareth Barnes Lluis Fuentemilla Emrah Duzel Will D. Penny 《PloS one》2013,8(8)

Multivariate analysis is a very general and powerful technique for analysing Magnetoencephalography (MEG) data. An outstanding problem however is how to make inferences that are consistent over a group of subjects as to whether there are condition-specific differences in data features, and what are those features that maximise these differences. Here we propose a solution based on Canonical Variates Analysis (CVA) model scoring at the subject level and random effects Bayesian model selection at the group level. We apply this approach to beamformer reconstructed MEG data in source space. CVA estimates those multivariate patterns of activation that correlate most highly with the experimental design; the order of a CVA model is then determined by the number of significant canonical vectors. Random effects Bayesian model comparison then provides machinery for inferring the optimal order over the group of subjects. Absence of a multivariate dependence is indicated by the null model being the most likely. This approach can also be applied to CVA models with a fixed number of canonical vectors but supplied with different feature sets. We illustrate the method by identifying feature sets based on variable-dimension MEG power spectra in the primary visual cortex and fusiform gyrus that are maximally discriminative of data epochs before versus after visual stimulation. 相似文献

9.

Two-step vegetation analysis based on very large data sets

Eddy Van der Maarel Ileana Espejel Patricia Moreno-Casasola 《Plant Ecology》1987,68(3):139-143

A two-step method for the classification of very large phytosociological data sets is demonstrated. Stratification of the set is suggested either by area in the case of a large and geographically heterogeneous region, or by vegetation type in the case of a set covering all the plant communities of an area. First, cluster analysis is performed on each subset. The resulting basic clusters are summarized by calculating a ‘synoptic coverabundance value’ for each species in each cluster. All basic clusters are then subjected to the same procedure. Second order clusters are interpreted as community types. The synoptic value proposed reflects both frequency and average cover-abundance. It is emphasized that a species should have a high frequency to be used as a diagnostic species. The method is demonstrated with a set of 1138 relevés and 250 species of coastal sand dune vegetation in Yucatan treated with the programs TWINSPAN and TABORD. Some problems and perspectives of the approach are discussed in the light of hierarchy theory and classification theory. 相似文献

10.

Fitting semiparametric random effects models to large data sets

Pennell ML Dunson DB 《Biostatistics (Oxford, England)》2007,8(4):821-834

For large data sets, it can be difficult or impossible to fit models with random effects using standard algorithms due to memory limitations or high computational burdens. In addition, it would be advantageous to use the abundant information to relax assumptions, such as normality of random effects. Motivated by data from an epidemiologic study of childhood growth, we propose a 2-stage method for fitting semiparametric random effects models to longitudinal data with many subjects. In the first stage, we use a multivariate clustering method to identify G相似文献

11.

Biostratigraphical dating of Cretaceous coral communities using large data sets

Hannes Löser 《Pal?ontologische Zeitschrift》2002,76(1):75-81

Habitats of hermatypic corals are shallow and turbulent marine environments that often lack biostratigraphic index fossils. For that reason many Cretaceous coral faunas are imprecisely dated or dated only on the basis of comparisons with other coral faunas. Using a large database on the taxonomy, stratigraphical and geographical distribution of corals in the Cretaceous, a method is proposed that will make it possible to specify the stratigraphical age of coral associations on the basis of their specific composition. In this process the stratigraphical range of the species (calculated before from well-dated faunas) is summarized and a probable age of the association proposed. The method does not only help to assess the biostratigraphical age of a fauna, but may also indicate whether a fauna represents an original composition or is a mixed association derived from reworked horizons or olistoliths. The method can be applied to any other group of organisms, provided that the essential data for a comparison are available. 相似文献

12.

Efficient clustering of large EST data sets on parallel computers

下载免费PDF全文

Kalyanaraman A Aluru S Kothari S Brendel V 《Nucleic acids research》2003,31(11):2963-2974

Clustering expressed sequence tags (ESTs) is a powerful strategy for gene identification, gene expression studies and identifying important genetic variations such as single nucleotide polymorphisms. To enable fast clustering of large-scale EST data, we developed PaCE (for Parallel Clustering of ESTs), a software program for EST clustering on parallel computers. In this paper, we report on the design and development of PaCE and its evaluation using Arabidopsis ESTs. The novel features of our approach include: (i) design of memory efficient algorithms to reduce the memory required to linear in the size of the input, (ii) a combination of algorithmic techniques to reduce the computational work without sacrificing the quality of clustering, and (iii) use of parallel processing to reduce run-time and facilitate clustering of larger data sets. Using a combination of these techniques, we report the clustering of 168 200 Arabidopsis ESTs in 15 min on an IBM xSeries cluster with 30 dual-processor nodes. We also clustered 327 632 rat ESTs in 47 min and 420 694 Triticum aestivum ESTs in 3 h and 15 min. We demonstrate the quality of our software using benchmark Arabidopsis EST data, and by comparing it with CAP3, a software widely used for EST assembly. Our software allows clustering of much larger EST data sets than is possible with current software. Because of its speed, it also facilitates multiple runs with different parameters, providing biologists a tool to better analyze EST sequence data. Using PaCE, we clustered EST data from 23 plant species and the results are available at the PlantGDB website. 相似文献

13.

Obtaining maximal concatenated phylogenetic data sets from large sequence databases 总被引：2，自引：0，他引：2

Sanderson MJ Driskell AC Ree RH Eulenstein O Langley S 《Molecular biology and evolution》2003,20(7):1036-1042

To improve the accuracy of tree reconstruction, phylogeneticists are extracting increasingly large multigene data sets from sequence databases. Determining whether a database contains at least k genes sampled from at least m species is an NP-complete problem. However, the skewed distribution of sequences in these databases permits all such data sets to be obtained in reasonable computing times even for large numbers of sequences. We developed an exact algorithm for obtaining the largest multigene data sets from a collection of sequences. The algorithm was then tested on a set of 100,000 protein sequences of green plants and used to identify the largest multigene ortholog data sets having at least 3 genes and 6 species. The distribution of sizes of these data sets forms a hollow curve, and the largest are surprisingly small, ranging from 62 genes by 6 species, to 3 genes by 65 species, with more symmetrical data sets of around 15 taxa by 15 genes. These upper bounds to sequence concatenation have important implications for building the tree of life from large sequence databases. 相似文献

14.

Management and multivariate analysis of large data sets in vegetation research

Otto Wildi 《Plant Ecology》1980,42(1-3):175-180

相似文献

15.

Comparative performance of supertree algorithms in large data sets using the soapberry family (Sapindaceae) as a case study

Buerki S Forest F Salamin N Alvarez N 《Systematic biology》2011,60(1):32-44

For the last 2 decades, supertree reconstruction has been an active field of research and has seen the development of a large number of major algorithms. Because of the growing popularity of the supertree methods, it has become necessary to evaluate the performance of these algorithms to determine which are the best options (especially with regard to the supermatrix approach that is widely used). In this study, seven of the most commonly used supertree methods are investigated by using a large empirical data set (in terms of number of taxa and molecular markers) from the worldwide flowering plant family Sapindaceae. Supertree methods were evaluated using several criteria: similarity of the supertrees with the input trees, similarity between the supertrees and the total evidence tree, level of resolution of the supertree and computational time required by the algorithm. Additional analyses were also conducted on a reduced data set to test if the performance levels were affected by the heuristic searches rather than the algorithms themselves. Based on our results, two main groups of supertree methods were identified: on one hand, the matrix representation with parsimony (MRP), MinFlip, and MinCut methods performed well according to our criteria, whereas the average consensus, split fit, and most similar supertree methods showed a poorer performance or at least did not behave the same way as the total evidence tree. Results for the super distance matrix, that is, the most recent approach tested here, were promising with at least one derived method performing as well as MRP, MinFlip, and MinCut. The output of each method was only slightly improved when applied to the reduced data set, suggesting a correct behavior of the heuristic searches and a relatively low sensitivity of the algorithms to data set sizes and missing data. Results also showed that the MRP analyses could reach a high level of quality even when using a simple heuristic search strategy, with the exception of MRP with Purvis coding scheme and reversible parsimony. The future of supertrees lies in the implementation of a standardized heuristic search for all methods and the increase in computing power to handle large data sets. The latter would prove to be particularly useful for promising approaches such as the maximum quartet fit method that yet requires substantial computing power. 相似文献

16.

High-throughput film-densitometry: an efficient approach to generate large data sets

Typke D Nordmeyer RA Jones A Lee J Avila-Sakar A Downing KH Glaeser RM 《Journal of structural biology》2005,149(1):17-29

A film-handling machine (robot) has been built which can, in conjunction with a commercially available film densitometer, exchange and digitize over 300 electron micrographs per day. Implementation of robotic film handling effectively eliminates the delay and tedium associated with digitizing images when data are initially recorded on photographic film. The modulation transfer function (MTF) of the commercially available densitometer is significantly worse than that of a high-end, scientific microdensitometer. Nevertheless, its signal-to-noise ratio (S/N) is quite excellent, allowing substantial restoration of the output to "near-to-perfect" performance. Due to the large area of the standard electron microscope film that can be digitized by the commercial densitometer (up to 10,000 x 13,680 pixels with an appropriately coded holder), automated film digitization offers a fast and inexpensive alternative to high-end CCD cameras as a means of acquiring large amounts of image data in electron microscopy. 相似文献

17.

STEME: efficient EM to find motifs in large data sets

Reid JE Wernisch L 《Nucleic acids research》2011,39(18):e126

MEME and many other popular motif finders use the expectation-maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughput biological techniques. A suffix tree is a data structure that can efficiently index a set of sequences. We describe an algorithm, Suffix Tree EM for Motif Elicitation (STEME), that approximates EM using suffix trees. To the best of our knowledge, this is the first application of suffix trees to EM. We provide an analysis of the expected running time of the algorithm and demonstrate that STEME runs an order of magnitude more quickly than the implementation of EM used by MEME. We give theoretical bounds for the quality of the approximation and show that, in practice, the approximation has a negligible effect on the outcome. We provide an open source implementation of the algorithm that we hope will be used to speed up existing and future motif search algorithms. 相似文献

18.

Resolution of phylogenetic conflict in large data sets by increased taxon sampling 总被引：9，自引：0，他引：9

Hedtke SM Townsend TM Hillis DM 《Systematic biology》2006,55(3):522-529

相似文献

19.

Extraction of tacit knowledge from large ADME data sets via pairwise analysis

Keefer CE Chang G Kauffman GW 《Bioorganic & medicinal chemistry》2011,19(12):3739-3749

Pharmaceutical companies routinely collect data across multiple projects for common ADME endpoints. Although at the time of collection the data is intended for use in decision making within a specific project, knowledge can be gained by data mining the entire cross-project data set for patterns of structure-activity relationships (SAR) that may be applied to any project. One such data mining method is pairwise analysis. This method has the advantage of being able to identify small structural changes that lead to significant changes in activity. In this paper, we describe the process for full pairwise analysis of our high-throughput ADME assays routinely used for compound discovery efforts at Pfizer (microsomal clearance, passive membrane permeability, P-gp efflux, and lipophilicity). We also describe multiple strategies for the application of these transforms in a prospective manner during compound design. Finally, a detailed analysis of the activity patterns in pairs of compounds that share the same molecular transformation reveals multiple types of transforms from an SAR perspective. These include bioisosteres, additives, multiplicatives, and a type we call switches as they act to either turn on or turn off an activity. 相似文献

20.

Analysis of the real EADGENE data set: Multivariate approaches and post analysis (Open Access publication)

《遗传、选种与进化》2007,39(6):651-668

The aim of this paper was to describe, and when possible compare, the multivariate methods used by the participants in the EADGENE WP1.4 workshop. The first approach was for class discovery and class prediction using evidence from the data at hand. Several teams used hierarchical clustering (HC) or principal component analysis (PCA) to identify groups of differentially expressed genes with a similar expression pattern over time points and infective agent (E. coli or S. aureus). The main result from these analyses was that HC and PCA were able to separate tissue samples taken at 24 h following E. coli infection from the other samples. The second approach identified groups of differentially co-expressed genes, by identifying clusters of genes highly correlated when animals were infected with E. coli but not correlated more than expected by chance when the infective pathogen was S. aureus. The third approach looked at differential expression of predefined gene sets. Gene sets were defined based on information retrieved from biological databases such as Gene Ontology. Based on these annotation sources the teams used either the GlobalTest or the Fisher exact test to identify differentially expressed gene sets. The main result from these analyses was that gene sets involved in immune defence responses were differentially expressed. 相似文献