首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we describe a new brute force algorithm for building the -Nearest Neighbor Graph (k-NNG). The k-NNG algorithm has many applications in areas such as machine learning, bio-informatics, and clustering analysis. While there are very efficient algorithms for data of low dimensions, for high dimensional data the brute force search is the best algorithm. There are two main parts to the algorithm: the first part is finding the distances between the input vectors, which may be formulated as a matrix multiplication problem; the second is the selection of the k-NNs for each of the query vectors. For the second part, we describe a novel graphics processing unit (GPU)-based multi-select algorithm based on quick sort. Our optimization makes clever use of warp voting functions available on the latest GPUs along with user-controlled cache. Benchmarks show significant improvement over state-of-the-art implementations of the k-NN search on GPUs.  相似文献   

2.
Litter decomposition rate (k) is typically estimated from proportional litter mass loss data using models that assume constant, normally distributed errors. However, such data often show non-normal errors with reduced variance near bounds (0 or 1), potentially leading to biased k estimates. We compared the performance of nonlinear regression using the beta distribution, which is well-suited to bounded data and this type of heteroscedasticity, to standard nonlinear regression (normal errors) on simulated and real litter decomposition data. Although the beta model often provided better fits to the simulated data (based on the corrected Akaike Information Criterion, AICc), standard nonlinear regression was robust to violation of homoscedasticity and gave equally or more accurate k estimates as nonlinear beta regression. Our simulation results also suggest that k estimates will be most accurate when study length captures mid to late stage decomposition (50–80% mass loss) and the number of measurements through time is ≥5. Regression method and data transformation choices had the smallest impact on k estimates during mid and late stage decomposition. Estimates of k were more variable among methods and generally less accurate during early and end stage decomposition. With real data, neither model was predominately best; in most cases the models were indistinguishable based on AICc, and gave similar k estimates. However, when decomposition rates were high, normal and beta model k estimates often diverged substantially. Therefore, we recommend a pragmatic approach where both models are compared and the best is selected for a given data set. Alternatively, both models may be used via model averaging to develop weighted parameter estimates. We provide code to perform nonlinear beta regression with freely available software.  相似文献   

3.
4.

Background

The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers). An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU), can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem.

Results

We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN) search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour) for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50–60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets.

Conclusion

Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN) provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL) at https://sourceforge.net/p/gpufsknn/.  相似文献   

5.

Background  

Metagenomics, or the sequencing and analysis of collective genomes (metagenomes) of microorganisms isolated from an environment, promises direct access to the "unculturable majority". This emerging field offers the potential to lay solid basis on our understanding of the entire living world. However, the taxonomic classification is an essential task in the analysis of metagenomics data sets that it is still far from being solved. We present a novel strategy to predict the taxonomic origin of environmental genomic fragments. The proposed classifier combines the idea of the k-nearest neighbor with strategies from kernel-based learning.  相似文献   

6.
Integration of molecular genetic techniques and geometric morphometrics represent a valuable tool in the resolution of taxonomic uncertainty and the identification of significant units for conservation. We combined mitochondrial DNA cytochrome c oxidase subunit II gene sequence data and geometric morphometric analysis to examine taxonomic status and identify units for conservation in four species of the hypogean beetle Duvalius (Coleoptera, Trechinae) using mainly museum specimens collected in central Italy. Previous taxonomic studies based on morphological traits described several subspecies often inhabiting geographically distinct caves. Phylogenetic analysis identified two well supported monophyletic lineages and a number of different clades with relatively small genetic differences, suggesting a short divergence time in line with known geological history of the study area. Geometric morphometrics, on the other hand, recovered a high level of distinctiveness among specimens. Both genetic and morphometric analyses did not entirely corroborate former taxonomic nomenclature, suggesting possible rearrangements and the definition of evolutionary significant units. Beetles of the genus Duvalius are protected by regional laws and the majority of taxa considered in this study inhabit caves located outside protected areas. Our study advocates the importance of devoting protection efforts to networks of cave ecosystems rather than single locations or species.  相似文献   

7.
《Annals of botany》1993,71(3):257-277
Four distance coefficients are compared on four data sets composed of samples coming from western European populations of the genera Dactylorhiza, Orchis and Epipactis (Orchidaceae). The performance of the distance coefficients is evaluated through: (a) the quality of clusters obtained by five classical methods (as compared to a priori classification), (b) the Mantel statistic with respect to an a priori distance matrix resulting from previous knowledge, (c) the result obtained with the k-means method, and (d) principal coordinate diagrams. It appears that: (a) the Mahalanobis distance based on the pooled dispersion matrix performs best on the whole; (b) a distance based on the recently developed Common Principal Component model, used with a log transformation, also provides useful information and performs best on the largest data set; (c) the Gölz and Reinhard taxonomic distance, widely used among orchidologists, is attractive for its simplicity, yet good performance and the valuable information it provides, despite its theoretical shortcomings. A brief taxonomic discussion is made on the results obtained for the Dactylorhiza samples, especially about samples whose identification was in doubt.  相似文献   

8.
Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.  相似文献   

9.
Based on the well-known k-mer model, we propose a k-mer natural vector model for representing a genetic sequence based on the numbers and distributions of k-mers in the sequence. We show that there exists a one-to-one correspondence between a genetic sequence and its associated k-mer natural vector. The k-mer natural vector method can be easily and quickly used to perform phylogenetic analysis of genetic sequences without requiring evolutionary models or human intervention. Whole or partial genomes can be handled more effective with our proposed method. It is applied to the phylogenetic analysis of genetic sequences, and the obtaining results fully demonstrate that the k-mer natural vector method is a very powerful tool for analysing and annotating genetic sequences and determining evolutionary relationships both in terms of accuracy and efficiency.  相似文献   

10.

Background

A basic task in bioinformatics is the counting of k-mers in genome sequences. Existing k-mer counting tools are most often optimized for small k < 32 and suffer from excessive memory resource consumption or degrading performance for large k. However, given the technology trend towards long reads of next-generation sequencers, support for large k becomes increasingly important.

Results

We present the open source k-mer counting software Gerbil that has been designed for the efficient counting of k-mers for k ≥ 32. Our software is the result of an intensive process of algorithm engineering. It implements a two-step approach. In the first step, genome reads are loaded from disk and redistributed to temporary files. In a second step, the k-mers of each temporary file are counted via a hash table approach. In addition to its basic functionality, Gerbil can optionally use GPUs to accelerate the counting step. In a set of experiments with real-world genome data sets, we show that Gerbil is able to efficiently support both small and large k.

Conclusions

While Gerbil’s performance is comparable to existing state-of-the-art open source k-mer counting tools for small k < 32, it vastly outperforms its competitors for large k, thereby enabling new applications which require large values of k.
  相似文献   

11.
We propose a computational method to measure and visualize interrelationships among any number of DNA sequences allowing, for example, the examination of hundreds or thousands of complete mitochondrial genomes. An "image distance" is computed for each pair of graphical representations of DNA sequences, and the distances are visualized as a Molecular Distance Map: Each point on the map represents a DNA sequence, and the spatial proximity between any two points reflects the degree of structural similarity between the corresponding sequences. The graphical representation of DNA sequences utilized, Chaos Game Representation (CGR), is genome- and species-specific and can thus act as a genomic signature. Consequently, Molecular Distance Maps could inform species identification, taxonomic classifications and, to a certain extent, evolutionary history. The image distance employed, Structural Dissimilarity Index (DSSIM), implicitly compares the occurrences of oligomers of length up to k (herein k = 9) in DNA sequences. We computed DSSIM distances for more than 5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional Scaling (MDS) to obtain Molecular Distance Maps that visually display the sequence relatedness in various subsets, at different taxonomic levels. This general-purpose method does not require DNA sequence alignment and can thus be used to compare similar or vastly different DNA sequences, genomic or computer-generated, of the same or different lengths. We illustrate potential uses of this approach by applying it to several taxonomic subsets: phylum Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class Amphibia, and order Primates. This analysis of an extensive dataset confirms that the oligomer composition of full mtDNA sequences can be a source of taxonomic information. This method also correctly finds the mtDNA sequences most closely related to that of the anatomically modern human (the Neanderthal, the Denisovan, and the chimp), and that the sequence most different from it in this dataset belongs to a cucumber.  相似文献   

12.
The current study describes the taxonomic and functional composition of metagenomic sequences obtained from a filamentous microbial mat isolated from the Comau fjord, located in the northernmost part of the Chilean Patagonia. The taxonomic composition of the microbial community showed a high proportion of members of the Gammaproteobacteria, including a high number of sequences that were recruited to the genomes of Moritella marina MP-1 and Colwellia psycherythraea 34H, suggesting the presence of populations related to these two psychrophilic bacterial species. Functional analysis of the community indicated a high proportion of genes coding for the transport and metabolism of amino acids, as well as in energy production. Among the energy production functions, we found protein-coding genes for sulfate and nitrate reduction, both processes associated with Gammaproteobacteria-related sequences. This report provides the first examination of the taxonomic composition and genetic diversity associated with these conspicuous microbial mat communities and provides a framework for future microbial studies in the Comau fjord.  相似文献   

13.
Multivariate analysis of leaf shape, anatomy, and Fourier-transform infrared (FTIR) data of 27 Camellia species with secretory structures (sects. Archecamellia, Stereocarpus, Furfuracea, Chrysantha), together with three species from related genera, Gordonia and Tutcheria (Theacea), was conducted to clarify some taxonomic problems. Our results show that crystals occurring in adaxial epidermal cells are firstly observed in Chrysantha species, and the secretory structures described are in fact cork warts. Furthermore, we introduce a form coefficient (F c) to assess the shape of epidermal cells, since they are usually irregular and difficult to describe. Pearson correlation analysis indicates that F c is useful to assess epidermal cell shape. Principal component analysis (PCA) of leaf shape indicates that two species from section Archecamellia and two species from section Stereocarpus are significantly different from those in section Furfuracea. Cluster analysis of FTIR data visualizes the degree of affinity among the 30 species examined here, which is consistent with the cluster analysis (CA) of anatomical data, as illustrated in the dendrogram. Therefore, our study indicates that integrated leaf characters based on leaf shape, anatomy, and FTIR data are useful in the taxonomic treatment of Camellia species with secretory structures. Taxonomic controversies among the Camellia species with secretory structures could thus be successfully addressed using only a few intact or small portions of leaves. Moreover, our results tend to support that Chrysantha species should not be merged into section Archecamellia, and that section Heterogenea should not be recognized in taxonomic treatments of Camellia species with secretory structures.  相似文献   

14.
The purpose of this study is to investigate the ability of multivariate analysis of dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) and diffusion-weighted MRI (DW-MRI) parametric maps, obtained early in the course of therapy, to predict which patients will achieve pathologic complete response (pCR) at the time of surgery. Thirty-three patients underwent DCE-MRI (to estimate Ktrans, ve, kep, and vp) and DW-MRI [to estimate the apparent diffusion coefficient (ADC)] at baseline (t1) and after the first cycle of neoadjuvant chemotherapy (t2). Four analyses were performed and evaluated using receiver-operating characteristic (ROC) analysis to test their ability to predict pCR. First, a region of interest (ROI) level analysis input the mean Ktrans, ve, kep, vp, and ADC into the logistic model. Second, a voxel-based analysis was performed in which a longitudinal registration algorithm aligned serial parameters to a common space for each patient. The voxels with an increase in kep, Ktrans, and vp or a decrease in ADC or ve were then detected and input into the regression model. In the third analysis, both the ROI and voxel level data were included in the regression model. In the fourth analysis, the ROI and voxel level data were combined with selected clinical data in the regression model. The overfitting-corrected area under the ROC curve (AUC) with 95% confidence intervals (CIs) was then calculated to evaluate the performance of the four analyses. The combination of kep, ADC ROI, and voxel level data achieved the best AUC (95% CI) of 0.87 (0.77–0.98).  相似文献   

15.
《Genomics》2019,111(6):1298-1305
Based on the k-mer model for protein sequence, a novel k-mer natural vector method is proposed to characterize the features of k-mers in a protein sequence, in which the numbers and distributions of k-mers are considered. It is proved that the relationship between a protein sequence and its k-mer natural vector is one-to-one. Phylogenetic analysis of protein sequences therefore can be easily performed without requiring evolutionary models or human intervention. In addition, there exists no a criterion to choose a suitable k, and k has a great influence on obtaining results as well as computational complexity. In this paper, a compound k-mer natural vector is utilized to quantify each protein sequence. The results gotten from phylogenetic analysis on three protein datasets demonstrate that our new method can precisely describe the evolutionary relationships of proteins, and greatly heighten the computing efficiency.  相似文献   

16.
《Comptes Rendus Palevol》2005,4(6-7):517-530
Previous research indicated that ammonoid taxonomic diversity exploded after the Late Permian mass extinction, regaining pre-extinction levels by the Late Induan (Dienerian substage). From taxonomic analyses it had been inferred that ammonoids recovered rapidly, relative to other marine invertebrate groups. Complementing taxonomic metrics with morphologic and spatial data revealed more complex recovery dynamics. Morphological analysis indicated that ammonoids did not fully recover until the Spathian or Anisian. Taxonomic diversity is a poor predictor of disparity during the recovery. Spatial partitioning of taxonomic and morphological diversity revealed spatially homogeneous recovery patterns. Combining taxonomic, morphological, and spatial data refined interpretations of Triassic ammonoid recovery patterns and indicated that ecological, not intrinsic, factors were the probable control on ammonoid recovery rates. To cite this article: A.J. McGowan, C. R. Palevol 4 (2005).  相似文献   

17.
The initial rates and steady-state values of proton uptake by broken chloroplasts have been measured as functions of light intensity at various concentrations of chlorophyll, pyocyanine, supporting electrolyte, buffer, as well as pH and temperature. Kinetic analysis of the data shows that the rate of decay of proton gradient due to backward leakage depends on light intensity. Under steady illumination, the decay constant kL is equal to kD + mR0, where R0 is the initial rate of proton uptake which is a function of light intensity, kD is the decay constant in the dark and m is a parameter which is independent of light intensity. Treatment of chloroplasts with lysolecithin, neutral detergent, 2,4-dinitrophenol, or valinomycin in the presence of K+ increases kD without affecting m. Treatment with N,N′-dicyclohexylcarbodiimide or adenylyl imidodiphosphate under appropriate conditions decreases m without affecting kD. Treatment with glutaraldehyde makes kL independent of light intensity and hence m = 0. These results suggest that the light-dependent part (mR0) of kL is due to leakage of protons through the coupling factor (CF1-CF0) complex which can open or close depending on light intensity and that the light-independent part (kD) of the decay constant kL is due to proton leakage elsewhere.  相似文献   

18.
The rapid melting of glaciers and loss of sea ice will result in changes in habitat conditions that may drive substantial changes in biodiversity. In order to bioassess the changing polar ecosystem and evaluate biological conservation, pelagic ciliate communities at different taxonomic resolutions were studied at five habitats in the Amundsen Sea during the austral summer from December 2010 to January 2011. Distinctive spatial patterns were observed in the communities among the five habitats (oceanic areas, transitional areas, polynyas, edges of glaciers, and edges of sea ice) in response to environmental variability (e.g., temperature, salinity, chlorophyll a, and nutrients). The distributions in the numbers of different taxonomic levels and of three biodiversity indices (Shannon-Wiener H′, Pielou’s J′, and Margalef D) also revealed clear spatial variability with the maximum mean species number and indices in the polynya and maximum genus and family numbers in the transitional area. The presence/absence of data at taxonomic resolutions up to the family level provided sufficient information to evaluate the ecological patterns of pelagic ciliate communities and could accurately reflect habitat variations. The k-dominance curves illustrated clearly that maximum diversity was presented in the polynya at the species level and in the transitional area at the genus and family level. We suggest that the diversity at higher taxonomic resolutions should be considered more in future monitoring. Our findings provide basic data and an approach toward answering important questions about biological conservation, especially the biodiversity at various taxonomic resolutions in response to the increasing climate changes in polar ecosystems.  相似文献   

19.
20.
《Genomics》2020,112(3):2233-2240
MicroRNA-like small RNAs (milRNAs) with length of 21–22 nucleotides are a type of small non-coding RNAs that are firstly found in Neurospora crassa in 2010. Identifying milRNAs of species without genomic information is a difficult problem. Here, knowledge-based energy features are developed to identify milRNAs by tactfully incorporating k-mer scheme and distance-dependent pair potential. Compared with k-mer scheme, features developed here can alleviate the inherent curse of dimensionality in k-scheme once k becomes large. In addition, milRNApredictor built on novel features performs comparably to k-mer scheme, and achieves sensitivity of 74.21%, and specificity of 75.72% based on 10-fold cross-validation. Furthermore, for novel miRNA prediction, there exists high overlap of results from milRNApredictor and state-of-the-art mirnovo. However, milRNApredictor is simpler to use with reduced requirements of input data and dependencies. Taken together, milRNApredictor can be used to de novo identify fungi milRNAs and other very short small RNAs of non-model organisms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号