首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We further examine how to incorporate predicted secondary structure information into the profile kernel to obtain a small but significant performance improvement. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs"--short regions of the original profile that contribute almost all the weight of the SVM classification score--and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results also outperform cluster kernels while providing much better scalability to large datasets.  相似文献   

2.
Grain yield of the maize plant depends on the sizes, shapes, and numbers of ears and the kernels they bear. An automated pipeline that can measure these components of yield from easily‐obtained digital images is needed to advance our understanding of this globally important crop. Here we present three custom algorithms designed to compute such yield components automatically from digital images acquired by a low‐cost platform. One algorithm determines the average space each kernel occupies along the cob axis using a sliding‐window Fourier transform analysis of image intensity features. A second counts individual kernels removed from ears, including those in clusters. A third measures each kernel's major and minor axis after a Bayesian analysis of contour points identifies the kernel tip. Dimensionless ear and kernel shape traits that may interrelate yield components are measured by principal components analysis of contour point sets. Increased objectivity and speed compared to typical manual methods are achieved without loss of accuracy as evidenced by high correlations with ground truth measurements and simulated data. Millimeter‐scale differences among ear, cob, and kernel traits that ranged more than 2.5‐fold across a diverse group of inbred maize lines were resolved. This system for measuring maize ear, cob, and kernel attributes is being used by multiple research groups as an automated Web service running on community high‐throughput computing and distributed data storage infrastructure. Users may create their own workflow using the source code that is staged for download on a public repository.  相似文献   

3.
This paper mainly focuses on how to effectively and efficiently measure visual similarity for local feature based representation. Among existing methods, metrics based on Bag of Visual Word (BoV) techniques are efficient and conceptually simple, at the expense of effectiveness. By contrast, kernel based metrics are more effective, but at the cost of greater computational complexity and increased storage requirements. We show that a unified visual matching framework can be developed to encompass both BoV and kernel based metrics, in which local kernel plays an important role between feature pairs or between features and their reconstruction. Generally, local kernels are defined using Euclidean distance or its derivatives, based either explicitly or implicitly on an assumption of Gaussian noise. However, local features such as SIFT and HoG often follow a heavy-tailed distribution which tends to undermine the motivation behind Euclidean metrics. Motivated by recent advances in feature coding techniques, a novel efficient local coding based matching kernel (LCMK) method is proposed. This exploits the manifold structures in Hilbert space derived from local kernels. The proposed method combines advantages of both BoV and kernel based metrics, and achieves a linear computational complexity. This enables efficient and scalable visual matching to be performed on large scale image sets. To evaluate the effectiveness of the proposed LCMK method, we conduct extensive experiments with widely used benchmark datasets, including 15-Scenes, Caltech101/256, PASCAL VOC 2007 and 2011 datasets. Experimental results confirm the effectiveness of the relatively efficient LCMK method.  相似文献   

4.
ABSTRACT Point counts are the most frequently used technique for sampling bird populations and communities, but have well‐known limitations such as inter‐ and intraobserver errors and limited availability of expert field observers. The use of acoustic recordings to survey birds offers solutions to these limitations. We designed a Soundscape Recording System (SRS) that combines a four‐channel, discrete microphone system with a quadraphonic playback system for surveying bird communities. We compared the effectiveness of SRS and point counts for estimating species abundance, richness, and composition of riparian breeding birds in California by comparing data collected simultaneously using both methods. We used the temporal‐removal method to estimate individual bird detection probabilities and species abundances using the program MARK. Akaike's Information Criterion provided strong evidence that detection probabilities differed between the two survey methods and among the 10 most common species. The probability of detecting birds was higher when listening to SRS recordings in the laboratory than during the field survey. Additionally, SRS data demonstrated a better fit to the temporal‐removal model assumptions and yielded more reliable estimates of detection probability and abundance than point‐count data. Our results demonstrate how the perceptual constraints of observers can affect temporal detection patterns during point counts and thus influence abundance estimates derived from time‐of‐detection approaches. We used a closed‐population capture–recapture approach to calculate jackknife estimates of species richness and average species detection probabilities for SRS and point counts using the program CAPTURE. SRS and point counts had similar species richness and detection probabilities. However, the methods differed in the composition of species detected based on Jaccard's similarity index. Most individuals (83%) detected during point counts vocalized at least once during the survey period and were available for detection using a purely acoustic technique, such as SRS. SRS provides an effective method for surveying bird communities, particularly when most species are detected by sound. SRS can eliminate or minimize observer biases, produce permanent records of surveys, and resolve problems associated with the limited availability of expert field observers.  相似文献   

5.
The time at which natural enemies colonize crop fields is an important determinant of their ability to suppress pest populations. This timing depends on the distance between source and sink habitats in the landscape. Here we estimate the time to colonization of sink habitats from a distant source habitat, using empirical mark-capture data of Diadegma semiclausum in Broccoli. The data originated from experiments conducted at two locations and dispersal was quantified by suction sampling before and after a major disturbance. Three dispersal kernels were fitted to the dispersal data: a normal, a negative exponential, and a square root negative exponential kernel. These kernels are characterized by a thin, intermediate and a fat tail, respectively. The dispersal kernels were included in an integro-difference equation model for parasitoid population redistribution to generate estimates of time to colonization of D. semiclausum in sink habitats at distances between 100 and 2000 m from a source. We show that the three dispersal kernels receive similar support from the data, but can produce a wide range of outcomes. The estimated arrival time of 1% of the D. semiclausum population at a distance 2000 m from the source ranges from 12 days to a length of time greatly exceeding the life span of the parasitoid. The square root negative exponential function, having the thickest tail among the tested functions, gave the fastest spread and colonization in three of the four data sets, but it gave the slowest redistribution in the fourth. In all four data sets, the rate of accumulation at the target increased with the mean dispersal distance of the fitted kernel model, irrespective of the fatness of the tail. This study underscores the importance of selecting a proper dispersal kernel for modelling spread and colonization time of organisms, and of the collection of pertinent data that enable kernel estimation and that can discriminate between different kernel shapes.  相似文献   

6.
Prediction of gene dynamic behavior is a challenging and important problem in genomic research while estimating the temporal correlations and non-stationarity are the keys in this process. Unfortunately, most existing techniques used for the inclusion of the temporal correlations treat the time course as evenly distributed time intervals and use stationary models with time-invariant settings. This is an assumption that is often violated in microarray time course data since the time course expression data are at unequal time points, where the difference in sampling times varies from minutes to days. Furthermore, the unevenly spaced short time courses with sudden changes make the prediction of genetic dynamics difficult. In this paper, we develop two types of Bayesian state space models to tackle this challenge for inferring and predicting the gene expression profiles associated with diseases. In the univariate time-varying Bayesian state space models we treat both the stochastic transition matrix and the observation matrix time-variant with linear setting and point out that this can easily be extended to nonlinear setting. In the multivariate Bayesian state space model we include temporal correlation structures in the covariance matrix estimations. In both models, the unevenly spaced short time courses with unseen time points are treated as hidden state variables. Bayesian approaches with various prior and hyper-prior models with MCMC algorithms are used to estimate the model parameters and hidden variables. We apply our models to multiple tissue polygenetic affymetrix data sets. Results show that the predictions of the genomic dynamic behavior can be well captured by the proposed models.  相似文献   

7.
8.
The simulation of dispersal processes in landscapes over large spatial extents is challenging because of the large difference in geographical scale between overwhelmingly dominant localised dispersal events, and rare long-distance dispersal events which typically drive overall rates of spread. While localised dispersal may point to high resolution individual level models, long-distance dispersal events are likely to involve much coarser grid-based models. In this paper we propose a discrete space (i.e., grid-based) model for dispersal processes in continuous space. We start by illustrating the behaviour of continuous space walks when their movement is discretised to a grid. The importance of short time period cell-to-cell moves which return a walk to its previous grid cell location is identified. A conceptual model which uses a Markov chain buffer phase between cells to replicate the observed behaviour of discretised continuous space walks is proposed. Analysis of the Markov chain shows that it can be parameterised using just two parameters in addition to the dispersal kernel. An algorithm for implementation of the proposed model is presented. Empirical results demonstrate that the proposed mechanism produces good matches to continuous space dispersal processes with both exponential and heavy-tailed dispersal kernels.  相似文献   

9.
Multiple kernel learning (MKL) is demonstrated to be flexible and effective in depicting heterogeneous data sources since MKL can introduce multiple kernels rather than a single fixed kernel into applications. However, MKL would get a high time and space complexity in contrast to single kernel learning, which is not expected in real-world applications. Meanwhile, it is known that the kernel mapping ways of MKL generally have two forms including implicit kernel mapping and empirical kernel mapping (EKM), where the latter is less attracted. In this paper, we focus on the MKL with the EKM, and propose a reduced multiple empirical kernel learning machine named RMEKLM for short. To the best of our knowledge, it is the first to reduce both time and space complexity of the MKL with EKM. Different from the existing MKL, the proposed RMEKLM adopts the Gauss Elimination technique to extract a set of feature vectors, which is validated that doing so does not lose much information of the original feature space. Then RMEKLM adopts the extracted feature vectors to span a reduced orthonormal subspace of the feature space, which is visualized in terms of the geometry structure. It can be demonstrated that the spanned subspace is isomorphic to the original feature space, which means that the dot product of two vectors in the original feature space is equal to that of the two corresponding vectors in the generated orthonormal subspace. More importantly, the proposed RMEKLM brings a simpler computation and meanwhile needs a less storage space, especially in the processing of testing. Finally, the experimental results show that RMEKLM owns a much efficient and effective performance in terms of both complexity and classification. The contributions of this paper can be given as follows: (1) by mapping the input space into an orthonormal subspace, the geometry of the generated subspace is visualized; (2) this paper first reduces both the time and space complexity of the EKM-based MKL; (3) this paper adopts the Gauss Elimination, one of the on-the-shelf techniques, to generate a basis of the original feature space, which is stable and efficient.  相似文献   

10.
Biological data mining using kernel methods can be improved by a task-specific choice of the kernel function. Oligo kernels for genomic sequence analysis have proven to have a high discriminative power and to provide interpretable results. Oligo kernels that consider subsequences of different lengths can be combined and parameterized to increase their flexibility. For adapting these parameters efficiently, gradient-based optimization of the kernel-target alignment is proposed. The power of this new, general model selection procedure and the benefits of fitting kernels to problem classes are demonstrated by adapting oligo kernels for bacterial gene start detection  相似文献   

11.
Tractable space‐time point processes models are needed in various fields. For example in weed science for gaining biological knowledge, for prediction of weed development in order to optimize local treatments with herbicides or in epidemiology for prediction of the risk of a disease. Motivated by the spatio‐temporal point patterns for two weed species, we propose a spatio‐temporal Cox model with intensity based on gamma random fields. The model is an extension of Neyman–Scott and shot‐noise Cox processes to the space‐time domain and it allows spatial and temporal inhomogeneity. We use the weed example to give a first intuitive interpretation of the model and then show how the model is constructed more rigorously and how to estimate the parameters. The weed data are analysed using the proposed model, and both spatially and temporally the model shows a good fit to the data using classical goodness‐of‐fit tests.  相似文献   

12.
Marginalized kernels for biological sequences   总被引:1,自引:0,他引:1  
MOTIVATION: Kernel methods such as support vector machines require a kernel function between objects to be defined a priori. Several works have been done to derive kernels from probability distributions, e.g., the Fisher kernel. However, a general methodology to design a kernel is not fully developed. RESULTS: We propose a reasonable way of designing a kernel when objects are generated from latent variable models (e.g., HMM). First of all, a joint kernel is designed for complete data which include both visible and hidden variables. Then a marginalized kernel for visible data is obtained by taking the expectation with respect to hidden variables. We will show that the Fisher kernel is a special case of marginalized kernels, which gives another viewpoint to the Fisher kernel theory. Although our approach can be applied to any object, we particularly derive several marginalized kernels useful for biological sequences (e.g., DNA and proteins). The effectiveness of marginalized kernels is illustrated in the task of classifying bacterial gyrase subunit B (gyrB) amino acid sequences.  相似文献   

13.
Mismatch string kernels for discriminative protein classification   总被引:1,自引:0,他引:1  
MOTIVATION: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. RESULTS: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus, the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highest-weighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies.  相似文献   

14.
Understanding patterns of pollen movement at the landscape scale is important for establishing management rules following the release of genetically modified (GM) crops. We use here a mating model adapted to cultivated species to estimate dispersal kernels from the genotypes of the progenies of male-sterile plants positioned at different sampling sites within a 10 x 10-km oilseed rape production area. Half of the pollen clouds sampled by the male-sterile plants originated from uncharacterized pollen sources that could consist of both large volunteer and feral populations, and fields within and outside the study area. The geometric dispersal kernel was the most appropriate to predict pollen movement in the study area. It predicted a much larger proportion of long-distance pollination than previously fitted dispersal kernels. This best-fitting mating model underestimated the level of differentiation among pollen clouds but could predict its spatial structure. The estimation method was validated on simulated genotypic data, and proved to provide good estimates of both the shape of the dispersal kernel and the rate and composition of pollen issued from uncharacterized pollen sources. The best dispersal kernel fitted here, the geometric kernel, should now be integrated into models that aim at predicting gene flow at the landscape level, in particular between GM and non-GM crops.  相似文献   

15.
Statistical and learning techniques are becoming increasingly popular for different tasks in bioinformatics. Many of the most powerful statistical and learning techniques are applicable to points in a Euclidean space but not directly applicable to discrete sequences such as protein sequences. One way to apply these techniques to protein sequences is to embed the sequences into a Euclidean space and then apply these techniques to the embedded points. In this work we introduce a biologically motivated sequence embedding, the homology kernel, which takes into account intuitions from local alignment, sequence homology, and predicted secondary structure. This embedding allows us to directly apply learning techniques to protein sequences. We apply the homology kernel in several ways. We demonstrate how the homology kernel can be used for protein family classification and outperforms state-of-the-art methods for remote homology detection. We show that the homology kernel can be used for secondary structure prediction and is competitive with popular secondary structure prediction methods. Finally, we show how the homology kernel can be used to incorporate information from homologous sequences in local sequence alignment.  相似文献   

16.
Semi-supervised protein classification using cluster kernels   总被引:2,自引:0,他引:2  
MOTIVATION: Building an accurate protein classification system depends critically upon choosing a good representation of the input sequences of amino acids. Recent work using string kernels for protein data has achieved state-of-the-art classification performance. However, such representations are based only on labeled data--examples with known 3D structures, organized into structural classes--whereas in practice, unlabeled data are far more plentiful. RESULTS: In this work, we develop simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences. We show that our methods greatly improve the classification performance of string kernels and outperform standard approaches for using unlabeled data, such as adding close homologs of the positive examples to the training data. We achieve equal or superior performance to previously presented cluster kernel methods and at the same time achieving far greater computational efficiency. AVAILABILITY: Source code is available at www.kyb.tuebingen.mpg.de/bs/people/weston/semiprot. The Spider matlab package is available at www.kyb.tuebingen.mpg.de/bs/people/spider. SUPPLEMENTARY INFORMATION: www.kyb.tuebingen.mpg.de/bs/people/weston/semiprot.  相似文献   

17.
Current modelling of inoculum transmission from a cropping season to the following one relies on the extrapolation of kernels estimated on data at short distances from punctual sources, because data collected at larger distances are scarce. We estimated the dispersal kernel of Leptosphaeria maculans ascospores from stubble left after harvest in the summer previous to newly sown oilseed rape fields, using phoma stem canker autumn disease severity. We built a dispersal model to analyse the data. Source strengths are described in the spatial domain covered by source fields by a log‐Gaussian spatial process. Infection potentials in the following season are described in the space consisting of the target fields, by a convolution of sources and a power‐exponential dispersal kernel. Data were collected on farmers' fields considered as sources in 2009 and 2011 (72 and 39 observation points) and as targets in 2010 and 2012 (172 and 200 points). We applied the Bayesian approach for model selection and parameter estimation. We obtained fat tail kernels for both data sets. This estimation is the first from data acquired over distances of 0 to 1000 m, using several non‐punctual inoculum sources. It opens the prospect of refining the existing simulators, or developing disease risk maps.  相似文献   

18.
Parametric kernel methods currently dominate the literature regarding the construction of animal home ranges (HRs) and utilization distributions (UDs). These methods frequently fail to capture the kinds of hard boundaries common to many natural systems. Recently a local convex hull (LoCoH) nonparametric kernel method, which generalizes the minimum convex polygon (MCP) method, was shown to be more appropriate than parametric kernel methods for constructing HRs and UDs, because of its ability to identify hard boundaries (e.g., rivers, cliff edges) and convergence to the true distribution as sample size increases. Here we extend the LoCoH in two ways: "fixed sphere-of-influence," or r-LoCoH (kernels constructed from all points within a fixed radius r of each reference point), and an "adaptive sphere-of-influence," or a-LoCoH (kernels constructed from all points within a radius a such that the distances of all points within the radius to the reference point sum to a value less than or equal to a), and compare them to the original "fixed-number-of-points," or k-LoCoH (all kernels constructed from k-1 nearest neighbors of root points). We also compare these nonparametric LoCoH to parametric kernel methods using manufactured data and data collected from GPS collars on African buffalo in the Kruger National Park, South Africa. Our results demonstrate that LoCoH methods are superior to parametric kernel methods in estimating areas used by animals, excluding unused areas (holes) and, generally, in constructing UDs and HRs arising from the movement of animals influenced by hard boundaries and irregular structures (e.g., rocky outcrops). We also demonstrate that a-LoCoH is generally superior to k- and r-LoCoH (with software for all three methods available at http://locoh.cnr.berkeley.edu).  相似文献   

19.
Protein homology detection using string alignment kernels   总被引:2,自引:0,他引:2  
MOTIVATION: Remote homology detection between protein sequences is a central problem in computational biology. Discriminative methods involving support vector machines (SVMs) are currently the most effective methods for the problem of superfamily recognition in the Structural Classification Of Proteins (SCOP) database. The performance of SVMs depends critically on the kernel function used to quantify the similarity between sequences. RESULTS: We propose new kernels for strings adapted to biological sequences, which we call local alignment kernels. These kernels measure the similarity between two sequences by summing up scores obtained from local alignments with gaps of the sequences. When tested in combination with SVM on their ability to recognize SCOP superfamilies on a benchmark dataset, the new kernels outperform state-of-the-art methods for remote homology detection. AVAILABILITY: Software and data available upon request.  相似文献   

20.
Precise measures of population abundance and trend are needed for species conservation; these are most difficult to obtain for rare and rapidly changing populations. We compare uncertainty in densities estimated from spatio–temporal models with that from standard design-based methods. Spatio–temporal models allow us to target priority areas where, and at times when, a population may most benefit. Generalised additive models were fitted to a 31-year time series of point-transect surveys of an endangered Hawaiian forest bird, the Hawai‘i ‘ākepa Loxops coccineus. This allowed us to estimate bird densities over space and time. We used two methods to quantify uncertainty in density estimates from the spatio–temporal model: the delta method (which assumes independence between detection and distribution parameters) and a variance propagation method. With the delta method we observed a 52% decrease in the width of the design-based 95% confidence interval (CI), while we observed a 37% decrease in CI width when propagating the variance. We mapped bird densities as they changed across space and time, allowing managers to evaluate management actions. Integrating detection function modelling with spatio–temporal modelling exploits survey data more efficiently by producing finer-grained abundance estimates than are possible with design-based methods as well as producing more precise abundance estimates. Model-based approaches require switching from making assumptions about the survey design to assumptions about bird distribution. Such a switch warrants consideration. In this case the model-based approach benefits conservation planning through improved management efficiency and reduced costs by taking into account both spatial shifts and temporal changes in population abundance and distribution.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号