期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Using substitution matrices to estimate probability distributions for biological sequences.

Eleazar Eskin William Stafford Noble Yoram Singer 《Journal of computational biology》2002,9(6):775-791

Accurately estimating probabilities from observations is important for probabilistic-based approaches to problems in computational biology. In this paper we present a biologically-motivated method for estimating probability distributions over discrete alphabets from observations using a mixture model of common ancestors. The method is an extension of substitution matrix-based probability estimation methods. In contrast to previous such methods, our method has a simple Bayesian interpretation and has the advantage over Dirichlet mixtures that it is both effective and simple to compute for large alphabets. The method is applied to estimate amino acid probabilities based on observed counts in an alignment and is shown to perform comparably to previous methods. The method is also applied to estimate probability distributions over protein families and improves protein classification accuracy. 相似文献

2.

Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo Method 总被引：39，自引：18，他引：21

Yang Z; Rannala B 《Molecular biology and evolution》1997,14(7):717-724

An improved Bayesian method is presented for estimating phylogenetic trees using DNA sequence data. The birth-death process with species sampling is used to specify the prior distribution of phylogenies and ancestral speciation times, and the posterior probabilities of phylogenies are used to estimate the maximum posterior probability (MAP) tree. Monte Carlo integration is used to integrate over the ancestral speciation times for particular trees. A Markov Chain Monte Carlo method is used to generate the set of trees with the highest posterior probabilities. Methods are described for an empirical Bayesian analysis, in which estimates of the speciation and extinction rates are used in calculating the posterior probabilities, and a hierarchical Bayesian analysis, in which these parameters are removed from the model by an additional integration. The Markov Chain Monte Carlo method avoids the requirement of our earlier method for calculating MAP trees to sum over all possible topologies (which limited the number of taxa in an analysis to about five). The methods are applied to analyze DNA sequences for nine species of primates, and the MAP tree, which is identical to a maximum-likelihood estimate of topology, has a probability of approximately 95%. 相似文献

3.

A Bayesian estimate of harbour seal survival using sparse photo-identification data

B. L. Mackey J. W. Durban S. J. Middlemas & P. M. Thompson 《Journal of Zoology》2008,274(1):18-27

Survival rates have rarely been estimated for pinniped populations due to the constraints of obtaining unbiased sample data. In this paper, we present an approach for estimating survival probabilities from individual recognition data in the form of photographic documentation of pelage patterns. This method was applied to estimate adult (age 2+) survival for harbour seals in the Moray Firth, NE Scotland. An astronomical telescope was used to obtain digital images of individual seals, and high-quality images were used to document the annual presence or absence of individuals at a single haul-out site over a 4-year period. A total of 95 females, 10 males and 57 individuals of unknown sex were photographically documented during the study period. Survival and recapture probabilities were estimated using Jolly–Seber mark–recapture models in a Bayesian statistical framework. Computer-intensive Markov Chain Monte Carlo methods were used to estimate the probability distributions for the survival and recapture probabilities, conveying the full extent of the uncertainty resulting from unavoidably sparse observational data. The deviance information criterion was used to identify a best-fitting model that accounted for variation in the probability of capture between sexes, with constant survival. The model estimated adult survival as 0.98 (95% probability interval of 0.94–1.00) using our photo-identification data alone, and 0.97 (0.92–0.99) with the use of an informative prior distribution based on previously published estimates of harbour seal survival. This paper represents the first survival estimate for harbour seals in the UK, and the first survival estimate using photo-identification data in any species of pinniped. 相似文献

4.

Estimating sexual dimorphism by method-of-moments

Steven C. Josephson Kenneth E. Juell Alan R. Rogers 《American journal of physical anthropology》1996,100(2):191-206

Estimating the degree of sexual dimorphism is difficult in fossil species because most specimens lack indicators of sex. We present a procedure that estimates sexual dimorphism in samples of unknown sex using method-of-moments. We assume that the distribution of a metric trait is composed of two underlying normal distributions, one for males and one for females. We use three moments around the mean of the combined-sex distribution to estimate the means and the common standard deviation of the two underlying distributions. This procedure has advantages over previous methods: it is relatively simple to use, specimens need not be assigned to sex a priori, no reference to living species analogs is required, and the method provides conservative estimates of dimorphism under a variety of conditions. The method performs best when the male and female distributions overlap minimally but also works well when overlap is substantial. Simulations indicate that this relatively simple method is more accurate and reliable than previous methods for estimating dimorphism. © 1996 Wiley-Liss, Inc. 相似文献

5.

Protein family classification using sparse markov transducers.

Eleazar Eskin William Stafford Noble Yoram Singer 《Journal of computational biology》2003,10(2):187-213

We present a method for classifying proteins into families based on short subsequences of amino acids using a new probabilistic model called sparse Markov transducers (SMT). We classify a protein by estimating probability distributions over subsequences of amino acids from the protein. Sparse Markov transducers, similar to probabilistic suffix trees, estimate a probability distribution conditioned on an input sequence. SMTs generalize probabilistic suffix trees by allowing for wild-cards in the conditioning sequences. Since substitutions of amino acids are common in protein families, incorporating wild-cards into the model significantly improves classification performance. We present two models for building protein family classifiers using SMTs. As protein databases become larger, data driven learning algorithms for probabilistic models such as SMTs will require vast amounts of memory. We therefore describe and use efficient data structures to improve the memory usage of SMTs. We evaluate SMTs by building protein family classifiers using the Pfam and SCOP databases and compare our results to previously published results and state-of-the-art protein homology detection methods. SMTs outperform previous probabilistic suffix tree methods and under certain conditions perform comparably to state-of-the-art protein homology methods. 相似文献

6.

A morphological and geometric method for estimating the selectivity of gill nets

Feodor Lobyrev Matthew J. Hoffman 《Reviews in Fish Biology and Fisheries》2018,28(4):909-924

We propose a new method for estimating gill net selectivity which estimates the probabilities leading to retention by analyzing both the fish morphology and the mesh geometry. This method estimates the number of fish approaching and contacting gill nets of different mesh sizes as an intermediate step towards computing the selectivity. Instead of assuming an underlying probability distribution as in indirect methods, we split the entire interaction between a fish and the gill net into several stages, each with its own probability. All the necessary parameters to compute these probabilities can be obtained from measurements of the fish, knowledge of the mesh geometry, and catch data from different mesh sizes. The framework offers three pathways for computing the total number of fish contacting the gill nets and has the capability to use both wedged and entangled fish in the analysis. As a proof of concept, the method is applied to catch data for cod (G. morhua) and Dolly Varden (S. malma) to estimate the number of fish contacting the gill nets in both cases. By estimating the number of fish contacting the gill net in addition to the selectivity, this method provides an important step towards deriving estimates of fish density in a particular fishery from gill net measurement. 相似文献

7.

Estimating recombination rates using three-site likelihoods

Wall JD 《Genetics》2004,167(3):1461-1473

We introduce a new method for jointly estimating crossing-over and gene conversion rates using sequence polymorphism data. The method calculates probabilities for subsets of the data consisting of three segregating sites and then forms a composite likelihood by multiplying together the probabilities of many subsets. Simulations show that this new method performs better than previously proposed methods for estimating gene conversion rates, but that all methods require large amounts of data to provide reliable estimates. While existing methods can easily estimate an "average" gene conversion rate over many loci, they cannot reliably estimate gene conversion rates for a single region of the genome. 相似文献

8.

Evaluation of several methods for estimating phylogenetic trees when substitution rates differ over nucleotide sites

Ziheng Yang 《Journal of molecular evolution》1995,40(6):689-697

Several maximum likelihood and distance matrix methods for estimating phylogenetic trees from homologous DNA sequences were compared when substitution rates at sites were assumed to follow a gamma distribution. Computer simulations were performed to estimate the probabilities that various tree estimation methods recover the true tree topology. The case of four species was considered, and a few combinations of parameters were examined. Attention was applied to discriminating among different sources of error in tree reconstruction, i.e., the inconsistency of the tree estimation method, the sampling error in the estimated tree due to limited sequence length, and the sampling error in the estimated probability due to the number of simulations being limited. Compared to the least squares method based on pairwise distance estimates, the joint likelihood analysis is found to be more robust when rate variation over sites is present but ignored and an assumption is thus violated. With limited data, the likelihood method has a much higher probability of recovering the true tree and is therefore more efficient than the least squares method. The concept of statistical consistency of a tree estimation method and its implications were explored, and it is suggested that, while the efficiency (or sampling error) of a tree estimation method is a very important property, statistical consistency of the method over a wide range of, if not all, parameter values is prerequisite. 相似文献

9.

Site occupancy models with heterogeneous detection probabilities 总被引：1，自引：0，他引：1

Royle JA 《Biometrics》2006,62(1):97-102

Models for estimating the probability of occurrence of a species in the presence of imperfect detection are important in many ecological disciplines. In these "site occupancy" models, the possibility of heterogeneity in detection probabilities among sites must be considered because variation in abundance (and other factors) among sampled sites induces variation in detection probability (p). In this article, I develop occurrence probability models that allow for heterogeneous detection probabilities by considering several common classes of mixture distributions for p. For any mixing distribution, the likelihood has the general form of a zero-inflated binomial mixture for which inference based upon integrated likelihood is straightforward. A recent paper by Link demonstrates that in closed population models used for estimating population size, different classes of mixture distributions are indistinguishable from data, yet can produce very different inferences about population size. I demonstrate that this problem can also arise in models for estimating site occupancy in the presence of heterogeneous detection probabilities. The implications of this are discussed in the context of an application to avian survey data and the development of animal monitoring programs. 相似文献

10.

Maximum likelihood inference of protein phylogeny and the origin of chloroplasts 总被引：15，自引：0，他引：15

Hirohisa Kishino Takashi Miyata Masami Hasegawa 《Journal of molecular evolution》1990,31(2):151-160

Summary A maximum likelihood method for inferring protein phylogeny was developed. It is based on a Markov model that takes into account the unequal transition probabilities among pairs of amino acids and does not assume constancy of rate among different lineages. Therefore, this method is expected to be powerful in inferring phylogeny among distantly related proteins, either orthologous or parallogous, where the evolutionary rate may deviate from constancy. Not only amino acid substitutions but also insertion/deletion events during evolution were incorporated into the Markov model. A simple method for estimating a bootstrap probability for the maximum likelihood tree among alternatives without performing a maximum likelihood estimation for each resampled data set was developed. These methods were applied to amino acid sequence data of a photosynthetic membrane protein,psbA, from photosystem II, and the phylogeny of this protein was discussed in relation to the origin of chloroplasts. 相似文献

11.

Mathematical expressions useful in the construction, description and evaluation of protein libraries

Bosley AD Ostermeier M 《Biomolecular engineering》2005,22(1-3):57-61

The creation of protein libraries by random mutagenesis and cassette mutagenesis has proven to be a successful method of protein engineering. Appropriate statistical analysis is important for the proper construction of these libraries and even more important for the interpretation of data from these libraries. We present simple mathematical expressions useful in the creation and evaluation of such libraries. These equations are useful in estimating the distribution of mutations, the degeneracy of the library and the frequency of a particular clone in the library. In addition, general equations addressing the probability that a particular clone is in a library, the probability that a library is complete, and as the consequences of retransformation of the library on these probabilities are presented. 相似文献

12.

Estimating genotypes with independently sampled descent graphs

Henshall JM Tier B Kerr RJ 《Genetical research》2001,78(3):281-288

A method for estimating genotypic and identity-by-descent probabilities in complex pedigrees is described. The method consists of an algorithm for drawing independent genotype samples which are consistent with the pedigree and observed genotype. The probability distribution function for samples obtained using the algorithm can be evaluated up to a normalizing constant, and combined with the likelihood to produce a weight for each sample. Importance sampling is then used to estimate genotypic and identity-by-descent probabilities. On small but complex pedigrees, the genotypic probability estimates are demonstrated to be empirically unbiased. On large complex pedigrees, while the algorithm for obtaining genotype samples is feasible, importance sampling may require an infeasible number of samples to estimate genotypic probabilities with accuracy. 相似文献

13.

Exploiting uncertain ecological fieldwork data with multi-event capture-recapture modelling: an example with bird sex assignment

Genovart M Pradel R Oro D 《The Journal of animal ecology》2012,81(5):970-977

1.?Sex plays a crucial role in evolutionary life histories. However, the inclusion of sex in demographic analysis may be a challenge in fieldwork, particularly in monomorphic species. Although behavioural data may help us to sex individuals in the field, this kind of data is unlikely to be error free and is usually discarded. 2.?Here we propose a multi-event capture-recapture model that enables us to exploit uncertain field observations regarding the sex of individuals based on behavioural or morphological criteria. The multi-event capture-recapture model allows us to account for sex uncertainty not restricting our ability to estimate the parameters of interest. In this case, by adding the confirmed sex of just a few individuals, we greatly improve the efficiency of the optimization algorithm. 3.?Using such an approach, we analysed sex differences in demographic parameters (e.g. survival, transience and sex ratio) in a population of Audouin's gulls using observations from long-term fieldwork monitoring (1988-2007). We also assessed the probability of ascertaining sex over time and the probability of error for each field-sexing criterion. 4.?We detected no strong effect of sex on either survival or transience probabilities, and both sexes showed a decreasing trend in survival over time and transience probability after recruitment increased with age and over time. The probability of ascertaining sex over time depended on observers' experience. Strikingly, courtship feeding (but not copulation) emerged as the most reliable clue for sexing individuals, which would suggest that Audouin's gulls engage in same-sex sexual behaviour such as same-sex mounting. 5.?The present modelling emerged as a reliable method for estimating demographic parameters and state transition parameters in ecological studies in which field observations of sex or other individual states are assigned erroneously and uncertainly. This approach could also be useful for applied ecologists for assessing the reliability of their criteria for assigning sex or other individual covariates in the field, thereby permitting them to optimizing their field ecological protocols. 相似文献

14.

A computational framework to empower probabilistic protein design

Fromer M Yanover C 《Bioinformatics (Oxford, England)》2008,24(13):i214-i222

MOTIVATION: The task of engineering a protein to perform a target biological function is known as protein design. A commonly used paradigm casts this functional design problem as a structural one, assuming a fixed backbone. In probabilistic protein design, positional amino acid probabilities are used to create a random library of sequences to be simultaneously screened for biological activity. Clearly, certain choices of probability distributions will be more successful in yielding functional sequences. However, since the number of sequences is exponential in protein length, computational optimization of the distribution is difficult. RESULTS: In this paper, we develop a computational framework for probabilistic protein design following the structural paradigm. We formulate the distribution of sequences for a structure using the Boltzmann distribution over their free energies. The corresponding probabilistic graphical model is constructed, and we apply belief propagation (BP) to calculate marginal amino acid probabilities. We test this method on a large structural dataset and demonstrate the superiority of BP over previous methods. Nevertheless, since the results obtained by BP are far from optimal, we thoroughly assess the paradigm using high-quality experimental data. We demonstrate that, for small scale sub-problems, BP attains identical results to those produced by exact inference on the paradigmatic model. However, quantitative analysis shows that the distributions predicted significantly differ from the experimental data. These findings, along with the excellent performance we observed using BP on the smaller problems, suggest potential shortcomings of the paradigm. We conclude with a discussion of how it may be improved in the future. 相似文献

15.

The relationship between species detection probability and local extinction probability

Alpizar-Jara R Nichols JD Hines JE Sauer JR Pollock KH Rosenberry CS 《Oecologia》2004,141(4):652-660

In community-level ecological studies, generally not all species present in sampled areas are detected. Many authors have proposed the use of estimation methods that allow detection probabilities that are <1 and that are heterogeneous among species. These methods can also be used to estimate community-dynamic parameters such as species local extinction probability and turnover rates (Nichols et al. Ecol Appl 8:1213–1225; Conserv Biol 12:1390–1398). Here, we present an ad hoc approach to estimating community-level vital rates in the presence of joint heterogeneity of detection probabilities and vital rates. The method consists of partitioning the number of species into two groups using the detection frequencies and then estimating vital rates (e.g., local extinction probabilities) for each group. Estimators from each group are combined in a weighted estimator of vital rates that accounts for the effect of heterogeneity. Using data from the North American Breeding Bird Survey, we computed such estimates and tested the hypothesis that detection probabilities and local extinction probabilities were negatively related. Our analyses support the hypothesis that species detection probability covaries negatively with local probability of extinction and turnover rates. A simulation study was conducted to assess the performance of vital parameter estimators as well as other estimators relevant to questions about heterogeneity, such as coefficient of variation of detection probabilities and proportion of species in each group. Both the weighted estimator suggested in this paper and the original unweighted estimator for local extinction probability performed fairly well and provided no basis for preferring one to the other. 相似文献

16.

Protein fold recognition by total alignment probability

Bienkowska JR Yu L Zarakhovich S Rogers RG Smith TF 《Proteins》2000,40(3):451-462

We present a protein fold-recognition method that uses a comprehensive statistical interpretation of structural Hidden Markov Models (HMMs). The structure/fold recognition is done by summing the probabilities of all sequence-to-structure alignments. The optimal alignment can be defined as the most probable, but suboptimal alignments may have comparable probabilities. These suboptimal alignments can be interpreted as optimal alignments to the "other" structures from the ensemble or optimal alignments under minor fluctuations in the scoring function. Summing probabilities for all alignments gives a complete estimate of sequence-model compatibility. In the case of HMMs that produce a sequence, this reflects the fact that due to our indifference to exactly how the HMM produced the sequence, we should sum over all possibilities. We have built a set of structural HMMs for 188 protein structures and have compared two methods for identifying the structure compatible with a sequence: by the optimal alignment probability and by the total probability. Fold recognition by total probability was 40% more accurate than fold recognition by the optimal alignment probability. Proteins 2000;40:451-462. 相似文献

17.

Expanded Fermi solution for estimating the survival of ingested pathogenic and probiotic microbial cells and spores

Peleg M Normand MD Horowitz J Corradini MG 《Applied and environmental microbiology》2011,77(1):312-319

The expanded Fermi solution was originally developed for estimating the number of food-poisoning victims when information concerning the circumstances of exposure is scarce. The method has been modified for estimating the initial number of pathogenic or probiotic cells or spores so that enough of them will survive the food preparation and digestive tract's obstacles to reach or colonize the gut in sufficient numbers to have an effect. The method is based on identifying the relevant obstacles and assigning each a survival probability range. The assumed number of needed survivors is also specified as a range. The initial number is then estimated to be the ratio of the number of survivors to the product of the survival probabilities. Assuming that the values of the number of survivors and the survival probabilities are uniformly distributed over their respective ranges, the sought initial number is construed as a random variable with a probability distribution whose parameters are explicitly determined by the individual factors' ranges. The distribution of the initial number is often approximately lognormal, and its mode is taken to be the best estimate of the initial number. The distribution also provides a credible interval for this estimated initial number. The best estimate and credible interval are shown to be robust against small perturbations of the ranges and therefore can help assessors achieve consensus where hard knowledge is scant. The calculation procedure has been automated and made freely downloadable as a Wolfram Demonstration. 相似文献

18.

ROC analysis with multiple classes and multiple tests: methodology and its application in microarray studies

Li J Fine JP 《Biostatistics (Oxford, England)》2008,9(3):566-576

The accuracy of a single diagnostic test for binary outcome can be summarized by the area under the receiver operating characteristic (ROC) curve. Volume under the surface and hypervolume under the manifold have been proposed as extensions for multiple class diagnosis (Scurfield, 1996, 1998). However, the lack of simple inferential procedures for such measures has limited their practical utility. Part of the difficulty is that calculating such quantities may not be straightforward, even with a single test. The decision rule used to generate the ROC surface requires class probability assessments, which are not provided by the tests. We develop a method based on estimating the probabilities via some procedure, for example, multinomial logistic regression. Bootstrap inferences are proposed to account for variability in estimating the probabilities and perform well in simulations. The ROC measures are compared to the correct classification rate, which depends heavily on class prevalences. An example of tumor classification with microarray data demonstrates that this property may lead to substantially different analyses. The ROC-based analysis yields notable decreases in model complexity over previous analyses. 相似文献

19.

Estimating the annual number of breeding attempts from breeding dates using mixture models

Thomas Cornulier David A. Elston Peter Arcese Tim G. Benton David J.T. Douglas Xavier Lambin Jane Reid Robert A. Robinson William J. Sutherland 《Ecology letters》2009,12(11):1184-1193

Well-established statistical methods exist to estimate variation in a number of key demographic rates from field data, including life-history transition probabilities and reproductive success per attempt. However, our understanding of the processes underlying population change remains incomplete without knowing the number of reproductive attempts individuals make annually; this is a key demographic rate for which we have no satisfactory method of estimating. Using census data to estimate this parameter from requires disaggregating the overlying temporal distributions of first and subsequent breeding attempts. We describe a Bayesian mixture method to estimate the annual number of reproductive attempts from field data to provide a new tool for demographic inference. We validate our method using comprehensive data on individually-marked song sparrows Melospiza melodia , and then apply it to more typical nest record data collected over 45 years on yellowhammers Emberiza citrinella . We illustrate the utility of our method by testing, and rejecting, the hypothesis that declines in UK yellowhammer populations have occurred concurrently with declines in annual breeding frequency. 相似文献

20.

Comparison of Methods for Estimating Bird Abundance and Trends From Historical Count Data

FRANK R. THOMPSON III FRANK A. LA SORTE 《The Journal of wildlife management》2008,72(8):1674-1682

Abstract: The use of bird counts as indices has come under increasing scrutiny because assumptions concerning detection probabilities may not be met, but there also seems to be some resistance to use of model-based approaches to estimating abundance. We used data from the United States Forest Service, Southern Region bird monitoring program to compare several common approaches for estimating annual abundance or indices and population trends from point-count data. We compared indices of abundance estimated as annual means of counts and from a mixed-Poisson model to abundance estimates from a count-removal model with 3 time intervals and a distance model with 3 distance bands. We compared trend estimates calculated from an autoregressive, exponential model fit to annual abundance estimates from the above methods and also by estimating trend directly by treating year as a continuous covariate in the mixed-Poisson model. We produced estimates for 6 forest songbirds based on an average of 621 and 459 points in 2 physiographic areas from 1997 to 2004. There was strong evidence that detection probabilities varied among species and years. Nevertheless, there was good overall agreement across trend estimates from the 5 methods for 9 of 12 comparisons. In 3 of 12 comparisons, however, patterns in detection probabilities potentially confounded interpretation of uncorrected counts. Estimates of detection probabilities differed greatly between removal and distance models, likely because the methods estimated different components of detection probability and the data collection was not optimally designed for either method. Given that detection probabilities often vary among species, years, and observers investigators should address detection probability in their surveys, whether it be by estimation of probability of detection and abundance, estimation of effects of key covariates when modeling count as an index of abundance, or through design-based methods to standardize these effects. 相似文献