首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We have examined the statistical requirements for the detection of mixtures of two lognormal distributions in doubly truncated data when the sample size is large. The expectation-maximization algorithm was used for parameter estimation. A bootstrap approach was used to test for a mixture of distributions using the likelihood ratio statistic. Analysis of computer simulated mixtures showed that as the ratio of the difference between the means to the minimum standard deviation increases, the power for detection also increases and the accuracy of parameter estimates improves. These procedures were used to examine the distribution of red blood cell volume in blood samples. Each distribution was doubly truncated to eliminate artifactual frequency counts and tested for best fit to a single lognormal distribution or a mixture of two lognormal distributions. A single population was found in samples obtained from 60 healthy individuals. Two subpopulations of cells were detected in 25 of 27 mixtures of blood prepared in vitro. Analyses of mixtures of blood from 40 patients treated for iron-deficiency anemia showed that subpopulations could be detected in all by 6 weeks after onset of treatment. To determine if two-component mixtures could be detected, distributions were examined from untransfused patients with refractory anemia. In two patients with inherited sideroblastic anemia a mixture of microcytic and normocytic cells was found, while in the third patient a single population of microcytic cells was identified. In two family members previously identified as carriers of inherited sideroblastic anemia, mixtures of microcytic and normocytic subpopulations were found. Twenty-five patients with acquired myelodysplastic anemia were examined. A good fit to a mixture of subpopulations containing abnormal microcytic or macrocytic cells was found in two. We have demonstrated that with large sample sizes, mixtures of distributions can be detected even when distributions appear to be unimodal. These statistical techniques provide a means to characterize and quantify alterations in erythrocyte subpopulations in anemia but could also be applied to any set of grouped, doubly truncated data to test for the presence of a mixture of two lognormal distributions.  相似文献   

2.
A method is proposed that aims at identifying clusters of individuals that show similar patterns when observed repeatedly. We consider linear‐mixed models that are widely used for the modeling of longitudinal data. In contrast to the classical assumption of a normal distribution for the random effects a finite mixture of normal distributions is assumed. Typically, the number of mixture components is unknown and has to be chosen, ideally by data driven tools. For this purpose, an EM algorithm‐based approach is considered that uses a penalized normal mixture as random effects distribution. The penalty term shrinks the pairwise distances of cluster centers based on the group lasso and the fused lasso method. The effect is that individuals with similar time trends are merged into the same cluster. The strength of regularization is determined by one penalization parameter. For finding the optimal penalization parameter a new model choice criterion is proposed.  相似文献   

3.
Fitting mixture models to grouped and truncated data via the EM algorithm   总被引:3,自引:0,他引:3  
The fitting of finite mixture models via the EM algorithm is considered for data which are available only in grouped form and which may also be truncated. A practical example is presented where a mixture of two doubly truncated log-normal distributions is adopted to model the distribution of the volume of red blood cells in cows during recovery from anemia.  相似文献   

4.
Zhou X  Yan L  Prows DR  Yang R 《Genomics》2011,97(6):379-385
As the two most popular models in survival analysis, the accelerated failure time (AFT) model can more easily fit survival data than the Cox proportional hazards model (PHM). In this study, we develop a general parametric AFT model for identifying survival trait loci, in which the flexible generalized F distribution, including many commonly used distributions as special cases, is specified as the baseline survival distribution. EM algorithm for maximum likelihood estimation of model parameters is given. Simulations are conducted to validate the flexibility and the utility of the proposed mapping procedure. In analyzing survival time following hyperoxic acute lung injury (HALI) of mice in an F(2) mating population, the generalized F distribution performed best among the six competing survival distributions and detected four QTLs controlling differential HALI survival.  相似文献   

5.
Binomial and geometric mixtures can be used to model data gathered in capture-recapture surveys of animal populations, removal surveys of harvest populations, registrations of disease populations, ecological species census, and so on. To compute a nonparametric maximum likelihood estimator for the mixing distribution of heterogeneous capture probabilities, we consider a conditional approach and use a reliable and fast integrative procedure which combines the EM algorithm to increase the likelihood and the vertex-exchange method to update the number of support points. A convergent Newtonian algorithm is used in the M-step of the EM algorithm.  相似文献   

6.
The stochastic nature of high-throughput screening (HTS) data indicates that information may be gleaned by applying statistical methods to HTS data. A foundation of parametric statistics is the study and elucidation of population distributions, which can be modeled using modern spreadsheet software. The methods and results described here use fundamental concepts of statistical population distributions analyzed using a spreadsheet to provide tools in a developing armamentarium for extracting information from HTS data. Specific examples using two HTS kinase assays are analyzed. The analyses use normal and gamma distributions, which combine to form mixture distributions. HTS data were found to be described well using such mixture distributions, and deconvolution of the mixtures to the constituent gamma and normal parts provided insight into how the assays performed. In particular, the proportion of hits confirmed was predicted from the original HTS data and used to assess screening assay performance. The analyses also provide a method for determining how hit thresholds--values used to separate active from inactive compounds--affect the proportion of compounds verified as active and how the threshold can be chosen to optimize the selection process.  相似文献   

7.

Background

Mixtures of beta distributions are a flexible tool for modeling data with values on the unit interval, such as methylation levels. However, maximum likelihood parameter estimation with beta distributions suffers from problems because of singularities in the log-likelihood function if some observations take the values 0 or 1.

Methods

While ad-hoc corrections have been proposed to mitigate this problem, we propose a different approach to parameter estimation for beta mixtures where such problems do not arise in the first place. Our algorithm combines latent variables with the method of moments instead of maximum likelihood, which has computational advantages over the popular EM algorithm.

Results

As an application, we demonstrate that methylation state classification is more accurate when using adaptive thresholds from beta mixtures than non-adaptive thresholds on observed methylation levels. We also demonstrate that we can accurately infer the number of mixture components.

Conclusions

The hybrid algorithm between likelihood-based component un-mixing and moment-based parameter estimation is a robust and efficient method for beta mixture estimation. We provide an implementation of the method (“betamix”) as open source software under the MIT license.
  相似文献   

8.
Sergeev AS  Arapova RK 《Genetika》2002,38(3):407-418
Estimation of gametic frequencies in multilocus polymorphic systems based on the numerical distribution of multilocus genotypes in a population sample ("analysis without pedigrees") is difficult because some gametes are not recognized in the data obtained. Even in the case of codominant systems, where all alleles can be recognized by genotypes, so that direct estimation of the frequencies of genes (alleles) is possible ("complete data"), estimation of the frequencies of multilocus gametes based on the data on multilocus genotypes is sometimes impossible, whether population data or even family data are used for studying genotypic segregation or analysis of linkage ("incomplete data"). Such "incomplete data" are analyzed based on the corresponding genetic models using the expectation-maximization (EM) algorithm. In this study, the EM algorithm based on the random-marriage model for a nonsubdivided population was used to estimate gametic frequencies. The EM algorithm used in the study does not set any limitations on the number of loci and the number of alleles of each locus. Locus and alleles are identified by numeration making possible to arrange loops. In each combination of alleles for a given combination of m out of L loci (L is the total number of loci studied), all alleles are assigned value 1, and the remaining alleles are assigned value 0. The sum of zeros and unities for each gamete is its gametic value (h), and the sum of the gametic values of the gametes that form a given genotype is the genotypic value (g) of this genotype. Then, gametes with the same h are united into a single class, which reduces the number of the estimated parameters. In a general case of m loci, this procedure yields m + 1 classes of gametes and 2m + 1 classes of genotypes with genotypic values g = 0, 1, 2, ..., 2m. The unknown frequencies of the m + 1 classes of gametes can be represented as functions of the gametic frequencies whose maximum likelihood estimations (MLEs) have been obtained in all previous EM procedures and the only unknown frequency (Pm(m)) that is to be estimated in the given EM procedure. At the expectation step, the expected frequencies (Fm(g) of the genotypes with genotypic values g are expressed in terms of the products of the frequencies of m + 1 classes of gametes. The data on genotypes are the numbers (ng) of individuals with genotypic values g = 0, 1, 2, 3, ..., 2m. The maximization step is the maximization of the logarithm of the likelihood function (LLF) for ng values. Thus, the EM algorithm is reduced, in each case, to solution of only one equation with one unknown parameter with the use of the ng values, i.e., the numbers of individuals after the corresponding regrouping of the data on the individuals' genotypes. Treatment of the data obtained by Kurbatova on the MNSs and Rhesus systems with alleles C, Cw, c, D, d, E, e with the use of Weir's EM algorithm and the EM algorithm suggested in this study yielded similar results. However, the MLEs of the parameters obtained with the use of either algorithm often converged to a wrong solution: the sum of the frequencies of all gametes (4 and 12 gametes for MNSs and Rhesus, respectively) was not equal to 1.0 even if the global maximum of LLF was reached for each of them (as it was for MNSs with the use of Weir's EM algorithm), with each parameter falling within admissible limits (e.g., [0, min(PN,Ps)] for PNs). The chi 2 function is suggested to be used as a goodness-of-fit function for the distribution of genotypes in a sample in order to select acceptable solutions. However, the minimum of this function only guarantee the acceptability of solutions if all limitations on the parameters are met: the sum of estimations of gametic frequencies is 1.0, each frequency falls within the admissible limits, and the "gametic algebra" is complied with (none of the frequencies is negative).  相似文献   

9.
This paper presents various algorithmic approaches for computing the maximum likelihood estimator of the mixing distribution of a one-parameter family of densities and provides a unifying computer-oriented concept for the statistical analysis of unobserved heterogeneity (i.e., observations stemming from different subpopulations) in a univariate sample. The case with unknown number of population subgroups as well as the case with known number of population subgroups, with emphasis on the first, is considered in the computer package C.A.MAN (Computer Assisted Mixture Analysis). It includes an algorithmic menu with choices of the EM algorithm, the vertex exchange algorithm, a combination of both, as well as the vertex direction method. To ensure reliable convergence, a step-length menu is provided for the three latter methods, each achieving monotonicity for the direction of choice. C.A.MAN has the option to work with restricted support size-that is, the case when the number of components is known a priori. In the latter case, the EM algorithm is used. Applications of mixture modelling in medical problems are discussed.  相似文献   

10.
S. Mandal  J. Qin  R.M. Pfeiffer 《Biometrics》2023,79(3):1701-1712
We propose and study a simple and innovative non-parametric approach to estimate the age-of-onset distribution for a disease from a cross-sectional sample of the population that includes individuals with prevalent disease. First, we estimate the joint distribution of two event times, the age of disease onset and the survival time after disease onset. We accommodate that individuals had to be alive at the time of the study by conditioning on their survival until the age at sampling. We propose a computationally efficient expectation–maximization (EM) algorithm and derive the asymptotic properties of the resulting estimates. From these joint probabilities we then obtain non-parametric estimates of the age-at-onset distribution by marginalizing over the survival time after disease onset to death. The method accommodates categorical covariates and can be used to obtain unbiased estimates of the covariate distribution in the source population. We show in simulations that our method performs well in finite samples even under large amounts of truncation for prevalent cases. We apply the proposed method to data from female participants in the Washington Ashkenazi Study to estimate the age-at-onset distribution of breast cancer associated with carrying BRCA1 or BRCA2 mutations.  相似文献   

11.
A technique for fitting mixture distributions to phenylthiocarbamide (PTC) sensitivity is described. Under the assumptions of Hardy-Weinberg equilibrium, a mixture of three normal components is postulated for the observed distribution, with the mixing parameters corresponding to the proportions of the three genotypes associated with two alleles A and a acting at a single locus. The corresponding genotypes AA, Aa, and aa are then considered to have separate means and variances. This paper is concerned with estimating the parameters of the model, and their standard errors, by using an application of the EM algorithm. This technique also caters for the fact that the sensitivity measurements are only known to lie between the endpoints of certain intervals and that the exact measurement of the attribute is not possible.  相似文献   

12.
Tail moments in the single cell gel electrophoresis (comet) assay usually do not follow a normal distribution, making the statistical analysis complicated. Researchers have used a wide variety of statistical techniques in an attempt to overcome this problem. In many cases, the tail moments follow a bimodal distribution that can be modeled with a mixture of gamma distributions. This bimodality may be due to cells being in two different stages of the cell cycle at the time of treatment. Maximum likelihood, modified to accommodate censored data, can be used to estimate the five parameters of the gamma mixture distribution for each slide. A weighted analysis of variance on the parameter estimates for the gamma mixtures can be performed to determine differences in DNA damage between treatments. These methods were applied to an experiment on the effect of thymidine kinase in DNA damage and repair. Analysis based on the mixture of gamma distributions was found to be more statistically valid, more powerful, and more informative than analysis based on log-transformed tail moments.  相似文献   

13.
Measuring gene expression over time can provide important insights into basic cellular processes. Identifying groups of genes with similar expression time-courses is a crucial first step in the analysis. As biologically relevant groups frequently overlap, due to genes having several distinct roles in those cellular processes, this is a difficult problem for classical clustering methods. We use a mixture model to circumvent this principal problem, with hidden Markov models (HMMs) as effective and flexible components. We show that the ensuing estimation problem can be addressed with additional labeled data partially supervised learning of mixtures - through a modification of the expectation-maximization (EM) algorithm. Good starting points for the mixture estimation are obtained through a modification to Bayesian model merging, which allows us to learn a collection of initial HMMs. We infer groups from mixtures with a simple information-theoretic decoding heuristic, which quantifies the level of ambiguity in group assignment. The effectiveness is shown with high-quality annotation data. As the HMMs we propose capture asynchronous behavior by design, the groups we find are also asynchronous. Synchronous subgroups are obtained from a novel algorithm based on Viterbi paths. We show the suitability of our HMM mixture approach on biological and simulated data and through the favorable comparison with previous approaches. A software implementing the method is freely available under the GPL from http://ghmm.org/gql.  相似文献   

14.
Finite mixtures of Gaussian distributions are known to provide an accurate approximation to any unknown density. Motivated by DNA repair studies in which data are collected for samples of cells from different individuals, we propose a class of hierarchically weighted finite mixture models. The modeling framework incorporates a collection of k Gaussian basis distributions, with the individual-specific response densities expressed as mixtures of these bases. To allow heterogeneity among individuals and predictor effects, we model the mixture weights, while treating the basis distributions as unknown but common to all distributions. This results in a flexible hierarchical model for samples of distributions. We consider analysis of variance-type structures and a parsimonious latent factor representation, which leads to simplified inferences on non-Gaussian covariance structures. Methods for posterior computation are developed, and the model is used to select genetic predictors of baseline DNA damage, susceptibility to induced damage, and rate of repair.  相似文献   

15.
Liu M  Lu W  Shao Y 《Biometrics》2006,62(4):1053-1061
Interval mapping using normal mixture models has been an important tool for analyzing quantitative traits in experimental organisms. When the primary phenotype is time-to-event, it is natural to use survival models such as Cox's proportional hazards model instead of normal mixtures to model the phenotype distribution. An extra challenge for modeling time-to-event data is that the underlying population may consist of susceptible and nonsusceptible subjects. In this article, we propose a semiparametric proportional hazards mixture cure model which allows missing covariates. We discuss applications to quantitative trait loci (QTL) mapping when the primary trait is time-to-event from a population of mixed susceptibility. This model can be used to characterize QTL effects on both susceptibility and time-to-event distribution, and to estimate QTL location. The model can naturally incorporate covariate effects of other risk factors. Maximum likelihood estimates for the parameters in the model as well as their corresponding variance estimates can be obtained numerically using an EM-type algorithm. The proposed methods are assessed by simulations under practical settings and illustrated using a real data set containing survival times of mice after infection with Listeria monocytogenes. An extension to multiple intervals is also discussed.  相似文献   

16.
Nielsen R  Hubisz MJ  Clark AG 《Genetics》2004,168(4):2373-2382
Most of the available SNP data have eluded valid population genetic analysis because most population genetical methods do not correctly accommodate the special discovery process used to identify SNPs. Most of the available SNP data have allele frequency distributions that are biased by the ascertainment protocol. We here show how this problem can be corrected by obtaining maximum-likelihood estimates of the true allele frequency distribution. In simple cases, the ML estimate of the true allele frequency distribution can be obtained analytically, but in other cases computational methods based on numerical optimization or the EM algorithm must be used. We illustrate the new correction method by analyzing some previously published SNP data from the SNP Consortium. Appropriate treatment of SNP ascertainment is vital to our ability to make correct inferences from the data of the International HapMap Project.  相似文献   

17.
Matrix population models are a standard tool for studying stage‐structured populations, but they are not flexible in describing stage duration distributions. This study describes a method for modeling various such distributions in matrix models. The method uses a mixture of two negative binomial distributions (parametrized using a maximum likelihood method) to approximate a target (true) distribution. To examine the performance of the method, populations consisting of two life stages (juvenile and adult) were considered. The juvenile duration distribution followed a gamma distribution, lognormal distribution, or zero‐truncated (over‐dispersed) Poisson distribution, each of which represents a target distribution to be approximated by a mixture distribution. The true population growth rate based on a target distribution was obtained using an individual‐based model, and the extent to which matrix models can approximate the target dynamics was examined. The results show that the method generally works well for the examined target distributions, but is prone to biased predictions under some conditions. In addition, the method works uniformly better than an existing method whose performance was also examined for comparison. Other details regarding parameter estimation and model development are also discussed.  相似文献   

18.
Eutectic mixtures formed between active pharmaceutical ingredients and/or excipients provide vast scope for pharmaceutical applications. This study aimed at the exploration of the crystallization abilities of two eutectic mixtures (EM) i.e., lidocaine-tetracaine and lidocaine-camphor (1:1 w/w). Thermogravimetric analysis (TGA) for degradation behavior whereas modulated temperature differential scanning calorimetry (MTDSC) set in first heating, cooling, and second heating cycles, was used to qualitatively analyze the complex exothermic and endothermic thermal transitions. Raman microspectroscopy characterized vibrational information specific to chemical bonds. Prepared EMs were left at room temperature for 24 h to visually examine their crystallization potentials. The degradation of lidocaine, tetracaine, camphor, lidocaine-tetracaine EM, and lidocaine-camphor EM began at 196.56, 163.82, 76.86, 146.01, and 42.72°C, respectively, which indicated that eutectic mixtures are less thermostable compared to their individual components. The MTDSC showed crystallization peaks for lidocaine, tetracaine, and camphor at 31.86, 29.36, and 174.02°C, respectively (n = 3). When studying the eutectic mixture, no crystallization peak was observed in the lidocaine-tetracaine EM, but a lidocaine-camphor EM crystallization peak was present at 18.81°C. Crystallization occurred in lidocaine-camphor EM after being kept at room temperature for 24 h, but not in lidocaine-tetracaine EM. Certain peak shifts were observed in Raman spectra which indicated possible interactions of eutectic mixture components, when a eutectic mixture was formed. We found that if the components forming a eutectic mixture have crystallization peaks close to each other and have sufficient hydrogen-bonding capability, then their eutectic mixture is least likely to crystallize out (as seen in lidocaine-tetracaine EM) or vice versa (lidocaine-camphor EM).KEY WORDS: crystallization, degradation, eutectic mixture, Raman spectroscopy, thermal analysis  相似文献   

19.
The available information on sample size requirements of mixture analysis methods is insufficient to permit a precise evaluation of the potential problems facing practical applications of mixture analysis. We use results from Monte Carlo simulation to assess the sample size requirements of a simple mixture analysis method under conditions relevant to biological applications of mixture analysis. The mixture model used includes two univariate normal components with equal variances but assumes that the researcher is ignorant as to the equality of the variances. The method used relies on the EM algorithm to compute the maximum likelihood estimates of the mixture parameters, and the likelihood ratio test to assess the number of components in the mixtures. Our results suggest that sample sizes close to 500 or 1000 data may be required to adequately solve mixtures commonly found in biology. Sample sizes of 500 or 1000 are difficult to achieve. However, use of this MA method may be a reasonable option when the researcher deals with problems which are intractable by other means. Copyright 1999 Academic Press.  相似文献   

20.
This paper investigates the prospects of successful mass spectrometric protein identification based on mass data from proteolytic digests of complex protein mixtures. Sets of proteolytic peptide masses representing various numbers of digested proteins in a mixture were generated in silico. In each set, different proteins were selected from a protein sequence collection and for each protein the sequence coverage was randomly selected within a particular regime (15-30% or 30-60%). We demonstrate that the Probity algorithm, which is characterized by an optimal tolerance for random interference, employed in an iterative procedure can correctly identify >95% of proteins at a desired significance level in mixtures composed of hundreds of yeast proteins under realistic mass spectrometric experimental constraints. By using a model of the distribution of protein abundance, we demonstrate that the very high efficiency of identification of protein mixtures that can be achieved by appropriate choices of informatics procedures is hampered by limitations of the mass spectrometric dynamic range. The results stress the desire to choose carefully experimental protocols for comprehensive proteome analysis, focusing on truly critical issues such as the dynamic range, which potentially limits the possibilities of identifying low abundance proteins.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号