首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 250 毫秒
1.
Hamada M  Kiryu H  Iwasaki W  Asai K 《PloS one》2011,6(2):e16450
In a number of estimation problems in bioinformatics, accuracy measures of the target problem are usually given, and it is important to design estimators that are suitable to those accuracy measures. However, there is often a discrepancy between an employed estimator and a given accuracy measure of the problem. In this study, we introduce a general class of efficient estimators for estimation problems on high-dimensional binary spaces, which represent many fundamental problems in bioinformatics. Theoretical analysis reveals that the proposed estimators generally fit with commonly-used accuracy measures (e.g. sensitivity, PPV, MCC and F-score) as well as it can be computed efficiently in many cases, and cover a wide range of problems in bioinformatics from the viewpoint of the principle of maximum expected accuracy (MEA). It is also shown that some important algorithms in bioinformatics can be interpreted in a unified manner. Not only the concept presented in this paper gives a useful framework to design MEA-based estimators but also it is highly extendable and sheds new light on many problems in bioinformatics.  相似文献   

2.
Joint classification and pairing of human chromosomes   总被引:1,自引:0,他引:1  
We reexamine the problems of computer-aided classification and pairing of human chromosomes, and propose to jointly optimize the solutions of these two related problems. The combined problem is formulated into one of optimal three-dimensional assignment with an objective function of maximum likelihood. This formulation poses two technical challenges: 1) estimation of the posterior probability that two chromosomes form a pair and the pair belongs to a class and 2) good heuristic algorithms to solve the three-dimensional assignment problem which is NP-hard. We present various techniques to solve these problems. We also generalize our algorithms to cases where the cell data are incomplete as often encountered in practice.  相似文献   

3.
Accurate class probability estimation is important for medical decision making but is challenging, particularly when the number of candidate features exceeds the number of cases. Special methods have been developed for nonprobabilistic classification, but relatively little attention has been given to class probability estimation with numerous candidate variables. In this paper, we investigate overfitting in the development of regularized class probability estimators. We investigate the relation between overfitting and accurate class probability estimation in terms of mean square error. Using simulation studies based on real datasets, we found that some degree of overfitting can be desirable for reducing mean square error. We also introduce a mean square error decomposition for class probability estimation that helps clarify the relationship between overfitting and prediction accuracy.  相似文献   

4.
This paper extends the multilevel survival model by allowing the existence of cured fraction in the model. Random effects induced by the multilevel clustering structure are specified in the linear predictors in both hazard function and cured probability parts. Adopting the generalized linear mixed model (GLMM) approach to formulate the problem, parameter estimation is achieved by maximizing a best linear unbiased prediction (BLUP) type log‐likelihood at the initial step of estimation, and is then extended to obtain residual maximum likelihood (REML) estimators of the variance component. The proposed multilevel mixture cure model is applied to analyze the (i) child survival study data with multilevel clustering and (ii) chronic granulomatous disease (CGD) data on recurrent infections as illustrations. A simulation study is carried out to evaluate the performance of the REML estimators and assess the accuracy of the standard error estimates.  相似文献   

5.
Considerable attention has been focused on predicting the secondary structure for aligned RNA sequences since it is useful not only for improving the limiting accuracy of conventional secondary structure prediction but also for finding non-coding RNAs in genomic sequences. Although there exist many algorithms of predicting secondary structure for aligned RNA sequences, further improvement of the accuracy is still awaited. In this article, toward improving the accuracy, a theoretical classification of state-of-the-art algorithms of predicting secondary structure for aligned RNA sequences is presented. The classification is based on the viewpoint of maximum expected accuracy (MEA), which has been successfully applied in various problems in bioinformatics. The classification reveals several disadvantages of the current algorithms but we propose an improvement of a previously introduced algorithm (CentroidAlifold). Finally, computational experiments strongly support the theoretical classification and indicate that the improved CentroidAlifold substantially outperforms other algorithms.  相似文献   

6.
The measurement of biallelic pair-wise association called linkage disequilibrium (LD) is an important issue in order to understand the genomic architecture. A plethora of measures of association in two by two tables have been proposed in the literature. Beside the problem of choosing an appropriate measure, the problem of their estimation has been neglected in the literature. It needs to be emphasized that the definition of a measure and the choice of an estimator function for it are conceptually unrelated tasks. In this paper, we compare the performance of various estimators for the three popular LD measures D', r and Y in a simulation study for small to moderate samples sizes (N<=500). The usual frequency-plug-in estimators can lead to unreliable or undefined estimates. Estimators based on the computationally expensive volume measures have been proposed recently as a remedy to this well-known problem. We confirm that volume estimators have better expected mean square error than the naive plug-in estimators. But they are outperformed by estimators plugging-in easy to calculate non-informative Bayesian probability estimates into the theoretical formulae for the measures. Fully Bayesian estimators with non-informative Dirichlet priors have comparable accuracy but are computationally more expensive. We recommend the use of non-informative Bayesian plug-in estimators based on Jeffreys' prior, in particular when dealing with SNP array data where the occurrence of small table entries and table margins is likely.  相似文献   

7.
When the sample size is not large or when the underlying disease is rare, to assure collection of an appropriate number of cases and to control the relative error of estimation, one may employ inverse sampling, in which one continues sampling subjects until one obtains exactly the desired number of cases. This paper focuses discussion on interval estimation of the simple difference between two proportions under independent inverse sampling. This paper develops three asymptotic interval estimators on the basis of the maximum likelihood estimator (MLE), the uniformly minimum variance unbiased estimator (UMVUE), and the asymptotic likelihood ratio test (ALRT). To compare the performance of these three estimators, this paper calculates the coverage probability and the expected length of the resulting confidence intervals on the basis of the exact distribution. This paper finds that when the underlying proportions of cases in both two comparison populations are small or moderate (≤0.20), all three asymptotic interval estimators developed here perform reasonably well even for the pre-determined number of cases as small as 5. When the pre-determined number of cases is moderate or large (≥50), all three estimators are essentially equivalent in all the situations considered here. Because application of the two interval estimators derived from the MLE and the UMVUE does not involve any numerical iterative procedure needed in the ALRT, for simplicity we may use these two estimators without losing efficiency.  相似文献   

8.
Multistate models can be successfully used for describing complex event history data, for example, describing stages in the disease progression of a patient. The so‐called “illness‐death” model plays a central role in the theory and practice of these models. Many time‐to‐event datasets from medical studies with multiple end points can be reduced to this generic structure. In these models one important goal is the modeling of transition rates but biomedical researchers are also interested in reporting interpretable results in a simple and summarized manner. These include estimates of predictive probabilities, such as the transition probabilities, occupation probabilities, cumulative incidence functions, and the sojourn time distributions. We will give a review of some of the available methods for estimating such quantities in the progressive illness‐death model conditionally (or not) on covariate measures. For some of these quantities estimators based on subsampling are employed. Subsampling, also referred to as landmarking, leads to small sample sizes and usually to heavily censored data leading to estimators with higher variability. To overcome this issue estimators based on a preliminary estimation (presmoothing) of the probability of censoring may be used. Among these, the presmoothed estimators for the cumulative incidences are new. We also introduce feasible estimation methods for the cumulative incidence function conditionally on covariate measures. The proposed methods are illustrated using real data. A comparative simulation study of several estimation approaches is performed and existing software in the form of R packages is discussed.  相似文献   

9.
Fetal loss often precludes the ascertainment of infection status in studies of perinatal transmission of HIV. The standard analysis based on liveborn babies can result in biased estimation and invalid inference in the presence of fetal death. This paper focuses on the problem of estimating treatment effects for mother-to-child transmission when infection status is unknown for some babies. Minimal data structures for identifiability of parameters are given. Methods using full likelihood and the inverse probability of selection-weighted estimators are suggested. Simulation studies are used to show that these estimators perform well in finite samples. Methods are applied to the data from a clinical trial in Dar es Salaam, Tanzania. To validly estimate the treatment effect using likelihood methods, investigators should make sure that the design includes a mini-study among uninfected mothers and that efforts are made to ascertain the infection status of as many babies lost as possible. The inverse probability weighting methods need precise estimation of the probability of observing infection status. We can further apply our methodology to the study of other vertically transmissible infections which are potentially fatal pre- and perinatally.  相似文献   

10.
Pedigree reconstruction using genotypic markers has become an important tool for the study of natural populations. The nonstandard nature of the underlying statistical problems has led to the necessity of developing specialized statistical and computational methods. In this article, a new version of pedigree reconstruction tools (PRT 2.0) is presented. The software implements algorithms proposed in Almudevar & Field (Journal of Agricultural Biological and Environmental Statistics, 4, 1999, 136) and Almudevar (Biometrics, 57, 2001a, 757) for the reconstruction of single generation sibling groups (SG). A wider range of enumeration algorithms is included, permitting improved computational performance. In particular, an iterative version of the algorithm designed for larger samples is included in a fully automated form. The new version also includes expanded simulation utilities, as well as extensive reporting, including half-sibling compatibility, parental genotype estimates and flagging of potential genotype errors. A number of alternative algorithms are described and demonstrated. A comparative discussion of the underlying methodologies is presented. Although important aspects of this problem remain open, we argue that a number of methodologies including maximum likelihood estimation (COLONY 1.2 and 2.0) and the set cover formulation (KINALYZER) exhibit undesirable properties in the sibling reconstruction problem. There is considerable evidence that large sets of individuals not genetically excluded as siblings can be inferred to be a true sibling group, but it is also true that unrelated individuals may be genetically compatible with a true sibling group by chance. Such individuals may be identified on a statistical basis. PRT 2.0, based on these sound statistical principles, is able to efficiently match or exceed the highest reported accuracy rates, particularly for larger SG. The new version is available at http://www.urmc.rochester.edu/biostat/people/faculty/almudevar.cfm.  相似文献   

11.
MINQUE (Minimum Norm Quadratic Unbiased Estimators) theory is applied to the problem of estimation of variance components in family data (siblings) with variable family size. Using this approach, the traditional iterative maximum likelihood estimators are shown to be asymptotically normal, even though the data come from non-identical parent distributions. Asymptotic expressions are also obtained for the variance of the MINQUE estimators which hold even if the data are decidedly non-normal (e.g. a mixture of normals). In the case of normal data, exact small-sample variance estimates are derived. Simulations demonstrate the fast rate of convergence to asymptotic properties as the number of families increases. These desirable qualities suggest that the easy to compute MINQUE class of estimators may provide a useful alternative method for modelling familial aggregation.  相似文献   

12.
In this paper, we consider several variations of the following basic tiling problem: given a sequence of real numbers with two size-bound parameters, we want to find a set of tiles of maximum total weight such that each tiles satisfies the size bounds. A solution to this problem is important to a number of computational biology applications such as selecting genomic DNA fragments for PCR-based amplicon microarrays and performing homology searches with long sequence queries. Our goal is to design efficient algorithms with linear or near-linear time and space in the normal range of parameter values for these problems. For this purpose, we first discuss the solution to a basic online interval maximum problem via a sliding-window approach and show how to use this solution in a nontrivial manner for many of the tiling problems introduced. We also discuss NP-hardness results and approximation algorithms for generalizing our basic tiling problem to higher dimensions. Finally, computational results from applying our tiling algorithms to genomic sequences of five model eukaryotes are reported.  相似文献   

13.
This paper discusses interval estimation of the simple difference (SD) between the proportions of the primary infection and the secondary infection, given the primary infection, by developing three asymptotic interval estimators using Wald's test statistic, the likelihood‐ratio test, and the basic principle of Fieller's theorem. This paper further evaluates and compares the performance of these interval estimators with respect to the coverage probability and the expected length of the resulting confidence intervals. This paper finds that the asymptotic confidence interval using the likelihood ratio test consistently performs well in all situations considered here. When the underlying SD is within 0.10 and the total number of subjects is not large (say, 50), this paper further finds that the interval estimators using Fieller's theorem would be preferable to the estimator using the Wald's test statistic if the primary infection probability were moderate (say, 0.30), but the latter is preferable to the former if this probability were large (say, 0.80). When the total number of subjects is large (say, ≥200), all the three interval estimators perform well in almost all situations considered in this paper. In these cases, for simplicity, we may apply either of the two interval estimators using Wald's test statistic or Fieller's theorem without losing much accuracy and efficiency as compared with the interval estimator using the asymptotic likelihood ratio test.  相似文献   

14.
This paper studies the application of evolutionary algorithms for bi-objective travelling salesman problem. Two evolutionary algorithms, including estimation of distribution algorithm (EDA) and genetic algorithm (GA), are considered. The solution to this problem is a set of trade-off alternatives. The problem is solved by optimizing the order of the cities so as to simultaneously minimize the two objectives of travelling distance and travelling cost incurred by the travelling salesman. In this paper, binary-representation-based evolutionary algorithms are replaced with an integer-representation. Three existing EDAs are altered to use this integer-representation, namely restricted Boltzmann machine (RBM), univariate marginal distribution algorithm (UMDA), and population-based incremental learning (PBIL). Each city is associated with a representative integer, and the probability of any of this representative integer to be located in any position of the chromosome is constructed through the modeling approach of the EDAs. New sequences of cities are obtained by sampling from the probabilistic model. A refinement operator and a local search operator are proposed in this piece of work. The EDAs are subsequently hybridized with GA in order to complement the limitations of both algorithms. The effect that each of these operators has on the quality of the solutions are investigated. Empirical results show that the hybrid algorithms are capable of finding a set of good trade-off solutions.  相似文献   

15.
Many applications of data partitioning (clustering) have been well studied in bioinformatics. Consider, for instance, a set N of organisms (elements) based on DNA marker data. A partition divides all elements in N into two or more disjoint clusters that cover all elements, where a cluster contains a non-empty subset of N. Different partitioning algorithms may produce different partitions. To compute the distance and find the consensus partition (also called consensus clustering) between two or more partitions are important and interesting problems that arise frequently in bioinformatics and data mining, in which different distance functions may be considered in different partition algorithms. In this article, we discuss the k partition-distance problem. Given a set of elements N with k partitions of N, the k partition-distance problem is to delete the minimum number of elements from each partition such that all remaining partitions become identical. This problem is NP-complete for general k?>?2 partitions, and no algorithms are known at present. We design the first known heuristic and approximation algorithms with performance ratios 2 to solve the k partition-distance problem in O(k?·?ρ?·?|N|) time, where ρ is the maximum number of clusters of these k partitions and |N| is the number of elements in N. We also present the first known exact algorithm in O(??·?2(?)·k(2)?·?|N|(2)) time, where ? is the partition-distance of the optimal solution for this problem. Performances of our exact and approximation algorithms in testing the random data with actual sets of organisms based on DNA markers are compared and discussed. Experimental results reveal that our algorithms can improve the computational speed of the exact algorithm for the two partition-distance problem in practice if the maximum number of elements per cluster is less than ρ. From both theoretical and computational points of view, our solutions are at most twice the partition-distance of the optimal solution. A website offering the interactive service of solving the k partition-distance problem using our and previous algorithms is available (see http://mail.tmue.edu.tw/~yhchen/KPDP.html).  相似文献   

16.
The multichannel recordings of signals of many cells cultivated on a multielectrode array (MEA) impose some challenging problems. A meanwhile classic problem is the separation of the recordings of a single electrode into classes of recordings where each class is caused by a single cell. This is the well-known spike sorting. A “dual” problem is the determination of the set of electrodes that record signals of a single cell. This set is called the neighborhood of the cell and has often more than one element if the MEA has a large number of electrodes with high density. A method for the reconstruction of the neighborhoods from the multichannel recordings is presented. Special effort is directed to a precise peak detection. For the evaluation of the algorithm, artificial data, obtained from an appropriate model of MEA recordings, are used. Because the artificial data provide a ground truth, an evaluation of the accuracy of the algorithm is possible. The algorithm works well for realistic parameters.  相似文献   

17.
Summary .  In this article, we study the estimation of mean response and regression coefficient in semiparametric regression problems when response variable is subject to nonrandom missingness. When the missingness is independent of the response conditional on high-dimensional auxiliary information, the parametric approach may misspecify the relationship between covariates and response while the nonparametric approach is infeasible because of the curse of dimensionality. To overcome this, we study a model-based approach to condense the auxiliary information and estimate the parameters of interest nonparametrically on the condensed covariate space. Our estimators possess the double robustness property, i.e., they are consistent whenever the model for the response given auxiliary covariates or the model for the missingness given auxiliary covariate is correct. We conduct a number of simulations to compare the numerical performance between our estimators and other existing estimators in the current missing data literature, including the propensity score approach and the inverse probability weighted estimating equation. A set of real data is used to illustrate our approach.  相似文献   

18.
通过引入区域的初始比例因子,考虑了二个区域A与B的封闭种群标记重捕模型,再利用完整的极大似然函数和多项分布函数的性质,给出了当个体在不同区域的个体捕捉率相等时的二个区域之间的转移概率与各区域的初始比例的求法,推导出在不同区域的个体捕捉率不相等但个体低转移率条件下二个区域的封闭种群的标记重捕模型的参数表达式,并用实例说明。  相似文献   

19.
We consider the problem of estimating segregation ratios in families based on ascertainment through affected children, formulate it as an incomplete problem and work out the EM algorithm for maximum likelihood estimation of segregation ratios. We treat both the cases of known and unknown ascertainment probability. We also derive expressions for the covariance matrix of the estimators suitable for computing along with the EM algorithm. We illustrate the method with an example, compare the computational effort with that required in using the scoring method and argue that the EM algorithm is simpler.  相似文献   

20.
Clinical trials with Poisson distributed count data as the primary outcome are common in various medical areas such as relapse counts in multiple sclerosis trials or the number of attacks in trials for the treatment of migraine. In this article, we present approximate sample size formulae for testing noninferiority using asymptotic tests which are based on restricted or unrestricted maximum likelihood estimators of the Poisson rates. The Poisson outcomes are allowed to be observed for unequal follow‐up schemes, and both the situations that the noninferiority margin is expressed in terms of the difference and the ratio are considered. The exact type I error rates and powers of these tests are evaluated and the accuracy of the approximate sample size formulae is examined. The test statistic using the restricted maximum likelihood estimators (for the difference test problem) and the test statistic that is based on the logarithmic transformation and employs the maximum likelihood estimators (for the ratio test problem) show favorable type I error control and can be recommended for practical application. The approximate sample size formulae show high accuracy even for small sample sizes and provide power values identical or close to the aspired ones. The methods are illustrated by a clinical trial example from anesthesia.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号