首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We develop an iterative relaxation algorithm called RIBRA for NMR protein backbone assignment. RIBRA applies nearest neighbor and weighted maximum independent set algorithms to solve the problem. To deal with noisy NMR spectral data, RIBRA is executed in an iterative fashion based on the quality of spectral peaks. We first produce spin system pairs using the spectral data without missing peaks, then the data group with one missing peak, and finally, the data group with two missing peaks. We test RIBRA on two real NMR datasets, hbSBD and hbLBD, and perfect BMRB data (with 902 proteins) and four synthetic BMRB data which simulate four kinds of errors. The accuracy of RIBRA on hbSBD and hbLBD are 91.4% and 83.6%, respectively. The average accuracy of RIBRA on perfect BMRB datasets is 98.28%, and 98.28%, 95.61%, 98.16%, and 96.28% on four kinds of synthetic datasets, respectively.  相似文献   

2.
Systolic blood pressure (SBP) is an age-dependent complex trait for which both environmental and genetic factors may play a role in explaining variability among individuals. We performed a genome-wide scan of the rate of change in SBP over time on the Framingham Heart Study data and one randomly selected replicate of the simulated data from the Genetic Analysis Workshop 13. We used a variance-component model to carry out linkage analysis and a Markov chain Monte Carlo-based multiple imputation approach to recover missing information. Furthermore, we adopted two selection strategies along with the multiple imputation to deal with subjects taking antihypertensive treatment. The simulated data were used to compare these two strategies, to explore the effectiveness of the multiple imputation in recovering varying degrees of missing information, and its impact on linkage analysis results. For the Framingham data, the marker with the highest LOD score for SBP slope was found on chromosome 7. Interestingly, we found that SBP slopes were not heritable in males but were for females; the marker with the highest LOD score was found on chromosome 18. Using the simulated data, we found that handling treated subjects using the multiple imputation improved the linkage results. We conclude that multiple imputation is a promising approach in recovering missing information in longitudinal genetic studies and hence in improving subsequent linkage analyses.  相似文献   

3.
4.
BackgroundPopulation-based net survival by tumour stage at diagnosis is a key measure in cancer surveillance. Unfortunately, data on tumour stage are often missing for a non-negligible proportion of patients and the mechanism giving rise to the missingness is usually anything but completely at random. In this setting, restricting analysis to the subset of complete records gives typically biased results. Multiple imputation is a promising practical approach to the issues raised by the missing data, but its use in conjunction with the Pohar-Perme method for estimating net survival has not been formally evaluated.MethodsWe performed a resampling study using colorectal cancer population-based registry data to evaluate the ability of multiple imputation, used along with the Pohar-Perme method, to deliver unbiased estimates of stage-specific net survival and recover missing stage information. We created 1000 independent data sets, each containing 5000 patients. Stage data were then made missing at random under two scenarios (30% and 50% missingness).ResultsComplete records analysis showed substantial bias and poor confidence interval coverage. Across both scenarios our multiple imputation strategy virtually eliminated the bias and greatly improved confidence interval coverage.ConclusionsIn the presence of missing stage data complete records analysis often gives severely biased results. We showed that combining multiple imputation with the Pohar-Perme estimator provides a valid practical approach for the estimation of stage-specific colorectal cancer net survival. As usual, when the percentage of missing data is high the results should be interpreted cautiously and sensitivity analyses are recommended.  相似文献   

5.
Paleontological investigations into morphological diversity, or disparity, are often confronted with large amounts of missing data. We illustrate how missing discrete data affect disparity using a novel simulation for removing data based on parameters from published datasets that contain both extinct and extant taxa. We develop an algorithm that assesses the distribution of missing characters in extinct taxa, and simulates data loss by applying that distribution to extant taxa. We term this technique “linkage.” We compare differences in disparity metrics and ordination spaces produced by linkage and random character removal. When we incorporated linkage among characters, disparity metrics declined and ordination spaces shrank at a slower rate with increasing missing data, indicating that correlations among characters govern the sensitivity of disparity analysis. We also present and test a new disparity method that uses the linkage algorithm to correct for the bias caused by missing data. We equalized proportions of missing data among time bins before calculating disparity, and found that estimates of disparity changed when missing data were taken into account. By removing the bias of missing data, we can gain new insights into the morphological evolution of organisms and highlight the detrimental effects of missing data on disparity analysis.  相似文献   

6.
Models for longitudinal data are employed in a wide range of behavioral, biomedical, psychosocial, and health‐care‐related research. One popular model for continuous response is the linear mixed‐effects model (LMM). Although simulations by recent studies show that LMM provides reliable estimates under departures from the normality assumption for complete data, the invariable occurrence of missing data in practical studies renders such robustness results less useful when applied to real study data. In this paper, we show by simulated studies that in the presence of missing data estimates of the fixed effect of LMM are biased under departures from normality. We discuss two robust alternatives, the weighted generalized estimating equations (WGEE) and the augmented WGEE (AWGEE), and compare their performances with LMM using real as well as simulated data. Our simulation results show that both WGEE and AWGEE provide valid inference for skewed non‐normal data when missing data follows the missing at random, the most popular missing data mechanism for real study data.  相似文献   

7.
Yuan Y  Little RJ 《Biometrics》2007,63(4):1172-1180
This article concerns item nonresponse adjustment for two-stage cluster samples. Specifically, we focus on two types of nonignorable nonresponse: nonresponse depending on covariates and underlying cluster characteristics, and depending on covariates and the missing outcome. In these circumstances, standard weighting and imputation adjustments are liable to be biased. To obtain consistent estimates, we extend the standard random-effects model by modeling these two types of missing data mechanism. We also propose semiparametric approaches based on fitting a spline on the propensity score, to weaken assumptions about the relationship between the outcome and covariates. These new methods are compared with existing approaches by simulation. The National Health and Nutrition Examination Survey data are used to illustrate these approaches.  相似文献   

8.
Protecting against nonrandomly missing data in longitudinal studies   总被引:1,自引:0,他引:1  
C H Brown 《Biometrics》1990,46(1):143-155
Nonrandomly missing data can pose serious problems in longitudinal studies. We generally have little knowledge about how missingness is related to the data values, and longitudinal studies are often far from complete. Two approaches that have been used to handle missing data--use of maximum likelihood with an ignorable mechanism and direct modeling of the missing data mechanism--have the disadvantage of not giving consistent estimates under important classes of nonrandom mechanisms. We introduce two protective estimators, that is, estimators that retain their consistency over a wide range of nonrandom mechanisms. We compare these protective estimators using longitudinal data from a mental health panel study. We also investigate their robustness to certain departures from normality.  相似文献   

9.
Bayesian networks can be used to identify possible causal relationships between variables based on their conditional dependencies and independencies, which can be particularly useful in complex biological scenarios with many measured variables. Here we propose two improvements to an existing method for Bayesian network analysis, designed to increase the power to detect potential causal relationships between variables (including potentially a mixture of both discrete and continuous variables). Our first improvement relates to the treatment of missing data. When there is missing data, the standard approach is to remove every individual with any missing data before performing analysis. This can be wasteful and undesirable when there are many individuals with missing data, perhaps with only one or a few variables missing. This motivates the use of imputation. We present a new imputation method that uses a version of nearest neighbour imputation, whereby missing data from one individual is replaced with data from another individual, their nearest neighbour. For each individual with missing data, the subsets of variables to be used to select the nearest neighbour are chosen by sampling without replacement the complete data and estimating a best fit Bayesian network. We show that this approach leads to marked improvements in the recall and precision of directed edges in the final network identified, and we illustrate the approach through application to data from a recent study investigating the causal relationship between methylation and gene expression in early inflammatory arthritis patients. We also describe a second improvement in the form of a pseudo-Bayesian approach for upweighting certain network edges, which can be useful when there is prior evidence concerning their directions.  相似文献   

10.
Hopke PK  Liu C  Rubin DB 《Biometrics》2001,57(1):22-33
Many chemical and environmental data sets are complicated by the existence of fully missing values or censored values known to lie below detection thresholds. For example, week-long samples of airborne particulate matter were obtained at Alert, NWT, Canada, between 1980 and 1991, where some of the concentrations of 24 particulate constituents were coarsened in the sense of being either fully missing or below detection limits. To facilitate scientific analysis, it is appealing to create complete data by filling in missing values so that standard complete-data methods can be applied. We briefly review commonly used strategies for handling missing values and focus on the multiple-imputation approach, which generally leads to valid inferences when faced with missing data. Three statistical models are developed for multiply imputing the missing values of airborne particulate matter. We expect that these models are useful for creating multiple imputations in a variety of incomplete multivariate time series data sets.  相似文献   

11.
We considered the contribution of two mitochondrial and two nuclear data sets for the phylogenetic reconstruction of 22 species of seed beetles in the genus Curculio (Coleoptera: Cuculionidae). A phylogenetic tree from representatives found on various hosts was inferred from a combined data set of mitochondrial DNA cytochrome oxidase subunit I, mitochondrial cytochrome b, nuclear elongation factor 1alpha, and nuclear phosphoglycerate mutase, used for the first time as a molecular marker. Separate parsimony analyses of each data set showed that individual gene trees were mainly congruent and often complementary in the support of clades but the analysis was complicated by failure of PCR amplification of nuclear genes for many taxa and hence missing data entries. When the four gene partitions were combined in a simultaneous analysis despite the missing data, this increased the resolution and taxonomic coverage compared to the individual source trees. Alternative approaches of combining the information via supertree methodology produced a comparatively less resolved tree, and hence seem inferior to combining data matrices even in cases where numerous taxa are missing. The molecular data suggest a classification of the European species into two species groups that are in accordance with morphological characteristics but the data do no support any of the previously recognised American species groups.  相似文献   

12.
Missing value estimation methods for DNA microarrays   总被引:39,自引:0,他引:39  
MOTIVATION: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data. RESULTS: We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.  相似文献   

13.
We have undertaken two-dimensional gel electrophoresis proteomic profiling on a series of cell lines with different recombinant antibody production rates. Due to the nature of gel-based experiments not all protein spots are detected across all samples in an experiment, and hence datasets are invariably incomplete. New approaches are therefore required for the analysis of such graduated datasets. We approached this problem in two ways. Firstly, we applied a missing value imputation technique to calculate missing data points. Secondly, we combined a singular value decomposition based hierarchical clustering with the expression variability test to identify protein spots whose expression correlates with increased antibody production. The results have shown that while imputation of missing data was a useful method to improve the statistical analysis of such data sets, this was of limited use in differentiating between the samples investigated, and highlighted a small number of candidate proteins for further investigation.  相似文献   

14.
The Generalized Euclidean Distance (GED) measure has been extensively used to conduct morphological disparity analyses based on palaeontological matrices of discrete characters. This is in part because some implementations allow the use of morphological matrices with high percentages of missing data without needing to prune taxa for a subsequent ordination of the data set. Previous studies have suggested that this way of using the GED may generate a bias in the resulting morphospace, but a detailed study of this possible effect has been lacking. Here, we test whether the percentage of missing data for a taxon artificially influences its position in the morphospace, and if missing data affects pre‐ and post‐ordination disparity measures. We find that this use of the GED creates a systematic bias, whereby taxa with higher percentages of missing data are placed closer to the centre of the morphospace than those with more complete scorings. This bias extends into pre‐ and post‐ordination calculations of disparity measures and can lead to erroneous interpretations of disparity patterns, especially if specimens present in a particular time interval or clade have distinct proportions of missing information. We suggest that this implementation of the GED should be used with caution, especially in cases with high percentages of missing data. Results recovered using an alternative distance measure, Maximum Observed Rescaled Distance (MORD), are more robust to missing data. As a consequence, we suggest that MORD is a more appropriate distance measure than GED when analysing data sets with high amounts of missing data.  相似文献   

15.
Methods to handle missing data have been an area of statistical research for many years. Little has been done within the context of pedigree analysis. In this paper we present two methods for imputing missing data for polygenic models using family data. The imputation schemes take into account familial relationships and use the observed familial information for the imputation. A traditional multiple imputation approach and multiple imputation or data augmentation approach within a Gibbs sampler for the handling of missing data for a polygenic model are presented.We used both the Genetic Analysis Workshop 13 simulated missing phenotype and the complete phenotype data sets as the means to illustrate the two methods. We looked at the phenotypic trait systolic blood pressure and the covariate gender at time point 11 (1970) for Cohort 1 and time point 1 (1971) for Cohort 2. Comparing the results for three replicates of complete and missing data incorporating multiple imputation, we find that multiple imputation via a Gibbs sampler produces more accurate results. Thus, we recommend the Gibbs sampler for imputation purposes because of the ease with which it can be extended to more complicated models, the consistency of the results, and the accountability of the variation due to imputation.  相似文献   

16.
17.
Montane species endemic to the “sky islands” of the North American southwest were significantly impacted by changing climates during the Pleistocene. We combined mitochondrial and genomic data with species distribution modelling to determine whether Aphonopelma marxi, a large tarantula from the nearby Colorado Plateau, was similarly impacted by glacial climates. Genetic analyses revealed that the species comprises three main clades that diverged in the Pleistocene. A clade distributed along the Mogollon Rim appears to have persisted in place during glacial conditions, whereas the other two clades probably colonized central and northeastern portions of the species' range from refugia in canyons. Climate models support this hypothesis for the Mogollon Rim, but late glacial climate data appear too coarse to detect suitable areas in canyons. Locations of canyon refugia could not be inferred from genomic analyses due to missing data, encouraging us to explore the effect of missing loci in phylogeographical inferences using RADseq. Results from analyses with varying amounts of missing data suggest that samples with large amounts of missing data can still improve inferences, and the specific loci that are missing matters more than the number of missing loci. This study highlights the profound impact of Pleistocene climates on tarantulas endemic to the Colorado Plateau, as well as the mixed nature of the region's fauna. Some animals recently colonized from nearby deserts as glacial climates receded, whereas others, like tarantulas, appear to have persisted on the Mogollon Rim and in refugia associated with the region's famous river‐cut canyons.  相似文献   

18.
Animal sociability measurements based on inter-individual distances or nearest-neighbour distributions can be obtained automatically with telemetry collars. So far, all the indices that have been used require the whole group to be observed. Here, we propose an index of the variability in affinity relationships in groups of domestic herbivores, whose definition does not depend on group size and that can be used even if some data are missing. This index and its estimators are based on a function that measures how frequently an animal is closer than another one from a third animal. When no data are missing, we show that our estimator and the variance of the sociability matrixsensu Sibbald (considered as the reference method) are strongly correlated. We then consider two cases of missing data. In the first case, some animals are randomly missing, that is, to account for random breakdown of telemetry collars. Our estimator is unbiased by such missing data and its variance decreases as the number of observation dates increases. In the second case, the same animals are missing at all observation dates, that is, in large herds where there are more individuals to be observed than available telemetry collars. Our estimator of affinity variance within a group is biased by such missing data. Thus, it requires changing animals equipped with telemetry collars regularly during the experiment. Conversely, the estimator remains unbiased at the population level, that is, if several independent groups are being analysed. We finally illustrate how this estimator can be used by investigating changes in the variability of affinities according to group size in grazing heifers.  相似文献   

19.
Analysts often estimate treatment effects in observational studies using propensity score matching techniques. When there are missing covariate values, analysts can multiply impute the missing data to create m completed data sets. Analysts can then estimate propensity scores on each of the completed data sets, and use these to estimate treatment effects. However, there has been relatively little attention on developing imputation models to deal with the additional problem of missing treatment indicators, perhaps due to the consequences of generating implausible imputations. However, simply ignoring the missing treatment values, akin to a complete case analysis, could also lead to problems when estimating treatment effects. We propose a latent class model to multiply impute missing treatment indicators. We illustrate its performance through simulations and with data taken from a study on determinants of children's cognitive development. This approach is seen to obtain treatment effect estimates closer to the true treatment effect than when employing conventional imputation procedures as well as compared to a complete case analysis.  相似文献   

20.
Restriction site-associated DNA sequencing (RAD-seq) and related methods have become relatively common approaches to resolve species-level phylogeny. It is not clear, however, whether RAD-seq data matrices are well suited to relaxed clock inference of divergence times, given the size of the matrices and the abundance of missing data. We investigated the sensitivity of Bayesian relaxed clock estimates of divergence times to alternative analytical decisions on an empirical RAD-seq phylogenetic matrix. We explored the relative contribution of secondary calibration strategies, amount of missing data, and the data partition analyzed to overall variance in divergence times inferred using BEAST MCMC analyses of Carex section Schoenoxiphium (Cyperaceae)—a recent radiation for which we have nearly complete species sampling of RAD-seq data. The crown node for Schoenoxiphium was estimated to be 15.22 (9.56–21.18) Ma using a single calibration point and low missing data, 11.93 (8.07–16.03) Ma using multiple calibration points and low missing data, and 8.34 (5.41–11.22) using multiple calibrations but high missing data. We found that using matrices with more than half of the individuals with missing data inferred younger mean ages for all nodes. Moreover, we have found that our molecular clock estimates are sensitive to the positions of the calibration(s) in our phylogenetic tree (using matrices with low missing data), especially when only a single calibration was applied to estimate divergence times. These results argue for sensitivity analyses and caution in interpreting divergence time estimates from RAD-seq data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号