首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Influenza viruses have been responsible for large losses of lives around the world and continue to present a great public health challenge. Antigenic characterization based on hemagglutination inhibition (HI) assay is one of the routine procedures for influenza vaccine strain selection. However, HI assay is only a crude experiment reflecting the antigenic correlations among testing antigens (viruses) and reference antisera (antibodies). Moreover, antigenic characterization is usually based on more than one HI dataset. The combination of multiple datasets results in an incomplete HI matrix with many unobserved entries. This paper proposes a new computational framework for constructing an influenza antigenic cartography from this incomplete matrix, which we refer to as Matrix Completion-Multidimensional Scaling (MC-MDS). In this approach, we first reconstruct the HI matrices with viruses and antibodies using low-rank matrix completion, and then generate the two-dimensional antigenic cartography using multidimensional scaling. Moreover, for influenza HI tables with herd immunity effect (such as those from Human influenza viruses), we propose a temporal model to reduce the inherent temporal bias of HI tables caused by herd immunity. By applying our method in HI datasets containing H3N2 influenza A viruses isolated from 1968 to 2003, we identified eleven clusters of antigenic variants, representing all major antigenic drift events in these 36 years. Our results showed that both the completed HI matrix and the antigenic cartography obtained via MC-MDS are useful in identifying influenza antigenic variants and thus can be used to facilitate influenza vaccine strain selection. The webserver is available at http://sysbio.cvm.msstate.edu/AntigenMap.  相似文献   

2.
King R  Brooks SP  Coulson T 《Biometrics》2008,64(4):1187-1195
SUMMARY: We consider the issue of analyzing complex ecological data in the presence of covariate information and model uncertainty. Several issues can arise when analyzing such data, not least the need to take into account where there are missing covariate values. This is most acutely observed in the presence of time-varying covariates. We consider mark-recapture-recovery data, where the corresponding recapture probabilities are less than unity, so that individuals are not always observed at each capture event. This often leads to a large amount of missing time-varying individual covariate information, because the covariate cannot usually be recorded if an individual is not observed. In addition, we address the problem of model selection over these covariates with missing data. We consider a Bayesian approach, where we are able to deal with large amounts of missing data, by essentially treating the missing values as auxiliary variables. This approach also allows a quantitative comparison of different models via posterior model probabilities, obtained via the reversible jump Markov chain Monte Carlo algorithm. To demonstrate this approach we analyze data relating to Soay sheep, which pose several statistical challenges in fully describing the intricacies of the system.  相似文献   

3.
With the increasing availability of diverse biological information for proteins, integration of heterogeneous data becomes more useful for many problems in proteomics, such as annotating protein functions, predicting novel protein–protein interactions and so on. In this paper, we present an integrative approach called InteHC (Inte grative H ierarchical C lustering) to identify protein complexes from multiple data sources. Although integrating multiple sources could effectively improve the coverage of current insufficient protein interactome (the false negative issue), it could also introduce potential false‐positive interactions that could hurt the performance of protein complex prediction. Our proposed InteHC method can effectively address these issues to facilitate accurate protein complex prediction and it is summarized into the following three steps. First, for each individual source/feature, InteHC computes the matrices to store the affinity scores between a protein pair that indicate their propensity to interact or co‐complex relationship. Second, InteHC computes a final score matrix, which is the weighted sum of affinity scores from individual sources. In particular, the weights indicating the reliability of individual sources are learned from a supervised model (i.e., a linear ranking SVM). Finally, a hierarchical clustering algorithm is performed on the final score matrix to generate clusters as predicted protein complexes. In our experiments, we compared the results collected by our hierarchical clustering on each individual feature with those predicted by InteHC on the combined matrix. We observed that integration of heterogeneous data significantly benefits the identification of protein complexes. Moreover, a comprehensive comparison demonstrates that InteHC performs much better than 14 state‐of‐the‐art approaches. All the experimental data and results can be downloaded from http://www.ntu.edu.sg/home/zhengjie/data/InteHC . Proteins 2013; 81:2023–2033. © 2013 Wiley Periodicals, Inc.  相似文献   

4.
Summary In epidemics of infectious diseases such as influenza, an individual may have one of four possible final states: prior immune, escaped from infection, infected with symptoms, and infected asymptomatically. The exact state is often not observed. In addition, the unobserved transmission times of asymptomatic infections further complicate analysis. Under the assumption of missing at random, data‐augmentation techniques can be used to integrate out such uncertainties. We adapt an importance‐sampling‐based Monte Carlo Expectation‐Maximization (MCEM) algorithm to the setting of an infectious disease transmitted in close contact groups. Assuming the independence between close contact groups, we propose a hybrid EM‐MCEM algorithm that applies the MCEM or the traditional EM algorithms to each close contact group depending on the dimension of missing data in that group, and discuss the variance estimation for this practice. In addition, we propose a bootstrap approach to assess the total Monte Carlo error and factor that error into the variance estimation. The proposed methods are evaluated using simulation studies. We use the hybrid EM‐MCEM algorithm to analyze two influenza epidemics in the late 1970s to assess the effects of age and preseason antibody levels on the transmissibility and pathogenicity of the viruses.  相似文献   

5.
Fang Y  Benjamin W  Sun M  Ramani K 《PloS one》2011,6(5):e19349
Protein-protein interaction (PPI) network analysis presents an essential role in understanding the functional relationship among proteins in a living biological system. Despite the success of current approaches for understanding the PPI network, the large fraction of missing and spurious PPIs and a low coverage of complete PPI network are the sources of major concern. In this paper, based on the diffusion process, we propose a new concept of global geometric affinity and an accompanying computational scheme to filter the uncertain PPIs, namely, reduce the spurious PPIs and recover the missing PPIs in the network. The main concept defines a diffusion process in which all proteins simultaneously participate to define a similarity metric (global geometric affinity (GGA)) to robustly reflect the internal connectivity among proteins. The robustness of the GGA is attributed to propagating the local connectivity to a global representation of similarity among proteins in a diffusion process. The propagation process is extremely fast as only simple matrix products are required in this computation process and thus our method is geared toward applications in high-throughput PPI networks. Furthermore, we proposed two new approaches that determine the optimal geometric scale of the PPI network and the optimal threshold for assigning the PPI from the GGA matrix. Our approach is tested with three protein-protein interaction networks and performs well with significant random noises of deletions and insertions in true PPIs. Our approach has the potential to benefit biological experiments, to better characterize network data sets, and to drive new discoveries.  相似文献   

6.
Bayesian networks can be used to identify possible causal relationships between variables based on their conditional dependencies and independencies, which can be particularly useful in complex biological scenarios with many measured variables. Here we propose two improvements to an existing method for Bayesian network analysis, designed to increase the power to detect potential causal relationships between variables (including potentially a mixture of both discrete and continuous variables). Our first improvement relates to the treatment of missing data. When there is missing data, the standard approach is to remove every individual with any missing data before performing analysis. This can be wasteful and undesirable when there are many individuals with missing data, perhaps with only one or a few variables missing. This motivates the use of imputation. We present a new imputation method that uses a version of nearest neighbour imputation, whereby missing data from one individual is replaced with data from another individual, their nearest neighbour. For each individual with missing data, the subsets of variables to be used to select the nearest neighbour are chosen by sampling without replacement the complete data and estimating a best fit Bayesian network. We show that this approach leads to marked improvements in the recall and precision of directed edges in the final network identified, and we illustrate the approach through application to data from a recent study investigating the causal relationship between methylation and gene expression in early inflammatory arthritis patients. We also describe a second improvement in the form of a pseudo-Bayesian approach for upweighting certain network edges, which can be useful when there is prior evidence concerning their directions.  相似文献   

7.
With the amount of available sequence data rapidly increasing, supermatrices are at the forefront of systematic studies. As an alternative to supertrees, supermatrices utilize a total evidence approach where different genes and other lines of data are merged into a single data matrix, which is then analyzed in an attempt to obtain the phylogeny that best explains the data. However, questions may arise when combining data sets in which one or more taxa do not have sequences available for each individual gene. Two possible solutions to this situation are to either leave all taxa separate and code unavailable sequences as missing, or to combine taxa at a level for which monophyly is assumed a priori. By reanalyzing the previous work of, we show that combining taxa may yield misleading results, i.e., hypotheses of relationships that are not supported by the underlying data.  相似文献   

8.
Supermatrices are often characterized by a large amount of missing data. One possible approach to minimize such missing data is to create composite taxa. These taxa are formed by sampling sequences from different species in order to obtain a composite sequence that includes a maximum number of genes. Although this approach is increasingly used, its accuracy has rarely been tested and some authors prefer to analyze incomplete supermatrices by coding unavailable sequences as missing. To further validate the composite taxon approach, it was applied to complete mitochondrial matrices of 102 mammal species representing 93 families with varying amount of missing data. On average, missing data and composite matrices showed similar congruence to model trees obtained from the complete sequence matrix. As expected, the level of congruence to model trees decreased as missing data increased, with both approaches. We conclude that the composite taxon approach is worth considering in a phylogenomic context since it performs well and reduces computing time when compared to missing data matrices.  相似文献   

9.
Surveillance is critical to mounting an appropriate and effective response to pandemics. However, aggregated case report data suffers from reporting delays and can lead to misleading inferences. Different from aggregated case report data, line list data is a table contains individual features such as dates of symptom onset and reporting for each reported case and a good source for modeling delays. Current methods for modeling reporting delays are not particularly appropriate for line list data, which typically has missing symptom onset dates that are non-ignorable for modeling reporting delays. In this paper, we develop a Bayesian approach that dynamically integrates imputation and estimation for line list data. Specifically, this Bayesian approach can accurately estimate the epidemic curve and instantaneous reproduction numbers, even with most symptom onset dates missing. The Bayesian approach is also robust to deviations from model assumptions, such as changes in the reporting delay distribution or incorrect specification of the maximum reporting delay. We apply the Bayesian approach to COVID-19 line list data in Massachusetts and find the reproduction number estimates correspond more closely to the control measures than the estimates based on the reported curve.  相似文献   

10.
Missing value estimation methods for DNA microarrays   总被引:39,自引:0,他引:39  
MOTIVATION: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data. RESULTS: We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.  相似文献   

11.
Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.  相似文献   

12.
The hemagglutination inhibition (HI) test has long been used as a standard measure of antibody response for inactivated influenza vaccines. However, the HI test has limitations, such as insensitivity when using some H3N2 virus strains and failure to detect neutralizing antibodies that target regions distant from the receptor binding site. We therefore examined a hemagglutinin pseudovirus neutralization (PVN) test as a possible supplement or alternative to the HI test. We evaluated the association of HI or PVN titres with protection against influenza infection in mice based on morbidity (where the illness was defined as 25% body weight loss). We assessed this relationship using dose–response models incorporating HI or PVN titres as a variable. The morbidity was correlated with the pre-exposure titres, and such a correlation was well described by a modified dose–response model. The mathematical modelling suggests that PVN titres consistently show a stronger association with in vivo protection as compared to HI titres in mice. Given our findings, the PVN test warrants further investigation as a tool for evaluating antibody responses to influenza vaccines containing hemagglutinin. The resulting models may also be useful for analyzing human clinical data to identify potentially protective antibody titres against influenza illness.  相似文献   

13.
Microarray experiments generate data sets with information on the expression levels of thousands of genes in a set of biological samples. Unfortunately, such experiments often produce multiple missing expression values, normally due to various experimental problems. As many algorithms for gene expression analysis require a complete data matrix as input, the missing values have to be estimated in order to analyze the available data. Alternatively, genes and arrays can be removed until no missing values remain. However, for genes or arrays with only a small number of missing values, it is desirable to impute those values. For the subsequent analysis to be as informative as possible, it is essential that the estimates for the missing gene expression values are accurate. A small amount of badly estimated missing values in the data might be enough for clustering methods, such as hierachical clustering or K-means clustering, to produce misleading results. Thus, accurate methods for missing value estimation are needed. We present novel methods for estimation of missing values in microarray data sets that are based on the least squares principle, and that utilize correlations between both genes and arrays. For this set of methods, we use the common reference name LSimpute. We compare the estimation accuracy of our methods with the widely used KNNimpute on three complete data matrices from public data sets by randomly knocking out data (labeling as missing). From these tests, we conclude that our LSimpute methods produce estimates that consistently are more accurate than those obtained using KNNimpute. Additionally, we examine a more classic approach to missing value estimation based on expectation maximization (EM). We refer to our EM implementations as EMimpute, and the estimate errors using the EMimpute methods are compared with those our novel methods produce. The results indicate that on average, the estimates from our best performing LSimpute method are at least as accurate as those from the best EMimpute algorithm.  相似文献   

14.
Arterial walls typically have a heterogeneous structure with three different layers (intima, media, and adventitia). Each layer can be modeled as a fiber-reinforced material with two families of relatively stiff collagenous fibers symmetrically arranged within an isotropic soft ground matrix. In this paper, we present two different modeling approaches, the embedded fiber (EF) approach and the angular integration (AI) approach, to simulate the anisotropic behavior of individual arterial wall layers involving layer-specific data. The EF approach directly incorporates the microscopic arrangement of fibers that are synthetically generated from a random walk algorithm and captures material anisotropy at the element level of the finite element formulation. The AI approach smears fibers in the ground matrix and treats the material as homogeneous, with material anisotropy introduced at the constitutive level by enhancing the isotropic strain energy with two anisotropic terms. Both approaches include the influence of fiber dispersion introduced by fiber angular distribution (departure of individual fibers from the mean orientation) and take into consideration the dispersion caused by fiber waviness, which has not been previously considered. By comparing the numerical results with the published experimental data of different layers of a human aorta, we show that by using histological data both approaches can successfully capture the anisotropic behavior of individual arterial wall layers. Furthermore, through a comprehensive parametric study, we establish the connections between the AI phenomenological material parameters and the EF parameters having straightforward physical or geometrical interpretations. This study provides valuable insight for the calibration of phenomenological parameters used in the homogenized modeling based on the fiber microscopic arrangement. Moreover, it facilitates a better understanding of individual arterial wall layers, which will eventually advance the study of the structure–function relationship of arterial walls as a whole.  相似文献   

15.
血凝抑制试验(HI)是评价季节性流感疫苗免疫效果的经典方法。由于对人、禽流感病毒的受体不同,不同来源的红血球检测敏感性可能有差异。实验中比较了经典的鸡血球HI方法和国外报道的马血球HI方法,发现两种方法在检测大流行流感疫苗接种者血清时,于不同毒株抗原检测时表现各不相同,检测结果差异在2倍之内,说明经典的鸡血球HI方法仍适用于评价大流行流感疫苗的免疫效果。  相似文献   

16.
目的 目前,如何从核磁共振(nuclear magnetic resonance,NMR)光谱实验中准确地确定蛋白质的三维结构是生物物理学中的一个热门课题,因为蛋白质是生物体的重要组成成分,了解蛋白质的空间结构对研究其功能至关重要,然而由于实验数据的严重缺乏使其成为一个很大的挑战。方法 在本文中,通过恢复距离矩阵的矩阵填充(matrix completion,MC)算法来解决蛋白质结构确定问题。首先,初始距离矩阵模型被建立,由于实验数据的缺乏,此时的初始距离矩阵为不完整矩阵,随后通过MC算法恢复初始距离矩阵的缺失数据,从而获得整个蛋白质三维结构。为了进一步测试算法的性能,本文选取了4种不同拓扑结构的蛋白质和6种现有的MC算法进行了测试,探究了算法在不同的采样率以及不同程度噪声的情况下算法的恢复效果。结果 通过分析均方根偏差(root-mean-square deviation,RMSD)和计算时间这两个重要指标的平均值及标准差评估了算法的性能,结果显示当采样率和噪声因子控制在一定范围内时,RMSD值和标准差都能达到很小的值。另外本文更加具体地比较了不同算法的特点和优势,在精确采样情况下...  相似文献   

17.
To guide control policies, it is important that the determinants of influenza transmission are fully characterized. Such assessment is complex because the risk of influenza infection is multifaceted and depends both on immunity acquired naturally or via vaccination and on the individual level of exposure to influenza in the community or in the household. Here, we analyse a large household cohort study conducted in 2007–2010 in Vietnam using innovative statistical methods to ascertain in an integrative framework the relative contribution of variables that influence the transmission of seasonal (H1N1, H3N2, B) and pandemic H1N1pdm09 influenza. Influenza infection was diagnosed by haemagglutination-inhibition (HI) antibody assay of paired serum samples. We used a Bayesian data augmentation Markov chain Monte Carlo strategy based on digraphs to reconstruct unobserved chains of transmission in households and estimate transmission parameters. The probability of transmission from an infected individual to another household member was 8% (95% CI, 6%, 10%) on average, and varied with pre-season titers, age and household size. Within households of size 3, the probability of transmission from an infected member to a child with low pre-season HI antibody titers was 27% (95% CI 21%–35%). High pre-season HI titers were protective against infection, with a reduction in the hazard of infection of 59% (95% CI, 44%–71%) and 87% (95% CI, 70%–96%) for intermediate (1∶20–1∶40) and high (≥1∶80) HI titers, respectively. Even after correcting for pre-season HI titers, adults had half the infection risk of children. Twenty six percent (95% CI: 21%, 30%) of infections may be attributed to household transmission. Our results highlight the importance of integrated analysis by influenza sub-type, age and pre-season HI titers in order to infer influenza transmission risks in and outside of the household.  相似文献   

18.
Missing outcomes or irregularly timed multivariate longitudinal data frequently occur in clinical trials or biomedical studies. The multivariate t linear mixed model (MtLMM) has been shown to be a robust approach to modeling multioutcome continuous repeated measures in the presence of outliers or heavy‐tailed noises. This paper presents a framework for fitting the MtLMM with an arbitrary missing data pattern embodied within multiple outcome variables recorded at irregular occasions. To address the serial correlation among the within‐subject errors, a damped exponential correlation structure is considered in the model. Under the missing at random mechanism, an efficient alternating expectation‐conditional maximization (AECM) algorithm is used to carry out estimation of parameters and imputation of missing values. The techniques for the estimation of random effects and the prediction of future responses are also investigated. Applications to an HIV‐AIDS study and a pregnancy study involving analysis of multivariate longitudinal data with missing outcomes as well as a simulation study have highlighted the superiority of MtLMMs on the provision of more adequate estimation, imputation and prediction performances.  相似文献   

19.
Seaman SR  White IR  Copas AJ  Li L 《Biometrics》2012,68(1):129-137
Two approaches commonly used to deal with missing data are multiple imputation (MI) and inverse-probability weighting (IPW). IPW is also used to adjust for unequal sampling fractions. MI is generally more efficient than IPW but more complex. Whereas IPW requires only a model for the probability that an individual has complete data (a univariate outcome), MI needs a model for the joint distribution of the missing data (a multivariate outcome) given the observed data. Inadequacies in either model may lead to important bias if large amounts of data are missing. A third approach combines MI and IPW to give a doubly robust estimator. A fourth approach (IPW/MI) combines MI and IPW but, unlike doubly robust methods, imputes only isolated missing values and uses weights to account for remaining larger blocks of unimputed missing data, such as would arise, e.g., in a cohort study subject to sample attrition, and/or unequal sampling fractions. In this article, we examine the performance, in terms of bias and efficiency, of IPW/MI relative to MI and IPW alone and investigate whether the Rubin's rules variance estimator is valid for IPW/MI. We prove that the Rubin's rules variance estimator is valid for IPW/MI for linear regression with an imputed outcome, we present simulations supporting the use of this variance estimator in more general settings, and we demonstrate that IPW/MI can have advantages over alternatives. IPW/MI is applied to data from the National Child Development Study.  相似文献   

20.
In many longitudinal studies, the individual characteristics associated with the repeated measures may be possible covariates of the time to an event of interest, and thus, it is desirable to model the time-to-event process and the longitudinal process jointly. Statistical analyses may be further complicated in such studies with missing data such as informative dropouts. This article considers a nonlinear mixed-effects model for the longitudinal process and the Cox proportional hazards model for the time-to-event process. We provide a method for simultaneous likelihood inference on the 2 models and allow for nonignorable data missing. The approach is illustrated with a recent AIDS study by jointly modeling HIV viral dynamics and time to viral rebound.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号