首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Single-particle cryo-electron microscopy (cryo-EM) is a technique that takes projection images of biomolecules frozen at cryogenic temperatures. A major advantage of this technique is its ability to image single biomolecules in heterogeneous conformations. While this poses a challenge for data analysis, recent algorithmic advances have enabled the recovery of heterogeneous conformations from the noisy imaging data. Here, we review methods for the reconstruction and heterogeneity analysis of cryo-EM images, ranging from linear-transformation-based methods to nonlinear deep generative models. We overview the dimensionality-reduction techniques used in heterogeneous 3D reconstruction methods and specify what information each method can infer from the data. Then, we review the methods that use cryo-EM images to estimate probability distributions over conformations in reduced subspaces or predefined by atomistic simulations. We conclude with the ongoing challenges for the cryo-EM community.  相似文献   

2.
基因表达谱芯片的数据挖掘   总被引:3,自引:1,他引:3  
随着基因芯片技术的迅速发展,表达谱芯片分析及aCGH等方法已被广泛应用于生命科学各个研究领域,由此产生的数据也呈指数级增长。如何从海量数据中获取有生物学意义的结果成为摆在生物学工作者面前的难题。对表达谱芯片数据挖掘方法进行了综述。介绍了基本分析思路,当前重点分析方向,如GO分析、pathway与调控网络分析、聚类分析等计算法则和相关几款易用的分析软件。并介绍了几种科学自由计算软件在表达谱生物信息学分析中的应用。藉此为从事芯片分析的研究人员提供参考。  相似文献   

3.
The analysis of the large amount of data generated in mass spectrometry-based proteomics experiments represents a significant challenge and is currently a bottleneck in many proteomics projects. In this review we discuss critical issues related to data processing and analysis in proteomics and describe available methods and tools. We place special emphasis on the elaboration of results that are supported by sound statistical arguments.  相似文献   

4.
癌症基因表达谱挖掘中的特征基因选择算法GA/WV   总被引:1,自引:0,他引:1  
鉴定癌症表达谱的特征基因集合可以促进癌症类型分类的研究,这也可能使病人获得更好的临床诊断?虽然一些方法在基因表达谱分析上取得了成功,但是用基因表达谱数据进行癌症分类研究依然是一个巨大的挑战,其主要原因在于缺少通用而可靠的基因重要性评估方法。GA/WV是一种新的用复杂的生物表达数据评估基因分类重要性的方法,通过联合遗传算法(GA)和加权投票分类算法(WV)得到的特征基因集合不但适用于WV分类器,也适用于其它分类器?将GA/WV方法用癌症基因表达谱数据集的验证,结果表明本方法是一种成功可靠的特征基因选择方法。  相似文献   

5.
The accurate determination of the biological effects of low doses of pollutants is a major public health challenge. DNA microarrays are a powerful tool for investigating small intracellular changes. However, the inherent low reliability of this technique, the small number of replicates and the lack of suitable statistical methods for the analysis of such a large number of attributes (genes) impair accurate data interpretation. To overcome this problem, we combined results of two independent analysis methods (ANOVA and RELIEF). We applied this analysis protocol to compare gene expression patterns in Saccharomyces cerevisiae growing in the absence and continuous presence of varying low doses of radiation. Global distribution analysis highlights the importance of mitochondrial membrane functions in the response. We demonstrate that microarrays detect cellular changes induced by irradiation at doses that are 1000-fold lower than the minimal dose associated with mutagenic effects.  相似文献   

6.
《Biophysical journal》2020,118(3):765-780
Biomolecular simulations are intrinsically high dimensional and generate noisy data sets of ever-increasing size. Extracting important features from the data is crucial for understanding the biophysical properties of molecular processes, but remains a big challenge. Machine learning (ML) provides powerful dimensionality reduction tools. However, such methods are often criticized as resembling black boxes with limited human-interpretable insight. We use methods from supervised and unsupervised ML to efficiently create interpretable maps of important features from molecular simulations. We benchmark the performance of several methods, including neural networks, random forests, and principal component analysis, using a toy model with properties reminiscent of macromolecular behavior. We then analyze three diverse biological processes: conformational changes within the soluble protein calmodulin, ligand binding to a G protein-coupled receptor, and activation of an ion channel voltage-sensor domain, unraveling features critical for signal transduction, ligand binding, and voltage sensing. This work demonstrates the usefulness of ML in understanding biomolecular states and demystifying complex simulations.  相似文献   

7.
L Boddy  M F Wilkins  C W Morris 《Cytometry》2001,44(3):195-209
BACKGROUND: Analytical flow cytometry (AFC), by quantifying sometimes more than 10 optical parameters on cells at rates of approximately 10(3) cells/s, rapidly generates vast quantities of multidimensional data, which provides a considerable challenge for data analysis. We review the application of multivariate data analysis and pattern recognition techniques to flow cytometry. METHODS: Approaches were divided into two broad types depending on whether the aim was identification or clustering. Multivariate statistical approaches, supervised artificial neural networks (ANNs), problems of overlapping character distributions, unbounded data sets, missing parameters, scaling up, and estimating proportions of different types of cells comprised the first category. Classic clustering methods, fuzzy clustering, and unsupervised ANNs comprised the second category.We demonstrate the state of the art by using AFC data on marine phytoplankton populations. RESULTS AND CONCLUSIONS: Information held within the large quantities of data generated by AFC was tractable using ANNs, but for field studies the problem of obtaining suitable training data needs to be resolved, and coping with an almost infinite number of cell categories needs further research.  相似文献   

8.
MOTIVATION: Discriminant analysis for high-dimensional and low-sample-sized data has become a hot research topic in bioinformatics, mainly motivated by its importance and challenge in applications to tumor classifications for high-dimensional microarray data. Two of the popular methods are the nearest shrunken centroids, also called predictive analysis of microarray (PAM), and shrunken centroids regularized discriminant analysis (SCRDA). Both methods are modifications to the classic linear discriminant analysis (LDA) in two aspects tailored to high-dimensional and low-sample-sized data: one is the regularization of the covariance matrix, and the other is variable selection through shrinkage. In spite of their usefulness, there are potential limitations with each method. The main concern is that both PAM and SCRDA are possibly too extreme: the covariance matrix in the former is restricted to be diagonal while in the latter there is barely any restriction. Based on the biology of gene functions and given the feature of the data, it may be beneficial to estimate the covariance matrix as an intermediate between the two; furthermore, more effective shrinkage schemes may be possible. RESULTS: We propose modified LDA methods to integrate biological knowledge of gene functions (or variable groups) into classification of microarray data. Instead of simply treating all the genes independently or imposing no restriction on the correlations among the genes, we group the genes according to their biological functions extracted from existing biological knowledge or data, and propose regularized covariance estimators that encourages between-group gene independence and within-group gene correlations while maintaining the flexibility of any general covariance structure. Furthermore, we propose a shrinkage scheme on groups of genes that tends to retain or remove a whole group of the genes altogether, in contrast to the standard shrinkage on individual genes. We show that one of the proposed methods performed better than PAM and SCRDA in a simulation study and several real data examples.  相似文献   

9.
MOTIVATION: Array comparative genomic hybridization (CGH) allows detection and mapping of copy number of DNA segments. A challenge is to make inferences about the copy number structure of the genome. Several statistical methods have been proposed to determine genomic segments with different copy number levels. However, to date, no comprehensive comparison of various characteristics of these methods exists. Moreover, the segmentation results have not been utilized in downstream analyses. RESULTS: We describe a comparison of three popular and publicly available methods for the analysis of array CGH data and we demonstrate how segmentation results may be utilized in the downstream analyses such as testing and classification, yielding higher power and prediction accuracy. Since the methods operate on individual chromosomes, we also propose a novel procedure for merging segments across the genome, which results in an interpretable set of copy number levels, and thus facilitate identification of copy number alterations in each genome. AVAILABILITY: http://www.bioconductor.org  相似文献   

10.
Zhao H  Zuo C  Chen S  Bang H 《Biometrics》2012,68(3):717-725
Summary Increasingly, estimations of health care costs are used to evaluate competing treatments or to assess the expected expenditures associated with certain diseases. In health policy and economics, the primary focus of these estimations has been on the mean cost, because the total cost can be derived directly from the mean cost, and because information about total resources utilized is highly relevant for policymakers. Yet, the median cost also could be important, both as an intuitive measure of central tendency in cost distribution and as a subject of interest to payers and consumers. In many prospective studies, cost data collection is sometimes incomplete for some subjects due to right censoring, which typically is caused by loss to follow-up or by limited study duration. Censoring poses a unique challenge for cost data analysis because of so-called induced informative censoring, in that traditional methods suited for survival data generally are invalid in censored cost estimation. In this article, we propose methods for estimating the median cost and its confidence interval (CI) when data are subject to right censoring. We also consider the estimation of the ratio and difference of two median costs and their CIs. These methods can be extended to the estimation of other quantiles and other informatively censored data. We conduct simulation and real data analysis in order to examine the performance of the proposed methods.  相似文献   

11.
12.
A major challenge in the analysis of DNA methylation (DNAm) data is variability introduced from intra-sample cellular heterogeneity, such as whole blood which is a convolution of DNAm profiles across a unique cell type. When this source of variability is confounded with an outcome of interest, if unaccounted for, false positives ensue. Current methods to estimate the cell type proportions in whole blood DNAm samples are only appropriate for one technology and lead to technology-specific biases if applied to data generated from other technologies. Here, we propose the technology-independent alternative: methylCC, which is available at https://github.com/stephaniehicks/methylCC.  相似文献   

13.
Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation. We name this method "Remove Unwanted Variation, 2-step" (RUV-2). We discuss various techniques for assessing the performance of an adjustment method and compare the performance of RUV-2 with that of other commonly used adjustment methods such as Combat and Surrogate Variable Analysis (SVA). We present several example studies, each concerning genes differentially expressed with respect to gender in the brain and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting RUV-2 for use in studies not concerned with differential expression and conclude that there may be promise but substantial challenges remain.  相似文献   

14.
BACKGROUND/AIMS: Complex traits pose a particular challenge to standard methods for segregation analysis (SA), and for such traits it is difficult to assess the ability of complex SA (CSA) to approximate the true mode of inheritance. Here we use an oligogenic Bayesian Markov chain Monte Carlo method for SA (OSA) to verify results from a single-locus likelihood-based CSA for data on a quantitative measure of reading ability. METHODS: We compared the profile likelihood from CSA, maximized over the trait allele frequency, to the posterior distribution of genotype effects from OSA to explore differences in the overall parameter estimates from SA on the original phenotype data and the same data Winsorized to reduce the potential influence of three outlying data points. RESULTS: Bayesian OSA revealed two modes of inheritance, one of which coincided with the QTL model from CSA. Winsorizing abolished the model originally estimated by CSA; both CSA and OSA identified only the second OSA model. CONCLUSION: Differences between the results from the two methods alerted us to the presence of influential data points, and identified the QTL model best supported by the data. Thus, the Bayesian OSA proved a valuable tool for assessing and verifying inheritance models from CSA.  相似文献   

15.
Putting the "landscape" in landscape genetics   总被引:1,自引:0,他引:1  
Landscape genetics has emerged as a new research area that integrates population genetics, landscape ecology and spatial statistics. Researchers in this field can combine the high resolution of genetic markers with spatial data and a variety of statistical methods to evaluate the role that landscape variables play in shaping genetic diversity and population structure. While interest in this research area is growing rapidly, our ability to fully utilize landscape data, test explicit hypotheses and truly integrate these diverse disciplines has lagged behind. Part of the current challenge in the development of the field of landscape genetics is bridging the communication and knowledge gap between these highly specific and technical disciplines. The goal of this review is to help bridge this gap by exposing geneticists to terminology, sampling methods and analysis techniques widely used in landscape ecology and spatial statistics but rarely addressed in the genetics literature. We offer a definition for the term "landscape genetics", provide an overview of the landscape genetics literature, give guidelines for appropriate sampling design and useful analysis techniques, and discuss future directions in the field. We hope, this review will stimulate increased dialog and enhance interdisciplinary collaborations advancing this exciting new field.  相似文献   

16.
Nested effects models for high-dimensional phenotyping screens   总被引:2,自引:0,他引:2  
MOTIVATION: In high-dimensional phenotyping screens, a large number of cellular features is observed after perturbing genes by knockouts or RNA interference. Comprehensive analysis of perturbation effects is one of the most powerful techniques for attributing functions to genes, but not much work has been done so far to adapt statistical and computational methodology to the specific needs of large-scale and high-dimensional phenotyping screens. RESULTS: We introduce and compare probabilistic methods to efficiently infer a genetic hierarchy from the nested structure of observed perturbation effects. These hierarchies elucidate the structures of signaling pathways and regulatory networks. Our methods achieve two goals: (1) they reveal clusters of genes with highly similar phenotypic profiles, and (2) they order (clusters of) genes according to subset relationships between phenotypes. We evaluate our algorithms in the controlled setting of simulation studies and show their practical use in two experimental scenarios: (1) a data set investigating the response to microbial challenge in Drosophila melanogaster, and (2) a compendium of expression profiles of Saccharomyces cerevisiae knockout strains. We show that our methods identify biologically justified genetic hierarchies of perturbation effects. AVAILABILITY: The software used in our analysis is freely available in the R package 'nem' from www.bioconductor.org.  相似文献   

17.
Recent technological advances have made it possible to collect high-dimensional genomic data along with clinical data on a large number of subjects. In the studies of chronic diseases such as cancer, it is of great interest to integrate clinical and genomic data to build a comprehensive understanding of the disease mechanisms. Despite extensive studies on integrative analysis, it remains an ongoing challenge to model the interaction effects between clinical and genomic variables, due to high dimensionality of the data and heterogeneity across data types. In this paper, we propose an integrative approach that models interaction effects using a single-index varying-coefficient model, where the effects of genomic features can be modified by clinical variables. We propose a penalized approach for separate selection of main and interaction effects. Notably, the proposed methods can be applied to right-censored survival outcomes based on a Cox proportional hazards model. We demonstrate the advantages of the proposed methods through extensive simulation studies and provide applications to a motivating cancer genomic study.  相似文献   

18.
Plasmode is a term coined several years ago to describe data sets that are derived from real data but for which some truth is known. Omic techniques, most especially microarray and genomewide association studies, have catalyzed a new zeitgeist of data sharing that is making data and data sets publicly available on an unprecedented scale. Coupling such data resources with a science of plasmode use would allow statistical methodologists to vet proposed techniques empirically (as opposed to only theoretically) and with data that are by definition realistic and representative. We illustrate the technique of empirical statistics by consideration of a common task when analyzing high dimensional data: the simultaneous testing of hundreds or thousands of hypotheses to determine which, if any, show statistical significance warranting follow-on research. The now-common practice of multiple testing in high dimensional experiment (HDE) settings has generated new methods for detecting statistically significant results. Although such methods have heretofore been subject to comparative performance analysis using simulated data, simulating data that realistically reflect data from an actual HDE remains a challenge. We describe a simulation procedure using actual data from an HDE where some truth regarding parameters of interest is known. We use the procedure to compare estimates for the proportion of true null hypotheses, the false discovery rate (FDR), and a local version of FDR obtained from 15 different statistical methods.  相似文献   

19.

Thanks to advances in high-throughput sequencing technologies, the importance of microbiome to human health and disease has been increasingly recognized. Analyzing microbiome data from sequencing experiments is challenging due to their unique features such as compositional data, excessive zero observations, overdispersion, and complex relations among microbial taxa. Clustered microbiome data have become prevalent in recent years from designs such as longitudinal studies, family studies, and matched case–control studies. The within-cluster dependence compounds the challenge of the microbiome data analysis. Methods that properly accommodate intra-cluster correlation and features of the microbiome data are needed. We develop robust and powerful differential composition tests for clustered microbiome data. The methods do not rely on any distributional assumptions on the microbial compositions, which provides flexibility to model various correlation structures among taxa and among samples within a cluster. By leveraging the adjusted sandwich covariance estimate, the methods properly accommodate sample dependence within a cluster. The two-part version of the test can further improve power in the presence of excessive zero observations. Different types of confounding variables can be easily adjusted for in the methods. We perform extensive simulation studies under commonly adopted clustered data designs to evaluate the methods. We demonstrate that the methods properly control the type I error under all designs and are more powerful than existing methods in many scenarios. The usefulness of the proposed methods is further demonstrated with two real datasets from longitudinal microbiome studies on pregnant women and inflammatory bowel disease patients. The methods have been incorporated into the R package “miLineage” publicly available at https://tangzheng1.github.io/tanglab/software.html.

  相似文献   

20.
Sequencing studies have been discovering a numerous number of rare variants, allowing the identification of the effects of rare variants on disease susceptibility. As a method to increase the statistical power of studies on rare variants, several groupwise association tests that group rare variants in genes and detect associations between genes and diseases have been proposed. One major challenge in these methods is to determine which variants are causal in a group, and to overcome this challenge, previous methods used prior information that specifies how likely each variant is causal. Another source of information that can be used to determine causal variants is the observed data because case individuals are likely to have more causal variants than control individuals. In this article, we introduce a likelihood ratio test (LRT) that uses both data and prior information to infer which variants are causal and uses this finding to determine whether a group of variants is involved in a disease. We demonstrate through simulations that LRT achieves higher power than previous methods. We also evaluate our method on mutation screening data of the susceptibility gene for ataxia telangiectasia, and show that LRT can detect an association in real data. To increase the computational speed of our method, we show how we can decompose the computation of LRT, and propose an efficient permutation test. With this optimization, we can efficiently compute an LRT statistic and its significance at a genome-wide level. The software for our method is publicly available at http://genetics.cs.ucla.edu/rarevariants .  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号