首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

One important preprocessing step in the analysis of microarray data is background subtraction. In high-density oligonucleotide arrays this is recognized as a crucial step for the global performance of the data analysis from raw intensities to expression values.

Results

We propose here an algorithm for background estimation based on a model in which the cost function is quadratic in a set of fitting parameters such that minimization can be performed through linear algebra. The model incorporates two effects: 1) Correlated intensities between neighboring features in the chip and 2) sequence-dependent affinities for non-specific hybridization fitted by an extended nearest-neighbor model.

Conclusion

The algorithm has been tested on 360 GeneChips from publicly available data of recent expression experiments. The algorithm is fast and accurate. Strong correlations between the fitted values for different experiments as well as between the free-energy parameters and their counterparts in aqueous solution indicate that the model captures a significant part of the underlying physical chemistry.  相似文献   

2.
Tank and Hopfield have shown that networks of analog neurons can be used to solve linear programming (LP) problems. We have re-examined their approach and found that their network model frequently computes solutions that are only suboptimal or that violate the LP problem's constraints. As their approach has proven unreliable, we have developed a new network model: the goal programming network. To this end, a network model was first developed for goal programming problems, a particular type of LP problems. From the manner the network operates on such problems, it was concluded that overconstrainedness, which is possibly present in an LP formulation, should be removed, and we have provided a simple procedure to accomplish this.  相似文献   

3.
The use of non-invasive genetic sampling to estimate population size in elusive or rare species is increasing. The data generated from this sampling differ from traditional mark-recapture data in that individuals may be captured multiple times within a session or there may only be a single sampling event. To accommodate this type of data, we develop a method, named capwire, based on a simple urn model containing individuals of two capture probabilities. The method is evaluated using simulations of an urn and of a more biologically realistic system where individuals occupy space, and display heterogeneous movement and DNA deposition patterns. We also analyse a small number of real data sets. The results indicate that when the data contain capture heterogeneity the method provides estimates with small bias and good coverage, along with high accuracy and precision. Performance is not as consistent when capture rates are homogeneous and when dealing with populations substantially larger than 100. For the few real data sets where N is approximately known, capwire's estimates are very good. We compare capwire's performance to commonly used rarefaction methods and to two heterogeneity estimators in program capture: Mh-Chao and Mh-jackknife. No method works best in all situations. While less precise, the Chao estimator is very robust. We also examine how large samples should be to achieve a given level of accuracy using capwire. We conclude that capwire provides an improved way to estimate N for some DNA-based data sets.  相似文献   

4.
Wang M  Williamson JM 《Biometrics》2005,61(4):973-981
We extend the Mantel-Haenszel estimating function to estimate both the intra-cluster pairwise correlation and the main effects for sparse clustered binary data. We propose both a composite likelihood approach and an estimating function approach for the analysis of such data. The proposed estimators are consistent and asymptotically normally distributed. Simulation results demonstrate that the two approaches are comparable in terms of bias and efficiency; however, the estimating equation approach is computationally simpler. Analysis of the Georgia High Blood Pressure survey is used for illustration.  相似文献   

5.
MOTIVATION: Maximum-likelihood methods for solving the consensus sequence identification (CSI) problem on DNA sequences may only find a local optimum rather than the global optimum. Additionally, such methods do not allow logical constraints to be imposed on their models. This study develops a linear programming technique to solve CSI problems by finding an optimum consensus sequence. This method is computationally more efficient and is guaranteed to reach the global optimum. The developed method can also be extended to treat more complicated CSI problems with ambiguous conserved patterns. RESULTS: A CSI problem is first formulated as a non-linear mixed 0-1 optimization program, which is then converted into a linear mixed 0-1 program. The proposed method provides the following advantages over maximum-likelihood methods: (1) It is guaranteed to find the global optimum. (2) It can embed various logical constraints into the corresponding model. (3) It is applicable to problems with many long sequences. (4) It can find the second and the third best solutions. An extension of the proposed linear mixed 0-1 program is also designed to solve CSI problems with an unknown spacer length between conserved regions. Two examples of searching for CRP-binding sites and for FNR-binding sites in the Escherichia coli genome are used to illustrate and test the proposed method. AVAILABILITY: A software package, Global Site Seer for the Microsoft Windows operating system is available by http://www.iim.nctu.edu.tw/~cjfu/gss.htm  相似文献   

6.
7.
Hu YJ 《Nucleic acids research》2003,31(13):3446-3449
RNA molecules play an important role in many biological activities. Knowing its secondary structure can help us better understand the molecule's ability to function. The methods for RNA structure determination have traditionally been implemented through biochemical, biophysical and phylogenetic analyses. As the advance of computer technology, an increasing number of computational approaches have recently been developed. They have different goals and apply various algorithms. For example, some focus on secondary structure prediction for a single sequence; some aim at finding a global alignment of multiple sequences. Some predict the structure based on free energy minimization; some make comparative sequence analyses to determine the structure. In this paper, we describe how to correctly use GPRM, a genetic programming approach to finding common secondary structure elements in a set of unaligned coregulated or homologous RNA sequences. GPRM can be accessed at http://bioinfo.cis.nctu.edu.tw/service/gprm/.  相似文献   

8.
Evolutionary biologists have adopted simple likelihood models for purposes of estimating ancestral states and evaluating character independence on specified phylogenies; however, for purposes of estimating phylogenies by using discrete morphological data, maximum parsimony remains the only option. This paper explores the possibility of using standard, well-behaved Markov models for estimating morphological phylogenies (including branch lengths) under the likelihood criterion. An important modification of standard Markov models involves making the likelihood conditional on characters being variable, because constant characters are absent in morphological data sets. Without this modification, branch lengths are often overestimated, resulting in potentially serious biases in tree topology selection. Several new avenues of research are opened by an explicitly model-based approach to phylogenetic analysis of discrete morphological data, including combined-data likelihood analyses (morphology + sequence data), likelihood ratio tests, and Bayesian analyses.  相似文献   

9.
S R Lipsitz 《Biometrics》1992,48(1):271-281
In many empirical analyses, the response of interest is categorical with an ordinal scale attached. Many investigators prefer to formulate a linear model, assigning scores to each category of the ordinal response and treating it as continuous. When the covariates are categorical, Haber (1985, Computational Statistics and Data Analysis 3, 1-10) has developed a method to obtain maximum likelihood (ML) estimates of the parameters of the linear model using Lagrange multipliers. However, when the covariates are continuous, the only method we found in the literature is ordinary least squares (OLS), performed under the assumption of homogeneous variance. The OLS estimates are unbiased and consistent but, since variance homogeneity is violated, the OLS estimates of variance can be biased and may not be consistent. We discuss a variance estimate (White, 1980, Econometrica 48, 817-838) that is consistent for the true variance of the OLS parameter estimates. The possible bias encountered by using the naive OLS variance estimate is discussed. An estimated generalized least squares (EGLS) estimator is proposed and its efficiency relative to OLS is discussed. Finally, an empirical comparison of OLS, EGLS, and ML estimators is made.  相似文献   

10.
Y Peng  Y Zhang  G Kou  Y Shi 《PloS one》2012,7(7):e41713
Determining the number of clusters in a data set is an essential yet difficult step in cluster analysis. Since this task involves more than one criterion, it can be modeled as a multiple criteria decision making (MCDM) problem. This paper proposes a multiple criteria decision making (MCDM)-based approach to estimate the number of clusters for a given data set. In this approach, MCDM methods consider different numbers of clusters as alternatives and the outputs of any clustering algorithm on validity measures as criteria. The proposed method is examined by an experimental study using three MCDM methods, the well-known clustering algorithm-k-means, ten relative measures, and fifteen public-domain UCI machine learning data sets. The results show that MCDM methods work fairly well in estimating the number of clusters in the data and outperform the ten relative measures considered in the study.  相似文献   

11.
Models for longitudinal data: a generalized estimating equation approach   总被引:84,自引:0,他引:84  
S L Zeger  K Y Liang  P S Albert 《Biometrics》1988,44(4):1049-1060
This article discusses extensions of generalized linear models for the analysis of longitudinal data. Two approaches are considered: subject-specific (SS) models in which heterogeneity in regression parameters is explicitly modelled; and population-averaged (PA) models in which the aggregate response for the population is the focus. We use a generalized estimating equation approach to fit both classes of models for discrete and continuous outcomes. When the subject-specific parameters are assumed to follow a Gaussian distribution, simple relationships between the PA and SS parameters are available. The methods are illustrated with an analysis of data on mother's smoking and children's respiratory disease.  相似文献   

12.
13.

Background

Predicting prognosis in patients from large-scale genomic data is a fundamentally challenging problem in genomic medicine. However, the prognosis still remains poor in many diseases. The poor prognosis may be caused by high complexity of biological systems, where multiple biological components and their hierarchical relationships are involved. Moreover, it is challenging to develop robust computational solutions with high-dimension, low-sample size data.

Results

In this study, we propose a Pathway-Associated Sparse Deep Neural Network (PASNet) that not only predicts patients’ prognoses but also describes complex biological processes regarding biological pathways for prognosis. PASNet models a multilayered, hierarchical biological system of genes and pathways to predict clinical outcomes by leveraging deep learning. The sparse solution of PASNet provides the capability of model interpretability that most conventional fully-connected neural networks lack. We applied PASNet for long-term survival prediction in Glioblastoma multiforme (GBM), which is a primary brain cancer that shows poor prognostic performance. The predictive performance of PASNet was evaluated with multiple cross-validation experiments. PASNet showed a higher Area Under the Curve (AUC) and F1-score than previous long-term survival prediction classifiers, and the significance of PASNet’s performance was assessed by Wilcoxon signed-rank test. Furthermore, the biological pathways, found in PASNet, were referred to as significant pathways in GBM in previous biology and medicine research.

Conclusions

PASNet can describe the different biological systems of clinical outcomes for prognostic prediction as well as predicting prognosis more accurately than the current state-of-the-art methods. PASNet is the first pathway-based deep neural network that represents hierarchical representations of genes and pathways and their nonlinear effects, to the best of our knowledge. Additionally, PASNet would be promising due to its flexible model representation and interpretability, embodying the strengths of deep learning. The open-source code of PASNet is available at https://github.com/DataX-JieHao/PASNet.
  相似文献   

14.
15.
16.
在生命体内,基因以及其它分子间相互作用形成复杂调控网络,生命过程都是以调控网络的形式存在,如从代谢通路网络到转录调控网络,从信号转导网络到蛋白质相互作用网络等等。因此,网络现象是生命现象的复杂本质和主要特征。本文系统地介绍了基于表达谱数据构建基因调控网络的布尔网络模型,线性模型,微分方程模型和贝叶斯网络模型,并对各种网络构建模型进行了深入的分析和总结。同时,文章从基因组序列信息、蛋白质相互作用信息和生物医学文献信息等方面讨论了基因调控网络方面构建的研究,这对从系统生物学水平揭示生命复杂机制具有重要的参考价值。  相似文献   

17.
18.
T Jombart  R M Eggo  P J Dodd  F Balloux 《Heredity》2011,106(2):383-390
Epidemiology and public health planning will increasingly rely on the analysis of genetic sequence data. In particular, genetic data coupled with dates and locations of sampled isolates can be used to reconstruct the spatiotemporal dynamics of pathogens during outbreaks. Thus far, phylogenetic methods have been used to tackle this issue. Although these approaches have proved useful for informing on the spread of pathogens, they do not aim at directly reconstructing the underlying transmission tree. Instead, phylogenetic models infer most recent common ancestors between pairs of isolates, which can be inadequate for densely sampled recent outbreaks, where the sample includes ancestral and descendent isolates. In this paper, we introduce a novel method based on a graph approach to reconstruct transmission trees directly from genetic data. Using simulated data, we show that our approach can efficiently reconstruct genealogies of isolates in situations where classical phylogenetic approaches fail to do so. We then illustrate our method by analyzing data from the early stages of the swine-origin A/H1N1 influenza pandemic. Using 433 isolates sequenced at both the hemagglutinin and neuraminidase genes, we reconstruct the likely history of the worldwide spread of this new influenza strain. The presented methodology opens new perspectives for the analysis of genetic data in the context of disease outbreaks.  相似文献   

19.
Genetic structure is ubiquitous in wild populations and is the result of the processes of natural selection, genetic drift, mutation, and gene flow. Genetic drift and divergent selection promotes the generation of genetic structure, while gene flow homogenizes the subpopulations. The ability to detect genetic structure from marker data diminishes rapidly with a decreasing level of differentiation among subpopulations. Weak genetic structure may be unimportant over evolutionary time scales but could have important implications in ecology and conservation biology. In this paper we examine methods for detecting and quantifying weak genetic structures using simulated data. We simulated populations consisting of two putative subpopulations evolving for up to 50 generations with varying degrees of gene flow (migration), and varying amounts of information (allelic diversity). There are a number of techniques available to detect and quantify genetic structure but here we concentrate on four methods: F(ST), population assignment, relatedness, and sibship assignment. Under the simple mating system simulated here, the four methods produce qualitatively similar results. However, the assignment method performed relatively poorly when genetic structure was weak and we therefore caution against using this method when the analytical aim is to detect fine-scale patterns. Further work should examine situations with different mating systems, for example where a few individuals dominate reproductive output of the population. This study will help workers to design their experiments (e.g., sample sizes of markers and individuals), and to decide which methods are likely to be most appropriate for their particular data.  相似文献   

20.
The Maximal Margin (MAMA) linear programming classification algorithm has recently been proposed and tested for cancer classification based on expression data. It demonstrated sound performance on publicly available expression datasets. We developed a web interface to allow potential users easy access to the MAMA classification tool. Basic and advanced options provide flexibility in exploitation. The input data format is the same as that used in most publicly available datasets. This makes the web resource particularly convenient for non-expert machine learning users working in the field of expression data analysis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号