首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
基因芯片数据分析方法研究进展   总被引:2,自引:0,他引:2  
基因芯片技术的出现改变了生物医学研究的前景,其产生的海量数据是限制其发展的瓶颈问题。为提取其中所隐含的有价值的信息,在基因芯片数据分析的复杂计算工具和方法方面近年来有很多尝试。本文对近5年来基因芯片表达数据的分类分析方法进行综述,既分类比较了以聚类分析为基础的分类方法,也吸收了当前应用数据挖掘、信息融合等系统生物学思路的研究技术,并对数据的分析结果进行评价。  相似文献   

2.
基因芯片实验要得到可靠的生物学结论,必须基于优化的实验设计和科学的数据分析。讨论了与基因芯片数据分析方法相关的实验设计方面的几个问题,简述了差异表达分析、聚类分析及功能富集分析等分析方法及其进展,并介绍了部分软件及应用。  相似文献   

3.
基因芯片作为一种新兴的技术手段已经在植物学、动物学、医学和农学等多个研究领域中发挥了重要作用。本文就基因芯片数据分析的各个环节,包括芯片数据的预处理、归一化、差异基因的判断、聚类分析以及基因芯片在植物功能基因组研究中的应用进行了综述。  相似文献   

4.
生物信息学在基因芯片中的应用   总被引:13,自引:1,他引:13  
生物信息学和基因芯片是生命科学研究领域中的两种新方法和新技术,生物信息学与基因芯片密切相关,生物信息学促进了基因芯片的研究与应用,而基因芯片则丰富了生物信息学的研究内容。本论文探讨生物信息学在基因芯片中的应用,将生物信息学方法运用到高密度基因芯片设计和芯片实验数据管理及分析。从信息学的角度提出基因芯片设计准则,提出寡核苷酸探针的优化设计方法,将该方法运用于再测序型芯片和基因表达型芯片的设计,在此基础上研制出高密度基因芯片设计软件系统和实验结果分析系统。  相似文献   

5.
基于AR模型的基因芯片数据识别   总被引:5,自引:5,他引:0  
将自回归模型(AR)模型引入基因芯片数据识别领域,提出了基于自回归模型的时间序列特征提取方法.利用动态时轴弯曲(DTW)作为分类器,在标准的肿瘤基因芯片数据的识别结果表明,本方法能够达到100%的识别率,可以应用于基因芯片数据的识别、分类和基因疾病推断。  相似文献   

6.
陈磊  刘毅慧 《生物信息学》2011,9(3):229-234
基因芯片技术是基因组学中的重要研究工具。而基因芯片数据( 微阵列数据) 往往是高维的,使得降维成为微阵列数据分析中的一个必要步骤。本文对美国哈佛医学院 G. J. Gordon 等人提供的肺癌微阵列数据进行分析。通过 t- test,Wilcoxon 秩和检测分别提取微阵列数据特征属性,后根据 CART( Classification and Regression Tree) 算法,以 Gini 差异性指标作为误差函数,用提取的特征属性广延的构造分类树; 再进行剪枝找到最优规模的树,目的是提高树的泛化性能使得能很好适应新的预测数据。实验证明: 该方法对肺癌微阵列数据分类识别率达到 96% 以上,且很稳定; 并可以得到人们容易理解的分类规则和分类关键基因。  相似文献   

7.
为了较系统地了解基因芯片在地下水污染研究中的应用进展,调研了基因芯片技术及其在地下水污染研究中应用的有关文献,简述了基因芯片原理、分类及实施流程,总结了系统发育寡核苷酸芯片和功能基因芯片在地下水污染研究中的最新应用进展,探讨了基因芯片检测性能、数据分析和应用等方面存在的问题和措施,在基因芯片性能改进和应用方面提出了进一步研究的方向。  相似文献   

8.
为了较系统地了解基因芯片在地下水污染研究中的应用进展,调研了基因芯片技术及其在地下水污染研究中应用的有关文献,简述了基因芯片原理、分类及实施流程,总结了系统发育寡核苷酸芯片和功能基因芯片在地下水污染研究中的最新应用进展,探讨了基因芯片检测性能、数据分析和应用等方面存在的问题和措施,在基因芯片性能改进和应用方面提出了进一步研究的方向。  相似文献   

9.
目的:研究混合效应模型(Mixed Effects Model)在肿瘤表达谱基因芯片数据分析中的检验效能,并探讨其分析效果。方法:采用混合效应模型分析肿瘤实例基因芯片数据,并以基因集富集分析方法(GSEA)作为参照比较分析结果的有效性和科学性,探讨其检验效果。结果:通过混合效应模型和基因集富集分析(GSEA)两种方法对肿瘤基因芯片数据的分析和比较,两种方法筛选出共同的差异表达通路外,混合效应模型额外地筛选出来GSEA未能检验到的8条差异表达通路,且得到文献支持;混和效应模型筛选出的前10个差异表达通路中有6个已有生物学证明而基因集富集分析方法(GSEA)筛选出的前10个差异表达通路中仅有4个已有生物学证明。结论:混合效应模型作为top-down方法中的典型代表,其优势在于通过构建潜变量达到降维目的,可有效地减少多个复杂的变异来源从而保证了结果的准确性和科学性,其检验效能优于基因集富集分析方法(GSEA),是一种行之有效的筛选肿瘤基因芯片数据的分析方法。  相似文献   

10.
针对基因芯片数据缺失问题,利用蛋白质相互作用关系与基因表达的内在联系,提出了一种利用蛋白质相互作用信息提高基因芯片缺失数据估计精度的方法.将蛋白质间的相互作用关系与基因表达数据间的距离相结合来计算基因间的表达相似度,根据这个新的相似性度量标准为含有缺失数据的基因选择更为合适的用于估计缺失值的基因集合.将新的相似性度量标准与传统的KNNimpute、 LLSimpute方法相结合,描述了对应的改进算法PPI-KNNimpute、 PPI-LLSimpute.对真实的数据集测试表明,蛋白质相互作用信息能有效改善基因缺失数据估计的精度.  相似文献   

11.
《Ecological Informatics》2007,2(2):138-149
Ecological patterns are difficult to extract directly from vegetation data. The respective surveys provide a high number of interrelated species occurrence variables. Since often only a limited number of ecological gradients determine species distributions, the data might be represented by much fewer but effectively independent variables. This can be achieved by reducing the dimensionality of the data. Conventional methods are either limited to linear feature extraction (e.g., principal component analysis, and Classical Multidimensional Scaling, CMDS) or require a priori assumptions on the intrinsic data dimensionality (e.g., Nonmetric Multidimensional Scaling, NMDS, and self organized maps, SOM).In this study we explored the potential of Isometric Feature Mapping (Isomap). This new method of dimensionality reduction is a nonlinear generalization of CMDS. Isomap is based on a nonlinear geodesic inter-point distance matrix. Estimating geodesic distances requires one free threshold parameter, which defines linear geometry to be preserved in the global nonlinear distance structure. We compared Isomap to its linear (CMDS) and nonmetric (NMDS) equivalents. Furthermore, the use of geodesic distances allowed also extending NMDS to a version that we called NMDS-G. In addition we investigated a supervised Isomap variant (S-Isomap) and showed that all these techniques are interpretable within a single methodical framework.As an example we investigated seven plots (subdivided in 456 subplots) in different secondary tropical montane forests with 773 species of vascular plants. A key problem for the study of tropical vegetation data is the heterogeneous small scale variability implying large ranges of β-diversity. The CMDS and NMDS methods did not reduce the data dimensionality reasonably. On the contrary, Isomap explained 95% of the data variance in the first five dimensions and provided ecologically interpretable visualizations; NMDS-G yielded similar results. The main shortcoming of the latter was the high computational cost and the requirement to predefine the dimension of the embedding space. The S-Isomap learning scheme did not improve the Isomap variant for an optimal threshold parameter but substantially improved the nonoptimal solutions.We conclude that Isomap as a new ordination method allows effective representations of high dimensional vegetation data sets. The method is promising since it does not require a priori assumptions, and is computationally highly effective.  相似文献   

12.
基于SAS的多元统计方法实现芯片数据挖掘   总被引:4,自引:0,他引:4  
黄晓韵  曹波  杨跃 《生物信息学》2010,8(2):147-149
利用SAS软件对GEO的一个肺癌芯片实验进行挖掘。采用非参数检验,判别分析和回归分析对该芯片实验中14个核受体的表达信息进行分析。结果表明,在0.05显著性水平下,ER1、VDR、RARα和RORα四个基因在腺癌和鳞癌表达具有统计学差异;RARβ在复发组和非复发组表达有差异。判别分析结果显示VDR和RORα表达量可以对病理类型进行预测,但是总误判率很高(0.2389);RARβ和PPARα对判别是否复发的总误判率更高(0.3457)。建立回归方程预测病理类型,入选模型的变量也是VDR和RORα,两者OR分别为0.126和4.452。可见,基于SAS的多元统计方法是芯片数据挖掘的一种潜在方法,一旦芯片实验标准化,利用SAS对不同芯片实验数据整合分析的结论将有益于推动假说形成。  相似文献   

13.
基于流形学习的基因表达谱数据可视化   总被引:2,自引:0,他引:2  
基因表达谱的可视化本质上是高维数据的降维问题。采用流形学习算法来解决基因表达谱的降维数据可视化,讨论了典型的流形学习算法(Isomap和LLE)在表达谱降维中的适用性。通过类内/类间距离定量评价数据降维的效果,对两个典型基因芯片数据集(结肠癌基因表达谱数据集和急性白血病基因表达谱数据集)进行降维分析,发现两个数据集的本征维数都低于3,因而可以用流形学习方法在低维投影空间中进行可视化。与传统的降维方法(如PCA和MDS)的投影结果作比较,显示Isomap流形学习方法有更好的可视化效果。  相似文献   

14.
Aim Spatial floristic and faunistic data bases promote the investigation of biogeographical gradients in relation to environmental determinants on regional to continental scales. Our aim was to extract major gradients in the distribution of vascular plant species from a grid‐based inventory (the German FLORKART data base) and relate them to long‐term precipitation and temperature records as well as soil conditions. We present an ordination technique capable of coping with this complex data array. The goal was also to sort out the influence of spatial autocorrelation, assuming floristic autocorrelation is anisotropic. Location Germany, at a spatial resolution of 6′ × 10′. Methods Isometric feature mapping (Isomap) was applied as a nonlinear ordination method. Isomap was coupled to ‘eigenvector‐based filters’ for generating spatial reference models representing spatial autocorrelation. What is novel here is that the derived filters are not based on the assumption of equidirectional autocorrelation. Instead, the so‐called ‘principal coordinates of anisotropic neighbour matrices’ build filters to test the influence of geographical vicinity in directions of high similarity among observations. Results The Isomap ordination of floristic data explained more than 95% of the data variance in six dimensions. The leading two dimensions (representing about 80% of the FLORKART data variance) revealed clear spatial gradients that could be related to independent effects of temperature, precipitation and soil observations. By contrast, the third and higher FLORKART dimensions were dominated by an antagonism of anisotropic spatial autocorrelation and soil conditions. A subsequent cluster analysis of the floristic Isomap coordinates educed the spatial organization of the floristic survey, indicating a considerable sampling bias. Conclusions We showed that Isomap provides a consistent methodical framework for both ordination and derived spatial filters. The technique is useful for tracing the often nonlinear features of species occurrence data to environmental drivers, taking into account anisotropic spatial autocorrelation. We also showed that sampling biases are a conspicuous source of variance in a frequently used floristic data base.  相似文献   

15.

Background  

Life processes are determined by the organism's genetic profile and multiple environmental variables. However the interaction between these factors is inherently non-linear [1]. Microarray data is one representation of the nonlinear interactions among genes and genes and environmental factors. Still most microarray studies use linear methods for the interpretation of nonlinear data. In this study, we apply Isomap, a nonlinear method of dimensionality reduction, to analyze three independent large Affymetrix high-density oligonucleotide microarray data sets.  相似文献   

16.
The problem of multiple surface clustering is a challenging task, particularly when the surfaces intersect. Available methods such as Isomap fail to capture the true shape of the surface near by the intersection and result in incorrect clustering. The Isomap algorithm uses shortest path between points. The main draw back of the shortest path algorithm is due to the lack of curvature constrained where causes to have a path between points on different surfaces. In this paper we tackle this problem by imposing a curvature constraint to the shortest path algorithm used in Isomap. The algorithm chooses several landmark nodes at random and then checks whether there is a curvature constrained path between each landmark node and every other node in the neighborhood graph. We build a binary feature vector for each point where each entry represents the connectivity of that point to a particular landmark. Then the binary feature vectors could be used as a input of conventional clustering algorithm such as hierarchical clustering. We apply our method to simulated and some real datasets and show, it performs comparably to the best methods such as K-manifold and spectral multi-manifold clustering.  相似文献   

17.
The recent explosion in procurement and availability of high-dimensional gene- and protein-expression profile datasets for cancer diagnostics has necessitated the development of sophisticated machine learning tools with which to analyze them. A major limitation in the ability to accurate classify these high-dimensional datasets stems from the 'curse of dimensionality', occurring in situations where the number of genes or peptides significantly exceeds the total number of patient samples. Previous attempts at dealing with this issue have mostly centered on the use of a dimensionality reduction (DR) scheme, Principal Component Analysis (PCA), to obtain a low-dimensional projection of the high-dimensional data. However, linear PCA and other linear DR methods, which rely on Euclidean distances to estimate object similarity, do not account for the inherent underlying nonlinear structure associated with most biomedical data. The motivation behind this work is to identify the appropriate DR methods for analysis of high-dimensional gene- and protein-expression studies. Towards this end, we empirically and rigorously compare three nonlinear (Isomap, Locally Linear Embedding, Laplacian Eigenmaps) and three linear DR schemes (PCA, Linear Discriminant Analysis, Multidimensional Scaling) with the intent of determining a reduced subspace representation in which the individual object classes are more easily discriminable.  相似文献   

18.
MOTIVATION: Genome-wide gene expression measurements, as currently determined by the microarray technology, can be represented mathematically as points in a high-dimensional gene expression space. Genes interact with each other in regulatory networks, restricting the cellular gene expression profiles to a certain manifold, or surface, in gene expression space. To obtain knowledge about this manifold, various dimensionality reduction methods and distance metrics are used. For data points distributed on curved manifolds, a sensible distance measure would be the geodesic distance along the manifold. In this work, we examine whether an approximate geodesic distance measure captures biological similarities better than the traditionally used Euclidean distance. RESULTS: We computed approximate geodesic distances, determined by the Isomap algorithm, for one set of lymphoma and one set of lung cancer microarray samples. Compared with the ordinary Euclidean distance metric, this distance measure produced more instructive, biologically relevant, visualizations when applying multidimensional scaling. This suggests the Isomap algorithm as a promising tool for the interpretation of microarray data. Furthermore, the results demonstrate the benefit and importance of taking nonlinearities in gene expression data into account.  相似文献   

19.
Segmentation of the 3D human body is a very challenging problem in applications exploiting volume capture data. Direct clustering in the Euclidean space is usually complex or even unsolvable. This paper presents an original method based on the Isomap (isometric feature mapping) transform of the volume data-set. The 3D articulated posture is mapped by Isomap in the pose of Da Vinci's Vitruvian man. The limbs are unrolled from each other and separated from the trunk and pelvis, and the topology of the human body shape is recovered. In such a configuration, Hoshen-Kopelman clustering applied to concentric spherical shells is used to automatically group points into the labelled principal curves. Shepard interpolation is utilised to back-map points of the principal curves into the original volume space. The experimental results performed on many different postures have proved the validity of the proposed method. Reliability of less than 2?cm and 3° in the location of the joint centres and direction axes of rotations has been obtained, respectively, which qualifies this procedure as a potential tool for markerless motion analysis.  相似文献   

20.
The analysis of molecular motion starting from extensive sampling of molecular configurations remains an important and challenging task in computational biology. Existing methods require a significant amount of time to extract the most relevant motion information from such data sets. In this work, we provide a practical tool for molecular motion analysis. The proposed method builds upon the recent ScIMAP (Scalable Isomap) method, which, by using proximity relations and dimensionality reduction, has been shown to reliably extract from simulation data a few parameters that capture the main, linear and/or nonlinear, modes of motion of a molecular system. The results we present in the context of protein folding reveal that the proposed method characterizes the folding process essentially as well as ScIMAP. At the same time, by projecting the simulation data and computing proximity relations in a low-dimensional Euclidean space, it renders such analysis computationally practical. In many instances, the proposed method reduces the computational cost from several CPU months to just a few CPU hours, making it possible to analyze extensive simulation data in a matter of a few hours using only a single processor. These results establish the proposed method as a reliable and practical tool for analyzing motions of considerably large molecular systems and proteins with complex folding mechanisms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号