首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Ji X  Li-Ling J  Sun Z 《FEBS letters》2003,542(1-3):125-131
In this work we have developed a new framework for microarray gene expression data analysis. This framework is based on hidden Markov models. We have benchmarked the performance of this probability model-based clustering algorithm on several gene expression datasets for which external evaluation criteria were available. The results showed that this approach could produce clusters of quality comparable to two prevalent clustering algorithms, but with the major advantage of determining the number of clusters. We have also applied this algorithm to analyze published data of yeast cell cycle gene expression and found it able to successfully dig out biologically meaningful gene groups. In addition, this algorithm can also find correlation between different functional groups and distinguish between function genes and regulation genes, which is helpful to construct a network describing particular biological associations. Currently, this method is limited to time series data. Supplementary materials are available at http://www.bioinfo.tsinghua.edu.cn/~rich/hmmgep_supp/.  相似文献   

2.
Feature extraction is one of the most important and effective method to reduce dimension in data mining, with emerging of high dimensional data such as microarray gene expression data. Feature extraction for gene selection, mainly serves two purposes. One is to identify certain disease-related genes. The other is to find a compact set of discriminative genes to build a pattern classifier with reduced complexity and improved generalization capabilities. Depending on the purpose of gene selection, two types of feature extraction algorithms including ranking-based feature extraction and set-based feature extraction are employed in microarray gene expression data analysis. In ranking-based feature extraction, features are evaluated on an individual basis, without considering inter-relationship between features in general, while set-based feature extraction evaluates features based on their role in a feature set by taking into account dependency between features. Just as learning methods, feature extraction has a problem in its generalization ability, which is robustness. However, the issue of robustness is often overlooked in feature extraction. In order to improve the accuracy and robustness of feature extraction for microarray data, a novel approach based on multi-algorithm fusion is proposed. By fusing different types of feature extraction algorithms to select the feature from the samples set, the proposed approach is able to improve feature extraction performance. The new approach is tested against gene expression dataset including Colon cancer data, CNS data, DLBCL data, and Leukemia data. The testing results show that the performance of this algorithm is better than existing solutions.  相似文献   

3.
Habitat fragmentation has been cited as one of the critical reasons for biodiversity loss. Establishing connected nature reserve networks is an effective way to reduce habit fragmentation. However, the resources devoted to nature reserves have always been scarce. Therefore it is important to allocate our scarce resources in an optimal way. The optimal design of a reserve network which is effective both ecologically and economically has become an important research topic in the reserve design literature. The problem of optimal selection of a subset from a larger group of potential habitat sites is solved using either heuristic or formal optimization methods. The heuristic methods, although flexible and computationally fast, can not guarantee the solution is optimal therefore may lead to scarce resources being used in an ineffective way. The formal optimization methods, on the other hand, guarantees the solution is optimal, but it has been argued that it would be difficult to model site selection process using optimization models, especially when spatial attributes of the reserve have to be taken into account. This paper presents a linear integer programming model for the design of a minimal connected reserve network using a graph theory approach. A connected tree is determined corresponding to a connected reserve. Computational performance of the model is tested using datasets randomly generated by the software GAMS. Results show that the model can solve a connected reserve design problem which includes 100 potential sites and 30 species in a reasonable period of time. As an empirical application, the model is applied to the protection of endangered and threatened bird species in the Cache River basin area in Illinois, US. Two connected reserve networks are determined for 13 bird species.  相似文献   

4.
Wang Y C  Hayri Önal 《农业工程》2011,31(5):235-240
Habitat fragmentation has been cited as one of the critical reasons for biodiversity loss. Establishing connected nature reserve networks is an effective way to reduce habit fragmentation. However, the resources devoted to nature reserves have always been scarce. Therefore it is important to allocate our scarce resources in an optimal way. The optimal design of a reserve network which is effective both ecologically and economically has become an important research topic in the reserve design literature. The problem of optimal selection of a subset from a larger group of potential habitat sites is solved using either heuristic or formal optimization methods. The heuristic methods, although flexible and computationally fast, can not guarantee the solution is optimal therefore may lead to scarce resources being used in an ineffective way. The formal optimization methods, on the other hand, guarantees the solution is optimal, but it has been argued that it would be difficult to model site selection process using optimization models, especially when spatial attributes of the reserve have to be taken into account. This paper presents a linear integer programming model for the design of a minimal connected reserve network using a graph theory approach. A connected tree is determined corresponding to a connected reserve. Computational performance of the model is tested using datasets randomly generated by the software GAMS. Results show that the model can solve a connected reserve design problem which includes 100 potential sites and 30 species in a reasonable period of time. As an empirical application, the model is applied to the protection of endangered and threatened bird species in the Cache River basin area in Illinois, US. Two connected reserve networks are determined for 13 bird species.  相似文献   

5.
Wigle DA  Rossant J  Jurisica I 《Genome biology》2001,2(7):reviews1019.1-reviews10194
Microarrays of mouse genes are now available from several sources, and they have so far given new insights into gene expression in embryonic development, regions of the brain and during apoptosis. Microarray data posted on the internet can be reanalyzed to study a range of questions.  相似文献   

6.
SUMMARY: The NetAffx Gene Ontology (GO) Mining Tool is a web-based, interactive tool that permits traversal of the GO graph in the context of microarray data. It accepts a list of Affymetrix probe sets and renders a GO graph as a heat map colored according to significance measurements. The rendered graph is interactive, with nodes linked to public web sites and to lists of the relevant probe sets. The GO Mining Tool provides visualization combining biological annotation with expression data, encompassing thousands of genes in one interactive view. AVAILABILITY: GO Mining Tool is freely available at http://www.affymetrix.com/analysis/query/go_analysis.affx  相似文献   

7.
MOTIVATION: It is understood that clustering genes are useful for exploring scientific knowledge from DNA microarray gene expression data. The explored knowledge can be finally used for annotating biological function for novel genes. Representing the explored knowledge in an efficient manner is then closely related to the classification accuracy. However, this issue has not yet been paid the attention it deserves. RESULT: A novel method based on template theory in cognitive psychology and pattern recognition is developed in this study for representing knowledge extracted from cluster analysis effectively. The basic principle is to represent knowledge according to the relationship between genes and a found cluster structure. Based on this novel knowledge representation method, a pattern recognition algorithm (the decision tree algorithm C4.5) is then used to construct a classifier for annotating biological functions of novel genes. The experiments on five published datasets show that this method has improved the classification performance compared with the conventional method. The statistical tests indicate that this improvement is significant. AVAILABILITY: The software package can be obtained upon request from the author.  相似文献   

8.
9.
Much of recent work to determine primary structures of nucleic acids and proteins employs the “fragmentation” or “overlap” stratagem. Typically, a preparation of a given polymer with unknown sequence is purified and then subjected to an enzyme known to cut the polymer at certain specific sites. The quantities and sequences of the resulting fragments are determined. For RNA primary sequences, pancreatic ribonuclease and T1 ribonuclease are ordinarily used as fragmenting enzymes. A technique is described for evaluating such fragment data. It has the following properties: It is easily determined whether or not the fragment data is inconsistent. It is always possible to determine the first and last nucleotides of the unknown sequence from the data of two limit digests. Consistent data from two limit digests can always be fitted into a convenient conceptual framework developed within the theory of graphs. In most cases, partial digest information can be used to modify the framework constructed from two limit digests, as such information is obtained. An efficient analysis of all fragment data in this conceptual framework can always be made. One can detect inconsistencies and can generate the entire list of polymer sequences consistent with the fragment data.  相似文献   

10.
Normalizing DNA microarray data   总被引:1,自引:0,他引:1  
  相似文献   

11.
MOTIVATION: One problem with discriminant analysis of DNA microarray data is that each sample is represented by quite a large number of genes, and many of them are irrelevant, insignificant or redundant to the discriminant problem at hand. Methods for selecting important genes are, therefore, of much significance in microarray data analysis. In the present study, a new criterion, called LS Bound measure, is proposed to address the gene selection problem. The LS Bound measure is derived from leave-one-out procedure of LS-SVMs (least squares support vector machines), and as the upper bound for leave-one-out classification results it reflects to some extent the generalization performance of gene subsets. RESULTS: We applied this LS Bound measure for gene selection on two benchmark microarray datasets: colon cancer and leukemia. We also compared the LS Bound measure with other evaluation criteria, including the well-known Fisher's ratio and Mahalanobis class separability measure, and other published gene selection algorithms, including Weighting factor and SVM Recursive Feature Elimination. The strength of the LS Bound measure is that it provides gene subsets leading to more accurate classification results than the filter method while its computational complexity is at the level of the filter method. AVAILABILITY: A companion website can be accessed at http://www.ntu.edu.sg/home5/pg02776030/lsbound/. The website contains: (1) the source code of the gene selection algorithm; (2) the complete set of tables and figures regarding the experimental study; (3) proof of the inequality (9). CONTACT: ekzmao@ntu.edu.sg.  相似文献   

12.
The advent of DNA microarray technology has offered the promise of casting new insights onto deciphering secrets of life by monitoring activities of thousands of genes simultaneously. Current analyses of microarray data focus on precise classification of biological types, for example, tumor versus normal tissues. A further scientific challenging task is to extract disease-relevant genes from the bewildering amounts of raw data, which is one of the most critical themes in the post-genomic era, but it is generally ignored due to lack of an efficient approach. In this paper, we present a novel ensemble method for gene extraction that can be tailored to fulfill multiple biological tasks including (i) precise classification of biological types; (ii) disease gene mining; and (iii) target-driven gene networking. We also give a numerical application for(i) and (ii) using a public microarrary data set and set aside a separate paper to address (iii).  相似文献   

13.
An ensemble method for gene discovery based on DNA microarray data   总被引:9,自引:0,他引:9  
DNA microarrays are now able to measure the expressions of thousands of genes simultaneously. These measurements or gene profiling provides a snapshot?of life that maps to a cross section of ge-netic activities in a four-dimension space of time and the biological entity. Although recent microarray ex-periments[1, 2] hold the promise of the innovative tech-nology to cast new insights onto discovery of secrets of life, development of powerful and efficient analysis strategies for microarray dat…  相似文献   

14.

Background  

Data clustering analysis has been extensively applied to extract information from gene expression profiles obtained with DNA microarrays. To this aim, existing clustering approaches, mainly developed in computer science, have been adapted to microarray data analysis. However, previous studies revealed that microarray datasets have very diverse structures, some of which may not be correctly captured by current clustering methods. We therefore approached the problem from a new starting point, and developed a clustering algorithm designed to capture dataset-specific structures at the beginning of the process.  相似文献   

15.
In poplar, genetic research on wood properties is very important for the improvement of wood quality. Studies of wood formation genes at each developmental stage using modern biotechnology have often been limited to several genes or gene families. Because of the complex regulatory network involved in the co-expression and interactions of thousands of genes, however, the genetic mechanisms of wood formation must be surveyed on a genome-wide scale. In this study, we identified wood formation-related genes using a differentially co-expressed (DCE) gene subset approach based on biological networks inferred from microarray data. Gene co-expression networks in leaf, root, and wood tissues were first constructed and topologically analyzed using microarray data collected from the Gene Expression Omnibus. The DCE gene modules in wood-forming tissue were then detected based on graph theory, which was followed by gene ontology (GO) enrichment analysis and GO annotation of probe sets. Finally, 72 probe sets were identified in the largest cohesive subgroup of the DCE gene network in wood tissue, with most of the probe sets associated with wood formation-related biological processes and GO cellular component categories. The approach described in this paper provides an effective strategy to identify wood formation genes in poplar and should contribute to the better understanding of the genetic and molecular mechanisms underlying wood properties in trees.  相似文献   

16.
Cancer, being among the most serious diseases, causes many deaths every year. Many investigators have devoted themselves to designing effective treatments for this disease. Cancer always involves abnormal cell growth with the potential to invade or spread to other parts of the body. In contrast, tumor suppressor genes (TSGs) act as guardians to prevent a disordered cell cycle and genomic instability in normal cells. Studies on TSGs can assist in the design of effective treatments against cancer. In this study, we propose a computational method to discover potential TSGs. Based on the known TSGs, a number of candidate genes were selected by applying the shortest path approach in a weighted graph that was constructed using protein–protein interaction network. The analysis of selected genes shows that some of them are new TSGs recently reported in the literature, while others may be novel TSGs.  相似文献   

17.
MOTIVATION: Many standard statistical techniques are effective on data that are normally distributed with constant variance. Microarray data typically violate these assumptions since they come from non-Gaussian distributions with a non-trivial mean-variance relationship. Several methods have been proposed that transform microarray data to stabilize variance and draw its distribution towards the Gaussian. Some methods, such as log or generalized log, rely on an underlying model for the data. Others, such as the spread-versus-level plot, do not. We propose an alternative data-driven multiscale approach, called the Data-Driven Haar-Fisz for microarrays (DDHFm) with replicates. DDHFm has the advantage of being 'distribution-free' in the sense that no parametric model for the underlying microarray data is required to be specified or estimated; hence, DDHFm can be applied very generally, not just to microarray data. RESULTS: DDHFm achieves very good variance stabilization of microarray data with replicates and produces transformed intensities that are approximately normally distributed. Simulation studies show that it performs better than other existing methods. Application of DDHFm to real one-color cDNA data validates these results. AVAILABILITY: The R package of the Data-Driven Haar-Fisz transform (DDHFm) for microarrays is available in Bioconductor and CRAN.  相似文献   

18.

Background

In this paper we present a method for the statistical assessment of cancer predictors which make use of gene expression profiles. The methodology is applied to a new data set of microarray gene expression data collected in Casa Sollievo della Sofferenza Hospital, Foggia – Italy. The data set is made up of normal (22) and tumor (25) specimens extracted from 25 patients affected by colon cancer. We propose to give answers to some questions which are relevant for the automatic diagnosis of cancer such as: Is the size of the available data set sufficient to build accurate classifiers? What is the statistical significance of the associated error rates? In what ways can accuracy be considered dependant on the adopted classification scheme? How many genes are correlated with the pathology and how many are sufficient for an accurate colon cancer classification? The method we propose answers these questions whilst avoiding the potential pitfalls hidden in the analysis and interpretation of microarray data.

Results

We estimate the generalization error, evaluated through the Leave-K-Out Cross Validation error, for three different classification schemes by varying the number of training examples and the number of the genes used. The statistical significance of the error rate is measured by using a permutation test. We provide a statistical analysis in terms of the frequencies of the genes involved in the classification. Using the whole set of genes, we found that the Weighted Voting Algorithm (WVA) classifier learns the distinction between normal and tumor specimens with 25 training examples, providing e = 21% (p = 0.045) as an error rate. This remains constant even when the number of examples increases. Moreover, Regularized Least Squares (RLS) and Support Vector Machines (SVM) classifiers can learn with only 15 training examples, with an error rate of e = 19% (p = 0.035) and e = 18% (p = 0.037) respectively. Moreover, the error rate decreases as the training set size increases, reaching its best performances with 35 training examples. In this case, RLS and SVM have error rates of e = 14% (p = 0.027) and e = 11% (p = 0.019). Concerning the number of genes, we found about 6000 genes (p < 0.05) correlated with the pathology, resulting from the signal-to-noise statistic. Moreover the performances of RLS and SVM classifiers do not change when 74% of genes is used. They progressively reduce up to e = 16% (p < 0.05) when only 2 genes are employed. The biological relevance of a set of genes determined by our statistical analysis and the major roles they play in colorectal tumorigenesis is discussed.

Conclusions

The method proposed provides statistically significant answers to precise questions relevant for the diagnosis and prognosis of cancer. We found that, with as few as 15 examples, it is possible to train statistically significant classifiers for colon cancer diagnosis. As for the definition of the number of genes sufficient for a reliable classification of colon cancer, our results suggest that it depends on the accuracy required.  相似文献   

19.
Mining microarray expression data by literature profiling   总被引:1,自引:1,他引:0  
Chaussabel D  Sher A 《Genome biology》2002,3(10):research0055.1-research005516

Background  

The rapidly expanding fields of genomics and proteomics have prompted the development of computational methods for managing, analyzing and visualizing expression data derived from microarray screening. Nevertheless, the lack of efficient techniques for assessing the biological implications of gene-expression data remains an important obstacle in exploiting this information.  相似文献   

20.
The graphoglyptid ichnogenus Paleodictyon has been alternatively interpreted as a foraging or farming trace; as a subsurface burrow for the habitation of one or more unknown organisms; the remains of a xenophyophore; and as the result of modular growth of an unknown organism. Graph theory and analysis of the geometry of the regular ichnospecies suggests that if the elements of Paleodictyon are interpreted as tunnels, then they are of extraordinary length relative to the size of any likely solitary tracemaker. In addition, because each vertex of the mesh is of degree three, any possible path through mesh requires revisiting in order to travel through the entire network; this makes the minimum path length even longer. These results suggest that it is unlikely that Paleodictyon is the result of subsurface burrowing.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号