首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Carlson SM  Najmi A  Cohen HJ 《Proteomics》2007,7(7):1037-1046
Correlated variables have been shown to confound statistical analyses in microarray experiments. The same effect applies to an even greater degree in proteomics, especially with the use of MS for parallel measurements. Biological effects such as PTM, fragmentation, and multimer formation can produce strongly correlated variables. The problem is compounded in some types of MS by technical effects such as incomplete chromatographic separation, binding to multiple surfaces, or multiple ionizations. Existing methods for dimension reduction, notably principal components analysis and related techniques, are not always satisfactory because they produce data that often lack clear biological interpretation. We propose a preprocessing algorithm that clusters highly correlated features, using the Bayes information criterion to select an optimal number of clusters. Statistical analysis of clusters, instead of individual features, benefits from lower noise, and reduces the difficulties associated with strongly correlated data. This preprocessing increases the statistical power of analyses using false discovery rate on simulated data. Strong correlations are often present in real data, and we find that clustering improves biomarker discovery in clinical SELDI-TOF-MS datasets of plasma from patients with Kawasaki disease, and bone-marrow cell extracts from patients with acute myeloid or acute lymphoblastic leukemia.  相似文献   

2.
In this paper, we propose a hybrid clustering method that combines the strengths of bottom-up hierarchical clustering with that of top-down clustering. The first method is good at identifying small clusters but not large ones; the strengths are reversed for the second method. The hybrid method is built on the new idea of a mutual cluster: a group of points closer to each other than to any other points. Theoretical connections between mutual clusters and bottom-up clustering methods are established, aiding in their interpretation and providing an algorithm for identification of mutual clusters. We illustrate the technique on simulated and real microarray datasets.  相似文献   

3.
MOTIVATION: Unsupervised analysis of microarray gene expression data attempts to find biologically significant patterns within a given collection of expression measurements. For example, hierarchical clustering can be applied to expression profiles of genes across multiple experiments, identifying groups of genes that share similar expression profiles. Previous work using the support vector machine supervised learning algorithm with microarray data suggests that higher-order features, such as pairwise and tertiary correlations across multiple experiments, may provide significant benefit in learning to recognize classes of co-expressed genes. RESULTS: We describe a generalization of the hierarchical clustering algorithm that efficiently incorporates these higher-order features by using a kernel function to map the data into a high-dimensional feature space. We then evaluate the utility of the kernel hierarchical clustering algorithm using both internal and external validation. The experiments demonstrate that the kernel representation itself is insufficient to provide improved clustering performance. We conclude that mapping gene expression data into a high-dimensional feature space is only a good idea when combined with a learning algorithm, such as the support vector machine that does not suffer from the curse of dimensionality. AVAILABILITY: Supplementary data at www.cs.columbia.edu/compbio/hiclust. Software source code available by request.  相似文献   

4.

Background  

Clustering is a popular data exploration technique widely used in microarray data analysis. Most conventional clustering algorithms, however, generate only one set of clusters independent of the biological context of the analysis. This is often inadequate to explore data from different biological perspectives and gain new insights. We propose a new clustering model that can generate multiple versions of different clusters from a single dataset, each of which highlights a different aspect of the given dataset.  相似文献   

5.

Background  

Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data analysis, little attention has been paid to uncertainty in the results obtained.  相似文献   

6.
We describe an algorithm for finding the most statistically significant non-overlapping subtrees of a hierarchical clustering of gene expression data with respect to a set of secondary data labels on genes. The method is implemented as a Java plug-in for a commercial gene expression analysis program (GeneSpring).  相似文献   

7.
8.
High-throughput technologies, such as proteomic screening and DNA micro-arrays, produce vast amounts of data requiring comprehensive analytical methods to decipher the biologically relevant results. One approach would be to manually search the biomedical literature; however, this would be an arduous task. We developed an automated literature-mining tool, termed MedGene, which comprehensively summarizes and estimates the relative strengths of all human gene-disease relationships in Medline. Using MedGene, we analyzed a novel micro-array expression dataset comparing breast cancer and normal breast tissue in the context of existing knowledge. We found no correlation between the strength of the literature association and the magnitude of the difference in expression level when considering changes as high as 5-fold; however, a significant correlation was observed (r = 0.41; p = 0.05) among genes showing an expression difference of 10-fold or more. Interestingly, this only held true for estrogen receptor (ER) positive tumors, not ER negative. MedGene identified a set of relatively understudied, yet highly expressed genes in ER negative tumors worthy of further examination.  相似文献   

9.

Background

A hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the only method applied.

Methodology/Principal Findings

We show that hierarchical clustering that involve global considerations, such as top-down (TD, divisive), or glocal (global-local) algorithms are better suited to reveal meaningful patterns in the data. This is demonstrated, by testing the correspondence between the results of several algorithms (TD, glocal and BU) and the correct annotations provided by experts. The correspondence was tested in multiple domains including gene expression experiments, stock trade records and functional protein families. The performance of each of the algorithms is evaluated by statistical criteria that are assigned to clusters (nodes of the hierarchy tree) based on expert-labeled data. Whereas TD algorithms perform better on global patterns, BU algorithms perform well and are advantageous when finer granularity of the data is sought. In addition, a novel TD algorithm that is based on genuine density of the data points is presented and is shown to outperform other divisive and agglomerative methods. Application of the algorithm to more than 500 protein sequences belonging to ion-channels illustrates the potential of the method for inferring overlooked functional annotations. ClustTree, a graphical Matlab toolbox for applying various hierarchical clustering algorithms and testing their quality is made available.

Conclusions

Although currently rarely used, global approaches, in particular, TD or glocal algorithms, should be considered in the exploratory process of clustering. In general, applying unsupervised clustering methods can leverage the quality of manually-created mapping of proteins families. As demonstrated, it can also provide insights in erroneous and missed annotations.  相似文献   

10.
When applying hierarchical clustering algorithms to cluster patient samples from microarray data, the clustering patterns generated by most algorithms tend to be dominated by groups of highly differentially expressed genes that have closely related expression patterns. Sometimes, these genes may not be relevant to the biological process under study or their functions may already be known. The problem is that these genes can potentially drown out the effects of other genes that are relevant or have novel functions. We propose a procedure called complementary hierarchical clustering that is designed to uncover the structures arising from these novel genes that are not as highly expressed. Simulation studies show that the procedure is effective when applied to a variety of examples. We also define a concept called relative gene importance that can be used to identify the influential genes in a given clustering. Finally, we analyze a microarray data set from 295 breast cancer patients, using clustering with the correlation-based distance measure. The complementary clustering reveals a grouping of the patients which is uncorrelated with a number of known prognostic signatures and significantly differing distant metastasis-free probabilities.  相似文献   

11.
12.
HCPM is a tool for clustering protein structures from comparative modeling, ab initio structure prediction, etc. A hierarchical clustering algorithm is designed and tested, and a heuristic is provided for an optimal cluster selection. The method has been successfully tested during the CASP6 experiment.  相似文献   

13.
We present the first practical algorithm for the optimal linear leaf ordering of trees that are generated by hierarchical clustering. Hierarchical clustering has been extensively used to analyze gene expression data, and we show how optimal leaf ordering can reveal biological structure that is not observed with an existing heuristic ordering method. For a tree with n leaves, there are 2(n-1) linear orderings consistent with the structure of the tree. Our optimal leaf ordering algorithm runs in time O(n(4)), and we present further improvements that make the running time of our algorithm practical.  相似文献   

14.
Heat treatment of milk induces the Maillard reaction between lactose and proteins; in this context, β‐lactoglobulin and α‐lactalbumin adducts have been used as markers to monitor milk quality. Since some milk proteins have been reported as essential for the delivery of microelements and, being resistant against proteolysis in the gastrointestinal tract, also contributing to the acquired immune response against pathogens and the stimulation of cellular proliferation, it is crucial to systematically determine the milk subproteome affected by the Maillard reaction for a careful evaluation of aliment functional properties. This is more important when milk is the unique nutritional source, as in infant diet. To this purpose, a combination of proteomic procedures based on analyte capture by combinatorial peptide ligand libraries, selective trapping of lactosylated peptides by m‐aminophenylboronic acid‐agarose chromatography and collision‐induced dissociation and electron transfer dissociation MS was used for systematic identification of the lactosylated proteins in milk samples subjected to different thermal treatments. An exhaustive modification of proteins was observed in milk powdered preparations for infant nutrition. Globally, this approach allowed the identification of 271 non‐redundant modification sites in 33 milk proteins, which also included low‐abundance components involved in nutrient delivery, defence response against virus/microorganisms and cellular proliferative events. A comparison of the modified peptide identification percentages resulting from electron transfer dissociation or collision‐induced dissociation fragmentation spectra confirmed the first activation mode as most advantageous for the analysis of lactosylated proteins. Nutritional, biological and toxicological consequences of these findings are discussed on the basis of the recent literature on this subject, emphasizing their impact on newborn diet.  相似文献   

15.
Clustering analysis has been an important research topic in the machine learning field due to the wide applications. In recent years, it has even become a valuable and useful tool for in-silico analysis of microarray or gene expression data. Although a number of clustering methods have been proposed, they are confronted with difficulties in meeting the requirements of automation, high quality, and high efficiency at the same time. In this paper, we propose a novel, parameterless and efficient clustering algorithm, namely, correlation search technique (CST), which fits for analysis of gene expression data. The unique feature of CST is it incorporates the validation techniques into the clustering process so that high quality clustering results can be produced on the fly. Through experimental evaluation, CST is shown to outperform other clustering methods greatly in terms of clustering quality, efficiency, and automation on both of synthetic and real data sets.  相似文献   

16.
17.

Purpose

The metal and mining industry routinely conducts life cycle assessment studies to monitor and document the potential environmental impacts of their products. These studies are typically conducted independently by the various commodity associations. To facilitate alignment of these methodologies, a working group comprised of interested industry organizations and their representatives was formed to propose uniform recommendations for key methodological choices.

Methods

Existing methodologies used by the participating associations were reviewed to identify areas of alignment as well as areas which could benefit from discussions and alignment. Recommendations for selected topics were then developed through a series of moderated discussions among the participating organizations throughout 2012 and 2013. Efforts were taken in the creation of the document to ensure alignment with the international standards ISO 14040 (2006) and ISO 14044 (2006). Four methodology issues were chosen to be addressed with respect to industry alignment: system boundary, recycling allocation, co-product allocation, and impact assessment categories.

Results and discussion

Recommendations for system boundary conclude that boundaries should include end-of-life disposal and recycling and, whenever possible, the product use phase, particularly for material and product comparison. For co-product allocation methods, the recommendations were based on the type of co-products being produced and included a range of options to guide practitioners’ decisions. It was recommended for recycling allocation that practitioners use the avoided burden methodology. Lastly, for the life cycle impact assessment stage, it was recommended that life cycle assessments (LCAs) on metal and mining products should report the following impact categories: global warming potential, acidification potential, eutrophication potential, photochemical oxidant creation potential, and ozone depletion potential. It was recommended that inclusion of other impact categories will be periodically re-evaluated by the metal industry. Further, the recommendation is that, while impact categories included are limited to the five above, all life cycle inventory (LCI) datasets themselves should contain accurate and comprehensive inventory data, given reasonable accessibility and data collection cost constraints.

Conclusions

Methodological alignment for LCA studies in the metal and mining industry will lead to improved consistency and applicability of the LCA data and results. Specifically, these recommendations improve the consistency of decisions regarding system boundary, recycling allocation, co-product allocation, and impact assessment categories. Further research is suggested to improve the specificity of certain recommendations (e.g., allocation), as well as expand the scope of the harmonization efforts to include other methodological decisions.
  相似文献   

18.
Analysing proteomic data   总被引:5,自引:0,他引:5  
The rapid growth of proteomics has been made possible by the development of reproducible 2D gels and biological mass spectrometry. However, despite technical improvements 2D gels are still less than perfectly reproducible and gels have to be aligned so spots for identical proteins appear in the same place. Gels can be warped by a variety of techniques to make them concordant. When gels are manipulated to improve registration, information is lost, so direct methods for gel registration which make use of all available data for spot matching are preferable to indirect ones. In order to identify proteins from gel spots a property or combination of properties that are unique to that protein are required. These can then be used to search databases for possible matches. Molecular mass, pI, amino acid composition and short sequence tags can all be used in database searches. Currently the method of choice for protein identification is mass spectrometry. Proteins are eluted from the gels and cleaved with specific endoproteases to produce a series of peptides of different molecular mass. In peptide mass fingerprinting, the peptide profile of the unknown protein is compared with theoretical peptide libraries generated from sequences in the different databases. Tandem mass spectroscopy (MS/MS) generates short amino acid sequence tags for the individual peptides. These partial sequences combined with the original peptide masses are then used for database searching, greatly improving specificity. Increasingly protein identification from MS/MS data is being fully or partially automated. When working with organisms, which do not have sequenced genomes (the case with most helminths), protein identification by database searching becomes problematical. A number of approaches to cross species protein identification have been suggested, but if the organism being studied is only distantly related to any organism with a sequenced genome then the likelihood of protein identification remains small. The dynamic nature of the proteome means that there really is no such thing as a single representative proteome and a complete set of metadata (data about the data) is going to be required if the full potential of database mining is to be realised in the future.  相似文献   

19.
Scientific publications should provide sufficient detail in terms of methodology and presented data to enable the community to reproduce the methodology to generate similar data and arrive at the same conclusion, if an identical sample is provided for analysis. The advent of high-throughput methods in biological experimentation impose some unique challenges both in data presentation in classical print format, as well as in describing methodology and data analysis in sufficient detail to conform to good publication practice. To facilitate this process, Proteome Science is adopting a set of methodology and data presentation guidelines to enable both peer reviewers, as well as the scientific community, to better evaluate high-throughput proteomic studies.  相似文献   

20.
Data mining application to proteomic data from mass spectrometry has gained much interest in recent years. Advances made in proteomics and mass spectrometry have resulted in considerable amount of data that cannot be easily visualized or interpreted. Mass spectral proteomic datasets are typically high dimensional but with small sample size. Consequently, advanced artificial intelligence and machine learning algorithms are increasingly being used for knowledge discovery from such datasets. Their overall goal is to extract useful information that leads to the identification of protein biomarker candidates. Such biomarkers could potentially have diagnostic value as tools for early detection, diagnosis, and prognosis of many diseases. The purpose of this review is to focus on the current trends in mining mass spectral proteomic data. Special emphasis is placed on the critical steps involved in the analysis of surface-enhanced laser desorption/ionization mass spectrometry proteomic data. Examples are drawn from previously published studies and relevant data mining terminology and techniques are exlained.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号