首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Pvclust: an R package for assessing the uncertainty in hierarchical clustering   总被引:11,自引:0,他引:11  
SUMMARY: Pvclust is an add-on package for a statistical software R to assess the uncertainty in hierarchical cluster analysis. Pvclust can be used easily for general statistical problems, such as DNA microarray analysis, to perform the bootstrap analysis of clustering, which has been popular in phylogenetic analysis. Pvclust calculates probability values (p-values) for each cluster using bootstrap resampling techniques. Two types of p-values are available: approximately unbiased (AU) p-value and bootstrap probability (BP) value. Multiscale bootstrap resampling is used for the calculation of AU p-value, which has superiority in bias over BP value calculated by the ordinary bootstrap resampling. In addition the computation time can be enormously decreased with parallel computing option.  相似文献   

2.
3.
Pay-for-performance programs are often aimed to improve the management of chronic diseases. We evaluate the impact of a local pay for performance programme (QOF+), which rewarded financially more ambitious quality targets (‘stretch targets’) than those used nationally in the Quality and Outcomes Framework (QOF). We focus on targets for intermediate outcomes in patients with cardiovascular disease and diabetes. A difference-in-difference approach is used to compare practice level achievements before and after the introduction of the local pay for performance program. In addition, we analysed patient-level data on exception reporting and intermediate outcomes utilizing an interrupted time series analysis. The local pay for performance program led to significantly higher target achievements (hypertension: p-value <0.001, coronary heart disease: p-values <0.001, diabetes: p-values <0.061, stroke: p-values <0.003). However, the increase was driven by higher rates of exception reporting (hypertension: p-value <0.001, coronary heart disease: p-values <0.03, diabetes: p-values <0.05) in patients with all conditions except for stroke. Exception reporting allows practitioners to exclude patients from target calculations if certain criteria are met, e.g. informed dissent of the patient for treatment. There were no statistically significant improvements in mean blood pressure, cholesterol or HbA1c levels. Thus, achievement of higher payment thresholds in the local pay for performance scheme was mainly attributed to increased exception reporting by practices with no discernable improvements in overall clinical quality. Hence, active monitoring of exception reporting should be considered when setting more ambitious quality targets. More generally, the study suggests a trade-off between additional incentive for better care and monitoring costs.  相似文献   

4.
Filtering is a common practice used to simplify the analysis of microarray data by removing from subsequent consideration probe sets believed to be unexpressed. The m/n filter, which is widely used in the analysis of Affymetrix data, removes all probe sets having fewer than m present calls among a set of n chips. The m/n filter has been widely used without considering its statistical properties. The level and power of the m/n filter are derived. Two alternative filters, the pooled p-value filter and the error-minimizing pooled p-value filter are proposed. The pooled p-value filter combines information from the present-absent p-values into a single summary p-value which is subsequently compared to a selected significance threshold. We show that pooled p-value filter is the uniformly most powerful statistical test under a reasonable beta model and that it exhibits greater power than the m/n filter in all scenarios considered in a simulation study. The error-minimizing pooled p-value filter compares the summary p-value with a threshold determined to minimize a total-error criterion based on a partition of the distribution of all probes' summary p-values. The pooled p-value and error-minimizing pooled p-value filters clearly perform better than the m/n filter in a case-study analysis. The case-study analysis also demonstrates a proposed method for estimating the number of differentially expressed probe sets excluded by filtering and subsequent impact on the final analysis. The filter impact analysis shows that the use of even the best filter may hinder, rather than enhance, the ability to discover interesting probe sets or genes. S-plus and R routines to implement the pooled p-value and error-minimizing pooled p-value filters have been developed and are available from www.stjuderesearch.org/depts/biostats/index.html.  相似文献   

5.
Permutation tests are amongst the most commonly used statistical tools in modern genomic research, a process by which p-values are attached to a test statistic by randomly permuting the sample or gene labels. Yet permutation p-values published in the genomic literature are often computed incorrectly, understated by about 1/m, where m is the number of permutations. The same is often true in the more general situation when Monte Carlo simulation is used to assign p-values. Although the p-value understatement is usually small in absolute terms, the implications can be serious in a multiple testing context. The understatement arises from the intuitive but mistaken idea of using permutation to estimate the tail probability of the test statistic. We argue instead that permutation should be viewed as generating an exact discrete null distribution. The relevant literature, some of which is likely to have been relatively inaccessible to the genomic community, is reviewed and summarized. A computation strategy is developed for exact p-values when permutations are randomly drawn. The strategy is valid for any number of permutations and samples. Some simple recommendations are made for the implementation of permutation tests in practice.  相似文献   

6.
A grid layout algorithm for automatic drawing of biochemical networks   总被引:4,自引:0,他引:4  
MOTIVATION: Visualization is indispensable in the research of complex biochemical networks. Available graph layout algorithms are not adequate for satisfactorily drawing such networks. New methods are required to visualize automatically the topological architectures and facilitate the understanding of the functions of the networks. RESULTS: We propose a novel layout algorithm to draw complex biochemical networks. A network is modeled as a system of interacting nodes on squared grids. A discrete cost function between each node pair is designed based on the topological relation and the geometric positions of the two nodes. The layouts are produced by minimizing the total cost. We design a fast algorithm to minimize the discrete cost function, by which candidate layouts can be produced efficiently. A simulated annealing procedure is used to choose better candidates. Our algorithm demonstrates its ability to exhibit cluster structures clearly in relatively compact layout areas without any prior knowledge. We developed Windows software to implement the algorithm for CADLIVE. AVAILABILITY: All materials can be freely downloaded from http://kurata21.bio.kyutech.ac.jp/grid/grid_layout.htm; http://www.cadlive.jp/ SUPPLEMENTARY INFORMATION: http://kurata21.bio.kyutech.ac.jp/grid/grid_layout.htm; http://www.cadlive.jp/  相似文献   

7.
8.
Gosselin F 《PloS one》2011,6(3):e14770

Background

Recent approaches mixing frequentist principles with Bayesian inference propose internal goodness-of-fit (GOF) p-values that might be valuable for critical analysis of Bayesian statistical models. However, GOF p-values developed to date only have known probability distributions under restrictive conditions. As a result, no known GOF p-value has a known probability distribution for any discrepancy function.

Methodology/Principal Findings

We show mathematically that a new GOF p-value, called the sampled posterior p-value (SPP), asymptotically has a uniform probability distribution whatever the discrepancy function. In a moderate finite sample context, simulations also showed that the SPP appears stable to relatively uninformative misspecifications of the prior distribution.

Conclusions/Significance

These reasons, together with its numerical simplicity, make the SPP a better canonical GOF p-value than existing GOF p-values.  相似文献   

9.
MOTIVATION: We present an extensive evaluation of different methods and criteria to detect remote homologs of a given protein sequence. We investigate two associated problems: first, to develop a sensitive searching method to identify possible candidates and, second, to assign a confidence to the putative candidates in order to select the best one. For searching methods where the score distributions are known, p-values are used as confidence measure with great success. For the cases where such theoretical backing is absent, we propose empirical approximations to p-values for searching procedures. RESULTS: As a baseline, we review the performances of different methods for detecting remote protein folds (sequence alignment and threading, with and without sequence profiles, global and local). The analysis is performed on a large representative set of protein structures. For fold recognition, we find that methods using sequence profiles generally perform better than methods using plain sequences, and that threading methods perform better than sequence alignment methods. In order to assess the quality of the predictions made, we establish and compare several confidence measures, including raw scores, z-scores, raw score gaps, z-score gaps, and different methods of p-value estimation. We work our way from the theoretically well backed local scores towards more explorative global and threading scores. The methods for assessing the statistical significance of predictions are compared using specificity--sensitivity plots. For local alignment techniques we find that p-value methods work best, albeit computationally cheaper methods such as those based on score gaps achieve similar performance. For global methods where no theory is available methods based on score gaps work best. By using the score gap functions as the measure of confidence we improve the more powerful fold recognition methods for which p-values are unavailable. AVAILABILITY: The benchmark set is available upon request.  相似文献   

10.
MOTIVATION: Chemical carcinogenicity is an important subject in health and environmental sciences, and a reliable method is expected to identify characteristic factors for carcinogenicity. The predictive toxicology challenge (PTC) 2000-2001 has provided the opportunity for various data mining methods to evaluate their performance. The cascade model, a data mining method developed by the author, has the capability to mine for local correlations in data sets with a large number of attributes. The current paper explores the effectiveness of the method on the problem of chemical carcinogenicity. RESULTS: Rodent carcinogenicity of 417 compounds examined by the National Toxicology Program (NTP) was used as the training set. The analysis by the cascade model, for example, could obtain a rule 'Highly flexible molecules are carcinogenic, if they have no hydrogen bond acceptors in halogenated alkanes and alkenes'. Resulting rules are applied to predict the activity of 185 compounds examined by the FDA. The ROC analysis performed by the PTC organizers has shown that the current method has excellent predictive power for the female rat data. AVAILABILITY: The binary program of DISCAS 2.1 and samples of input data sets on Windows PC are available at http://www.clab.kwansei.ac.jp/mining/discas/discas.html upon request from the author. SUPPLEMENTARY INFORMATION: Summary of prediction results and cross validations is accessible via http://www.clab.kwansei.ac.jp/~okada/BIJ/BIJsupple.htm. Used rules and the prediction results for each molecule are also provided.  相似文献   

11.
In this report, we describe the result of an extensive investigation of the effects of the conformations of proteins on the solvency of the bulk-phase water in which the proteins are dissolved. The concentrations of the proteins used were usually between 20 to 40%; the temperature was 25 degrees +/- 1 degree C. To probe the solvency of the water, the apparent equilibrium distribution coefficients (or p-values) of 4 solutes were studied: Na+ (sulfate), glycine, sucrose, and urea. From 8 to 14 isolated proteins in three types of conformations were investigated: native; denatured by agents that unravel the secondary structure (e.g., alpha-helix, beta-pleated sheet) of the protein (i.e., 9 M urea, 3 M guanidine HCl); denatured by agents that only disrupt the tertiary structure but leave the secondary structure intact or even strengthened (i.e., 0.1 M sodium dodecylsulfate or SDS, 2 M n-propanol). The results are as follows: (1) as a rule, native proteins have no or weak effect on the solvency of the water for all 4 probes; (2) exposure to 0.1 M SDS and to 2 M n-propanol, as a rule, does not significantly decrease the p-value of all 4 probes; (3) exposure to 9 M urea and to 3 M guanidine HCl consistently lowers the p-values of sucrose, glycine and Na+ (sulfate) and equally consistently produces no effect on the p-value of urea. Sucrose, glycine, and Na+ are found in low concentrations in cell water while urea is not. These experiments were designed and carried out primarily to test two subsidiary theories of the AI hypotheses: the polarized multilayer (PM) theory of cell water; and the theory of size-dependent solute exclusion.(ABSTRACT TRUNCATED AT 400 WORDS)  相似文献   

12.
The hydrophobic cores of proteins predicted by wavelet analysis   总被引:7,自引:0,他引:7  
MOTIVATION: In the process of protein construction, buried hydrophobic residues tend to assemble in a core of a protein. Methods used to predict these cores involve use or no use of sequential alignment. In the case of a close homology, prediction was more accurate if sequential alignment was used. If the homology was weak, predictions would be unreliable. A hydrophobicity plot involving the hydropathy index is useful for purposes of prediction, and smoothing is essential. However, the proposed methods are insufficient. We attempted to predict hydrophobic cores with a low frequency extracted from the hydrophobicity plot, using wavelet analysis. RESULTS: The cores were predicted at a rate of 68.7%, by cross-validation. Using wavelet analysis, the cores of non-homologous proteins can be predicted with close to 70% accuracy, without sequential alignment. AVAILABILITY: The program used in this study is available from Intergalactic Reality (http://www.intergalact.com). CONTACT: hirakawa@grt.kyushu-u.ac.jp, kuhara@grt.kyushu-u.ac.jp  相似文献   

13.
MOTIVATION: An unmanageably large amount of data on genome sequences is accumulating, prompting researchers to develop new methods to analyze them. We have devised a novel method designated oligostickiness, a measure roughly proportional to the binding affinity of an oligonucleotide to a DNA of interest, in order to analyze genome sequences as a whole. RESULTS: Fifteen representative genomes such as Bacillus subtilis, Escherichia coli, Saccharomyces cerevisiae, Caenorhabditis elegans, H. sapiens and others were analyzed by this method using more than 50 probe dodecanucleotides, offering the following findings: (i) Genome sequences can be specifically featured by way of oligostickiness maps. (ii) Oligostickiness analysis, which is similar to but more informative than (G + C) content or repetitive sequence analysis, can reveal intra-genomic structures such as mosaic structures (E. coli and B. subtilis) and highly sticky/non-sticky regions of biological meanings. (iii) Some probe oligonucleotides such as dC(12) and dT(12) can be used for classifying genomes, clearly discriminating prokaryotes and eukaryotes. (iv) Based on global oligostickiness, which is the average value of the local oligostickinesses, the features of a genome could be visualized in spider web mode. The pattern of a spider web as well as a set of oligostickiness maps is highly characteristic to each genome or chromosome. Thus, we called it as chromosome texture, leading to a finding that all the chromosomes contained in a cell, so far investigated, have a common texture. AVAILABILITY: Oligostickinesses maps used in this work are available at http://gp.fms.saitama-u.ac.jp/ CONTACT: koichi@fms.saitama-u.ac.jp  相似文献   

14.
Gene recognition by combination of several gene-finding programs   总被引:8,自引:1,他引:7  
MOTIVATION: A number of programs have been developed to predict the eukaryotic gene structures in DNA sequences. However, gene finding is still a challenging problem. RESULTS: We have explored the effectiveness when the results of several gene-finding programs were re- analyzed and combined. We studied several methods with four programs (FEXH, GeneParser3, GEN-SCAN and GRAIL2). By HIGHEST-policy combination method or BOUNDARY method, approximate correlation (AC) improved by 3- 5% in comparison with the best single gene-finding program. From another viewpoint, OR-based combination of the four programs is the most reliable to know whether a candidate exon overlaps with the real exon or not, although it is less sensitive than GENSCAN for exon-intron boundaries. Our methods can easily be extended to combine other programs. AVAILABILITY: We have developed a server program (Shirokane System) and a client program (GeneScope) to use the methods. GeneScope is available through a WWW site (http://gf.genome.ad.jp/). CONTACT: katsu,takagi@ims.u-tokyo.ac.jp   相似文献   

15.
MOTIVATION: The BLAST program for comparing two sequences assumes independent sequences in its random model. The resulting random alignment matrices have correlations across their diagonals. Analytic formulas for the BLAST p-value essentially neglect these correlations and are equivalent to a random model with independent diagonals. Progress on the independent diagonals model has been surprisingly rapid, but the practical magnitude of the correlations it neglects remains unknown. In addition, BLAST uses a finite-size correction that is particularly important when either of the sequences being compared is short. Several formulas for the finite-size correction have now been given, but the corresponding errors in the BLAST p-values have not been quantified. As the lengths of compared sequences tend to infinity, it is also theoretically unknown whether the neglected correlations vanish faster than the finite-size correction. RESULTS: Because we required certain analytic formulas, our study restricted its computer experiments to ungapped sequence alignment. We expect some of our conclusions to extend qualitatively to gapped sequence alignment, however. With this caveat, the finite-size correction appeared to vanish faster than the neglected correlations. Although the finite-size correction underestimated the BLAST p-value, it improved the approximation substantially for all but very short sequences. In practice, the Altschul-Gish finite-size correction was superior to Spouge's. The independent diagonals model was always within a factor of 2 of the true BLAST p-value, although fitting p-value parameters from it probably is unwise. CONTACT: spouge@ncbi.nlm.nih.gov  相似文献   

16.
DNA Data Bank of Japan (DDBJ) for genome scale research in life science   总被引:5,自引:0,他引:5  
The DNA Data Bank of Japan (DDBJ, http://www.ddbj.nig.ac.jp) has made an effort to collect as much data as possible mainly from Japanese researchers. The increase rates of the data we collected, annotated and released to the public in the past year are 43% for the number of entries and 52% for the number of bases. The increase rates are accelerated even after the human genome was sequenced, because sequencing technology has been remarkably advanced and simplified, and research in life science has been shifted from the gene scale to the genome scale. In addition, we have developed the Genome Information Broker (GIB, http://gib.genes.nig.ac.jp) that now includes more than 50 complete microbial genome and Arabidopsis genome data. We have also developed a database of the human genome, the Human Genomics Studio (HGS, http://studio.nig.ac.jp). HGS provides one with a set of sequences being as continuous as possible in any one of the 24 chromosomes. Both GIB and HGS have been updated incorporating newly available data and retrieval tools.  相似文献   

17.
Horizontal gene transfer (HGT) is a common event in prokaryotic evolution. Therefore, it is very important to consider HGT in the study of molecular evolution of prokaryotes. This is true also for conducting computer simulations of their molecular phylogeny because HGT is known to be a serious disturbing factor for estimating their correct phylogeny. To the best of our knowledge, no existing computer program has generated a phylogenetic tree with HGT from an original phylogenetic tree. We developed a program called HGT-Gen that generates a phylogenetic tree with HGT on the basis of an original phylogenetic tree of a protein or gene. HGT-Gen converts an operational taxonomic unit or a clade from one place to another in a given phylogenetic tree. We have also devised an algorithm to compute the average length between any pair of branches in the tree. It defines and computes the relative evolutionary time to normalize evolutionary time for each lineage. The algorithm can generate an HGT between a pair of donor and acceptor lineages at the same evolutionary time. HGT-Gen is used with a sequence-generating program to evaluate the influence of HGT on the molecular phylogeny of prokaryotes in a computer simulation study.

Availability

The database is available for free at http://www.grl.shizuoka.ac.jp/˜thoriike/HGT-Gen.html  相似文献   

18.
MOTIVATION: We developed an algorithm to reconstruct ancestral sequences, taking into account the rate variation among sites of the protein sequences. Our algorithm maximizes the joint probability of the ancestral sequences, assuming that the rate is gamma distributed among sites. Our algorithm probably finds the global maximum. The use of 'joint' reconstruction is motivated by studies that use the sequences at all the internal nodes in a phylogenetic tree, such as, for instance, the inference of patterns of amino-acid replacement, or tracing the biochemical changes that occurred during the evolution of a given protein family. RESULTS: We give an algorithm that guarantees finding the global maximum. The efficient search method makes our method applicable to datasets with large number sequences. We analyze ancestral sequences of five gene families, exploring the effect of the amount of among-site-rate-variation, and the degree of sequence divergence on the resulting ancestral states. AVAILABILITY AND SUPPLEMENTARY INFORMATION: http://evolu3.ism.ac.jp/~tal/ Contact: tal@ism.ac.jp  相似文献   

19.
ADAPTSITE: detecting natural selection at single amino acid sites.   总被引:12,自引:0,他引:12  
ADAPTSITE is a program package for detecting natural selection at single amino acid sites, using a multiple alignment of protein-coding sequences for a given phylogenetic tree. The program infers ancestral codons at all interior nodes, and computes the total numbers of synonymous (c(S)) and nonsynonymous (c(N)) substitutions as well as the average numbers of synonymous (s(S)) and nonsynonymous (s(N)) sites for each codon site. The probabilities of occurrence of synonymous and nonsynonymous substitutions are approximated by s(S) / (s(S) + s(N)) and s(N) / (s(S) + s(N)), respectively. The null hypothesis of selective neutrality is tested for each codon site, assuming a binomial distribution for the probability of obtaining c(S) and c(N). AVAILABILITY: ADAPTSITE is available free of charge at the World-Wide Web sites http://mep.bio.psu.edu/adaptivevol.html and http://www.cib.nig.ac.jp/dda/yossuzuk/welcome.html. The package includes the source code written in C, binary files for UNIX operating systems, manual, and example files.  相似文献   

20.
SiGN-SSM is an open-source gene network estimation software able to run in parallel on PCs and massively parallel supercomputers. The software estimates a state space model (SSM), that is a statistical dynamic model suitable for analyzing short time and/or replicated time series gene expression profiles. SiGN-SSM implements a novel parameter constraint effective to stabilize the estimated models. Also, by using a supercomputer, it is able to determine the gene network structure by a statistical permutation test in a practical time. SiGN-SSM is applicable not only to analyzing temporal regulatory dependencies between genes, but also to extracting the differentially regulated genes from time series expression profiles. AVAILABILITY: SiGN-SSM is distributed under GNU Affero General Public Licence (GNU AGPL) version 3 and can be downloaded at http://sign.hgc.jp/signssm/. The pre-compiled binaries for some architectures are available in addition to the source code. The pre-installed binaries are also available on the Human Genome Center supercomputer system. The online manual and the supplementary information of SiGN-SSM is available on our web site. CONTACT: tamada@ims.u-tokyo.ac.jp.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号