首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: New application areas of survival analysis as for example based on micro-array expression data call for novel tools able to handle high-dimensional data. While classical (semi-) parametric techniques as based on likelihood or partial likelihood functions are omnipresent in clinical studies, they are often inadequate for modelling in case when there are less observations than features in the data. Support vector machines (svms) and extensions are in general found particularly useful for such cases, both conceptually (non-parametric approach), computationally (boiling down to a convex program which can be solved efficiently), theoretically (for its intrinsic relation with learning theory) as well as empirically. This article discusses such an extension of svms which is tuned towards survival data. A particularly useful feature is that this method can incorporate such additional structure as additive models, positivity constraints of the parameters or regression constraints. RESULTS: Besides discussion of the proposed methods, an empirical case study is conducted on both clinical as well as micro-array gene expression data in the context of cancer studies. Results are expressed based on the logrank statistic, concordance index and the hazard ratio. The reported performances indicate that the present method yields better models for high-dimensional data, while it gives results which are comparable to what classical techniques based on a proportional hazard model give for clinical data.  相似文献   

2.
Due to concerns about data quality, McKechnie, Coe, Gerson, and Wolf ( 2016 ) questioned the conclusions of our study (Khaliq et al., 2015 ) published in this journal. Here, we argue that most of the questioned data points are in fact useful for macrophysiological analyses, mostly because the vast majority of data are explicitly reported in the peer‐reviewed physiological literature. Furthermore, we show that our conclusions remain largely robust irrespective of the data inclusion criterion. While we think that constructive debates about the adequate use of primary data in meta‐studies as well as more transparency in data inclusion criteria are indeed useful, we also emphasize that data suitability should be evaluated in the light of the scope and scale of the study in which they are used. We hope that this discussion will not discourage the exchange between disciplines such as biogeography and physiology, as this integration is needed to address some of the most urgent scientific challenges.  相似文献   

3.
Environmental data often include low-level concentrations below reporting limits. These data may be reported as “< RL,” where RL is one of several types of reporting limits. Some values also may be reported as a single number, but flagged with a qualifier (J-values) to indicate a difference in precision as compared to values above the RL. A currently used method for reporting censored environmental data called “insider censoring” produces a strong upward bias, while also distorting the shape of the data distribution. This results in inaccurate estimates of summary statistics and regression coefficients, distorts evaluations of whether data follow a normal distribution, and introduces inaccuracies into risk assessments and models. Insider censoring occurs when data measured as below the detection limit (< DL) are reported as less than the higher quantitation limit (< QL), whereas values between the DL and QL are reported as individual numbers. Three unbiased alternatives to insider censoring are presented so that laboratories and their data users can recognize, and remedy, this problem.  相似文献   

4.
The method known as Analysis of Concentration (AOC) is proposed as a tool to measure the predictivity of binary data for cover data. The application of AOC to structured tables of oak forests of Central Italy has proved that binary data are more predictive for cover than cover for binary data. The ordinations produced by AOC with binary and cover data are very similar and interpretable with similar results.  相似文献   

5.
Metric data are usually assessed on a continuous scale with good precision, but sometimes agricultural researchers cannot obtain precise measurements of a variable. Values of such a variable cannot then be expressed as real numbers (e.g., 1.51 or 2.56), but often can be represented by intervals into which the values fall (e.g., from 1 to 2 or from 2 to 3). In this situation, statisticians talk about censoring and censored data, as opposed to missing data, where no information is available at all. Traditionally, in agriculture and biology, three methods have been used to analyse such data: (a) when intervals are narrow, some form of imputation (e.g., mid‐point imputation) is used to replace the interval and traditional methods for continuous data are employed (such as analyses of variance [ANOVA] and regression); (b) for time‐to‐event data, the cumulative proportions of individuals that experienced the event of interest are analysed, instead of the individual observed times‐to‐event; (c) when intervals are wide and many individuals are collected, non‐parametric methods of data analysis are favoured, where counts are considered instead of the individual observed value for each sample element. In this paper, we show that these methods may be suboptimal: The first one does not respect the process of data collection, the second leads to unreliable standard errors (SEs), while the third does not make full use of all the available information. As an alternative, methods of survival analysis for censored data can be useful, leading to reliable inferences and sound hypotheses testing. These methods are illustrated using three examples from plant and crop sciences.  相似文献   

6.
The present study assesses the effects of genotyping errors on the type I error rate of a particular transmission/disequilibrium test (TDT(std)), which assumes that data are errorless, and introduces a new transmission/disequilibrium test (TDT(ae)) that allows for random genotyping errors. We evaluate the type I error rate and power of the TDT(ae) under a variety of simulations and perform a power comparison between the TDT(std) and the TDT(ae), for errorless data. Both the TDT(std) and the TDT(ae) statistics are computed as two times a log-likelihood difference, and both are asymptotically distributed as chi(2) with 1 df. Genotype data for trios are simulated under a null hypothesis and under an alternative (power) hypothesis. For each simulation, errors are introduced randomly via a computer algorithm with different probabilities (called "allelic error rates"). The TDT(std) statistic is computed on all trios that show Mendelian consistency, whereas the TDT(ae) statistic is computed on all trios. The results indicate that TDT(std) shows a significant increase in type I error when applied to data in which inconsistent trios are removed. This type I error increases both with an increase in sample size and with an increase in the allelic error rates. TDT(ae) always maintains correct type I error rates for the simulations considered. Factors affecting the power of the TDT(ae) are discussed. Finally, the power of TDT(std) is at least that of TDT(ae) for simulations with errorless data. Because data are rarely error free, we recommend that researchers use methods, such as the TDT(ae), that allow for errors in genotype data.  相似文献   

7.
8.
Large subunit ribosomal DNA (LSU rDNA) sequence data from 120 taxa and cytochrome oxidase subunit 1(COI) sequence data from 27 taxa are analyzed separately and together to estimate the internal phylogeny of the class Demospongiae and to evaluate how consistent these data are with pre-existing hypotheses of relationship concerning order-level monophyly and relationships. The monophyly of Porifera is only slightly inconsistent with LSU data, which do not support the monophyly of the class Demospongiae regardless of the inclusion or exclusion of Homoscleromopha (this result is likely due to the placement of a single hexactinellid taxon within the Demospongiae), however, no LSU support is found for the monophyly of Silicea (Demospongiae+Hexactinellida) unless homoscleromorphs are excluded. Neither the subclasses Ceractinomorpha and Tetractinomorpha, nor the orders Halichondrida, Hadromerida, and Haplosclerida are supported as monophyletic under any data partition. The haplosclerid suborders Haplosclerina and Petrosina are supported as monophyletic to the exclusion of the suborder Spongillina, and the orders Dictyoceratida, Verongida, Poecilosclerida, Astrophorida, Spirophorida, Homosclerophorida, and Agelasida are largely reconstructed as monophyletic, with the exception of few anomalously placed taxa. Few inter-order relationships are strongly supported by any data partition, but there is moderate support for a verongid+chondrosid clade and a tetractinellid+halichondrid clade. Furthermore, LSU data strongly support the existence of two novel clades that do not correspond to the existing classification and that show no morphological uniformity. Finally, every data partition supports the monophyly of a clade that includes the order Agelasida, some members of the genus Axinella, and two taxa tentatively identified as belonging to the orders Hadromerida and Halichondrida.  相似文献   

9.
A microprocessor-based system which performs realtime correlated acquisition, storage and display of multiparameter (3-parameter) data from a flow cytometer (FACS-III) is presented. List-mode techniques are not employed. The 3-parameter data is collected and correlated, then displayed along with cell-frequency as a realtime 3-parameter colour scattergram, while the experiment is in progress; in addition, correlated and uncorrelated higher-resolution projections of the 3-parameter data are collected and stored. The data projections may also be displayed: as 1-parameter histograms, or as 2-parameter colour or grey-scale scattergrams. Examples of 2- and 3-parameter colour scattergrams are presented. The speed and some characteristics of the realtime acquisition and display software are examined; methods to increase the realtime speed are discussed.  相似文献   

10.
For time series of count data, correlated measurements, clustering as well as excessive zeros occur simultaneously in biomedical applications. Ignoring such effects might contribute to misleading treatment outcomes. A generalized mixture Poisson geometric process (GMPGP) model and a zero‐altered mixture Poisson geometric process (ZMPGP) model are developed from the geometric process model, which was originally developed for modelling positive continuous data and was extended to handle count data. These models are motivated by evaluating the trend development of new tumour counts for bladder cancer patients as well as by identifying useful covariates which affect the count level. The models are implemented using Bayesian method with Markov chain Monte Carlo (MCMC) algorithms and are assessed using deviance information criterion (DIC).  相似文献   

11.
Using DNA sequence data from multiple genes (often from more than one genome compartment) to reconstruct phylogenetic relationships has become routine. Augmenting this approach with genomic structural characters (e.g., intron gain and loss, changes in gene order) as these data become available from comparative studies already has provided critical insight into some long-standing questions about the evolution of land plants. Here we report on the presence of a group II intron located in the mitochondrial atp1 gene of leptosporangiate and marattioid ferns. Primary sequence data for the atp1 gene are newly reported for 27 taxa, and results are presented from maximum likelihood-based phylogenetic analyses using Bayesian inference for 34 land plants in three data sets: (1) single-gene mitochondrial atp1 (exon+intron sequences); (2) five combined genes (mitochondrial atp1 [exon only]; plastid rbcL, atpB, rps4; nuclear SSU rDNA); and (3) same five combined genes plus morphology. All our phylogenetic analyses corroborate results from previous fern studies that used plastid and nuclear sequence data: the monophyly of euphyllophytes, as well as of monilophytes; whisk ferns (Psilotidae) sister to ophioglossoid ferns (Ophioglossidae); horsetails (Equisetopsida) sister to marattioid ferns (Marattiidae), which together are sister to the monophyletic leptosporangiate ferns. In contrast to the results from the primary sequence data, the genomic structural data (atp1 intron distribution pattern) would seem to suggest that leptosporangiate and marattioid ferns are monophyletic, and together they are the sister group to horsetails--a topology that is rarely reconstructed using primary sequence data.  相似文献   

12.
13.
Measurements in populations which serve as valid indicators of biological relationship should be proportional to genetic distance. In order to test the utility of discrete cranial traits for estimating genetic distances among populations, estimates of admixture are obtained for gene frequency data and nonmetric cranial data in São Paulo mulattos (M). The gene frequency data serve as a control that the three populations are related as stated: estimates of admixture are obtained by using São Paulo whites (W) and blacks (B) as parental populations and by estimating the parameter of admixture, m, in the model pM = (1 ? m) pW + mpB (Elston, 1971) where the p's are either gene frequencies or nonmetric trait frequencies. A test of goodness of fit of the model provides a means of ascertaining whether or not the data fit this linear model. While the gene frequency data indicate distances among the three populations which are highly compatible with the linear model of admixture, the nonmetric data show significant deviations from the model. This implies that the frequencies of the nonmetric traits in the populations used in this analysis are not a linear function of genetic distance. This discourages the use of nonmetric traits in making quantitative conclusions about genetic relationships. It also suggests the need for investigation of the use of other skeletal characters for estimating genetic distance, as well as approaches for such investigations through the study of hybrid individuals.  相似文献   

14.
We describe an integrated suite of algorithms and software for general accurate mass and time (AMT) tagging data analysis of mass spectrometry data. The AMT approach combines identifications from liquid chromatography (LC) tandem mass spectrometry (MS/MS) data with peptide accurate mass and retention time locations from high-resolution LC-MS data. Our workflow includes the traditional AMT approach, in which MS/MS identifications are located in external databases, as well as methods based on more recent hybrid instruments such as the LTQ-FT or Orbitrap, where MS/MS identifications are embedded with the MS data. We demonstrate our AMT workflow's utility for general data synthesis by combining data from two dissimilar biospecimens. Specifically, we demonstrate its use relevant to serum biomarker discovery by identifying which peptides sequenced by MS/MS analysis of tumor tissue may also be present in the plasma of tumor-bearing and control mice. The analysis workflow, referred to as msInspect/AMT, extends and combines existing open-source platforms for LC-MS/MS (CPAS) and LC-MS (msInspect) data analysis and is available in an unrestricted open-source distribution.  相似文献   

15.
Development of a clearer understanding of the causes and consequences of environmental change is an important issue globally. The consequent demand for objective, reliable and up-to-date environmental information has led to the establishment of long-term integrated environmental monitoring programmes, including the UK's Environmental Change Network (ECN). Databases form the core information resource for such programmes. The UK Environmental Change Network Data Centre manages data on behalf of ECN (as well as other related UK integrated environmental monitoring networks) and provides a robust and integrated system of information management. This paper describes how data are captured – through standardised protocols and data entry systems – as well more recent approaches such as wireless sensors. Data are managed centrally through a database and GIS. Quality control is built in at all levels of the system. Data are then made accessible through a variety of data access methods – through bespoke web interfaces, as well as third-party data portals. This paper describes the informatics approach of the ECN Data Centre which aims to develop a seamless system of data capture, management and data access interfaces to support research.  相似文献   

16.
The decisiveness of a data set has been defined as the degree to which all possible dichotomous trees for that data set differ in length, and the DD statistic (the data decisiveness index) has been proposed to measure this degree. In this paper, we first discuss an exact nonre cursive formula for the length of indecisive datasets (DD = 0) that consist of informative binary characters in which no missing entries are allowed. Next, the concept of indecisive data sets is extended to data sets in which missing entries may be present. Last, indecisive data sets with missing entries are used as an aid to construct hypothetical data sets that single out some of the factors that influence the DD statistic. On the basis of these examples, it is concluded that the concept of data decisiveness is too elusive to be captured into a single and simple index such as DD.  相似文献   

17.
Experimental constraints associated with NMR structures are available from the Protein Data Bank (PDB) in the form of `Magnetic Resonance' (MR) files. These files contain multiple types of data concatenated without boundary markers and are difficult to use for further research. Reported here are the results of a project initiated to annotate, archive, and disseminate these data to the research community from a searchable resource in a uniform format. The MR files from a set of 1410 NMR structures were analyzed and their original constituent data blocks annotated as to data type using a semi-automated protocol. A new software program called Wattos was then used to parse and archive the data in a relational database. From the total number of MR file blocks annotated as constraints, it proved possible to parse 84% (3337/3975). The constraint lists that were parsed correspond to three data types (2511 distance, 788 dihedral angle, and 38 residual dipolar couplings lists) from the three most popular software packages used in NMR structure determination: XPLOR/CNS (2520 lists), DISCOVER (412 lists), and DYANA/DIANA (405 lists). These constraints were then mapped to a developmental version of the BioMagResBank (BMRB) data model. A total of 31 data types originating from 16 programs have been classified, with the NOE distance constraint being the most commonly observed. The results serve as a model for the development of standards for NMR constraint deposition in computer-readable form. The constraints are updated regularly and are available from the BMRB web site (http://www.bmrb.wisc.edu).  相似文献   

18.
Missing data are a widely recognized nuisance factor in phylogenetic analyses, and the fear of missing data may deter systematists from including characters that are highly incomplete. In this paper, I used simulations to explore the consequences of including sets of characters that contain missing data. More specifically, I tested whether the benefits of increasing the number of characters outweigh the costs of adding missing data cells to a matrix. The results show that the addition of a set of characters with missing data is generally more likely to increase phylogenetic accuracy than decrease it, but the potential benefits of adding these characters quickly disappear as the proportion of missing data increases. Furthermore, despite the overall trend, adding characters with missing data does decrease accuracy in some cases. In these situations, the missing data entries are not themselves misleading, but their presence may mimic the effects of limited taxon sampling, which can positively mislead. Criteria are discussed for predicting whether adding characters with missing data may increase or decrease accuracy. The results of this study also suggest that accuracy can be increased to a surprising degree by (1) "filling the holes" in a data matrix as much as possible (even when relatively few taxa are missing data), and (2) adding fewer characters scored for all taxa rather than adding a larger number of characters known for fewer taxa. Missing data can also be eliminated from an analysis through the exclusion of incomplete taxa rather than incomplete characters, but this approach may reduce the usefulness of the analysis and (in some cases) the accuracy of the estimated trees.  相似文献   

19.
GDPC: connecting researchers with multiple integrated data sources   总被引:1,自引:0,他引:1  
The goal of this project is to simplify access to genomic diversity and phenotype data, thereby encouraging reuse of this data. The Genomic Diversity and Phenotype Connection (GDPC) accomplishes this by retrieving data from one or more data sources and by allowing researchers to analyze integrated data in a standard format. GDPC is written in JAVA and provides (1) data sources available as web services that transfer XML formatted data via the SOAP protocol; (2) a JAVA API for programmatic access to data sources; and (3) a front-end application that allows users to manage data sources, retrieve data based on filters, sort/group data based on property values and save/open the data as XML files. AVAILABILITY: The source code, compiled code, documentation and GDPC Browser are freely available at: www.maizegenetics.net/gdpc/index.html the current release of GDPC is version 1.0, with updated releases planned for the future. Comments are welcome.  相似文献   

20.
The collection and conversion of 4-color fluorescent genotyping data from capillary array electrophoresis microchip devices and its conversion to a format easily and rapidly analyzed by Genetic Profiler genotyping software is presented. Microchip fluorescence intensity data are acquired and stored as 4-color tab-delimited text. These files are converted to electrophoretic signal data (ESD) files using a utility program (TEXT-to-ESD) written in C. TEXT-to-ESD generates an ESD file by converting text data to binary data and then appending a 632-byte ESD-file trailer. Up to 96 ESD files are then assembled into a run folder and imported into Genetic Profiler, where data are reduced to 4-color electropherograms and analyzed. In this manner, DNA fragment sizing data acquired with our high-speed electrophoretic microchip devices can be rapidly analyzed using robust commercial software. Additionally, the conversion program allows sizing of data with Genetic Profiler that have been preprocessed using other third-party software, such as BaseFinder.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号