首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

Although numerous investigations have compared gene expression microarray platforms, preprocessing methods and batch correction algorithms using constructed spike-in or dilution datasets, there remains a paucity of studies examining the properties of microarray data using diverse biological samples. Most microarray experiments seek to identify subtle differences between samples with variable background noise, a scenario poorly represented by constructed datasets. Thus, microarray users lack important information regarding the complexities introduced in real-world experimental settings. The recent development of a multiplexed, digital technology for nucleic acid measurement enables counting of individual RNA molecules without amplification and, for the first time, permits such a study.

Results

Using a set of human leukocyte subset RNA samples, we compared previously acquired microarray expression values with RNA molecule counts determined by the nCounter Analysis System (NanoString Technologies) in selected genes. We found that gene measurements across samples correlated well between the two platforms, particularly for high-variance genes, while genes deemed unexpressed by the nCounter generally had both low expression and low variance on the microarray. Confirming previous findings from spike-in and dilution datasets, this “gold-standard” comparison demonstrated signal compression that varied dramatically by expression level and, to a lesser extent, by dataset. Most importantly, examination of three different cell types revealed that noise levels differed across tissues.

Conclusions

Microarray measurements generally correlate with relative RNA molecule counts within optimal ranges but suffer from expression-dependent accuracy bias and precision that varies across datasets. We urge microarray users to consider expression-level effects in signal interpretation and to evaluate noise properties in each dataset independently.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-649) contains supplementary material, which is available to authorized users.  相似文献   

2.

Background

Knowing the phase of marker genotype data can be useful in genome-wide association studies, because it makes it possible to use analysis frameworks that account for identity by descent or parent of origin of alleles and it can lead to a large increase in data quantities via genotype or sequence imputation. Long-range phasing and haplotype library imputation constitute a fast and accurate method to impute phase for SNP data.

Methods

A long-range phasing and haplotype library imputation algorithm was developed. It combines information from surrogate parents and long haplotypes to resolve phase in a manner that is not dependent on the family structure of a dataset or on the presence of pedigree information.

Results

The algorithm performed well in both simulated and real livestock and human datasets in terms of both phasing accuracy and computation efficiency. The percentage of alleles that could be phased in both simulated and real datasets of varying size generally exceeded 98% while the percentage of alleles incorrectly phased in simulated data was generally less than 0.5%. The accuracy of phasing was affected by dataset size, with lower accuracy for dataset sizes less than 1000, but was not affected by effective population size, family data structure, presence or absence of pedigree information, and SNP density. The method was computationally fast. In comparison to a commonly used statistical method (fastPHASE), the current method made about 8% less phasing mistakes and ran about 26 times faster for a small dataset. For larger datasets, the differences in computational time are expected to be even greater. A computer program implementing these methods has been made available.

Conclusions

The algorithm and software developed in this study make feasible the routine phasing of high-density SNP chips in large datasets.  相似文献   

3.
CP Li  ZG Yu  GS Han  KH Chu 《PloS one》2012,7(7):e42154

Background

The composition vector (CV) method has been proved to be a reliable and fast alignment-free method to analyze large COI barcoding data. In this study, we modify this method for analyzing multi-gene datasets for plant DNA barcoding. The modified method includes an adjustable-weighted algorithm for the vector distance according to the ratio in sequence length of the candidate genes for each pair of taxa.

Methodology/Principal Findings

Three datasets, matK+rbcL dataset with 2,083 sequences, matK+rbcL dataset with 397 sequences and matK+rbcL+trnH-psbA dataset with 397 sequences, were tested. We showed that the success rates of grouping sequences at the genus/species level based on this modified CV approach are always higher than those based on the traditional K2P/NJ method. For the matK+rbcL datasets, the modified CV approach outperformed the K2P-NJ approach by 7.9% in both the 2,083-sequence and 397-sequence datasets, and for the matK+rbcL+trnH-psbA dataset, the CV approach outperformed the traditional approach by 16.7%.

Conclusions

We conclude that the modified CV approach is an efficient method for analyzing large multi-gene datasets for plant DNA barcoding. Source code, implemented in C++ and supported on MS Windows, is freely available for download at http://math.xtu.edu.cn/myphp/math/research/source/Barcode_source_codes.zip.  相似文献   

4.

Background

Since both the number of SNPs (single nucleotide polymorphisms) used in genomic prediction and the number of individuals used in training datasets are rapidly increasing, there is an increasing need to improve the efficiency of genomic prediction models in terms of computing time and memory (RAM) required.

Methods

In this paper, two alternative algorithms for genomic prediction are presented that replace the originally suggested residual updating algorithm, without affecting the estimates. The first alternative algorithm continues to use residual updating, but takes advantage of the characteristic that the predictor variables in the model (i.e. the SNP genotypes) take only three different values, and is therefore termed “improved residual updating”. The second alternative algorithm, here termed “right-hand-side updating” (RHS-updating), extends the idea of improved residual updating across multiple SNPs. The alternative algorithms can be implemented for a range of different genomic predictions models, including random regression BLUP (best linear unbiased prediction) and most Bayesian genomic prediction models. To test the required computing time and RAM, both alternative algorithms were implemented in a Bayesian stochastic search variable selection model.

Results

Compared to the original algorithm, the improved residual updating algorithm reduced CPU time by 35.3 to 43.3%, without changing memory requirements. The RHS-updating algorithm reduced CPU time by 74.5 to 93.0% and memory requirements by 13.1 to 66.4% compared to the original algorithm.

Conclusions

The presented RHS-updating algorithm provides an interesting alternative to reduce both computing time and memory requirements for a range of genomic prediction models.  相似文献   

5.

Purpose

To evaluate the accuracy of the sub-classification of renal cortical neoplasms using molecular signatures.

Experimental Design

A search of publicly available databases was performed to identify microarray datasets with multiple histologic sub-types of renal cortical neoplasms. Meta-analytic techniques were utilized to identify differentially expressed genes for each histologic subtype. The lists of genes obtained from the meta-analysis were used to create predictive signatures through the use of a pair-based method. These signatures were organized into an algorithm to sub-classify renal neoplasms. The use of these signatures according to our algorithm was validated on several independent datasets.

Results

We identified three Gene Expression Omnibus datasets that fit our criteria to develop a training set. All of the datasets in our study utilized the Affymetrix platform. The final training dataset included 149 samples represented by the four most common histologic subtypes of renal cortical neoplasms: 69 clear cell, 41 papillary, 16 chromophobe, and 23 oncocytomas. When validation of our signatures was performed on external datasets, we were able to correctly classify 68 of the 72 samples (94%). The correct classification by subtype was 19/20 (95%) for clear cell, 14/14 (100%) for papillary, 17/19 (89%) for chromophobe, 18/19 (95%) for oncocytomas.

Conclusions

Through the use of meta-analytic techniques, we were able to create an algorithm that sub-classified renal neoplasms on a molecular level with 94% accuracy across multiple independent datasets. This algorithm may aid in selecting molecular therapies and may improve the accuracy of subtyping of renal cortical tumors.  相似文献   

6.
7.

Background

Injury is a leading cause of the global burden of disease (GBD). Estimates of non-fatal injury burden have been limited by a paucity of empirical outcomes data. This study aimed to (i) establish the 12-month disability associated with each GBD 2010 injury health state, and (ii) compare approaches to modelling the impact of multiple injury health states on disability as measured by the Glasgow Outcome Scale – Extended (GOS-E).

Methods

12-month functional outcomes for 11,337 survivors to hospital discharge were drawn from the Victorian State Trauma Registry and the Victorian Orthopaedic Trauma Outcomes Registry. ICD-10 diagnosis codes were mapped to the GBD 2010 injury health states. Cases with a GOS-E score >6 were defined as “recovered.” A split dataset approach was used. Cases were randomly assigned to development or test datasets. Probability of recovery for each health state was calculated using the development dataset. Three logistic regression models were evaluated: a) additive, multivariable; b) “worst injury;” and c) multiplicative. Models were adjusted for age and comorbidity and investigated for discrimination and calibration.

Findings

A single injury health state was recorded for 46% of cases (1–16 health states per case). The additive (C-statistic 0.70, 95% CI: 0.69, 0.71) and “worst injury” (C-statistic 0.70; 95% CI: 0.68, 0.71) models demonstrated higher discrimination than the multiplicative (C-statistic 0.68; 95% CI: 0.67, 0.70) model. The additive and “worst injury” models demonstrated acceptable calibration.

Conclusions

The majority of patients survived with persisting disability at 12-months, highlighting the importance of improving estimates of non-fatal injury burden. Additive and “worst” injury models performed similarly. GBD 2010 injury states were moderately predictive of recovery 1-year post-injury. Further evaluation using additional measures of health status and functioning and comparison with the GBD 2010 disability weights will be needed to optimise injury states for future GBD studies.  相似文献   

8.

Background

The diagnostic approach to dizzy, older patients is not straightforward as many organ systems can be involved and evidence for diagnostic strategies is lacking. A first differentiation in diagnostic subtypes or profiles may guide the diagnostic process of dizziness and can serve as a classification system in future research. In the literature this has been done, but based on pathophysiological reasoning only.

Objective

To establish a classification of diagnostic profiles of dizziness based on empirical data.

Design

Cross-sectional study.

Participants and Setting

417 consecutive patients of 65 years and older presenting with dizziness to 45 primary care physicians in the Netherlands from July 2006 to January 2008.

Methods

We performed tests, including patient history, and physical and additional examination, previously selected by an international expert panel and based on an earlier systematic review. We used the results of these tests in a principal component analysis for exploration, data-reduction and finally differentiation into diagnostic dizziness profiles.

Results

Demographic data and the results of the tests yielded 221 variables, of which 49 contributed to the classification of dizziness into six diagnostic profiles, that may be named as follows: “frailty”, “psychological”, “cardiovascular”, “presyncope”, “non-specific dizziness” and “ENT”. These explained 32% of the variance.

Conclusions

Empirically identified components classify dizziness into six profiles. This classification takes into account the heterogeneity and multicausality of dizziness and may serve as starting point for research on diagnostic strategies and can be a first step in an evidence based diagnostic approach of dizzy older patients.  相似文献   

9.

Background

Reducing the effects of sequencing errors and PCR artifacts has emerged as an essential component in amplicon-based metagenomic studies. Denoising algorithms have been designed that can reduce error rates in mock community data, but they change the sequence data in a manner that can be inconsistent with the process of removing errors in studies of real communities. In addition, they are limited by the size of the dataset and the sequencing technology used.

Results

FlowClus uses a systematic approach to filter and denoise reads efficiently. When denoising real datasets, FlowClus provides feedback about the process that can be used as the basis to adjust the parameters of the algorithm to suit the particular dataset. When used to analyze a mock community dataset, FlowClus produced a lower error rate compared to other denoising algorithms, while retaining significantly more sequence information. Among its other attributes, FlowClus can analyze longer reads being generated from all stages of 454 sequencing technology, as well as from Ion Torrent. It has processed a large dataset of 2.2 million GS-FLX Titanium reads in twelve hours; using its more efficient (but less precise) trie analysis option, this time was further reduced, to seven minutes.

Conclusions

Many of the amplicon-based metagenomics datasets generated over the last several years have been processed through a denoising pipeline that likely caused deleterious effects on the raw data. By using FlowClus, one can avoid such negative outcomes while maintaining control over the filtering and denoising processes. Because of its efficiency, FlowClus can be used to re-analyze multiple large datasets together, thereby leading to more standardized conclusions. FlowClus is freely available on GitHub (jsh58/FlowClus); it is written in C and supported on Linux.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0532-1) contains supplementary material, which is available to authorized users.  相似文献   

10.
11.
12.

Background

Diagnosis of soft tissue sarcomas (STS) is challenging. Many remain unclassified (not-otherwise-specified, NOS) or grouped in controversial categories such as malignant fibrous histiocytoma (MFH), with unclear therapeutic value. We analyzed several independent microarray datasets, to identify a predictor, use it to classify unclassifiable sarcomas, and assess oncogenic pathway activation and chemotherapy response.

Methodology/Principal Findings

We analyzed 5 independent datasets (325 tumor arrays). We developed and validated a predictor, which was used to reclassify MFH and NOS sarcomas. The molecular “match” between MFH and their predicted subtypes was assessed using genome-wide hierarchical clustering and Subclass-Mapping. Findings were validated in 15 paraffin samples profiled on the DASL platform. Bayesian models of oncogenic pathway activation and chemotherapy response were applied to individual STS samples. A 170-gene predictor was developed and independently validated (80-85% accuracy in all datasets). Most MFH and NOS tumors were reclassified as leiomyosarcomas, liposarcomas and fibrosarcomas. “Molecular match” between MFH and their predicted STS subtypes was confirmed both within and across datasets. This classification revealed previously unrecognized tissue differentiation lines (adipocyte, fibroblastic, smooth-muscle) and was reproduced in paraffin specimens. Different sarcoma subtypes demonstrated distinct oncogenic pathway activation patterns, and reclassified MFH tumors shared oncogenic pathway activation patterns with their predicted subtypes. These patterns were associated with predicted resistance to chemotherapeutic agents commonly used in sarcomas.

Conclusions/Significance

STS profiling can aid in diagnosis through a predictor tracking distinct tissue differentiation in unclassified tumors, and in therapeutic management via oncogenic pathway activation and chemotherapy response assessment.  相似文献   

13.
14.
15.
16.
《PloS one》2010,5(11)

Background

Late Onset Alzheimer''s disease (LOAD) is the leading cause of dementia. Recent large genome-wide association studies (GWAS) identified the first strongly supported LOAD susceptibility genes since the discovery of the involvement of APOE in the early 1990s. We have now exploited these GWAS datasets to uncover key LOAD pathophysiological processes.

Methodology

We applied a recently developed tool for mining GWAS data for biologically meaningful information to a LOAD GWAS dataset. The principal findings were then tested in an independent GWAS dataset.

Principal Findings

We found a significant overrepresentation of association signals in pathways related to cholesterol metabolism and the immune response in both of the two largest genome-wide association studies for LOAD.

Significance

Processes related to cholesterol metabolism and the innate immune response have previously been implicated by pathological and epidemiological studies of Alzheimer''s disease, but it has been unclear whether those findings reflected primary aetiological events or consequences of the disease process. Our independent evidence from two large studies now demonstrates that these processes are aetiologically relevant, and suggests that they may be suitable targets for novel and existing therapeutic approaches.  相似文献   

17.

Background

The widespread popularity of genomic applications is threatened by the “bioinformatics bottleneck” resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly.

Results

We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers.

Conclusions

Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers.  相似文献   

18.

Background

Reaction time (RT) is one of the most widely used measures of performance in experimental psychology, yet relatively few fMRI studies have included trial-by-trial differences in RT as a predictor variable in their analyses. Using a multi-study approach, we investigated whether there are brain regions that show a general relationship between trial-by-trial RT variability and activation across a range of cognitive tasks.

Methodology/Principal Findings

The relation between trial-by-trial differences in RT and brain activation was modeled in five different fMRI datasets spanning a range of experimental tasks and stimulus modalities. Three main findings were identified. First, in a widely distributed set of gray and white matter regions, activation was delayed on trials with long RTs relative to short RTs, suggesting delayed initiation of underlying physiological processes. Second, in lateral and medial frontal regions, activation showed a “time-on-task” effect, increasing linearly as a function of RT. Finally, RT variability reliably modulated the BOLD signal not only in gray matter but also in diffuse regions of white matter.

Conclusions/Significance

The results highlight the importance of modeling trial-by-trial RT in fMRI analyses and raise the possibility that RT variability may provide a powerful probe for investigating the previously elusive white matter BOLD signal.  相似文献   

19.

Background

A major challenges in the analysis of large and complex biomedical data is to develop an approach for 1) identifying distinct subgroups in the sampled populations, 2) characterizing their relationships among subgroups, and 3) developing a prediction model to classify subgroup memberships of new samples by finding a set of predictors. Each subgroup can represent different pathogen serotypes of microorganisms, different tumor subtypes in cancer patients, or different genetic makeups of patients related to treatment response.

Methods

This paper proposes a composite model for subgroup identification and prediction using biclusters. A biclustering technique is first used to identify a set of biclusters from the sampled data. For each bicluster, a subgroup-specific binary classifier is built to determine if a particular sample is either inside or outside the bicluster. A composite model, which consists of all binary classifiers, is constructed to classify samples into several disjoint subgroups. The proposed composite model neither depends on any specific biclustering algorithm or patterns of biclusters, nor on any classification algorithms.

Results

The composite model was shown to have an overall accuracy of 97.4% for a synthetic dataset consisting of four subgroups. The model was applied to two datasets where the sample’s subgroup memberships were known. The procedure showed 83.7% accuracy in discriminating lung cancer adenocarcinoma and squamous carcinoma subtypes, and was able to identify 5 serotypes and several subtypes with about 94% accuracy in a pathogen dataset.

Conclusion

The composite model presents a novel approach to developing a biclustering-based classification model from unlabeled sampled data. The proposed approach combines unsupervised biclustering and supervised classification techniques to classify samples into disjoint subgroups based on their associated attributes, such as genotypic factors, phenotypic outcomes, efficacy/safety measures, or responses to treatments. The procedure is useful for identification of unknown species or new biomarkers for targeted therapy.  相似文献   

20.
Xu P  Yang P  Lei X  Yao D 《PloS one》2011,6(1):e14634

Background

There is a growing interest in the study of signal processing and machine learning methods, which may make the brain computer interface (BCI) a new communication channel. A variety of classification methods have been utilized to convert the brain information into control commands. However, most of the methods only produce uncalibrated values and uncertain results.

Methodology/Principal Findings

In this study, we presented a probabilistic method “enhanced BLDA” (EBLDA) for multi-class motor imagery BCI, which utilized Bayesian linear discriminant analysis (BLDA) with probabilistic output to improve the classification performance. EBLDA builds a new classifier that enlarges training dataset by adding test samples with high probability. EBLDA is based on the hypothesis that unlabeled samples with high probability provide valuable information to enhance learning process and generate a classifier with refined decision boundaries. To investigate the performance of EBLDA, we first used carefully designed simulated datasets to study how EBLDA works. Then, we adopted a real BCI dataset for further evaluation. The current study shows that: 1) Probabilistic information can improve the performance of BCI for subjects with high kappa coefficient; 2) With supplementary training samples from the test samples of high probability, EBLDA is significantly better than BLDA in classification, especially for small training datasets, in which EBLDA can obtain a refined decision boundary by a shift of BLDA decision boundary with the support of the information from test samples.

Conclusions/Significance

The proposed EBLDA could potentially reduce training effort. Therefore, it is valuable for us to realize an effective online BCI system, especially for multi-class BCI systems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号