首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In personalized medicine, biomarkers are used to select therapies with the highest likelihood of success based on an individual patient’s biomarker/genomic profile. Two goals are to choose important biomarkers that accurately predict treatment outcomes and to cull unimportant biomarkers to reduce the cost of biological and clinical verifications. These goals are challenging due to the high dimensionality of genomic data. Variable selection methods based on penalized regression (e.g., the lasso and elastic net) have yielded promising results. However, selecting the right amount of penalization is critical to simultaneously achieving these two goals. Standard approaches based on cross-validation (CV) typically provide high prediction accuracy with high true positive rates (TPRs) but at the cost of too many false positives. Alternatively, stability selection (SS) controls the number of false positives, but at the cost of yielding too few true positives. To circumvent these issues, we propose prediction-oriented marker selection (PROMISE), which combines SS with CV to conflate the advantages of both methods. Our application of PROMISE with the lasso and elastic net in data analysis shows that, compared to CV, PROMISE produces sparse solutions, few false positives, and small type I + type II error, and maintains good prediction accuracy, with a marginal decrease in the TPRs. Compared to SS, PROMISE offers better prediction accuracy and TPRs. In summary, PROMISE can be applied in many fields to select regularization parameters when the goals are to minimize false positives and maximize prediction accuracy.  相似文献   

2.

Background

The continuous flow of EST data remains one of the richest sources for discoveries in modern biology. The first step in EST data mining is usually associated with EST clustering, the process of grouping of original fragments according to their annotation, similarity to known genomic DNA or each other. Clustered EST data, accumulated in databases such as UniGene, STACK and TIGR Gene Indices have proven to be crucial in research areas from gene discovery to regulation of gene expression.

Results

We have developed a new nucleotide sequence matching algorithm and its implementation for clustering EST sequences. The program is based on the original CLU match detection algorithm, which has improved performance over the widely used d2_cluster. The CLU algorithm automatically ignores low-complexity regions like poly-tracts and short tandem repeats.

Conclusion

CLU represents a new generation of EST clustering algorithm with improved performance over current approaches. An early implementation can be applied in small and medium-size projects. The CLU program is available on an open source basis free of charge. It can be downloaded from http://compbio.pbrc.edu/pti
  相似文献   

3.

Background

Horizontal gene transfer plays an important role in evolution because it sometimes allows recipient lineages to adapt to new ecological niches. High genes transfer frequencies were inferred for prokaryotic and early eukaryotic evolution. Does horizontal gene transfer also impact phylogenetic reconstruction of the evolutionary history of genomes and organisms? The answer to this question depends at least in part on the actual gene transfer frequencies and on the ability to weed out transferred genes from further analyses. Are the detected transfers mainly false positives, or are they the tip of an iceberg of many transfer events most of which go undetected by current methods?

Results

Phylogenetic detection methods appear to be the method of choice to infer gene transfers, especially for ancient transfers and those followed by orthologous replacement. Here we explore how well some of these methods perform using in silico transfers between the terminal branches of a gamma proteobacterial, genome based phylogeny. For the experiments performed here on average the AU test at a 5% significance level detects 90.3% of the transfers and 91% of the exchanges as significant. Using the Robinson-Foulds distance only 57.7% of the exchanges and 60% of the donations were identified as significant. Analyses using bipartition spectra appeared most successful in our test case. The power of detection was on average 97% using a 70% cut-off and 94.2% with 90% cut-off for identifying conflicting bipartitions, while the rate of false positives was below 4.2% and 2.1% for the two cut-offs, respectively. For all methods the detection rates improved when more intervening branches separated donor and recipient.

Conclusion

Rates of detected transfers should not be mistaken for the actual transfer rates; most analyses of gene transfers remain anecdotal. The method and significance level to identify potential gene transfer events represent a trade-off between the frequency of erroneous identification (false positives) and the power to detect actual transfer events.
  相似文献   

4.
Effects of filtering by Present call on analysis of microarray experiments   总被引:1,自引:0,他引:1  

Background

Affymetrix GeneChips® are widely used for expression profiling of tens of thousands of genes. The large number of comparisons can lead to false positives. Various methods have been used to reduce false positives, but they have rarely been compared or quantitatively evaluated. Here we describe and evaluate a simple method that uses the detection (Present/Absent) call generated by the Affymetrix microarray suite version 5 software (MAS5) to remove data that is not reliably detected before further analysis, and compare this with filtering by expression level. We explore the effects of various thresholds for removing data in experiments of different size (from 3 to 10 arrays per treatment), as well as their relative power to detect significant differences in expression.

Results

Our approach sets a threshold for the fraction of arrays called Present in at least one treatment group. This method removes a large percentage of probe sets called Absent before carrying out the comparisons, while retaining most of the probe sets called Present. It preferentially retains the more significant probe sets (p ≤ 0.001) and those probe sets that are turned on or off, and improves the false discovery rate. Permutations to estimate false positives indicate that probe sets removed by the filter contribute a disproportionate number of false positives. Filtering by fraction Present is effective when applied to data generated either by the MAS5 algorithm or by other probe-level algorithms, for example RMA (robust multichip average). Experiment size greatly affects the ability to reproducibly detect significant differences, and also impacts the effect of filtering; smaller experiments (3–5 samples per treatment group) benefit from more restrictive filtering (≥50% Present).

Conclusion

Use of a threshold fraction of Present detection calls (derived by MAS5) provided a simple method that effectively eliminated from analysis probe sets that are unlikely to be reliable while preserving the most significant probe sets and those turned on or off; it thereby increased the ratio of true positives to false positives.  相似文献   

5.

Background

In recent years real-time PCR has become a leading technique for nucleic acid detection and quantification. These assays have the potential to greatly enhance efficiency in the clinical laboratory. Choice of primer and probe sequences is critical for accurate diagnosis in the clinic, yet current primer/probe signature design strategies are limited, and signature evaluation methods are lacking.

Methods

We assessed the quality of a signature by predicting the number of true positive, false positive and false negative hits against all available public sequence data. We found real-time PCR signatures described in recent literature and used a BLAST search based approach to collect all hits to the primer-probe combinations that should be amplified by real-time PCR chemistry. We then compared our hits with the sequences in the NCBI taxonomy tree that the signature was designed to detect.

Results

We found that many published signatures have high specificity (almost no false positives) but low sensitivity (high false negative rate). Where high sensitivity is needed, we offer a revised methodology for signature design which may designate that multiple signatures are required to detect all sequenced strains. We use this methodology to produce new signatures that are predicted to have higher sensitivity and specificity.

Conclusion

We show that current methods for real-time PCR assay design have unacceptably low sensitivities for most clinical applications. Additionally, as new sequence data becomes available, old assays must be reassessed and redesigned. A standard protocol for both generating and assessing the quality of these assays is therefore of great value. Real-time PCR has the capacity to greatly improve clinical diagnostics. The improved assay design and evaluation methods presented herein will expedite adoption of this technique in the clinical lab.  相似文献   

6.

Background

In many eukaryotes, microRNAs (miRNAs) bind to complementary sites in the 3'-untranslated regions (3'-UTRs) of target messenger RNAs (mRNAs) and regulate their expression at the stage of translation. Recent studies have revealed that many miRNAs are evolutionarily conserved; however, the evolution of their target genes has yet to be systematically characterized. We sought to elucidate a set of conserved miRNA/target-gene pairs and to analyse the mechanism underlying miRNA-mediated gene regulation in the early stage of bilaterian evolution.

Results

Initially, we extracted five evolutionarily conserved miRNAs (let-7, miR-1, miR-124, miR-125/lin-4, and miR-34) among five diverse bilaterian animals. Subsequently, we designed a procedure to predict evolutionarily conserved miRNA/target-gene pairs by introducing orthologous gene information. As a result, we extracted 31 orthologous miRNA/target-gene pairs that were conserved among at least four diverse bilaterian animals; the prediction set showed prominent enrichment of orthologous miRNA/target-gene pairs that were verified experimentally. Approximately 84% of the target genes were regulated by three miRNAs (let-7, miR-1, and miR-124) and their function was classified mainly into the following categories: development, muscle formation, cell adhesion, and gene regulation. We used a reporter gene assay to experimentally verify the downregulation of six candidate pairs (out of six tested pairs) in HeLa cells.

Conclusions

The application of our new method enables the identification of 31 miRNA/target-gene pairs that were expected to have been regulated from the era of the common bilaterian ancestor. The downregulation of all six candidate pairs suggests that orthologous information contributed to the elucidation of the primordial set of genes that has been regulated by miRNAs; it was also an efficient tool for the elimination of false positives from the predicted candidates. In conclusion, our study identified potentially important miRNA-target pairs that were evolutionarily conserved throughout diverse bilaterian animals and that may provide new insights into early-stage miRNA functions.  相似文献   

7.

Background

Nasal gene expression profiling is a promising method to characterize COPD non-invasively. We aimed to identify a nasal gene expression profile to distinguish COPD patients from healthy controls. We investigated whether this COPD-associated gene expression profile in nasal epithelium is comparable with the profile observed in bronchial epithelium.

Methods

Genome wide gene expression analysis was performed on nasal epithelial brushes of 31 severe COPD patients and 22 controls, all current smokers, using Affymetrix Human Gene 1.0 ST Arrays. We repeated the gene expression analysis on bronchial epithelial brushes in 2 independent cohorts of mild-to-moderate COPD patients and controls.

Results

In nasal epithelium, 135 genes were significantly differentially expressed between severe COPD patients and controls, 21 being up- and 114 downregulated in COPD (false discovery rate?<?0.01). Gene Set Enrichment Analysis (GSEA) showed significant concordant enrichment of COPD-associated nasal and bronchial gene expression in both independent cohorts (FDRGSEA <?0.001).

Conclusion

We identified a nasal gene expression profile that differentiates severe COPD patients from controls. Of interest, part of the nasal gene expression changes in COPD mimics differentially expressed genes in the bronchus. These findings indicate that nasal gene expression profiling is potentially useful as a non-invasive biomarker in COPD.

Trial registration

ClinicalTrials.gov registration number NCT01351792 (registration date May 10, 2011), ClinicalTrials.gov registration number NCT00848406 (registration date February 19, 2009), ClinicalTrials.gov registration number NCT00807469 (registration date December 11, 2008).
  相似文献   

8.
9.
Thermincola potens” strain JR is one of the first Gram-positive dissimilatory metal-reducing bacteria (DMRB) for which there is a complete genome sequence. Consistent with the physiology of this organism, preliminary annotation revealed an abundance of multiheme c-type cytochromes that are putatively associated with the periplasm and cell surface in a Gram-positive bacterium. Here we report the complete genome sequence of strain JR.“Thermincola potens” strain JR, a Gram-positive anaerobe isolated from a thermophilic microbial fuel cell (MFC), constituted a dominant member of the current-producing bacterial community (10). Strain JR is a Thermincola species in the phylum Firmicutes belonging to the family Peptococcaceae in the order Clostridiales. It shares 99% 16S rRNA gene sequence identity with the two known members of the Thermincola genus, T. carboxdophilia and T. ferriacetica (8, 12). This strain coupled acetate oxidation to reduction of the insoluble electron acceptors MFC anodes and hydrous ferric oxide (HFO) (10). Strain JR is also capable of growth with CO as the sole electron donor and carbon source.This member of the Firmicutes is the first MFC isolate and Thermincola species to have its genome sequenced and is one of only a few bacteria in the Peptococcaceae to have its genome sequenced (5, 11). Genomic analysis will aid elucidation of electron transfer mechanisms by strain JR, contributing to the knowledge of extracellular respiration by Gram-positive bacteria. By comparing these mechanisms to those in Gram-negative organisms, the conserved and disparate aspects of this seminal metabolism can be identified. This will include analysis of the c-type cytochrome gene makeup of the genome, especially the increased number of proteins with double heme (CXXCH) motifs and multiple heme binding domains compared to the nearest phylogenetic neighbors with sequenced genomes (4, 6, 7). c-type cytochromes are essential for the reduction of insoluble electron acceptors by model Gram-negative bacteria, such as Geobacter or Shewanella species (3, 9); however, their role in Gram-positive mineral respiration is still unknown.Joint Genome Institute (JGI) sequencing used a combination of 454 and Illumina techniques with 27× coverage. All library construction and sequencing techniques are available at http://www.jgi.doe.gov/. Illumina reads were assembled into 121 contigs using Velvet 0.7.1.18 (13) and shredded into 1-kb pseudoreads (with 100-bp overlap). The pseudoreads were incorporated into a hybrid 454/Illumina assembly using the parallel Phrap assembler (CodonCode Corporation, Dedham, MA) (1, 2). Misassemblies were corrected with Dupfinisher (C. S. Han and P. Chain, presented at the 2006 International Conference on Bioinformatics and Computational Biology). Gene modeling was performed using Prodigal (http://prodigal.ornl.gov/), and resulting protein translations were assigned by comparisons to Pfam, KEGG, and COGs databases using BLASTP or HMMER. The complete genome was a single circular chromosome of approximately 3,036,819 bp with an average G+C content of 45.9%. A total of 2,963 protein-encoding genes were predicted, and 393 (6.9%) had no similarity to public database sequences.  相似文献   

10.
Automatic annotation of eukaryotic genes,pseudogenes and promoters   总被引:1,自引:0,他引:1  
  相似文献   

11.

Background

Tandem affinity purification coupled with mass-spectrometry (TAP/MS) analysis is a popular method for the identification of novel endogenous protein-protein interactions (PPIs) in large-scale. Computational analysis of TAP/MS data is a critical step, particularly for high-throughput datasets, yet it remains challenging due to the noisy nature of TAP/MS data.

Results

We investigated several major TAP/MS data analysis methods for identifying PPIs, and developed an advanced method, which incorporates an improved statistical method to filter out false positives from the negative controls. Our method is named PPIRank that stands for PPI rank ing in TAP/MS data. We compared PPIRank with several other existing methods in analyzing two pathway-specific TAP/MS PPI datasets from Drosophila.

Conclusion

Experimental results show that PPIRank is more capable than other approaches in terms of identifying known interactions collected in the BioGRID PPI database. Specifically, PPIRank is able to capture more true interactions and simultaneously less false positives in both Insulin and Hippo pathways of Drosophila Melanogaster.
  相似文献   

12.

Background

Peptides derived from endogenous antigens can bind to MHC class I molecules. Those which bind with high affinity can invoke a CD8+ immune response, resulting in the destruction of infected cells. Much work in immunoinformatics has involved the algorithmic prediction of peptide binding affinity to various MHC-I alleles. A number of tools for MHC-I binding prediction have been developed, many of which are available on the web.

Results

We hypothesize that peptides predicted by a number of tools are more likely to bind than those predicted by just one tool, and that the likelihood of a particular peptide being a binder is related to the number of tools that predict it, as well as the accuracy of those tools. To this end, we have built and tested a heuristic-based method of making MHC-binding predictions by combining the results from multiple tools. The predictive performance of each individual tool is first ascertained. These performance data are used to derive weights such that the predictions of tools with better accuracy are given greater credence. The combined tool was evaluated using ten-fold cross-validation and was found to signicantly outperform the individual tools when a high specificity threshold is used. It performs comparably well to the best-performing individual tools at lower specificity thresholds. Finally, it also outperforms the combination of the tools resulting from linear discriminant analysis.

Conclusion

A heuristic-based method of combining the results of the individual tools better facilitates the scanning of large proteomes for potential epitopes, yielding more actual high-affinity binders while reporting very few false positives.  相似文献   

13.

Background

One of the major goals in gene and protein expression profiling of cancer is to identify biomarkers and build classification models for prediction of disease prognosis or treatment response. Many traditional statistical methods, based on microarray gene expression data alone and individual genes' discriminatory power, often fail to identify biologically meaningful biomarkers thus resulting in poor prediction performance across data sets. Nonetheless, the variables in multivariable classifiers should synergistically interact to produce more effective classifiers than individual biomarkers.

Results

We developed an integrated approach, namely network-constrained support vector machine (netSVM), for cancer biomarker identification with an improved prediction performance. The netSVM approach is specifically designed for network biomarker identification by integrating gene expression data and protein-protein interaction data. We first evaluated the effectiveness of netSVM using simulation studies, demonstrating its improved performance over state-of-the-art network-based methods and gene-based methods for network biomarker identification. We then applied the netSVM approach to two breast cancer data sets to identify prognostic signatures for prediction of breast cancer metastasis. The experimental results show that: (1) network biomarkers identified by netSVM are highly enriched in biological pathways associated with cancer progression; (2) prediction performance is much improved when tested across different data sets. Specifically, many genes related to apoptosis, cell cycle, and cell proliferation, which are hallmark signatures of breast cancer metastasis, were identified by the netSVM approach. More importantly, several novel hub genes, biologically important with many interactions in PPI network but often showing little change in expression as compared with their downstream genes, were also identified as network biomarkers; the genes were enriched in signaling pathways such as TGF-beta signaling pathway, MAPK signaling pathway, and JAK-STAT signaling pathway. These signaling pathways may provide new insight to the underlying mechanism of breast cancer metastasis.

Conclusions

We have developed a network-based approach for cancer biomarker identification, netSVM, resulting in an improved prediction performance with network biomarkers. We have applied the netSVM approach to breast cancer gene expression data to predict metastasis in patients. Network biomarkers identified by netSVM reveal potential signaling pathways associated with breast cancer metastasis, and help improve the prediction performance across independent data sets.  相似文献   

14.
Despite significant advances in automated nuclear magnetic resonance-based protein structure determination, the high numbers of false positives and false negatives among the peaks selected by fully automated methods remain a problem. These false positives and negatives impair the performance of resonance assignment methods. One of the main reasons for this problem is that the computational research community often considers peak picking and resonance assignment to be two separate problems, whereas spectroscopists use expert knowledge to pick peaks and assign their resonances at the same time. We propose a novel framework that simultaneously conducts slice picking and spin system forming, an essential step in resonance assignment. Our framework then employs a genetic algorithm, directed by both connectivity information and amino acid typing information from the spin systems, to assign the spin systems to residues. The inputs to our framework can be as few as two commonly used spectra, i.e., CBCA(CO)NH and HNCACB. Different from the existing peak picking and resonance assignment methods that treat peaks as the units, our method is based on ‘slices’, which are one-dimensional vectors in three-dimensional spectra that correspond to certain ( \(N, H\) ) values. Experimental results on both benchmark simulated data sets and four real protein data sets demonstrate that our method significantly outperforms the state-of-the-art methods while using a less number of spectra than those methods. Our method is freely available at http://sfb.kaust.edu.sa/Pages/Software.aspx.  相似文献   

15.

Background

Since the initial publication of its complete genome sequence, Arabidopsis thaliana has become more important than ever as a model for plant research. However, the initial genome annotation was submitted by multiple centers using inconsistent methods, making the data difficult to use for many applications.

Results

Over the course of three years, TIGR has completed its effort to standardize the structural and functional annotation of the Arabidopsis genome. Using both manual and automated methods, Arabidopsis gene structures were refined and gene products were renamed and assigned to Gene Ontology categories. We present an overview of the methods employed, tools developed, and protocols followed, summarizing the contents of each data release with special emphasis on our final annotation release (version 5).

Conclusion

Over the entire period, several thousand new genes and pseudogenes were added to the annotation. Approximately one third of the originally annotated gene models were significantly refined yielding improved gene structure annotations, and every protein-coding gene was manually inspected and classified using Gene Ontology terms.  相似文献   

16.

Background

Host-associated microbial communities have important roles in tissue homeostasis and overall health. Severe perturbations can occur within these microbial communities during critical illness due to underlying diseases and clinical interventions, potentially influencing patient outcomes. We sought to profile the microbial composition of critically ill mechanically ventilated patients, and to determine whether microbial diversity is associated with illness severity and mortality.

Methods

We conducted a prospective, observational study of mechanically ventilated critically ill patients with a high incidence of pneumonia in 2 intensive care units (ICUs) in Hamilton, Canada, nested within a randomized trial for the prevention of healthcare-associated infections. The microbial profiles of specimens from 3 anatomical sites (respiratory, and upper and lower gastrointestinal tracts) were characterized using 16S ribosomal RNA gene sequencing.

Results

We collected 65 specimens from 34 ICU patients enrolled in the trial (29 endotracheal aspirates, 26 gastric aspirates and 10 stool specimens). Specimens were collected at a median time of 3?days (lower respiratory tract and gastric aspirates; interquartile range [IQR] 2–4) and 6?days (stool; IQR 4.25–6.75) following ICU admission. We observed a loss of biogeographical distinction between the lower respiratory tract and gastrointestinal tract microbiota during critical illness. Moreover, microbial diversity in the respiratory tract was inversely correlated with APACHE II score (r?=???0.46, p?=?0.013) and was associated with hospital mortality (Median Shannon index: Discharged alive; 1.964 vs. Deceased; 1.348, p?=?0.045).

Conclusions

The composition of the host-associated microbial communities is severely perturbed during critical illness. Reduced microbial diversity reflects high illness severity and is associated with mortality. Microbial diversity may be a biomarker of prognostic value in mechanically ventilated patients.

Trial registration

ClinicalTrials.gov ID NCT01782755. Registered February 4 2013.
  相似文献   

17.
18.

Background

A recent analysis of protein sequences deposited in the NCBI RefSeq database indicates that ~8.5 million protein sequences are encoded in prokaryotic and eukaryotic genomes, where ~30% are explicitly annotated as "hypothetical" or "uncharacterized" protein. Our Comparison of Protein Active-Site Structures (CPASS v.2) database and software compares the sequence and structural characteristics of experimentally determined ligand binding sites to infer a functional relationship in the absence of global sequence or structure similarity. CPASS is an important component of our Functional Annotation Screening Technology by NMR (FAST-NMR) protocol and has been successfully applied to aid the annotation of a number of proteins of unknown function.

Findings

We report a major upgrade to our CPASS software and database that significantly improves its broad utility. CPASS v.2 is designed with a layered architecture to increase flexibility and portability that also enables job distribution over the Open Science Grid (OSG) to increase speed. Similarly, the CPASS interface was enhanced to provide more user flexibility in submitting a CPASS query. CPASS v.2 now allows for both automatic and manual definition of ligand-binding sites and permits pair-wise, one versus all, one versus list, or list versus list comparisons. Solvent accessible surface area, ligand root-mean square difference, and Cβ distances have been incorporated into the CPASS similarity function to improve the quality of the results. The CPASS database has also been updated.

Conclusions

CPASS v.2 is more than an order of magnitude faster than the original implementation, and allows for multiple simultaneous job submissions. Similarly, the CPASS database of ligand-defined binding sites has increased in size by ~ 38%, dramatically increasing the likelihood of a positive search result. The modification to the CPASS similarity function is effective in reducing CPASS similarity scores for false positives by ~30%, while leaving true positives unaffected. Importantly, receiver operating characteristics (ROC) curves demonstrate the high correlation between CPASS similarity scores and an accurate functional assignment. As indicated by distribution curves, scores ≥ 30% infer a functional similarity. Software URL: http://cpass.unl.edu.  相似文献   

19.
20.

Background

Existing software for quantitative trait mapping is either not able to model polygenic variation or does not allow incorporation of more than one genetic variance component. Improperly modeling the genetic relatedness among subjects can result in excessive false positives. We have developed an R package, QTLRel, to enable more flexible modeling of genetic relatedness as well as covariates and non-genetic variance components.

Results

We have successfully used the package to analyze many datasets, including F34 body weight data that contains 688 individuals genotyped at 3105 SNP markers and identified 11 QTL. It took 295 seconds to estimate variance components and 70 seconds to perform the genome scan on an Linux machine equipped with a 2.40GHz Intel(R) Core(TM)2 Quad CPU.

Conclusions

QTLRel provides a toolkit for genome-wide association studies that is capable of calculating genetic incidence matrices from pedigrees, estimating variance components, performing genome scans, incorporating interactive covariates and genetic and non-genetic variance components, as well as other functionalities such as multiple-QTL mapping and genome-wide epistasis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号