首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Specific selection pressures often lead to specifically mutated genomes. The open source software SeqFeatR has been developed to identify associations between mutation patterns in biological sequences and specific selection pressures (“features”). For instance, SeqFeatR has been used to discover in viral protein sequences new T cell epitopes for hosts of given HLA types. SeqFeatR supports frequentist and Bayesian methods for the discovery of statistical sequence-feature associations. Moreover, it offers novel ways to visualize results of the statistical analyses and to relate them to further properties. In this article we demonstrate various functions of SeqFeatR with real data. The most frequently used set of functions is also provided by a web server. SeqFeatR is implemented as R package and freely available from the R archive CRAN (http://cran.r-project.org/web/packages/SeqFeatR/index.html). The package includes a tutorial vignette. The software is distributed under the GNU General Public License (version 3 or later). The web server URL is https://seqfeatr.zmb.uni-due.de.  相似文献   

2.
With the development of high-throughput experimental techniques such as microarray, mass spectrometry and large-scale mutagenesis, there is an increasing need to automatically annotate gene sets and identify the involved pathways. Although many pathway analysis tools are developed, new tools are still needed to meet the requirements for flexible or advanced analysis purpose. Here, we developed an R-based software package (SubpathwayMiner) for flexible pathway identification. SubpathwayMiner facilitates sub-pathway identification of metabolic pathways by using pathway structure information. Additionally, SubpathwayMiner also provides more flexibility in annotating gene sets and identifying the involved pathways (entire pathways and sub-pathways): (i) SubpathwayMiner is able to provide the most up-to-date pathway analysis results for users; (ii) SubpathwayMiner supports multiple species (∼100 eukaryotes, 714 bacteria and 52 Archaea) and different gene identifiers (Entrez Gene IDs, NCBI-gi IDs, UniProt IDs, PDB IDs, etc.) in the KEGG GENE database; (iii) the system is quite efficient in cooperating with other R-based tools in biology. SubpathwayMiner is freely available at http://cran.r-project.org/web/packages/SubpathwayMiner/.  相似文献   

3.
Dependence measures and tests for independence have recently attracted a lot of attention, because they are the cornerstone of algorithms for network inference in probabilistic graphical models. Pearson''s product moment correlation coefficient is still by far the most widely used statistic yet it is largely constrained to detecting linear relationships. In this work we provide an exact formula for the th nearest neighbor distance distribution of rank-transformed data. Based on that, we propose two novel tests for independence. An implementation of these tests, together with a general benchmark framework for independence testing, are freely available as a CRAN software package (http://cran.r-project.org/web/packages/knnIndep). In this paper we have benchmarked Pearson''s correlation, Hoeffding''s , dcor, Kraskov''s estimator for mutual information, maximal information criterion and our two tests. We conclude that no particular method is generally superior to all other methods. However, dcor and Hoeffding''s are the most powerful tests for many different types of dependence.  相似文献   

4.
Reverse engineering approaches to constructing gene regulatory networks (GRNs) based on genome-wide mRNA expression data have led to significant biological findings, such as the discovery of novel drug targets. However, the reliability of the reconstructed GRNs needs to be improved. Here, we propose an ensemble-based network aggregation approach to improving the accuracy of network topologies constructed from mRNA expression data. To evaluate the performances of different approaches, we created dozens of simulated networks from combinations of gene-set sizes and sample sizes and also tested our methods on three Escherichia coli datasets. We demonstrate that the ensemble-based network aggregation approach can be used to effectively integrate GRNs constructed from different studies – producing more accurate networks. We also apply this approach to building a network from epithelial mesenchymal transition (EMT) signature microarray data and identify hub genes that might be potential drug targets. The R code used to perform all of the analyses is available in an R package entitled “ENA”, accessible on CRAN (http://cran.r-project.org/web/packages/ENA/).  相似文献   

5.

Background

DAVID is the most popular tool for interpreting large lists of gene/proteins classically produced in high-throughput experiments. However, the use of DAVID website becomes difficult when analyzing multiple gene lists, for it does not provide an adequate visualization tool to show/compare multiple enrichment results in a concise and informative manner.

Result

We implemented a new R-based graphical tool, BACA (Bubble chArt to Compare Annotations), which uses the DAVID web service for cross-comparing enrichment analysis results derived from multiple large gene lists. BACA is implemented in R and is freely available at the CRAN repository (http://cran.r-project.org/web/packages/BACA/).

Conclusion

The package BACA allows R users to combine multiple annotation charts into one output graph by passing DAVID website.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0477-4) contains supplementary material, which is available to authorized users.  相似文献   

6.
Cross-linking immunoprecipitation coupled with high-throughput sequencing (CLIP-Seq) has made it possible to identify the targeting sites of RNA-binding proteins in various cell culture systems and tissue types on a genome-wide scale. Here we present a novel model-based approach (MiClip) to identify high-confidence protein-RNA binding sites from CLIP-seq datasets. This approach assigns a probability score for each potential binding site to help prioritize subsequent validation experiments. The MiClip algorithm has been tested in both HITS-CLIP and PAR-CLIP datasets. In the HITS-CLIP dataset, the signal/noise ratios of miRNA seed motif enrichment produced by the MiClip approach are between 17% and 301% higher than those by the ad hoc method for the top 10 most enriched miRNAs. In the PAR-CLIP dataset, the MiClip approach can identify ∼50% more validated binding targets than the original ad hoc method and two recently published methods. To facilitate the application of the algorithm, we have released an R package, MiClip ( http://cran.r-project.org/web/packages/MiClip/index.html ), and a public web-based graphical user interface software (http://galaxy.qbrc.org/tool_runner?tool_id=mi_clip) for customized analysis.  相似文献   

7.
Contemporary genetic studies are revealing the genetic complexity of many traits in humans and model organisms. Two hallmarks of this complexity are epistasis, meaning gene-gene interaction, and pleiotropy, in which one gene affects multiple phenotypes. Understanding the genetic architecture of complex traits requires addressing these phenomena, but interpreting the biological significance of epistasis and pleiotropy is often difficult. While epistasis reveals dependencies between genetic variants, it is often unclear how the activity of one variant is specifically modifying the other. Epistasis found in one phenotypic context may disappear in another context, rendering the genetic interaction ambiguous. Pleiotropy can suggest either redundant phenotype measures or gene variants that affect multiple biological processes. Here we present an R package, R/cape, which addresses these interpretation ambiguities by implementing a novel method to generate predictive and interpretable genetic networks that influence quantitative phenotypes. R/cape integrates information from multiple related phenotypes to constrain models of epistasis, thereby enhancing the detection of interactions that simultaneously describe all phenotypes. The networks inferred by R/cape are readily interpretable in terms of directed influences that indicate suppressive and enhancing effects of individual genetic variants on other variants, which in turn account for the variance in quantitative traits. We demonstrate the utility of R/cape by analyzing a mouse backcross, thereby discovering novel epistatic interactions influencing phenotypes related to obesity and diabetes. R/cape is an easy-to-use, platform-independent R package and can be applied to data from both genetic screens and a variety of segregating populations including backcrosses, intercrosses, and natural populations. The package is freely available under the GPL-3 license at http://cran.r-project.org/web/packages/cape.
This is a PLOS Computational Biology Software Article
  相似文献   

8.
Conventional genome-wide association studies (GWAS) have been proven to be a successful strategy for identifying genetic variants associated with complex human traits. However, there is still a large heritability gap between GWAS and transitional family studies. The “missing heritability” has been suggested to be due to lack of studies focused on epistasis, also called gene–gene interactions, because individual trials have often had insufficient sample size. Meta-analysis is a common method for increasing statistical power. However, sufficient detailed information is difficult to obtain. A previous study employed a meta-regression-based method to detect epistasis, but it faced the challenge of inconsistent estimates. Here, we describe a Markov chain Monte Carlo-based method, called “Epistasis Test in Meta-Analysis” (ETMA), which uses genotype summary data to obtain consistent estimates of epistasis effects in meta-analysis. We defined a series of conditions to generate simulation data and tested the power and type I error rates in ETMA, individual data analysis and conventional meta-regression-based method. ETMA not only successfully facilitated consistency of evidence but also yielded acceptable type I error and higher power than conventional meta-regression. We applied ETMA to three real meta-analysis data sets. We found significant gene–gene interactions in the renin–angiotensin system and the polycyclic aromatic hydrocarbon metabolism pathway, with strong supporting evidence. In addition, glutathione S-transferase (GST) mu 1 and theta 1 were confirmed to exert independent effects on cancer. We concluded that the application of ETMA to real meta-analysis data was successful. Finally, we developed an R package, etma, for the detection of epistasis in meta-analysis [etma is available via the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/web/packages/etma/index.html].  相似文献   

9.
We propose permutation tests based on the pairwise distances between microarrays to compare location, variability, or equivalence of gene expression between two populations. For these tests the entire microarray or some pre-specified subset of genes is the unit of analysis. The pairwise distances only have to be computed once so the procedure is not computationally intensive despite the high dimensionality of the data. An R software package, permtest, implementing the method is freely available from the Comprehensive R Archive Network at http://cran.r-project.org.  相似文献   

10.
Understanding which peptides and proteins have the potential to undergo amyloid formation and what driving forces are responsible for amyloid-like fiber formation and stabilization remains limited. This is mainly because proteins that can undergo structural changes, which lead to amyloid formation, are quite diverse and share no obvious sequence or structural homology, despite the structural similarity found in the fibrils. To address these issues, a novel approach based on recursive feature selection and feed-forward neural networks was undertaken to identify key features highly correlated with the self-assembly problem. This approach allowed the identification of seven physicochemical and biochemical properties of the amino acids highly associated with the self-assembly of peptides and proteins into amyloid-like fibrils (normalized frequency of β-sheet, normalized frequency of β-sheet from LG, weights for β-sheet at the window position of 1, isoelectric point, atom-based hydrophobic moment, helix termination parameter at position j+1 and ΔG° values for peptides extrapolated in 0 M urea). Moreover, these features enabled the development of a new predictor (available at http://cran.r-project.org/web/packages/appnn/index.html) capable of accurately and reliably predicting the amyloidogenic propensity from the polypeptide sequence alone with a prediction accuracy of 84.9 % against an external validation dataset of sequences with experimental in vitro, evidence of amyloid formation.  相似文献   

11.
When two variables are related by a known function, the coefficient of determination (denoted R2) measures the proportion of the total variance in the observations explained by that function. For linear relationships, this is equal to the square of the correlation coefficient, ρ. When the parametric form of the relationship is unknown, however, it is unclear how to estimate the proportion of explained variance equitably—assigning similar values to equally noisy relationships. Here we demonstrate how to directly estimate a generalised R2 when the form of the relationship is unknown, and we consider the performance of the Maximal Information Coefficient (MIC)—a recently proposed information theoretic measure of dependence. We show that our approach behaves equitably, has more power than MIC to detect association between variables, and converges faster with increasing sample size. Most importantly, our approach generalises to higher dimensions, estimating the strength of multivariate relationships (Y against A, B, …) as well as measuring association while controlling for covariates (Y against X controlling for C). An R package named matie (“Measuring Association and Testing Independence Efficiently”) is available (http://cran.r-project.org/web/packages/matie/).  相似文献   

12.
Previous phylogenetic studies in oaks (Quercus, Fagaceae) have failed to resolve the backbone topology of the genus with strong support. Here, we utilize next-generation sequencing of restriction-site associated DNA (RAD-Seq) to resolve a framework phylogeny of a predominantly American clade of oaks whose crown age is estimated at 23–33 million years old. Using a recently developed analytical pipeline for RAD-Seq phylogenetics, we created a concatenated matrix of 1.40 E06 aligned nucleotides, constituting 27,727 sequence clusters. RAD-Seq data were readily combined across runs, with no difference in phylogenetic placement between technical replicates, which overlapped by only 43–64% in locus coverage. 17% (4,715) of the loci we analyzed could be mapped with high confidence to one or more expressed sequence tags in NCBI Genbank. A concatenated matrix of the loci that BLAST to at least one EST sequence provides approximately half as many variable or parsimony-informative characters as equal-sized datasets from the non-EST loci. The EST-associated matrix is more complete (fewer missing loci) and has slightly lower homoplasy than non-EST subsampled matrices of the same size, but there is no difference in phylogenetic support or relative attribution of base substitutions to internal versus terminal branches of the phylogeny. We introduce a partitioned RAD visualization method (implemented in the R package RADami; http://cran.r-project.org/web/packages/RADami) to investigate the possibility that suboptimal topologies supported by large numbers of loci—due, for example, to reticulate evolution or lineage sorting—are masked by the globally optimal tree. We find no evidence for strongly-supported alternative topologies in our study, suggesting that the phylogeny we recover is a robust estimate of large-scale phylogenetic patterns in the American oak clade. Our study is one of the first to demonstrate the utility of RAD-Seq data for inferring phylogeny in a 23–33 million year-old clade.  相似文献   

13.
14.
adegenet: a R package for the multivariate analysis of genetic markers   总被引:4,自引:0,他引:4  
The package adegenet for the R software is dedicated to the multivariate analysis of genetic markers. It extends the ade4 package of multivariate methods by implementing formal classes and functions to manipulate and analyse genetic markers. Data can be imported from common population genetics software and exported to other software and R packages. adegenet also implements standard population genetics tools along with more original approaches for spatial genetics and hybridization. AVAILABILITY: Stable version is available from CRAN: http://cran.r-project.org/mirrors.html. Development version is available from adegenet website: http://adegenet.r-forge.r-project.org/. Both versions can be installed directly from R. adegenet is distributed under the GNU General Public Licence (v.2).  相似文献   

15.
Many human diseases are attributable to complex interactions among genetic and environmental factors. Statistical tools capable of modeling such complex interactions are necessary to improve identification of genetic factors that increase a patient''s risk of disease. Logic Forest (LF), a bagging ensemble algorithm based on logic regression (LR), is able to discover interactions among binary variables predictive of response such as the biologic interactions that predispose individuals to disease. However, LF''s ability to recover interactions degrades for more infrequently occurring interactions. A rare genetic interaction may occur if, for example, the interaction increases disease risk in a patient subpopulation that represents only a small proportion of the overall patient population. We present an alternative ensemble adaptation of LR based on boosting rather than bagging called LBoost. We compare the ability of LBoost and LF to identify variable interactions in simulation studies. Results indicate that LBoost is superior to LF for identifying genetic interactions associated with disease that are infrequent in the population. We apply LBoost to a subset of single nucleotide polymorphisms on the PRDX genes from the Cancer Genetic Markers of Susceptibility Breast Cancer Scan to investigate genetic risk for breast cancer. LBoost is publicly available on CRAN as part of the LogicForest package, http://cran.r-project.org/.  相似文献   

16.
Demographic loop analysis is one of the basic methods applied in life cycle analysis in population ecology. Here, we developed an R package called “loop” to implement the algorithmic approach of loop analysis developed by a previous work. Additionally, the package can provide flexible network plotting and food web analysis as well. In this paper we illustrated the loop decomposition analysis using the life-cycle graph of a tropical tree species Vouacapoua americana; and performed food web statistics for the two real food webs for illustrating food web plotting and detecting key species in securing food web persistence. The package, including source code and binary versions, is available at the following URL: http://cran.r-project.org/web/packages/loop/.  相似文献   

17.
Interest in gene drive technology has continued to grow as promising new drive systems have been developed in the lab and discussions are moving towards implementing field trials. The prospect of field trials requires models that incorporate a significant degree of ecological detail, including parameters that change over time in response to environmental data such as temperature and rainfall, leading to seasonal patterns in mosquito population density. Epidemiological outcomes are also of growing importance, as: i) the suitability of a gene drive construct for release will depend on its expected impact on disease transmission, and ii) initial field trials are expected to have a measured entomological outcome and a modeled epidemiological outcome. We present MGDrivE 2 (Mosquito Gene Drive Explorer 2): a significant development from the MGDrivE 1 simulation framework that investigates the population dynamics of a variety of gene drive architectures and their spread through spatially-explicit mosquito populations. Key strengths and fundamental improvements of the MGDrivE 2 framework are: i) the ability of parameters to vary with time and induce seasonal population dynamics, ii) an epidemiological module accommodating reciprocal pathogen transmission between humans and mosquitoes, and iii) an implementation framework based on stochastic Petri nets that enables efficient model formulation and flexible implementation. Example MGDrivE 2 simulations are presented to demonstrate the application of the framework to a CRISPR-based split gene drive system intended to drive a disease-refractory gene into a population in a confinable and reversible manner, incorporating time-varying temperature and rainfall data. The simulations also evaluate impact on human disease incidence and prevalence. Further documentation and use examples are provided in vignettes at the project’s CRAN repository. MGDrivE 2 is freely available as an open-source R package on CRAN (https://CRAN.R-project.org/package=MGDrivE2). We intend the package to provide a flexible tool capable of modeling gene drive constructs as they move closer to field application and to infer their expected impact on disease transmission.  相似文献   

18.

Background

Sequencing and genotyping technology advancements have led to massive, growing repositories of spatially explicit genetic data and increasing quantities of temporal data (i.e., ancient DNA). These data will allow more complex and fine-scale inferences about population history than ever before; however, new methods are needed to test complex hypotheses.

Results

This article presents popRange, a forward genetic simulator, which incorporates large-scale genetic data with stochastic spatially and temporally explicit demographic and selective models. Features such as spatially and temporally variable selection coefficients and demography are incorporated in a highly flexible manner. popRange is implemented as an R package and presented with an example simulation exploring a selected allele’s trajectory in multiple subpopulations.

Conclusions

popRange allows researchers to evaluate and test complex scenarios by simulating large-scale data with complicated demographic and selective features. popRange is available for download at http://cran.r-project.org/web/packages/popRange/index.html.
  相似文献   

19.

Background

A generalized notion of biclustering involves the identification of patterns across subspaces within a data matrix. This approach is particularly well-suited to analysis of heterogeneous molecular biology datasets, such as those collected from populations of cancer patients. Different definitions of biclusters will offer different opportunities to discover information from datasets, making it pertinent to tailor the desired patterns to the intended application. This paper introduces ‘GABi’, a customizable framework for subspace pattern mining suited to large heterogeneous datasets. Most existing biclustering algorithms discover biclusters of only a few distinct structures. However, by enabling definition of arbitrary bicluster models, the GABi framework enables the application of biclustering to tasks for which no existing algorithm could be used.

Results

First, a series of artificial datasets were constructed to represent three clearly distinct scenarios for applying biclustering. With a bicluster model created for each distinct scenario, GABi is shown to recover the correct solutions more effectively than a panel of alternative approaches, where the bicluster model may not reflect the structure of the desired solution. Secondly, the GABi framework is used to integrate clinical outcome data with an ovarian cancer DNA methylation dataset, leading to the discovery that widespread dysregulation of DNA methylation associates with poor patient prognosis, a result that has not previously been reported. This illustrates a further benefit of the flexible bicluster definition of GABi, which is that it enables incorporation of multiple sources of data, with each data source treated in a specific manner, leading to a means of intelligent integrated subspace pattern mining across multiple datasets.

Conclusions

The GABi framework enables discovery of biologically relevant patterns of any specified structure from large collections of genomic data. An R implementation of the GABi framework is available through CRAN (http://cran.r-project.org/web/packages/GABi/index.html).

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0355-5) contains supplementary material, which is available to authorized users.  相似文献   

20.

Key message

We enhance power and accuracy of QTL mapping in multiple related families, by clustering the founders of the families on their local genomic similarity.

Abstract

MCQTL is a linkage mapping software application that allows the joint QTL mapping of multiple related families. In its current implementation, QTLs are modeled with one or two parameters for each parent that is a founder of the multi-cross design. The higher the number of parents, the higher the number of model parameters which can impact the power and the accuracy of the mapping. We propose to make use of the availability of denser and denser genotyping information on the founders to lessen the number of MCQTL parameters and thus boost the QTL discovery. We developed clusthaplo, an R package (http://cran.r-project.org/web/packages/clusthaplo/index.html), which aims to cluster haplotypes using a genomic similarity that reflects the probability of sharing the same ancestral allele. Computed in a sliding window along the genome and followed by a clustering method, the genomic similarity allows the local clustering of the parent haplotypes. Our assumption is that the haplotypes belonging to the same class transmit the same ancestral allele. So their putative QTL allelic effects can be modeled with the same parameter, leading to a parsimonious model, that is plugged in MCQTL. Intensive simulations using three maize data sets showed the significant gain in power and in accuracy of the QTL mapping with the ancestral allele model compared to the classical MCQTL model. MCQTL_LD (clusthaplo outputs plug in MCQTL) is a versatile and powerful tool for QTL mapping in multiple related families that makes use of linkage and linkage disequilibrium (web site http://carlit.toulouse.inra.fr/MCQTL/).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号