A common goal of microarray and related high-throughput genomic experiments is to identify genes that vary across biological condition. Most often this is accomplished by identifying genes with changes in mean expression level, so called differentially expressed (DE) genes, and a number of effective methods for identifying DE genes have been developed. Although useful, these approaches do not accommodate other types of differential regulation. An important example concerns differential coexpression (DC). Investigations of this class of genes are hampered by the large cardinality of the space to be interrogated as well as by influential outliers. As a result, existing DC approaches are often underpowered, exceedingly prone to false discoveries, and/or computationally intractable for even a moderately large number of pairs. To address this, an empirical Bayesian approach for identifying DC gene pairs is developed. The approach provides a false discovery rate controlled list of significant DC gene pairs without sacrificing power. It is applicable within a single study as well as across multiple studies. Computations are greatly facilitated by a modification to the expectation-maximization algorithm and a procedural heuristic. Simulations suggest that the proposed approach outperforms existing methods in far less computational time; and case study results suggest that the approach will likely prove to be a useful complement to current DE methods in high-throughput genomic studies.  相似文献   

Protein crystals contain two different types of interfaces: biologically relevant ones, observed in protein–protein complexes and oligomeric proteins, and nonspecific ones, corresponding to crystal lattice contacts. Because of the increasing complexity of the objects being tackled in structural biology, distinguishing biological contacts from crystal contacts is not always a trivial task and can lead to wrong interpretation of macromolecular structures. We devised an approach (CRK, core‐rim Ka/Ks ratio) for distinguishing biologically relevant interfaces from nonspecific ones. Given a protein–protein interface, CRK finds a set of homologs to the sequences of the proteins involved in the interface, retrieves and aligns the corresponding coding sequences, on which it carries out a residue‐by‐residue Ka/Ks ratio (ω) calculation. It divides interface residues into a “rim” and a “core” set and analyzes the selection pressure on the residues belonging to the two sets. We developed and tested CRK on different datasets and test cases, consisting of biologically relevant contacts, nonspecific ones or of both types. The method proves very effective in distinguishing the two categories of interfaces, with an overall accuracy rate of 84%. As it relies on different principles when compared with existing tools, CRK is optimally suited to be used in combination with them. In addition, CRK has potential applications in the validation of structures of oligomeric proteins and protein complexes. Proteins 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

Although mutation analysis serves as a key part in making a definitive diagnosis about a genetic disease, it still remains a time-consuming step to interpret their biological implications through integration of various lines of archived information about genes in question. To expedite this evaluation step of disease-causing genetic variations, here we developed Mutation@A Glance (http://rapid.rcai.riken.jp/mutation/), a highly integrated web-based analysis tool for analysing human disease mutations; it implements a user-friendly graphical interface to visualize about 40 000 known disease-associated mutations and genetic polymorphisms from more than 2600 protein-coding human disease-causing genes. Mutation@A Glance locates already known genetic variation data individually on the nucleotide and the amino acid sequences and makes it possible to cross-reference them with tertiary and/or quaternary protein structures and various functional features associated with specific amino acid residues in the proteins. We showed that the disease-associated missense mutations had a stronger tendency to reside in positions relevant to the structure/function of proteins than neutral genetic variations. From a practical viewpoint, Mutation@A Glance could certainly function as a ‘one-stop’ analysis platform for newly determined DNA sequences, which enables us to readily identify and evaluate new genetic variations by integrating multiple lines of information about the disease-causing candidate genes.  相似文献   

Network theory applied to protein structures provides insights into numerous problems of biological relevance. The explosion in structural data available from PDB and simulations establishes a need to introduce a standalone‐efficient program that assembles network concepts/parameters under one hood in an automated manner. Herein, we discuss the development/application of an exhaustive, user‐friendly, standalone program package named PSN‐Ensemble, which can handle structural ensembles generated through molecular dynamics (MD) simulation/NMR studies or from multiple X‐ray structures. The novelty in network construction lies in the explicit consideration of side‐chain interactions among amino acids. The program evaluates network parameters dealing with topological organization and long‐range allosteric communication. The introduction of a flexible weighing scheme in terms of residue pairwise cross‐correlation/interaction energy in PSN‐Ensemble brings in dynamical/chemical knowledge into the network representation. Also, the results are mapped on a graphical display of the structure, allowing an easy access of network analysis to a general biological community. The potential of PSN‐Ensemble toward examining structural ensemble is exemplified using MD trajectories of an ubiquitin‐conjugating enzyme (UbcH5b). Furthermore, insights derived from network parameters evaluated using PSN‐Ensemble for single‐static structures of active/inactive states of β2‐adrenergic receptor and the ternary tRNA complexes of tyrosyl tRNA synthetases (from organisms across kingdoms) are discussed. PSN‐Ensemble is freely available from http://vishgraph.mbu.iisc.ernet.in/PSN‐Ensemble/psn_index.html .  相似文献   

Generation of full protein coordinates from limited information, e.g., the Cα coordinates, is an important step in protein homology modeling and structure determination, and molecular dynamics (MD) simulations may prove to be important in this task. We describe a new method, in which the protein backbone is built quickly in a rather crude way and then refined by minimization techniques. Subsequently, the side chains are positioned using extensive MD calculations. The method is tested on two proteins, and results compared to proteins constructed using two other MD-based methods. In the first method, we supplemented an existing backbone building method with a new procedure to add side chains. The second one largely consists of available methodology. The constructed proteins are compared to the corresponding X-ray structures, which became available during this study, and they are in good agreement (backbone RMS values of 0.5–0.7 Å, and all-atom RMS values of 1.5–1.9 Å). This comparative study indicates that extensive MD simulations are able, to some extent, to generate details of the native protein structure, and may contribute to the development of a standardized methodology to predict reliably (parts of) protein structures when only partial coordinate data are available. © 1994 John Wiley & Sons, Inc.  相似文献   

Next-generation sequencing (NGS) technologies provide the potential for developing high-throughput and low-cost platforms for clinical diagnostics. A limiting factor to clinical applications of genomic NGS is downstream bioinformatics analysis for data interpretation. We have developed an integrated approach for end-to-end clinical NGS data analysis from variant detection to functional profiling. Robust bioinformatics pipelines were implemented for genome alignment, single nucleotide polymorphism (SNP), small insertion/deletion (InDel), and copy number variation (CNV) detection of whole exome sequencing (WES) data from the Illumina platform. Quality-control metrics were analyzed at each step of the pipeline by use of a validated training dataset to ensure data integrity for clinical applications. We annotate the variants with data regarding the disease population and variant impact. Custom algorithms were developed to filter variants based on criteria, such as quality of variant, inheritance pattern, and impact of variant on protein function. The developed clinical variant pipeline links the identified rare variants to Integrated Genome Viewer for visualization in a genomic context and to the Protein Information Resource’s iProXpress for rich protein and disease information. With the application of our system of annotations, prioritizations, inheritance filters, and functional profiling and analysis, we have created a unique methodology for downstream variant filtering that empowers clinicians and researchers to interpret more effectively the relevance of genomic alterations within a rare genetic disease.  相似文献   

An approach is described for rapidly determining protein structures by NMR that utilizes proteins containing 13C-methyl labeled Val, Leu, and Ile (1) and protonated Phe and Tyr in a deuterated background. Using this strategy, the key NOEs that define the hydrophobic core and overall fold of the protein are easily obtained. NMR data are acquired using cryogenic probe technology which markedly reduces the spectrometer time needed for data acquisition. The approach is demonstrated by determining the overall fold of the antiapoptotic protein, Bcl-xL, from data collected in only 4 days. Refinement of the Bcl-xL structure to a backbone rmsd of 0.95 Å was accomplished with data collected in an additional 3 days. A distance analysis of 180 different proteins and structure calculations using simulated data suggests that our method will allow the global folds of a wide variety of proteins to be determined.  相似文献   

In epidemic models, the effective reproduction number is of central importance to assess the transmission dynamics of an infectious disease and to orient health intervention strategies. Publicly shared data during an outbreak often suffers from two sources of misreporting (underreporting and delay in reporting) that should not be overlooked when estimating epidemiological parameters. The main statistical challenge in models that intrinsically account for a misreporting process lies in the joint estimation of the time-varying reproduction number and the delay/underreporting parameters. Existing Bayesian approaches typically rely on Markov chain Monte Carlo algorithms that are extremely costly from a computational perspective. We propose a much faster alternative based on Laplacian-P-splines (LPS) that combines Bayesian penalized B-splines for flexible and smooth estimation of the instantaneous reproduction number and Laplace approximations to selected posterior distributions for fast computation. Assuming a known generation interval distribution, the incidence at a given calendar time is governed by the epidemic renewal equation and the delay structure is specified through a composite link framework. Laplace approximations to the conditional posterior of the spline vector are obtained from analytical versions of the gradient and Hessian of the log-likelihood, implying a drastic speed-up in the computation of posterior estimates. Furthermore, the proposed LPS approach can be used to obtain point estimates and approximate credible intervals for the delay and reporting probabilities. Simulation of epidemics with different combinations for the underreporting rate and delay structure (one-day, two-day, and weekend delays) show that the proposed LPS methodology delivers fast and accurate estimates outperforming existing methods that do not take into account underreporting and delay patterns. Finally, LPS is illustrated in two real case studies of epidemic outbreaks.  相似文献   

A reliable automated approach for assignment of NOESY spectra would allow more rapid determination of protein structures by NMR. In this paper we describe a semi-automated procedure for complete NOESY assignment (SANE, Structure Assisted NOE Evaluation), coupled to an iterative procedure for NMR structure determination where the user is directly involved. Our method is similar to ARIA [Nilges et al. (1997) J. Mol. Biol., 269, 408–422], but is compatible with the molecular dynamics suites AMBER and DYANA. The method is ideal for systems where an initial model or crystal structure is available, but has also been used successfully for ab initio structure determination. Use of this semi-automated iterative approach assists in the identification of errors in the NOE assignments to short-cut the path to an NMR solution structure.  相似文献   

Evaluating or predicting the quality of protein models (i.e., predicted protein tertiary structures) without knowing their native structures is important for selecting and appropriately using protein models. We describe an iterative approach that improves the performances of protein Model Quality Assurance Programs (MQAPs). Given the initial quality scores of a list of models assigned by a MQAP, the method iteratively refines the scores until the ranking of the models does not change. We applied the method to the model quality assessment data generated by 30 MQAPs during the Eighth Critical Assessment of Techniques for Protein Structure Prediction. To various degrees, our method increased the average correlation between predicted and real quality scores of 25 out of 30 MQAPs and reduced the average loss (i.e., the difference between the top ranked model and the best model) for 28 MQAPs. Particularly, for MQAPs with low average correlations (<0.4), the correlation can be increased by several times. Similar experiments conducted on the CASP9 MQAPs also demonstrated the effectiveness of the method. Our method is a hybrid method that combines the original method of a MQAP and the pair-wise comparison clustering method. It can achieve a high accuracy similar to a full pair-wise clustering method, but with much less computation time when evaluating hundreds of models. Furthermore, without knowing native structures, the iterative refining method can evaluate the performance of a MQAP by analyzing its model quality predictions.  相似文献   

A new, efficient method for the assembly of protein tertiary structure from known, loosely encoded secondary structure restraints and sparse information about exact side chain contacts is proposed and evaluated. The method is based on a new, very simple method for the reduced modeling of protein structure and dynamics, where the protein is described as a lattice chain connecting side chain centers of mass rather than Cαs. The model has implicit built-in multibody correlations that simulate short- and long-range packing preferences, hydrogen bonding cooperativity and a mean force potential describing hydrophobic interactions. Due to the simplicity of the protein representation and definition of the model force field, the Monte Carlo algorithm is at least an order of magnitude faster than previously published Monte Carlo algorithms for structure assembly. In contrast to existing algorithms, the new method requires a smaller number of tertiary restraints for successful fold assembly; on average, one for every seven residues as compared to one for every four residues. For example, for smaller proteins such as the B domain of protein G, the resulting structures have a coordinate root mean square deviation (cRMSD), which is about 3 Å from the experimental structure; for myoglobin, structures whose backbone cRMSD is 4.3 Å are produced, and for a 247-residue TIM barrel, the cRMSD of the resulting folds is about 6 Å. As would be expected, increasing the number of tertiary restraints improves the accuracy of the assembled structures. The reliability and robustness of the new method should enable its routine application in model building protocols based on various (very sparse) experimentally derived structural restraints. Proteins 32:475–494, 1998. © 1998 Wiley-Liss, Inc.  相似文献   

We present a novel and efficient approach for assessing protein-protein complex formation, which combines ab initio docking calculations performed with the protein docking algorithm BiGGER and chemical shift perturbation data collected with heteronuclear single quantum coherence (HSQC) or TROSY nuclear magnetic resonance (NMR) spectroscopy. This method, termed "restrained soft-docking," is validated for several known protein complexes. These data demonstrate that restrained soft-docking extends the size limitations of NMR spectroscopy and provides an alternative method for investigating macromolecular protein complexes that requires less experimental time, effort, and resources. The potential utility of this novel NMR and simulated docking approach in current structural genomic initiatives is discussed.  相似文献   

Inference of population structure from genetic data plays an important role in population and medical genetics studies. With the advancement and decreasing cost of sequencing technology, the increasingly available whole genome sequencing data provide much richer information about the underlying population structure. The traditional method originally developed for array-based genotype data for computing and selecting top principal components (PCs) that capture population structure may not perform well on sequencing data for two reasons. First, the number of genetic variants p is much larger than the sample size n in sequencing data such that the sample-to-marker ratio n / p $n/p$ is nearly zero, violating the assumption of the Tracy-Widom test used in their method. Second, their method might not be able to handle the linkage disequilibrium well in sequencing data. To resolve those two practical issues, we propose a new method called ERStruct to determine the number of top informative PCs based on sequencing data. More specifically, we propose to use the ratio of consecutive eigenvalues as a more robust test statistic, and then we approximate its null distribution using modern random matrix theory. Both simulation studies and applications to two public data sets from the HapMap 3 and the 1000 Genomes Projects demonstrate the empirical performance of our ERStruct method.  相似文献   

We describe a statistical framework for reconstructing the sequence of transmission events between observed cases of an endemic infectious disease using genetic, temporal and spatial information. Previous approaches to reconstructing transmission trees have assumed all infections in the study area originated from a single introduction and that a large fraction of cases were observed. There are as yet no approaches appropriate for endemic situations in which a disease is already well established in a host population and in which there may be multiple origins of infection, or that can enumerate unobserved infections missing from the sample. Our proposed framework addresses these shortcomings, enabling reconstruction of partially observed transmission trees and estimating the number of cases missing from the sample. Analyses of simulated datasets show the method to be accurate in identifying direct transmissions, while introductions and transmissions via one or more unsampled intermediate cases could be identified at high to moderate levels of case detection. When applied to partial genome sequences of rabies virus sampled from an endemic region of South Africa, our method reveals several distinct transmission cycles with little contact between them, and direct transmission over long distances suggesting significant anthropogenic influence in the movement of infected dogs.  相似文献   

We have developed a tool for computer-assisted assignments of protein NMR spectra from triple resonance data. The program is designed to resemble established manual assignment procedures as closely as possible. IBIS exports its results in XEASY format. Thus, using IBIS the operator has continuous visual and accounting control over the progress of the assignment procedure. IBIS achieves complete assignments for those residues that exhibit sequential triple resonance connectivities within a few hours or days.  相似文献   

The discovery of rare genetic variants through next generation sequencing is a very challenging issue in the field of human genetics. We propose a novel region‐based statistical approach based on a Bayes Factor (BF) to assess evidence of association between a set of rare variants (RVs) located on the same genomic region and a disease outcome in the context of case‐control design. Marginal likelihoods are computed under the null and alternative hypotheses assuming a binomial distribution for the RV count in the region and a beta or mixture of Dirac and beta prior distribution for the probability of RV. We derive the theoretical null distribution of the BF under our prior setting and show that a Bayesian control of the false Discovery Rate can be obtained for genome‐wide inference. Informative priors are introduced using prior evidence of association from a Kolmogorov‐Smirnov test statistic. We use our simulation program, sim1000G, to generate RV data similar to the 1000 genomes sequencing project. Our simulation studies showed that the new BF statistic outperforms standard methods (SKAT, SKAT‐O, Burden test) in case‐control studies with moderate sample sizes and is equivalent to them under large sample size scenarios. Our real data application to a lung cancer case‐control study found enrichment for RVs in known and novel cancer genes. It also suggests that using the BF with informative prior improves the overall gene discovery compared to the BF with noninformative prior.  相似文献   

Assignment of NMR spectra is a prerequisite for structure determination of proteins using NMR. The time spent on the assignment is comparatively long compared to that spent on other parts in the protein structure determination process, but it can be shortened by using either interactive or fully automated computer programs. To benefit from the advantages of both types of program we have developed a version of the interactive assignment program ANSIG to include automatized, yet user-supervised, routines. The new program includes tools for (i) semiautomatic sequential assignment, (ii) plotting of distances from PDB structure files directly in NMR spectra and (iii) statistical analysis of distance restraint violations with the possibility to directly zoom to violated NOEs in NOESY spectra.  相似文献   

Van der Waals (vdW) interaction energies between different atom types, energies of hydrogen bonds (H-bonds), and atomic solvation parameters (ASPs) have been derived from the published thermodynamic stabilities of 106 mutants with available crystal structures by use of an originally designed model for the calculation of free-energy differences. The set of mutants included substitutions of uncharged, inflexible, water-inaccessible residues in alpha-helices and beta-sheets of T4, human, and hen lysozymes and HI ribonuclease. The determined energies of vdW interactions and H-bonds were smaller than in molecular mechanics and followed the "like dissolves like" rule, as expected in condensed media but not in vacuum. The depths of modified Lennard-Jones potentials were -0.34, -0.12, and -0.06 kcal/mole for similar atom types (polar-polar, aromatic-aromatic, and aliphatic-aliphatic interactions, respectively) and -0.10, -0.08, -0.06, -0.02, and nearly 0 kcal/mole for different types (sulfur-polar, sulfur-aromatic, sulfur-aliphatic, aliphatic-aromatic, and carbon-polar, respectively), whereas the depths of H-bond potentials were -1.5 to -1.8 kcal/mole. The obtained solvation parameters, that is, transfer energies from water to the protein interior, were 19, 7, -1, -21, and -66 cal/moleA(2) for aliphatic carbon, aromatic carbon, sulfur, nitrogen, and oxygen, respectively, which is close to the cyclohexane scale for aliphatic and aromatic groups but intermediate between octanol and cyclohexane for others. An analysis of additional replacements at the water-protein interface indicates that vdW interactions between protein atoms are reduced when they occur across water.  相似文献   

