首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
Lavner Y  Kotlar D 《Gene》2005,345(1):127-138
We study the interrelations between tRNA gene copy numbers, gene expression levels and measures of codon bias in the human genome. First, we show that isoaccepting tRNA gene copy numbers correlate positively with expression-weighted frequencies of amino acids and codons. Using expression data of more than 14,000 human genes, we show a weak positive correlation between gene expression level and frequency of optimal codons (codons with highest tRNA gene copy number). Interestingly, contrary to non-mammalian eukaryotes, codon bias tends to be high in both highly expressed genes and lowly expressed genes. We suggest that selection may act on codon bias, not only to increase elongation rate by favoring optimal codons in highly expressed genes, but also to reduce elongation rate by favoring non-optimal codons in lowly expressed genes. We also show that the frequency of optimal codons is in positive correlation with estimates of protein biosynthetic cost, and suggest another possible action of selection on codon bias: preference of optimal codons as production cost rises, to reduce the rate of amino acid misincorporation. In the analyses of this work, we introduce a new measure of frequency of optimal codons (FOP'), which is unaffected by amino acid composition and is corrected for background nucleotide content; we also introduce a new method for computing expected codon frequencies, based on the dinucleotide composition of the introns and the non-coding regions surrounding a gene.  相似文献   

2.
Markov models of codon substitution are powerful inferential tools for studying biological processes such as natural selection and preferences in amino acid substitution. The equilibrium character distributions of these models are almost always estimated using nucleotide frequencies observed in a sequence alignment, primarily as a matter of historical convention. In this note, we demonstrate that a popular class of such estimators are biased, and that this bias has an adverse effect on goodness of fit and estimates of substitution rates. We propose a “corrected” empirical estimator that begins with observed nucleotide counts, but accounts for the nucleotide composition of stop codons. We show via simulation that the corrected estimates outperform the de facto standard estimates not just by providing better estimates of the frequencies themselves, but also by leading to improved estimation of other parameters in the evolutionary models. On a curated collection of sequence alignments, our estimators show a significant improvement in goodness of fit compared to the approach. Maximum likelihood estimation of the frequency parameters appears to be warranted in many cases, albeit at a greater computational cost. Our results demonstrate that there is little justification, either statistical or computational, for continued use of the -style estimators.  相似文献   

3.
The relationship of least squared-error estimation to the commonly used data pre-processing method of stimulus locked signal averaging is discussed. First, a generalized squared-error estimate is derived. Second, two data pre-processing methods are introduced and shown analytically to be equivalent with respect to subsequent least squared-error estimation. The first method consists of fitting known functions directly to unaltered data while the second method fits to the same data after it has been time-averaged. A third method of less utility is also demonstrated to be equivalent. It consists of first fitting to sub-blocks of the unaltered data and then averaging the resulting estimates. Finally, a numerical example is presented. It substantiates the analytical contentions and points out practical considerations which might arise in the course of implementation of the estimation procedure.  相似文献   

4.
Robust estimation of allele frequencies in pools of DNA has the potential to reduce genotyping costs and/or increase the number of individuals contributing to a study where hundreds of thousands of genetic markers need to be genotyped in very large populations sample sets, such as genome wide association studies. In order to make accurate allele frequency estimations from pooled samples a correction for unequal allele representation must be applied. We have developed the polynomial based probe specific correction (PPC) which is a novel correction algorithm for accurate estimation of allele frequencies in data from high-density microarrays. This algorithm was validated through comparison of allele frequencies from a set of 10 individually genotyped DNA's and frequencies estimated from pools of these 10 DNAs using GeneChip 10K Mapping Xba 131 arrays. Our results demonstrate that when using the PPC to correct for allelic biases the accuracy of the allele frequency estimates increases dramatically.  相似文献   

5.
Modeling residue usage in aligned protein sequences via maximum likelihood   总被引:9,自引:6,他引:3  
A computational method is presented for characterizing residue usage, i.e., site-specific residue frequencies, in aligned protein sequences. The method obtains frequency estimates that maximize the likelihood of the sequences in a simple model for sequence evolution, given a tree or a set of candidate trees computed by other methods. These maximum- likelihood frequencies constitute a profile of the sequences, and thus the method offers a rigorous alternative to sequence weighting for constructing such a profile. The ability of this method to discard misleading phylogenetic effects allows the biochemical propensities of different positions in a sequence to be more clearly observed and interpreted.   相似文献   

6.
7.
Pei J  Grishin NV 《Proteins》2004,56(4):782-794
We study the effects of various factors in representing and combining evolutionary and structural information for local protein structural prediction based on fragment selection. We prepare databases of fragments from a set of non-redundant protein domains. For each fragment, evolutionary information is derived from homologous sequences and represented as estimated effective counts and frequencies of amino acids (evolutionary frequencies) at each position. Position-specific amino acid preferences called structural frequencies are derived from statistical analysis of discrete local structural environments in database structures. Our method for local structure prediction is based on ranking and selecting database fragments that are most similar to a target fragment. Using secondary structure type as a local structural property, we test our method in a number of settings. The major findings are: (1) the COMPASS-type scoring function for fragment similarity comparison gives better prediction accuracy than three other tested scoring functions for profile-profile comparison. We show that the COMPASS-type scoring function can be derived both in the probabilistic framework and in the framework of statistical potentials. (2) Using the evolutionary frequencies of database fragments gives better prediction accuracy than using structural frequencies. (3) Finer definition of local environments, such as including more side-chain solvent accessibility classes and considering the backbone conformations of neighboring residues, gives increasingly better prediction accuracy using structural frequencies. (4) Combining evolutionary and structural frequencies of database fragments, either in a linear fashion or using a pseudocount mixture formula, results in improvement of prediction accuracy. Combination at the log-odds score level is not as effective as combination at the frequency level. This suggests that there might be better ways of combining sequence and structural information than the commonly used linear combination of log-odds scores. Our method of fragment selection and frequency combination gives reasonable results of secondary structure prediction tested on 56 CASP5 targets (average SOV score 0.77), suggesting that it is a valid method for local protein structure prediction. Mixture of predicted structural frequencies and evolutionary frequencies improve the quality of local profile-to-profile alignment by COMPASS.  相似文献   

8.
Best linear unbiased allele-frequency estimation in complex pedigrees   总被引:4,自引:0,他引:4  
McPeek MS  Wu X  Ober C 《Biometrics》2004,60(2):359-367
Many types of genetic analyses depend on estimates of allele frequencies. We consider the problem of allele-frequency estimation based on data from related individuals. The motivation for this work is data collected on the Hutterites, an isolated founder population, so we focus particularly on the case in which the relationships among the sampled individuals are specified by a large, complex pedigree for which maximum likelihood estimation is impractical. For this case, we propose to use the best linear unbiased estimator (BLUE) of allele frequency. We derive this estimator, which is equivalent to the quasi-likelihood estimator for this problem, and we describe an efficient algorithm for computing the estimate and its variance. We show that our estimator has certain desirable small-sample properties in common with the maximum likelihood estimator (MLE) for this problem. We treat both the case when parental origin of each allele is known and when it is unknown. The results are extended to prediction of allele frequency in some set of individuals S based on genotype data collected on a set of individuals R. We compare the mean-squared error of the BLUE, the commonly used naive estimator (sample frequency) and the MLE when the latter is feasible to calculate. The results indicate that although the MLE performs the best of the three, the BLUE is close in performance to the MLE and is substantially easier to calculate, making it particularly useful for large complex pedigrees in which MLE calculation is impractical or infeasible. We apply our method to allele-frequency estimation in a Hutterite data set.  相似文献   

9.
Abstract. Two alternatives are offered to Podani's proposals, based on the claim that Braun‐Blanquet cover‐abundance estimates cannot be properly analysed by conventional mul‐tivariate methods. 1. The ordinal transform scale, based on an extended Braun‐Blanquet cover‐abundance scale, comes close to a metric cover percentage scale after (1) the abundance values r (very few individuals), + (few ind.), 1 (abundant) and 2m (very abundant, cover < 5%) are replaced by cover percentage estimates and (2) the higher Braun‐Blanquet values, notably 4 and 5, with cover intervals 50‐75% and 75‐100%, respectively, are interpreted as estimates of considerably higher cover values than the usual visual projection on the ground (because of the position of stems and leaves in several layers). I propose the equation ln C= (OTV ?2) /a, where C= Cover%, OTV is the 1 to 9 Ordinal Transfer Value and a is a factor weighting the cover values. With this equation cover values in a geometric series are achieved for the nine values in the extended Braun‐Blanquet scale from 0.5 % (OTV 1) to 140% (OTV 9) for a= 1.415, and for a= 1.380 from 0.6 % to 160%. 2. This makes use of an earlier developed ‘optimum‐transformation’ of cover‐abundance values. For each species a frequency distribution of cover‐abundance values is determined for a large data set, i.e. of dune slack vegetation. Tiny species have low values (OTVs 1–3) with high frequencies and hardly occur with higher OTV values; here all scores are considered ‘optimal’. In dominant species OTVs 7 to 9 have the highest frequencies and only these values are considered optimal. Species with intermediate OTV ranges have optimum ranges with low‐bound OTV = 2, 3, 4 and 5, respectively. No species were found in the dune slack data set with a frequency distribution justifying an optimum range with low‐bound OTV = 6. For mathematically correct numerical treatments’ optimum scores’ can be converted to 1 and sub‐optimal scores to 0 in order to approach a presence/absence situation. Both alternatives are suggested to be acceptable approximations to a metric basis for numerical analyses.  相似文献   

10.
Haplotype analyses have become increasingly common in genetic studies of human disease because of their ability to identify unique chromosomal segments likely to harbor disease-predisposing genes. The study of haplotypes is also used to investigate many population processes, such as migration and immigration rates, linkage-disequilibrium strength, and the relatedness of populations. Unfortunately, many haplotype-analysis methods require phase information that can be difficult to obtain from samples of nonhaploid species. There are, however, strategies for estimating haplotype frequencies from unphased diploid genotype data collected on a sample of individuals that make use of the expectation-maximization (EM) algorithm to overcome the missing phase information. The accuracy of such strategies, compared with other phase-determination methods, must be assessed before their use can be advocated. In this study, we consider and explore sources of error between EM-derived haplotype frequency estimates and their population parameters, noting that much of this error is due to sampling error, which is inherent in all studies, even when phase can be determined. In light of this, we focus on the additional error between haplotype frequencies within a sample data set and EM-derived haplotype frequency estimates incurred by the estimation procedure. We assess the accuracy of haplotype frequency estimation as a function of a number of factors, including sample size, number of loci studied, allele frequencies, and locus-specific allelic departures from Hardy-Weinberg and linkage equilibrium. We point out the relative impacts of sampling error and estimation error, calling attention to the pronounced accuracy of EM estimates once sampling error has been accounted for. We also suggest that many factors that may influence accuracy can be assessed empirically within a data set-a fact that can be used to create "diagnostics" that a user can turn to for assessing potential inaccuracies in estimation.  相似文献   

11.
Statistical and biochemical studies have revealed nonrandom patterns in codon assignments. The canonical genetic code is known to be highly efficient in minimizing the effects of mistranslational errors and point mutations, since it is known that, when an amino acid is converted to another due to error, the biochemical properties of the resulted amino acid are usually very similar to those of the original one. In this study, we have taken into consideration both relative frequencies of amino acids and relative gene copy frequencies of tRNAs in genomic sequences in order to introduce a fitness function which models the mistranslational probabilities more accurately in modern organisms. The relative gene copy frequencies of tRNAs are used as estimates of the tRNA content. We also altered the rule previously used for the calculation of the probabilities of single base mutation occurrences. Our model signifies higher optimality of the genetic code towards load minimization and suggests the presence of a coevolution of tRNA frequency and the genetic code.  相似文献   

12.
Gilbert PB  Wu C  Jobes DV 《Biometrics》2008,64(1):198-207
Summary .   Consider a placebo-controlled preventive HIV vaccine efficacy trial. An HIV amino acid sequence is measured from each volunteer who acquires HIV, and these sequences are aligned together with the reference HIV sequence represented in the vaccine. We develop genome scanning methods to identify positions at which the amino acids in infected vaccine recipient sequences either (A) are more divergent from the reference amino acid than the amino acids in infected placebo recipient sequences or (B) have a different frequency distribution than the placebo sequences, irrespective of a reference amino acid. We consider t -test-type statistics for problem A and Euclidean, Mahalanobis, and Kullback–Leibler-type statistics for problem B. The test statistics incorporate weights to reflect biological information contained in different amino acid positions and mismatches. Position-specific p -values are obtained by approximating the null distribution of the statistics either by a permutation procedure or by a nonparametric estimation. A permutation method is used to estimate a cut-off p -value to control the per comparison error rate at a prespecified level. The methods are examined in simulations and are applied to two HIV examples. The methods for problem B address the general problem of comparing discrete frequency distributions between groups in a high-dimensional data setting.  相似文献   

13.
This study examined the method of simultaneous estimation of recombination frequency and parameters for a qualitative trait locus and compared the results with those of standard methods of linkage analysis. With both approaches we were able to detect linkage of an incompletely penetrant qualitative trait to highly polymorphic markers with recombination frequencies in the range of .00-.05. Our results suggest that detecting linkage at larger recombination frequencies may require larger data sets or large high-density families. When applied to all families without regard to informativeness of the family structure for linkage, analyses of simulated data could detect no advantage of simultaneous estimation over more traditional and much less time-consuming methods, either in detecting linkage, estimating frequency, refining estimates of parameters for the qualitative trait locus, or avoiding false evidence for linkage. However, the method of sampling affected results.  相似文献   

14.
In theory, codon models that account for the dependence of nucleotide substitutions between codon positions as well as differences between synonymous and non-synonymous changes best describe the sequence evolution in protein coding genes. However, in practice we know little about the degree to which violations of the assumptions of codon model-based estimates occur, and how significant these artifacts may be. In nucleotide-based phylogenies from first and second codon positions in a concatenated plastid gene data set, two distantly related taxa--dinoflagellate and haptophyte plastids--were robustly grouped together. This artifactual grouping is attributed to the parallel heterogeneity in leucine (Leu) and serine (Ser) codon usages in the data set. Here, by using this data set, we demonstrated that codon-based phylogenetic estimations are seriously biased, robustly uniting the dinoflagellate and haptophyte plastids into a monophyletic clade, when the model assumption of homogeneity of codon composition was violated. Our results suggest that similar phylogenetic artifacts may occur via codon usage heterogeneity in any amino acids in codon model-based estimations. We advise that homogeneity in codon usage across taxa in a data set be confirmed before codon model-based phylogenetic estimation is attempted.  相似文献   

15.
Despite the increasing opportunity to collect large‐scale data sets for population genomic analyses, the use of high‐throughput sequencing to study populations of polyploids has seen little application. This is due in large part to problems associated with determining allele copy number in the genotypes of polyploid individuals (allelic dosage uncertainty–ADU), which complicates the calculation of important quantities such as allele frequencies. Here, we describe a statistical model to estimate biallelic SNP frequencies in a population of autopolyploids using high‐throughput sequencing data in the form of read counts. We bridge the gap from data collection (using restriction enzyme based techniques [e.g. GBS, RADseq]) to allele frequency estimation in a unified inferential framework using a hierarchical Bayesian model to sum over genotype uncertainty. Simulated data sets were generated under various conditions for tetraploid, hexaploid and octoploid populations to evaluate the model's performance and to help guide the collection of empirical data. We also provide an implementation of our model in the R package polyfreqs and demonstrate its use with two example analyses that investigate (i) levels of expected and observed heterozygosity and (ii) model adequacy. Our simulations show that the number of individuals sampled from a population has a greater impact on estimation error than sequencing coverage. The example analyses also show that our model and software can be used to make inferences beyond the estimation of allele frequencies for autopolyploids by providing assessments of model adequacy and estimates of heterozygosity.  相似文献   

16.
A protein is generally classified into one of the following four structural classes: all alpha, all beta, alpha+beta and alpha/beta. In this paper, based on the weighting to the 20 constituent amino acids, a new method is proposed for predicting the structural class of a protein according to its amino acid composition. The 20 weighting parameters, which reflect the different properties of the 20 constituent amino acids, have been obtained from a training set of proteins through the linear-programming approach. The rate of correct prediction for a training set of proteins by means of the new method was 100%, whereas the highest rate of previous methods was 82.8%. Furthermore, the results showed that the more numerous training proteins, the more effective the new method.  相似文献   

17.
Development of methods for estimating species trees from multilocus data is a current challenge in evolutionary biology. We propose a method for estimating the species tree topology and branch lengths using approximate Bayesian computation (ABC). The method takes as data a sample of observed rooted gene tree topologies, and then iterates through the following sequence of steps: First, a randomly selected species tree is used to compute the distribution of rooted gene tree topologies. This distribution is then compared to the observed gene topology frequencies, and if the fit between the observed and the predicted distributions is close enough, the proposed species tree is retained. Repeating this many times leads to a collection of retained species trees that are then used to form the estimate of the overall species tree. We test the performance of the method, which we call ST-ABC, using both simulated and empirical data. The simulation study examines both symmetric and asymmetric species trees over a range of branch lengths and sample sizes. The results from the simulation study show that the model performs very well, giving accurate estimates for both the topology and the branch lengths across the conditions studied, and that a sample size of 25 loci appears to be adequate for the method. Further, we apply the method to two empirical cases: a 4-taxon data set for primates and a 7-taxon data set for yeast. In both cases, we find that estimates obtained with ST-ABC agree with previous studies. The method provides efficient estimation of the species tree, and does not require sequence data, but rather the observed distribution of rooted gene topologies without branch lengths. Therefore, this method is a useful alternative to other currently available methods for species tree estimation.  相似文献   

18.
We present an amino map based on their inter-residue contact energies using the Miyazawa-Jernigan matrix. This work is based on the method of metric multi-dimensional scaling (MMDS). The MMDS map shows, among other things, that the MJ contact energies imply the hydrophobic-hydrophilic nature of the amino acid residues. With the help of the map we are able to compare and draw inferences from uncorrelated data sets such as BLOSUM and PAM with MJ methods. We also use a hierarchical clustering method on our MMDS distance matrix to group the amino acids and arrive at an optimum number of groups for simplifying the amino acid set.  相似文献   

19.
Almudevar A 《Biometrics》2001,57(3):757-763
The problem of assessing the variability in pedigree reconstruction using DNA markers is considered for the special case of single generation samples with no parents present. Error in pedigree reconstruction is measured through a metric imposed on the space of partitions of the individuals into family groups. A confidence set can therefore be taken to be a neighborhood of a point estimate, analogous to the estimation of a parameter in Euclidean space. The coverage probability is estimated using bootstrap techniques. Although the distributional properties of the sample depend on the population genotype frequencies, these are in practice usually unknown. Confidence sets conditioned on a statistic approximately sufficient for these frequencies are compared with confidence sets obtained by substituting frequency estimates directly into the sampling distribution. In two simulation studies, the difference is found to be of some consequence.  相似文献   

20.
Chen P  Gillis KD 《Biophysical journal》2000,79(4):2162-2170
High-resolution measurement of membrane capacitance in the whole-cell-recording configuration can be used to detect small changes in membrane surface area that accompany exocytosis and endocytosis. We have investigated the noise of membrane capacitance measurements to determine the fundamental limits of resolution in actual cells in the whole-cell mode. Two previously overlooked sources of noise are particularly evident at low frequencies. The first noise source is accompanied by a correlation between capacitance estimates, whereas the second noise source is due to "1/f-like" current noise. An analytic expression that summarizes the noise from thermal and 1/f sources is derived, which agrees with experimental measurements from actual cells over a large frequency range. Our results demonstrate that the optimal frequencies for capacitance measurements are higher than previously believed. Finally, we demonstrate that the capacitance noise at high frequencies can be reduced by compensating for the voltage drop of the sine wave across the series resistance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号