首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.

Background  

Hidden Markov models and hidden Boltzmann models are employed in computational biology and a variety of other scientific fields for a variety of analyses of sequential data. Whether the associated algorithms are used to compute an actual probability or, more generally, an odds ratio or some other score, a frequent requirement is that the error statistics of a given score be known. What is the chance that random data would achieve that score or better? What is the chance that a real signal would achieve a given score threshold?  相似文献   

2.

Background  

Segmental duplications, or low-copy repeats, are common in mammalian genomes. In the human genome, most segmental duplications are mosaics comprised of multiple duplicated fragments. This complex genomic organization complicates analysis of the evolutionary history of these sequences. One model proposed to explain this mosaic patterns is a model of repeated aggregation and subsequent duplication of genomic sequences.  相似文献   

3.
4.

Background  

Profile Hidden Markov Models (HMM) are statistical representations of protein families derived from patterns of sequence conservation in multiple alignments and have been used in identifying remote homologues with considerable success. These conservation patterns arise from fold specific signals, shared across multiple families, and function specific signals unique to the families. The availability of sequences pre-classified according to their function permits the use of negative training sequences to improve the specificity of the HMM, both by optimizing the threshold cutoff and by modifying emission probabilities to minimize the influence of fold-specific signals. A protocol to generate family specific HMMs is described that first constructs a profile HMM from an alignment of the family's sequences and then uses this model to identify sequences belonging to other classes that score above the default threshold (false positives). Ten-fold cross validation is used to optimise the discrimination threshold score for the model. The advent of fast multiple alignment methods enables the use of the profile alignments to align the true and false positive sequences, and the resulting alignments are used to modify the emission probabilities in the original model.  相似文献   

5.
6.

Background  

Recent advances on high-throughput technologies have produced a vast amount of protein sequences, while the number of high-resolution structures has seen a limited increase. This has impelled the production of many strategies to built protein structures from its sequence, generating a considerable amount of alternative models. The selection of the closest model to the native conformation has thus become crucial for structure prediction. Several methods have been developed to score protein models by energies, knowledge-based potentials and combination of both.  相似文献   

7.

Background  

The selection of the most accurate protein model from a set of alternatives is a crucial step in protein structure prediction both in template-based and ab initio approaches. Scoring functions have been developed which can either return a quality estimate for a single model or derive a score from the information contained in the ensemble of models for a given sequence. Local structural features occurring more frequently in the ensemble have a greater probability of being correct. Within the context of the CASP experiment, these so called consensus methods have been shown to perform considerably better in selecting good candidate models, but tend to fail if the best models are far from the dominant structural cluster. In this paper we show that model selection can be improved if both approaches are combined by pre-filtering the models used during the calculation of the structural consensus.  相似文献   

8.
9.

Background  

The function of a novel gene product is typically predicted by transitive assignment of annotation from similar sequences. We describe a novel method, GOtcha, for predicting gene product function by annotation with Gene Ontology (GO) terms. GOtcha predicts GO term associations with term-specific probability (P-score) measures of confidence. Term-specific probabilities are a novel feature of GOtcha and allow the identification of conflicts or uncertainty in annotation.  相似文献   

10.

Background  

The optimal score for ungapped local alignments of infinitely long random sequences is known to follow a Gumbel extreme value distribution. Less is known about the important case, where gaps are allowed. For this case, the distribution is only known empirically in the high-probability region, which is biologically less relevant.  相似文献   

11.

Background  

Confidence in pairwise alignments of biological sequences, obtained by various methods such as Blast or Smith-Waterman, is critical for automatic analyses of genomic data. Two statistical models have been proposed. In the asymptotic limit of long sequences, the Karlin-Altschul model is based on the computation of a P-value, assuming that the number of high scoring matching regions above a threshold is Poisson distributed. Alternatively, the Lipman-Pearson model is based on the computation of a Z-value from a random score distribution obtained by a Monte-Carlo simulation. Z-values allow the deduction of an upper bound of the P-value (1/Z-value 2) following the TULIP theorem. Simulations of Z-value distribution is known to fit with a Gumbel law. This remarkable property was not demonstrated and had no obvious biological support.  相似文献   

12.

Background  

Profile hidden Markov model (HMM) techniques are among the most powerful methods for protein homology detection. Yet, the critical features for successful modelling are not fully known. In the present work we approached this by using two of the most popular HMM packages: SAM and HMMER. The programs' abilities to build models and score sequences were compared on a SCOP/Pfam based test set. The comparison was done separately for local and global HMM scoring.  相似文献   

13.
14.

Background  

The chemical property and biological function of a protein is a direct consequence of its primary structure. Several algorithms have been developed which determine alignment and similarity of primary protein sequences. However, character based similarity cannot provide insight into the structural aspects of a protein. We present a method based on spectral similarity to compare subsequences of amino acids that behave similarly but are not aligned well by considering amino acids as mere characters. This approach finds a similarity score between sequences based on any given attribute, like hydrophobicity of amino acids, on the basis of spectral information after partial conversion to the frequency domain.  相似文献   

15.

Background  

Non-parametric bootstrapping is a widely-used statistical procedure for assessing confidence of model parameters based on the empirical distribution of the observed data [1] and, as such, it has become a common method for assessing tree confidence in phylogenetics [2]. Traditional non-parametric bootstrapping does not weigh each tree inferred from resampled (i.e., pseudo-replicated) sequences. Hence, the quality of these trees is not taken into account when computing bootstrap scores associated with the clades of the original phylogeny. As a consequence, traditionally, the trees with different bootstrap support or those providing a different fit to the corresponding pseudo-replicated sequences (the fit quality can be expressed through the LS, ML or parsimony score) contribute in the same way to the computation of the bootstrap support of the original phylogeny.  相似文献   

16.

Background

The 12q24 locus entails at least one gene responsible for hypercholesterolemia. Within the 12q24 locus lies the gene of proteasome modulator 9 (PSMD9). PSMD9 is in linkage with type 2 diabetes (T2D), T2D-nephropathy and macrovascular pathology in Italian families and PSMD9 rare mutations contribute to T2D.

Aims

In the present study, we aimed at determining whether the PSMD9 T2D risk single nucleotide polymorphisms (SNPs) IVS3 + nt460 A > G, IVS3 + nt437 T > C and E197G A > G are linked to hypercholesterolemia in 200 T2D Italian families.

Methods

We characterized 200 Italian families for presence and/or absence of hypercholesterolemia characterized by LDL levels ≥ 100 mg/dl in drug-naïve patients and/or presence of a diagnosis of hypercholesterolemia in a patient treated with statin medication. The phenotypes were described as unknown in all cases in which the diagnosis was either unclear or the data were missing. We tested in the 200 Italians families for evidence of linkage of the PSMD9 SNPs with hypercholesterolemia. The non-parametric linkage analysis was performed for the qualitative phenotype by using the Merlin software; the Lod score and correspondent P-value were calculated. For the significant linkage score, 1000 replicates were performed to calculate the empirical P-value.

Results

The PSMD9 gene SNPs studied show linkage to hypercholesterolemia. The results are not due to random chance.

Conclusions

PSMD9 should be tested in all populations reporting linkage to hypercholesterolemia within the chromosome 12q24 locus. The impact of this gene on hypercholesterolemia and contribution to cardio- and cerebrovascular events may be high.  相似文献   

17.

Background

Partitioning the human immunoglobulin variable region into variable (V), diversity (D), and joining (J) segments is a common sequence analysis step. We introduce a novel approximate dynamic programming method that uses conserved immunoglobulin gene motifs to improve performance of aligning V-segments of rearranged immunoglobulin (Ig) genes. Our new algorithm enhances the former JOINSOLVER algorithm by processing sequences with insertions and/or deletions (indels) and improves the efficiency for large datasets provided by high throughput sequencing.

Results

In our simulations, which include rearrangements with indels, the V-matching success rate improved from 61% for partial alignments of sequences with indels in the original algorithm to over 99% in the approximate algorithm. An improvement in the alignment of human VDJ rearrangements over the initial JOINSOLVER algorithm was also seen when compared to the Stanford.S22 human Ig dataset with an online VDJ partitioning software evaluation tool.

Conclusions

HTJoinSolver can rapidly identify V- and J-segments with indels to high accuracy for mutated sequences when the mutation probability is around 30% and 20% respectively. The D-segment is much harder to fit even at 20% mutation probability. For all segments, the probability of correctly matching V, D, and J increases with our alignment score.  相似文献   

18.

Background

Small RNAs are important regulators of genome function, yet their prediction in genomes is still a major computational challenge. Statistical analyses of pre-miRNA sequences indicated that their 2D structure tends to have a minimal free energy (MFE) significantly lower than MFE values of equivalently randomized sequences with the same nucleotide composition, in contrast to other classes of non-coding RNA. The computation of many MFEs is, however, too intensive to allow for genome-wide screenings.

Results

Using a local grid infrastructure, MFE distributions of random sequences were pre-calculated on a large scale. These distributions follow a normal distribution and can be used to determine the MFE distribution for any given sequence composition by interpolation. It allows on-the-fly calculation of the normal distribution for any candidate sequence composition.

Conclusion

The speedup achieved makes genome-wide screening with this characteristic of a pre-miRNA sequence practical. Although this particular property alone will not be able to distinguish miRNAs from other sequences sufficiently discriminative, the MFE-based P-value should be added to the parameters of choice to be included in the selection of potential miRNA candidates for experimental verification.  相似文献   

19.

Background  

The purpose of this study was to investigate the pattern of distribution of risk factors for stroke and the 10-year probability of stroke by the Framingham risk score in hypertensive patients visiting private clinics vs. tertiary hospitals.  相似文献   

20.

Background  

Detecting remote homologies by direct comparison of protein sequences remains a challenging task. We had previously developed a similarity score between sequences, called a local alignment kernel, that exhibits good performance for this task in combination with a support vector machine. The local alignment kernel depends on an amino acid substitution matrix. Since commonly used BLOSUM or PAM matrices for scoring amino acid matches have been optimized to be used in combination with the Smith-Waterman algorithm, the matrices optimal for the local alignment kernel can be different.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号