首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.

Background  

We are interested in the problem of predicting secondary structure for small sets of homologous RNAs, by incorporating limited comparative sequence information into an RNA folding model. The Sankoff algorithm for simultaneous RNA folding and alignment is a basis for approaches to this problem. There are two open problems in applying a Sankoff algorithm: development of a good unified scoring system for alignment and folding and development of practical heuristics for dealing with the computational complexity of the algorithm.  相似文献   

2.
Protein structure alignment using a genetic algorithm   总被引:3,自引:0,他引:3  
Szustakowski JD  Weng Z 《Proteins》2000,38(4):428-440
We have developed a novel, fully automatic method for aligning the three-dimensional structures of two proteins. The basic approach is to first align the proteins' secondary structure elements and then extend the alignment to include any equivalent residues found in loops or turns. The initial secondary structure element alignment is determined by a genetic algorithm. After refinement of the secondary structure element alignment, the protein backbones are superposed and a search is performed to identify any additional equivalent residues in a convergent process. Alignments are evaluated using intramolecular distance matrices. Alignments can be performed with or without sequential connectivity constraints. We have applied the method to proteins from several well-studied families: globins, immunoglobulins, serine proteases, dihydrofolate reductases, and DNA methyltransferases. Agreement with manually curated alignments is excellent. A web-based server and additional supporting information are available at http://engpub1.bu.edu/-josephs.  相似文献   

3.
MOTIVATION: Multiple sequence alignments (MSAs) are at the heart of bioinformatics analysis. Recently, a number of multiple protein sequence alignment benchmarks (i.e. BAliBASE, OXBench, PREFAB and SMART) have been released to evaluate new and existing MSA applications. These databases have been well received by researchers and help to quantitatively evaluate MSA programs on protein sequences. Unfortunately, analogous DNA benchmarks are not available, making evaluation of MSA programs difficult for DNA sequences. RESULTS: This work presents the first known multiple DNA sequence alignment benchmarks that are (1) comprised of protein-coding portions of DNA (2) based on biological features such as the tertiary structure of encoded proteins. These reference DNA databases contain a total of 3545 alignments, comprising of 68 581 sequences. Two versions of the database are available: mdsa_100s and mdsa_all. The mdsa_100s version contains the alignments of the data sets that TBLASTN found 100% sequence identity for each sequence. The mdsa_all version includes all hits with an E-value score above the threshold of 0.001. A primary use of these databases is to benchmark the performance of MSA applications on DNA data sets. The first such case study is included in the Supplementary Material.  相似文献   

4.
A new set of DNA base-nucleic acid codes and their hypercomplex number representation have been introduced for taking the probability of each nucleotide into full account. A new scoring system has been proposed to suit the hypercomplex number representation of the DNA base-nucleic acid codes and incorporated with the method of dot matrix analysis and various algorithms of sequence alignment. The problem of DNA sequence alignment can be processed in a rather similar way to pairwise alignment of the protein sequence.  相似文献   

5.
A tool called Locfind for the sequence-based prediction of the localization of eukaryotic proteins is introduced. It is based on bidirectional recurrent neural networks trained to read sequentially the amino acid sequence and produce localization information along the sequence. Systematic variation of the network architecture in combination with an efficient learning algorithm lead to a 91% correct localization prediction for novel proteins in fivefold cross-validation. The data and evaluation procedure are the same as the non-plant part of the widely used TargetP tool by Emanuelsson et al. The Locfind system is available on the WWW for predictions (http://www.stepc.gr/~synaptic/locfind.html).  相似文献   

6.
7.
SRP-RNA sequence alignment and secondary structure.   总被引:14,自引:21,他引:14       下载免费PDF全文
The secondary structures of the RNAs from the signal recognition particle, termed SRP-RNA, were derived buy comparative analyses of an alignment of 39 sequences. The models are minimal in that only base pairs are included for which there is comparative evidence. The structures represent refinements of earlier versions and include a new short helix.  相似文献   

8.
  • 1.1. Analysis of eukaryotic sequences reveals recurring trends in upstream regions. Oligomers composed of (G/C)n and (A/T)m blocks are preferentially flanked by (G/C)2 doublets on their 3' rather than on their 5′ ends, that is (G/C)nä(A/T)m(G/C)2 > (G/C)n+2(A/T)m.
  • 2.2. These trends are stronger for larger n and smaller m. Additional trends are outlined below.
  • 3.3. The trends are correlated with DNA structural parameters, in particular with twist and roll angles.
  • 4.4. Generally, the trends hold if the base pair step joining the 5′ (G/C)2 doublet to the (G/C)n (A/T)m oligomer is not undertwisted and is not strongly rolled into the major groove.
  • 5.5. Other DNA parameters crucial for DNA-protein interactions are discussed as well.
  相似文献   

9.
Prediction of DNA structure from sequence: a build-up technique   总被引:2,自引:0,他引:2  
A build-up technique has been devised that permits prediction of DNA structure from sequence. No experimental information is employed other than the force field parameters. This strategy for dealing with the multiple minimum problem requires a supercomputer to make the necessary global searches. The number of energy minimization trials that were made for each of the 16 deoxydinucleoside monophosphate conformational building blocks of DNA was 1944. As a test case, the minimum energy conformations of d(GpC) and d(CpG) to 5.5 kcal/mole were then combined to generate energy-minimized structures for d(CpGpC). The number of trials that were made for d(CpGpC) was 3752. Minima for this single-stranded trimer to 15 kcal/mole were then employed to search for minimum energy conformations of the duplex d(CpGpC).d(GpCpG). The number of starting conformations that were utilized at this stage was 1514. The lowest energy duplex had a Z-II-DNA conformation, followed by a B-DNA form at 1.2 kcal/mole. The A- and Z-I-forms as well as many novel Watson-Crick base-paired structures were found at higher energy. Finally, energy-minimized structures of d(CG)6.d(CG)6 in Z-II and B-DNA conformations were computed using torsion angles from the analogous duplex trimer minima.  相似文献   

10.

Background  

DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity.  相似文献   

11.

Background  

Growing interest on biological pathways has called for new statistical methods for modeling and testing a genetic pathway effect on a health outcome. The fact that genes within a pathway tend to interact with each other and relate to the outcome in a complicated way makes nonparametric methods more desirable. The kernel machine method provides a convenient, powerful and unified method for multi-dimensional parametric and nonparametric modeling of the pathway effect.  相似文献   

12.
13.
14.
MicroRNA identification based on sequence and structure alignment   总被引:20,自引:0,他引:20  
MOTIVATION: MicroRNAs (miRNA) are approximately 22 nt long non-coding RNAs that are derived from larger hairpin RNA precursors and play important regulatory roles in both animals and plants. The short length of the miRNA sequences and relatively low conservation of pre-miRNA sequences restrict the conventional sequence-alignment-based methods to finding only relatively close homologs. On the other hand, it has been reported that miRNA genes are more conserved in the secondary structure rather than in primary sequences. Therefore, secondary structural features should be more fully exploited in the homologue search for new miRNA genes. RESULTS: In this paper, we present a novel genome-wide computational approach to detect miRNAs in animals based on both sequence and structure alignment. Experiments show this approach has higher sensitivity and comparable specificity than other reported homologue searching methods. We applied this method on Anopheles gambiae and detected 59 new miRNA genes. AVAILABILITY: This program is available at http://bioinfo.au.tsinghua.edu.cn/miralign. SUPPLEMENTARY INFORMATION: Supplementary information is available at http://bioinfo.au.tsinghua.edu.cn/miralign/supplementary.htm.  相似文献   

15.

Background

Privacy protecting is an important issue in medical informatics and differential privacy is a state-of-the-art framework for data privacy research. Differential privacy offers provable privacy against attackers who have auxiliary information, and can be applied to data mining models (for example, logistic regression). However, differentially private methods sometimes introduce too much noise and make outputs less useful. Given available public data in medical research (e.g. from patients who sign open-consent agreements), we can design algorithms that use both public and private data sets to decrease the amount of noise that is introduced.

Methodology

In this paper, we modify the update step in Newton-Raphson method to propose a differentially private distributed logistic regression model based on both public and private data.

Experiments and results

We try our algorithm on three different data sets, and show its advantage over: (1) a logistic regression model based solely on public data, and (2) a differentially private distributed logistic regression model based on private data under various scenarios.

Conclusion

Logistic regression models built with our new algorithm based on both private and public datasets demonstrate better utility than models that trained on private or public datasets alone without sacrificing the rigorous privacy guarantee.
  相似文献   

16.

Background  

For successful protein structure prediction by comparative modeling, in addition to identifying a good template protein with known structure, obtaining an accurate sequence alignment between a query protein and a template protein is critical. It has been known that the alignment accuracy can vary significantly depending on our choice of various alignment parameters such as gap opening penalty and gap extension penalty. Because the accuracy of sequence alignment is typically measured by comparing it with its corresponding structure alignment, there is no good way of evaluating alignment accuracy without knowing the structure of a query protein, which is obviously not available at the time of structure prediction. Moreover, there is no universal alignment parameter option that would always yield the optimal alignment.  相似文献   

17.
Whole-genome sequencing harbors unprecedented potential for characterization of individual and family genetic variation. Here, we develop a novel synthetic human reference sequence that is ethnically concordant and use it for the analysis of genomes from a nuclear family with history of familial thrombophilia. We demonstrate that the use of the major allele reference sequence results in improved genotype accuracy for disease-associated variant loci. We infer recombination sites to the lowest median resolution demonstrated to date (< 1,000 base pairs). We use family inheritance state analysis to control sequencing error and inform family-wide haplotype phasing, allowing quantification of genome-wide compound heterozygosity. We develop a sequence-based methodology for Human Leukocyte Antigen typing that contributes to disease risk prediction. Finally, we advance methods for analysis of disease and pharmacogenomic risk across the coding and non-coding genome that incorporate phased variant data. We show these methods are capable of identifying multigenic risk for inherited thrombophilia and informing the appropriate pharmacological therapy. These ethnicity-specific, family-based approaches to interpretation of genetic variation are emblematic of the next generation of genetic risk assessment using whole-genome sequencing.  相似文献   

18.
MOTIVATION: Given that association and dissociation of protein molecules is crucial in most biological processes several in silico methods have been recently developed to predict protein-protein interactions. Structural evidence has shown that usually interacting pairs of close homologs (interologs) physically interact in the same way. Moreover, conservation of an interaction depends on the conservation of the interface between interacting partners. In this article we make use of both, structural similarities among domains of known interacting proteins found in the Database of Interacting Proteins (DIP) and conservation of pairs of sequence patches involved in protein-protein interfaces to predict putative protein interaction pairs. RESULTS: We have obtained a large amount of putative protein-protein interaction (approximately 130,000). The list is independent from other techniques both experimental and theoretical. We separated the list of predictions into three sets according to their relationship with known interacting proteins found in DIP. For each set, only a small fraction of the predicted protein pairs could be independently validated by cross checking with the Human Protein Reference Database (HPRD). The fraction of validated protein pairs was always larger than that expected by using random protein pairs. Furthermore, a correlation map of interacting protein pairs was calculated with respect to molecular function, as defined in the Gene Ontology database. It shows good consistency of the predicted interactions with data in the HPRD database. The intersection between the lists of interactions of other methods and ours produces a network of potentially high-confidence interactions.  相似文献   

19.
We have developed a phylogeny-aware progressive alignment method that recognizes insertions and deletions as distinct evolutionary events and thus avoids systematic errors created by traditional alignment methods. We now extend this method to simultaneously model regional heterogeneity and evolution. This novel method can be flexibly adapted to alignment of nucleotide or amino acid sequences evolving under processes that vary over genomic regions and, being fully probabilistic, provides an estimate of regional heterogeneity of the evolutionary process along the alignment and a measure of local reliability of the solution. Furthermore, the evolutionary modelling of substitution process permits adjusting the sensitivity and specificity of the alignment and, if high specificity is aimed at, leaving sequences unaligned when their divergence is beyond a meaningful detection of homology.  相似文献   

20.
Percentage is widely used to describe different results in food microbiology, e.g., probability of microbial growth, percent inactivated, and percent of positive samples. Four sets of percentage data, percent-growth-positive, germination extent, probability for one cell to grow, and maximum fraction of positive tubes, were obtained from our own experiments and the literature. These data were modeled using linear and logistic regression. Five methods were used to compare the goodness of fit of the two models: percentage of predictions closer to observations, range of the differences (predicted value minus observed value), deviation of the model, linear regression between the observed and predicted values, and bias and accuracy factors. Logistic regression was a better predictor of at least 78% of the observations in all four data sets. In all cases, the deviation of logistic models was much smaller. The linear correlation between observations and logistic predictions was always stronger. Validation (accomplished using part of one data set) also demonstrated that the logistic model was more accurate in predicting new data points. Bias and accuracy factors were found to be less informative when evaluating models developed for percentage data, since neither of these indices can compare predictions at zero. Model simplification for the logistic model was demonstrated with one data set. The simplified model was as powerful in making predictions as the full linear model, and it also gave clearer insight in determining the key experimental factors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号