首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We derive an expectation maximization algorithm for maximum-likelihood training of substitution rate matrices from multiple sequence alignments. The algorithm can be used to train hidden substitution models, where the structural context of a residue is treated as a hidden variable that can evolve over time. We used the algorithm to train hidden substitution matrices on protein alignments in the Pfam database. Measuring the accuracy of multiple alignment algorithms with reference to BAliBASE (a database of structural reference alignments) our substitution matrices consistently outperform the PAM series, with the improvement steadily increasing as up to four hidden site classes are added. We discuss several applications of this algorithm in bioinformatics.  相似文献   

2.
Clustering time-course gene expression data (gene trajectories) is an important step towards solving the complex problem of gene regulatory network modeling and discovery as it significantly reduces the dimensionality of the gene space required for analysis. Traditional clustering methods that perform hill-climbing from randomly initialized cluster centers are prone to produce inconsistent and sub-optimal cluster solutions over different runs. This paper introduces a novel method that hybridizes genetic algorithm (GA) and expectation maximization algorithms (EM) for clustering gene trajectories with the mixtures of multiple linear regression models (MLRs), with the objective of improving the global optimality and consistency of the clustering performance. The proposed method is applied to cluster the human fibroblasts and the yeast time-course gene expression data based on their trajectory similarities. It outperforms the standard EM method significantly in terms of both clustering accuracy and consistency. The biological implications of the improved clustering performance are demonstrated.  相似文献   

3.
We report an automated procedure for high-throughput NMR resonance assignment for a protein of known structure, or of an homologous structure. Our algorithm performs Nuclear Vector Replacement (NVR) by Expectation/Maximization (EM) to compute assignments. NVR correlates experimentally-measured NH residual dipolar couplings (RDCs) and chemical shifts to a given a priori whole-protein 3D structural model. The algorithm requires only uniform (15)N-labelling of the protein, and processes unassigned H(N)-(15)N HSQC spectra, H(N)-(15)N RDCs, and sparse H(N)-H(N) NOE's (d(NN)s). NVR runs in minutes and efficiently assigns the (H(N),(15)N) backbone resonances as well as the sparse d(NN)s from the 3D (15)N-NOESY spectrum, in O (n(3)) time. The algorithm is demonstrated on NMR data from a 76-residue protein, human ubiquitin, matched to four structures, including one mutant (homolog), determined either by X-ray crystallography or by different NMR experiments (without RDCs). NVR achieves an average assignment accuracy of over 99%. We further demonstrate the feasibility of our algorithm for different and larger proteins, using different combinations of real and simulated NMR data for hen lysozyme (129 residues) and streptococcal protein G (56 residues), matched to a variety of 3D structural models.  相似文献   

4.
We conducted a comprehensive study of copy number variants (CNVs) well-tagged by SNPs (r(2)≥ 0.8) by analyzing their effect on gene expression and their association with disease susceptibility and other complex human traits. We tested whether these CNVs were more likely to be functional than frequency-matched SNPs as trait-associated loci or as expression quantitative trait loci (eQTLs) influencing phenotype by altering gene regulation. Our study found that CNV-tagging SNPs are significantly enriched for cis eQTLs; furthermore, we observed that trait associations from the NHGRI catalog show an overrepresentation of SNPs tagging CNVs relative to frequency-matched SNPs. We found that these SNPs tagging CNVs are more likely to affect multiple expression traits than frequency-matched variants. Given these findings on the functional relevance of CNVs, we created an online resource of expression-associated CNVs (eCNVs) using the most comprehensive population-based map of CNVs to inform future studies of complex traits. Although previous studies of common CNVs that can be typed on existing platforms and/or interrogated by SNPs in genome-wide association studies concluded that such CNVs appear unlikely to have a major role in the genetic basis of several complex diseases examined, our findings indicate that it would be premature to dismiss the possibility that even common CNVs may contribute to complex phenotypes and at least some common diseases.  相似文献   

5.

Background  

Hidden Markov models are widely employed by numerous bioinformatics programs used today. Applications range widely from comparative gene prediction to time-series analyses of micro-array data. The parameters of the underlying models need to be adjusted for specific data sets, for example the genome of a particular species, in order to maximize the prediction accuracy. Computationally efficient algorithms for parameter training are thus key to maximizing the usability of a wide range of bioinformatics applications.  相似文献   

6.
Statistical methodology for the identification and characterization of protein binding sites in a set of unaligned DNA fragments is presented. Each sequence must contain at least one common site. No alignment of the sites is required. Instead, the uncertainty in the location of the sites is handled by employing the missing information principle to develop an "expectation maximization" (EM) algorithm. This approach allows for the simultaneous identification of the sites and characterization of the binding motifs. The reliability of the algorithm increases with the number of fragments, but the computations increase only linearly. The method is illustrated with an example, using known cyclic adenosine monophosphate receptor protein (CRP) binding sites. The final motif is utilized in a search for undiscovered CRP binding sites.  相似文献   

7.
The evolutionary tree reconstruction algorithm called SEMPHY using structural expectation maximization (SEM) is an efficient approach but has local optimality problem. To improve SEMPHY, a new algorithm named HSEMPHY based on the homotopy continuation principle is proposed in the present study for reconstructing evolutionary trees. The HSEMPHY algorithm computes the condition probability of hidden variables in the structural through maximum entropy principle. It can reduce the influence of the initial value of the final resolution by simulating the process of the homotopy principle and by introducing the homotopy parameter beta. HSEMPHY is tested on real datasets and simulated dataset to compare with SEMPHY and the two most popular reconstruction approaches PHYML and RAXML. Experimental results show that HSEMPHY is at least as good as PHYML and RAXML and is very robust to poor starting trees.  相似文献   

8.
9.
MOTIVATION: Sequences for new proteins are being determined at a rapid rate, as a result of the Human Genome Project, and related genome research. The ability to predict the three-dimensional structure of proteins from sequence alone would be useful in discovering and understanding their function. Threading, or fold recognition, aims to predict the tertiary structure of a protein by aligning its amino acid sequence with a large number of structures, and finding the best fit. This approach depends on obtaining good performance from both the scoring function, which simulates the free energy for given trial alignments, and the threading algorithm, which searches for the lowest-score alignment. It appears that current scoring functions and threading algorithms need improvement. RESULTS: This paper presents a new threading algorithm. Numerical tests demonstrate that it is more powerful than two popular approximate algorithms, and much faster than exact methods.  相似文献   

10.
11.
We propose a stochastic learning algorithm for multilayer perceptrons of linear-threshold function units, which theoretically converges with probability one and experimentally exhibits 100% convergence rate and remarkable speed on parity and classification problems with typical generalization accuracy. For learning the n bit parity function with n hidden units, the algorithm converged on all the trials we tested (n=2 to 12) after 5.8 x 4.1(n) presentations for 0.23 x 4.0(n-6) seconds on a 533MHz Alpha 21164A chip on average, which is five to ten times faster than Levenberg-Marquardt algorithm with restarts. For a medium size classification problem known as Thyroid in UCI repository, the algorithm is faster in speed and comparative in generalization accuracy than the standard backpropagation and Levenberg-Marquardt algorithms.  相似文献   

12.
MOTIVATION: Pair-wise alignment of protein sequences and local similarity searches produce many false positives because of compositionally biased regions, also called low-complexity regions (LCRs), of amino acid residues. Masking and filtering such regions significantly improves the reliability of homology searches and, consequently, functional predictions. Most of the available algorithms are based on a statistical approach. We wished to investigate the structural properties of LCRs in biological sequences and develop an algorithm for filtering them. RESULTS: We present an algorithm for detecting and masking LCRs in protein sequences to improve the quality of database searches. We developed the algorithm based on the complexity analysis of subsequences delimited by a pair of identical, repeating subsequences. Given a protein sequence, the algorithm first computes the suffix tree of the sequence. It then collects repeating subsequences from the tree. Finally, the algorithm iteratively tests whether each subsequence delimited by a pair of repeating subsequences meets a given criteria. Test results with 1000 proteins from 20 families in Pfam show that the repeating subsequences are a good indicator for the low-complexity regions, and the algorithm based on such structural information strongly compete with others. AVAILABILITY: http://bioinfo.knu.ac.kr/research/CARD/ CONTACT: swshin@bioinfo.knu.ac.kr  相似文献   

13.
Quantitative analysis of mitochondrial DNA (mtDNA) is crucial for proper diagnosis of diseases that are caused by or associated with mtDNA depletion. However, such a quantitative characterization of mtDNA is not a simple procedure and requires several laboratory steps at which potential errors can accumulate. Here, we describe a modified procedure for quantitative human mtDNA analysis. The procedure is based on using two PCR-amplified, fluorescein-labeled DNA probes, complementary to mtDNA (detection probe) and chromosomal 18S rDNA (reference probe), both of similar length. Thus, equal amounts of these probes can be used and, contrary to previously published procedures, no mtDNA purification (apart from total DNA isolation) or 18S rDNA cloning is necessary for probe preparation. Two separate hybridizations (each with one probe) are suggested instead of one hybridization with both probes; this decreases background signals and enables adjustment of the strength of specific signals from both probes, which is useful in the subsequent densitometric analysis after superimposing of both pictures. Using different DNA amounts for reactions, we have proved that the procedure is quantitative in a broad range of sample DNA concentrations. Moreover, we were able to detect mtDNA depletion unambiguously in tissue samples from patients suffering from diseases caused by dysfunction of mtDNA.  相似文献   

14.
15.
An algorithm for the estimation of stochastic processes in a neural system is presented. This process is defined here as the continuous stochastic process reflecting the dynamics of the neural system which has some inputs and generates output spike trains. The algorithm proposed here is to identify the system parameters and then estimate the stochastic process called neural system process here. These procedures carried out on the basis of the output spike trains which are supposed to be the data observed in the randomly missing way by the threshold time function in the neural system. The algorithm is constructed with the well-known Kalman filters and realizes the estimation of the neural system process by cooperating with the algorithm for the parameter estimation of the threshold time function presented previously (Nakao et al., 1983). The performance of the algorithm is examined by applying it to the various spike trains simulated by some artificial models and also to the neural spike trains recorded in cat's optic tract fibers. The results in these applications are thought to prove the effectiveness of the algorithm proposed here to some extent. Such attempts, we think, will serve to improve the characterizing and modelling techniques of the stochastic neural systems.  相似文献   

16.
Three-dimensional reconstruction of large macromolecules like viruses at resolutions below 10 A requires a large set of projection images. Several automatic and semi-automatic particle detection algorithms have been developed along the years. Here we present a general technique designed to automatically identify the projection images of particles. The method is based on Markov random field modelling of the projected images and involves a pre-processing of electron micrographs followed by image segmentation and post-processing. The image is modelled as a coupling of two fields--a Markovian and a non-Markovian. The Markovian field represents the segmented image. The micrograph is the non-Markovian field. The image segmentation step involves an estimation of coupling parameters and the maximum á posteriori estimate of the realization of the Markovian field i.e, segmented image. Unlike most current methods, no bootstrapping with an initial selection of particles is required.  相似文献   

17.
This article discusses the problem of unloading a sequence of boxes from a single conveyor line with a minimum number of moves. The problem under study is efficiently solvable with dynamic programming if the complete sequence of boxes is known in advance. In practice, however, the problem typically occurs in a real-time setting where the boxes are simultaneously placed on and picked from the conveyor line. Moreover, a large part of the sequence is often not visible. As a result, only a part of the sequence is known when deciding which boxes to move next. We develop an online algorithm that evaluates the quality of each possible move with a scenario-based stochastic method. Two versions of the algorithm are analyzed: in one version, the quality of each scenario is measured with an exact method, while a heuristic technique is applied in the second version. We evaluate the performance of the proposed algorithms using extensive computational experiments and establish a simple policy for determining which version to choose for specific problems. Numerical results show that the proposed approach consistently provides high-quality results, and compares favorably with the best known deterministic online algorithms. Indeed, the new approach typically provides results with relative gaps of 1–5% to the optimum, which is about 20–80% lower than those obtained with the best deterministic approach.  相似文献   

18.
In this paper there is developed a stochastic theory for rare and nonrecessive genes in large populations that may have individuals of several age groups present at one time. The analysis is based on an age-dependent branching process due to Goodman. An approximate formula for the probability of extinction of a line of mutant genes, originating in an ancestral heterozygote in age group 0, is calculated. Expressions are also given for the asymptotic rates of approach of the probabilities of extinction of lines at finite times to their limiting values. These expressions apply regardless of the age of the ancestral heterozygote or whether the line has a positive probability of surviving indefinitely. Mean frequencies at equilibrium are calculated when there is recurrent mutation to an unfavorable gene.  相似文献   

19.
Expression profile analysis of genes provides valuable information concerning the genetic response of cells to stimuli. We describe an adaptation of this technology that can be used to probe for the expression of specific families of genes in microbial species. In our method a combination of sets of oligonucleotide probes representing fingerprint sequences specific to protein families is used to identify the presence and expression levels of family homologs in a microbial cell. We demonstrate computationally, using exemplars, that when the cDNA complement from an organism is sequentially screened against a set of specific motif oligonucleotides, statistically significant information can be obtained concerning the expression of the corresponding genes. This method can be used to identify specific genes and pathways simultaneously in several organisms of interest even in the absence of sequence information from the organisms.  相似文献   

20.
In order to analyze male sterility caused by deletion of SRY and DAZ, we examined the accuracy and cost-effectiveness of a modified primed in situ labeling (PRINS) technique for detection of single-copy genes. Peripheral blood samples were collected from 50 healthy men; medium-term cultured lymphocytes from these samples were suspended in fixative solution and then spread on clean slides. We used four primers homologous to unique regions of the SRY and DAZ regions of the human Y-chromosome and incorporated reagents to increase polymerase specificity and to enhance the hybridization signal. PRINS of SRY and DAZ gave bands at Yp11.3 and Yq11.2, respectively, in all 50 metaphase spreads. The PRINS SRY signals were as distinct as those obtained using traditional fluorescence in situ hybridization (FISH). This new method is ideal for rapid localization of single-copy genes or small DNA segments, making PRINS a cost-effective alternative to FISH. Further enhancement of PRINS to increase its speed of implementation may lead to its wide use in the field of medical genetics.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号