首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Feature selection for the prediction of translation initiation sites   总被引:3,自引:0,他引:3  
Translation initiation sites (TISs) are important signals in cDNA sequences. In many previous attempts to predict TISs in cDNA sequences, three major factors affect the prediction performance: the nature of the cDNA sequence sets, the relevant features selected. and the classification methods used. In this paper, we examine different approaches to select and integrate relevant features for TIS prediction. The top selected significant features include the features from the position weight matrix and the propensity matrix, the number of nucleotide C in the sequence downstream ATG, the number of downstream stop codons. the number of upstream ATGs, and the number of some amino acids, such as amino acids A and D. With the numerical data generated from these features, different classification methods, including decision tree. naive Bayes, and support vector machine, were applied to three independent sequence sets. The identified significant features were found to be biologically meaningful. while the experiments showed promising results.  相似文献   

2.
3.
4.
Genomic prediction uses DNA sequences and phenotypes to predict genetic values. In homogeneous populations, theory indicates that the accuracy of genomic prediction increases with sample size. However, differences in allele frequencies and linkage disequilibrium patterns can lead to heterogeneity in SNP effects. In this context, calibrating genomic predictions using a large, potentially heterogeneous, training data set may not lead to optimal prediction accuracy. Some studies tried to address this sample size/homogeneity trade-off using training set optimization algorithms; however, this approach assumes that a single training data set is optimum for all individuals in the prediction set. Here, we propose an approach that identifies, for each individual in the prediction set, a subset from the training data (i.e., a set of support points) from which predictions are derived. The methodology that we propose is a sparse selection index (SSI) that integrates selection index methodology with sparsity-inducing techniques commonly used for high-dimensional regression. The sparsity of the resulting index is controlled by a regularization parameter (λ); the G-Best Linear Unbiased Predictor (G-BLUP) (the prediction method most commonly used in plant and animal breeding) appears as a special case which happens when λ = 0. In this study, we present the methodology and demonstrate (using two wheat data sets with phenotypes collected in 10 different environments) that the SSI can achieve significant (anywhere between 5 and 10%) gains in prediction accuracy relative to the G-BLUP.  相似文献   

5.
Marker assisted selection using best linear unbiased prediction   总被引:1,自引:0,他引:1  
  相似文献   

6.
7.
Lu L  Niu B  Zhao J  Liu L  Lu WC  Liu XJ  Li YX  Cai YD 《Peptides》2009,30(2):359-364
GalNAc-transferase can catalyze the biosynthesis of O-linked oligosaccharides. The specificity of GalNAc-transferase is composed of nine amino acid residues denoted by R4, R3, R2, R1, R0, R1', R2', R3', R4'. To predict whether the reducing monosaccharide will be covalently linked to the central residue R0(Ser or Thr), a new method based on feature selection has been proposed in our work. 277 nonapeptides from reference [Chou KC. A sequence-coupled vector-projection model for predicting the specificity of GalNAc-transferase. Protein Sci 1995;4:1365-83] are chosen for training set. Each nonapeptide is represented by hundreds of amino acid properties collected by Amino Acid Index database (http://www.genome.jp/aaindex) and transformed into a numeric vector with 4554 features. The Maximum Relevance Minimum Redundancy (mRMR) method combining with Incremental Feature Selection (IFS) and Feature Forward Selection (FFS) are then applied for feature selection. Nearest Neighbor Algorithm (NNA) is used to build prediction models. The optimal model contains 54 features and its correct rate tested by Jackknife cross-validation test reaches 91.34%. Final feature analysis indicates that amino acid residues at position R3' play the most important role in the recognition of GalNAc-transferase specificity, which were confirmed by the experiments [Elhammer AP, Poorman RA, Brown E, Maggiora LL, Hoogerheide JG, Kezdy FJ. The specificity of UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase as inferred from a database of in vivo substrates and from the in vitro glycosylation of proteins and peptides. J Biol Chem 1993;268:10029-38; O'Connell BC, Hagen FK, Tabak LA. The influence of flanking sequence on the O-glycosylation of threonine in vitro. J Biol Chem 1992;267:25010-8; Yoshida A, Suzuki M, Ikenaga H, Takeuchi M. Discovery of the shortest sequence motif for high level mucin-type O-glycosylation. J Biol Chem 1997;272:16884-8]. Our method can be used as a tool for predicting O-glycosylation sites and for investigating the GalNAc-transferase specificity, which is useful for designing competitive inhibitors of GalNAc-transferase. The predicting software is available upon the request.  相似文献   

8.
We have investigated the relative merits of two commonly used methods for target site selection for ribozymes: secondary structure prediction (MFold program) and in vitro accessibility assays. A total of eight methylated ribozymes with DNA arms were synthesized and analyzed in a transient co-transfection assay in HeLa cells. Residual expression levels ranging from 23 to 72% were obtained with anti-PSKH1 ribozymes compared to cells transfected with an irrelevant control ribozyme. Ribozyme efficacy depended on both ribozyme concentration and the steady state expression levels of the target mRNA. Allylated ribozymes against a subset of the target sites generally displayed poorer efficacy than their methylated counterparts. This effect appeared to be influenced by in vivo accessibility of the target site. Ribozymes designed on the basis of either selection method displayed a wide range of efficacies with no significant differences in the average activities of the two groups of ribozymes. While in vitro accessibility assays had limited predictive power, there was a significant correlation between certain features of the predicted secondary structure of the target sequence and the efficacy of the corresponding ribozyme. Specifically, ribozyme efficacy appeared to be positively correlated with the presence of short stem regions and helices of low stability within their target sequences. There were no correlations with predicted free energy or loop length.  相似文献   

9.
On prediction of genetic values in marker-assisted selection.   总被引:13,自引:0,他引:13  
C Lange  J C Whittaker 《Genetics》2001,159(3):1375-1381
We suggest a new approximation for the prediction of genetic values in marker-assisted selection. The new approximation is compared to the standard approach. It is shown that the new approach will often provide substantially better prediction of genetic values; furthermore the new approximation avoids some of the known statistical problems of the standard approach. The advantages of the new approach are illustrated by a simulation study in which the new approximation outperforms both the standard approach and phenotypic selection.  相似文献   

10.
Summary An equivalence between restricted best linear unbiased prediction (and thus restricted selection index) and a particular example of a selection model is presented. Specifically, the equivalence is between restricted selection and a model of selection on the residuals of the general mixed linear model. This result illustrates that restricted selection acts by nonrandomly sampling those genes that act pleiotropically in multiple trait genetic models. An expression for a mixed linear model which includes restrictions is also presented.  相似文献   

11.
12.
13.
Selection using single nucleotide polymorphism (SNP) markers scattered throughout the genome or genomic selection (GS) is considered. This approach permits simultaneous selection of most quantitative trait loci (QTLs) determining the selected trait. According to expert assessment, GS makes it possible to save 92% of the funds spent on traditional selection and is twice as efficient as the latter.  相似文献   

14.
15.
We present a new method for the rapid identification of amino acid residues that contribute to protein-protein interfaces. Tail-interacting protein of 47 kDa (TIP47) binds Rab9 GTPase and the cytoplasmic domains of mannose 6-phosphate receptors and is required for their transport from endosomes to the Golgi apparatus. Cysteine mutations were incorporated randomly into TIP47 by expression in Escherichia coli cells harboring specific misincorporator tRNAs. We made use of the ability of the native TIP47 protein to protect 48 cysteine probes from chemical modification by iodoacetamide as a means to obtain a surface map of TIP47, revealing the identity of surface-localized, hydrophobic residues that are likely to participate in protein-protein interactions. Direct mutation of predicted interface residues confirmed that the protein had altered binding affinity for the mannose 6-phosphate receptor. TIP47 mutants with enhanced or diminished affinities were also selected by affinity chromatography. These methods were validated in comparison with the protein's crystal structure, and provide a powerful means to predict protein-protein interaction interfaces.  相似文献   

16.
In the study of in silico functional genomics, improving the performance of protein function prediction is the ultimate goal for identifying proteins associated with defined cellular functions. The classical prediction approach is to employ pairwise sequence alignments. However this method often faces difficulties when no statistically significant homologous sequences are identified. An alternative way is to predict protein function from sequence-derived features using machine learning. In this case the choice of possible features which can be derived from the sequence is of vital importance to ensure adequate discrimination to predict function. In this paper we have successfully selected biologically significant features for protein function prediction. This was performed using a new feature selection method (FrankSum) that avoids data distribution assumptions, uses a data independent measurement (p-value) within the feature, identifies redundancy between features and uses an appropriate ranking criterion for feature selection. We have shown that classifiers generated from features selected by FrankSum outperforms classifiers generated from full feature sets, randomly selected features and features selected from the Wrapper method. We have also shown the features are concordant across all species and top ranking features are biologically informative. We conclude that feature selection is vital for successful protein function prediction and FrankSum is one of the feature selection methods that can be applied successfully to such a domain.  相似文献   

17.
18.
19.
20.
Xie  Minzhu  Lei  Xiaowen  Zhong  Jianchen  Ouyang  Jianxing  Li  Guijing 《BMC bioinformatics》2022,23(8):1-13
Background

Essential proteins are indispensable to the development and survival of cells. The identification of essential proteins not only is helpful for the understanding of the minimal requirements for cell survival, but also has practical significance in disease diagnosis, drug design and medical treatment. With the rapidly amassing of protein–protein interaction (PPI) data, computationally identifying essential proteins from protein–protein interaction networks (PINs) becomes more and more popular. Up to now, a number of various approaches for essential protein identification based on PINs have been developed.

Results

In this paper, we propose a new and effective approach called iMEPP to identify essential proteins from PINs by fusing multiple types of biological data and applying the influence maximization mechanism to the PINs. Concretely, we first integrate PPI data, gene expression data and Gene Ontology to construct weighted PINs, to alleviate the impact of high false-positives in the raw PPI data. Then, we define the influence scores of nodes in PINs with both orthological data and PIN topological information. Finally, we develop an influence discount algorithm to identify essential proteins based on the influence maximization mechanism.

Conclusions

We applied our method to identifying essential proteins from saccharomyces cerevisiae PIN. Experiments show that our iMEPP method outperforms the existing methods, which validates its effectiveness and advantage.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号