首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
  1. Download : Download high-res image (169KB)
  2. Download : Download full-size image
  相似文献   

2.
Lee K  Kim DW  Na D  Lee KH  Lee D 《Nucleic acids research》2006,34(17):4655-4666
Subcellular localization is one of the key functional characteristics of proteins. An automatic and efficient prediction method for the protein subcellular localization is highly required owing to the need for large-scale genome analysis. From a machine learning point of view, a dataset of protein localization has several characteristics: the dataset has too many classes (there are more than 10 localizations in a cell), it is a multi-label dataset (a protein may occur in several different subcellular locations), and it is too imbalanced (the number of proteins in each localization is remarkably different). Even though many previous works have been done for the prediction of protein subcellular localization, none of them tackles effectively these characteristics at the same time. Thus, a new computational method for protein localization is eventually needed for more reliable outcomes. To address the issue, we present a protein localization predictor based on D-SVDD (PLPD) for the prediction of protein localization, which can find the likelihood of a specific localization of a protein more easily and more correctly. Moreover, we introduce three measurements for the more precise evaluation of a protein localization predictor. As the results of various datasets which are made from the experiments of Huh et al. (2003), the proposed PLPD method represents a different approach that might play a complimentary role to the existing methods, such as Nearest Neighbor method and discriminate covariant method. Finally, after finding a good boundary for each localization using the 5184 classified proteins as training data, we predicted 138 proteins whose subcellular localizations could not be clearly observed by the experiments of Huh et al. (2003).  相似文献   

3.

Background

While there are a large number of bioinformatics datasets for clustering, many of them are incomplete, i.e., missing attribute values in some data samples needed by clustering algorithms. A variety of clustering algorithms have been proposed in the past years, but they usually are limited to cluster on the complete dataset. Besides, conventional clustering algorithms cannot obtain a trade-off between accuracy and efficiency of the clustering process since many essential parameters are determined by the human user’s experience.

Results

The paper proposes a Multiple Kernel Density Clustering algorithm for Incomplete datasets called MKDCI. The MKDCI algorithm consists of recovering missing attribute values of input data samples, learning an optimally combined kernel for clustering the input dataset, reducing dimensionality with the optimal kernel based on multiple basis kernels, detecting cluster centroids with the Isolation Forests method, assigning clusters with arbitrary shape and visualizing the results.

Conclusions

Extensive experiments on several well-known clustering datasets in bioinformatics field demonstrate the effectiveness of the proposed MKDCI algorithm. Compared with existing density clustering algorithms and parameter-free clustering algorithms, the proposed MKDCI algorithm tends to automatically produce clusters of better quality on the incomplete dataset in bioinformatics.
  相似文献   

4.
5.

Background  

Microarray experiments have become very popular in life science research. However, if such experiments are only considered independently, the possibilities for analysis and interpretation of many life science phenomena are reduced. The accumulation of publicly available data provides biomedical researchers with a valuable opportunity to either discover new phenomena or improve the interpretation and validation of other phenomena that partially understood or well known. This can only be achieved by intelligently exploiting this rich mine of information.  相似文献   

6.

Background  

Biological information is commonly used to cluster or classify entities of interest such as genes, conditions, species or samples. However, different sources of data can be used to classify the same set of entities and methods allowing the comparison of the performance of two data sources or the determination of how well a given classification agrees with another are frequently needed, especially in the absence of a universally accepted "gold standard" classification.  相似文献   

7.
Maize diseases are a major source of yield loss, but due to the lack of human experience and limitations of traditional image-recognition technology, obtaining satisfactory large-scale identification results of maize diseases are difficult. Fortunately, the advancement of deep learning-based technology makes it possible to automatically identify diseases. However, it still faces issues caused by small sample sizes and complex field background, which affect the accuracy of disease identification. To address these issues, a deep learning-based method was proposed for maize disease identification in this paper. DenseNet121 was used as the main extraction network and a multi-dilated-CBAM-DenseNet (MDCDenseNet) model was built by combining the multi-dilated module and convolutional block attention module (CBAM) attention mechanism. Five models of MDCDenseNet, DenseNet121, ResNet50, MobileNetV2, and NASNetMobile were compared and tested using three kinds of maize leave images from the PlantVillage dataset and field-collected at Northeast Agricultural University in China. Furthermore, auxiliary classifier generative adversarial network (ACGAN) and transfer learning were used to expand the dataset and pre-train for optimal identification results. When tested on field-collected datasets with a complex background, the MDCDenseNet model outperformed compared to these models with an accuracy of 98.84%. Therefore, it can provide a viable reference for the identification of maize leaf diseases collected from the farmland with a small sample size and complex background.  相似文献   

8.
Shotgun proteomics uses liquid chromatography-tandem mass spectrometry to identify proteins in complex biological samples. We describe an algorithm, called Percolator, for improving the rate of confident peptide identifications from a collection of tandem mass spectra. Percolator uses semi-supervised machine learning to discriminate between correct and decoy spectrum identifications, correctly assigning peptides to 17% more spectra from a tryptic Saccharomyces cerevisiae dataset, and up to 77% more spectra from non-tryptic digests, relative to a fully supervised approach.  相似文献   

9.

Background

Meta-analysis of gene expression microarray datasets presents significant challenges for statistical analysis. We developed and validated a new bioinformatic method for the identification of genes upregulated in subsets of samples of a given tumour type (‘outlier genes’), a hallmark of potential oncogenes.

Methodology

A new statistical method (the gene tissue index, GTI) was developed by modifying and adapting algorithms originally developed for statistical problems in economics. We compared the potential of the GTI to detect outlier genes in meta-datasets with four previously defined statistical methods, COPA, the OS statistic, the t-test and ORT, using simulated data. We demonstrated that the GTI performed equally well to existing methods in a single study simulation. Next, we evaluated the performance of the GTI in the analysis of combined Affymetrix gene expression data from several published studies covering 392 normal samples of tissue from the central nervous system, 74 astrocytomas, and 353 glioblastomas. According to the results, the GTI was better able than most of the previous methods to identify known oncogenic outlier genes. In addition, the GTI identified 29 novel outlier genes in glioblastomas, including TYMS and CDKN2A. The over-expression of these genes was validated in vivo by immunohistochemical staining data from clinical glioblastoma samples. Immunohistochemical data were available for 65% (19 of 29) of these genes, and 17 of these 19 genes (90%) showed a typical outlier staining pattern. Furthermore, raltitrexed, a specific inhibitor of TYMS used in the therapy of tumour types other than glioblastoma, also effectively blocked cell proliferation in glioblastoma cell lines, thus highlighting this outlier gene candidate as a potential therapeutic target.

Conclusions/Significance

Taken together, these results support the GTI as a novel approach to identify potential oncogene outliers and drug targets. The algorithm is implemented in an R package (Text S1).  相似文献   

10.
PIR: a new resource for bioinformatics   总被引:3,自引:0,他引:3  
SUMMARY: The Protein Information Resource (PIR) has greatly expanded its Web site and developed a set of interactive search and analysis tools to facilitate the analysis, annotation, and functional identification of proteins. New search engines have been implemented to combine sequence similarity search results with database annotation information. The new PIR search systems have proved very useful in providing enriched functional annotation of protein sequences, determining protein superfamily-domain relationships, and detecting annotation errors in genomic database archives. AVAILABILITY: http://pir.georgetown.edu/. CONTACT: mcgarvey@nbrf.georgetown.edu  相似文献   

11.
Increased concentrations of Total Phosphorus (TP) in freshwater systems lead to eutrophication and can contribute to a wide range of environmental effects. In the modern era, water quality models have increasingly been used globally for the development of management scenarios with the aim of reducing the eutrophication risk. However, the accuracy of these models is limited by the quality of the boundary conditions forcing data, namely TP concentration datasets. In this study, a novel methodology is proposed to improve machine learning prediction accuracy in the modeling of river TP concentration forced with small input training datasets. These models can then be used to increase the quality and consistency of the TP concentration datasets required to force water quality models. This new methodology relies on the generation of 100 new training datasets from the raw training datasets of input predictors through the implementation of an over/undersampling technique. The modeling approach used in this study was supported by the application of ten machine learning algorithms to estimate the TP concentration values in 22 rivers located in Portugal. The modeling approach also included an input feature importance evaluation, as well as model hyperparameter optimization. In general terms, the Extreme Gradient Boosting (XGBoost) and Support Vector Regressor (SVR) models performed best overall, with the ensemble results recorded for both models working to increase the mean Nash-Sutcliffe efficiency (NSE) across all the areas being studied by 96% (0.01 ± 0.22 to 0.31 ± 0.32) and reduce the mean percentage bias (PBIAS) by 43% (18.47 ± 17.31 to 10.60 ± 17.40). The results of this study suggest that the solution proposed has the potential to significantly improve the modeling of TP concentration in rivers with machine learning methods, as well as providing increased scope for its application to larger training datasets and the prediction of other types of dependent variables. Hopefully, the results of this study will further add to the body of information available in this area of research and aid the development of the water management process.  相似文献   

12.

Background  

Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest.  相似文献   

13.
The optoacoustic imaging (OAI) methods are rapidly evolving for resolving optical contrast in medical imaging applications. In practice, measurement strategies are commonly implemented under limited-view conditions due to oversized image objectives or system design limitations. Data acquired by limited-view detection may impart artifacts and distortions in reconstructed optoacoustic (OA) images. We propose a hybrid data-driven deep learning approach based on generative adversarial network (GAN), termed as LV-GAN, to efficiently recover high quality images from limited-view OA images. Trained on both simulation and experiment data, LV-GAN is found capable of achieving high recovery accuracy even under limited detection angles less than 60°. The feasibility of LV-GAN for artifact removal in biological applications was validated by ex vivo experiments based on two different OAI systems, suggesting high potential of a ubiquitous use of LV-GAN to optimize image quality or system design for different scanners and application scenarios.  相似文献   

14.
A R Kessler  B Kessler  S Yehuda 《Life sciences》1986,38(13):1185-1192
In this account we report in vivo effects of a plant lipid preparation (MMPL) on brain cholesterol and the activity and learning performance of aging male rats. Three-month-old rats were fed for 3 months with a diet that was enriched with 3% MMPL. Another group of 18 month-old rats was fed for 6 months with a 3% MMPL-enriched diet. This food regime lowered markedly the cholesterol level in the hippocampal and cortical regions and increased their lipid membrane fluidity. The animals of both age groups also responded to MMPL with a higher activity and their learning performances, compared to normal diet-fed animals, improved notably. This improvement continued at least 4 months after terminating the supply of MMPL. Significant inverse correlationships were obtained between the length of the training period required to attain proper criteria and cholesterol levels of the hippocampal and cortical brain fractions.  相似文献   

15.
MOTIVATION: We describe APDB, a novel measure for evaluating the quality of a protein sequence alignment, given two or more PDB structures. This evaluation does not require a reference alignment or a structure superposition. APDB is designed to efficiently and objectively benchmark multiple sequence alignment methods. RESULTS: Using existing collections of reference multiple sequence alignments and existing alignment methods, we show that APDB gives results that are consistent with those obtained using conventional evaluations. We also show that APDB is suitable for evaluating sequence alignments that are structurally equivalent. We conclude that APDB provides an alternative to more conventional methods used for benchmarking sequence alignment packages.  相似文献   

16.
SUMMARY: The development of NMR in structural proteomics requires the availability of automatic structure determination methods. Many researchers are commonly confronted with the lack of raw datasets during the validation step of such methods. In order to increase test possibilities, the NMRb web-site offers a database of NMR raw datasets, ordered by spectral characteristics. AVAILABILITY: NMRb is available from: http://nmrb.cbs.cnrs.fr. SUPPLEMENTARY INFORMATION: General organization of NMRb figure, relational model organization, and XML structure files are available from http://nmrb.cbs.cnrs.fr/nmrb-doc.html.  相似文献   

17.
The ability to manage the constantly growing clinically relevant information in genetics available on the internet is becoming crucial in medical practice. Therefore, training students in teaching environments that develop bioinformatics skills is a particular challenge to medical schools. We present here an instructional approach that potentiates learning of hormone/vitamin mechanisms of action in gene regulation with the acquisition and practice of bioinformatics skills. The activity is integrated within the study of the Endocrine System module. Given a nucleotide sequence of a hormone or vitamin-response element, students use internet databases and tools to find the gene to which it belongs. Subsequently, students search how the corresponding hormone/vitamin influences the expression of that particular gene and how a dysfunctional interaction might cause disease. This activity was presented for four consecutive years to cohorts of 50-60 students/year enrolled in the 2(nd) year of the medical degree. 90% of the students developed a better understanding of the usefulness of bioinformatics and 98% intend to use web-based resources in the future. Since hormones and vitamins regulate genes of all body organ systems, this activity successfully integrates the whole body physiology of the medical curriculum.  相似文献   

18.
Background: Many existing bioinformatics predictors are based on machine learning technology. When applying these predictors in practical studies, their predictive performances should be well understood. Different performance measures are applied in various studies as well as different evaluation methods. Even for the same performance measure, different terms, nomenclatures or notations may appear in different context. Results: We carried out a review on the most commonly used performance measures and the evaluation methods for bioinformatics predictors. Conclusions: It is important in bioinformatics to correctly understand and interpret the performance, as it is the key to rigorously compare performances of different predictors and to choose the right predictor.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号