首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
Computer‐automated identification of insect species has long been sought to support activities such as environmental monitoring, forensics, pest diagnostics, border security and vector epidemiology, to name just a few. In order to succeed, an automated identification programme capable of addressing the needs of the end user should be able to classify hundreds of taxa, if not thousands, and is expected to distinguish closely related and hence morphologically similar species. However, it remains unknown how automated identification methods might handle an increase in data quantity, be it in reference imagery or taxonomic diversity. We sought to test the scalability of an automated identification method in terms of the number of reference specimens used to train the classifier and the number of taxa into which the classifier should assign unknown specimens. Is there an optimal number of reference images, where the cost of acquiring more images becomes greater than the marginal increase in identification success? Does increasing taxonomic diversity affect identification success, whether negatively or positively? In order to test the scalability of the automated insect identification enterprise, we used a sparse processing technique and support vector machine to test the largest dataset to date: 72 species of fruit flies (Diptera: Tephritidae) and 76 species of mosquitoes (Diptera: Culicidae). We found that: (i) machine vision methods are capable of correctly classifying large numbers of closely related species; (ii) when the misclassification of a specimen occurs at the species level, it is often classified in the correct genus; (iii) classification success increases asymptotically as new training images are added to the dataset; (iv) broad taxon sampling outside a focal group can increase classification success within it.  相似文献   

3.
细胞外基质蛋白质在细胞的一系列生物过程中发挥着重要作用,它的异常调节会导致很多重大疾病。理论细胞外基质蛋白质参考数据是实现细胞外基质蛋白质高效鉴定的基础,研究者们已经基于机器学习的方法开发出一系列的细胞外基质蛋白质预测工具。文中首先阐述了基于机器学习模型构建细胞外基质蛋白质预测工具的基本流程,之后以工具为单位总结了已有细胞外基质蛋白质预测工具的研究成果,最后提出了细胞外基质蛋白质预测工具目前面临的问题和可能的优化方法。  相似文献   

4.
PurposeTo compare the organ-dose and effective-dose (E) delivered to the patient during percutaneous vertebroplasty (PVP) of one thoracic or lumbar vertebra performed under CT guidance or using a fixed C-arm.MethodsConsecutive adult patients undergoing PVP of one vertebra under CT-guidance, with optimized protocol and training of physicians, or using a fixed C-arm were retrospectively included from January 2016 to June 2017. Organ-doses were computed on 16 organs using CT Expo 2.4 software for the CT procedures and PCXMC 2.0 for the fixed C-arm procedures. E was also computed with both software. Dosimetric values per anatomic locations for all procedures were compared using the paired Mann-Whitney-Wilcoxon test.ResultsIn total, 73 patients were analysed (27 men and 46 women, mean age 78 ± 10 years) among whom 35 (48%) underwent PVP under CT guidance and 38 (52%) PVP using a fixed C-arm. The median E was 11.31 [6.54; 15.82] mSv for all PVPs performed under CT guidance and 5.58 [3.33; 8.71] mSv for fixed C-arm and the differences was significant (p<0.001). For lumbar PVP, the organ doses of stomach, liver and colon were significantly higher with CT-scan than with the fixed C-arm: 97% (p=0.02); 21% (p=0.099) and 375% (p=0.002), respectively. For thoracic PVP, the lung organ dose was significantly higher with CT-scan than with the fixed C-arm (127%; p<0.001) and the oesophagus organ doses were not significantly different (p = 0.626).ConclusionThis study showed that the E and the organ dose on directly exposed organs were both higher for PVP performed under CT-guidance than with the fixed C-arm.  相似文献   

5.
Ternary organic solar cells (OSCs) have progressed significantly in recent years due to the sufficient photon harvesting of the blend photoactive layer including three absorption‐complementary materials. With the rapid development of highly efficient ternary OSCs in photovoltaics, the precise energy‐level alignment of the three active components within ternary OSC devices should be taken into account. The machine‐learning technique is a computational method that can effectively learn from previous historical data to build predictive models. In this study, a dataset of 124 fullerene derivatives‐based ternary OSCs is manually constructed from a diverse range of literature along with their frontier molecular orbital theory levels, and device structures. Different machine‐learning algorithms are trained based on these electronic parameters to predict photovoltaic efficiency. Thus, the best predictive capability is provided by using the Random Forest approach beyond other machine‐learning algorithms in the dataset. Furthermore, the Random Forest algorithm yields valuable insights into the crucial role of lowest unoccupied molecular orbital energy levels of organic donors in the performance of ternary OSCs. The outcome of this study demonstrates a smart strategy for extracting underlying complex correlations in fullerene derivatives‐based ternary OSCs, thereby accelerating the development of ternary OSCs and related research fields.  相似文献   

6.
Babnigg G  Giometti CS 《Proteomics》2006,6(16):4514-4522
In proteome studies, identification of proteins requires searching protein sequence databases. The public protein sequence databases (e.g., NCBInr, UniProt) each contain millions of entries, and private databases add thousands more. Although much of the sequence information in these databases is redundant, each database uses distinct identifiers for the identical protein sequence and often contains unique annotation information. Users of one database obtain a database-specific sequence identifier that is often difficult to reconcile with the identifiers from a different database. When multiple databases are used for searches or the databases being searched are updated frequently, interpreting the protein identifications and associated annotations can be problematic. We have developed a database of unique protein sequence identifiers called Sequence Globally Unique Identifiers (SEGUID) derived from primary protein sequences. These identifiers serve as a common link between multiple sequence databases and are resilient to annotation changes in either public or private databases throughout the lifetime of a given protein sequence. The SEGUID Database can be downloaded (http://bioinformatics.anl.gov/SEGUID/) or easily generated at any site with access to primary protein sequence databases. Since SEGUIDs are stable, predictions based on the primary sequence information (e.g., pI, Mr) can be calculated just once; we have generated approximately 500 different calculations for more than 2.5 million sequences. SEGUIDs are used to integrate MS and 2-DE data with bioinformatics information and provide the opportunity to search multiple protein sequence databases, thereby providing a higher probability of finding the most valid protein identifications.  相似文献   

7.
Many non-synonymous SNPs (nsSNPs) are associated with diseases, and numerous machine learning methods have been applied to train classifiers for sorting disease-associated nsSNPs from neutral ones. The continuously accumulated nsSNP data allows us to further explore better prediction approaches. In this work, we partitioned the training data into 20 subsets according to either original or substituted amino acid type at the nsSNP site. Using support vector machine (SVM), training classification models on each subset resulted in an overall accuracy of 76.3% or 74.9% depending on the two different partition criteria, while training on the whole dataset obtained an accuracy of only 72.6%. Moreover, the dataset was also randomly divided into 20 subsets, but the corresponding accuracy was only 73.2%. Our results demonstrated that partitioning the whole training dataset into subsets properly, i.e., according to the residue type at the nsSNP site, will improve the performance of the trained classifiers significantly, which should be valuable in developing better tools for predicting the disease-association of nsSNPs.  相似文献   

8.
9.
使用国内的三种厌氧菌生化鉴定装置及购进的API20A 条,对标准参考菌株10 株,临床分离株109 株及从粪便分离株 14 株共计 133 株进行了测试比较,虽然这四种方法各有其优缺点,但结果这几种方法与国外的API20A 法相比,其符合率及重复率均无明显差异(P> 0.05),而国内的厌氧菌生化鉴定装置都具有快速,简便、准确的特点,无论是符合率还是重复率都在80% 或90% 以上,当然这四种方法还存在着一定问题待改进,所以我们希望上述国内有关装置在不断地改进中完善,使我们的厌氧菌检测水平更上一层楼。  相似文献   

10.
Quantitative proteomics methods have emerged as powerful tools for measuring protein expression changes at the proteome level. Using MS‐based approaches, it is now possible to routinely quantify thousands of proteins. However, prefractionation of the samples at the protein or peptide level is usually necessary to go deep into the proteome, increasing both MS analysis time and technical variability. Recently, a new MS acquisition method named SWATH is introduced with the potential to provide good coverage of the proteome as well as a good measurement precision without prior sample fractionation. In contrast to shotgun‐based MS however, a library containing experimental acquired spectra is necessary for the bioinformatics analysis of SWATH data. In this study, spectral libraries for two widely used models are built to study crop ripening or animal embryogenesis, Solanum lycopersicum (tomato) and Drosophila melanogaster, respectively. The spectral libraries comprise fragments for 5197 and 6040 proteins for S. lycopersicum and D. melanogaster, respectively, and allow reproducible quantification for thousands of peptides per MS analysis. The spectral libraries and all MS data are available in the MassIVE repository with the dataset identifiers MSV000081074 and MSV000081075 and the PRIDE repository with the dataset identifiers PXD006493 and PXD006495.  相似文献   

11.
This paper describes an automated apparatus combining Rosenfield's and Lalezari's antibody screening and identification basic technics. PVP bromelin and low ionic strength acid polybren channels are used; agglutinates are decanded; the remaining cells are hemolyzed and the optical density is then measured through a colorimeter and recorded on a chart; speed is of 40 samples an hour. This machine was also used for irregular antibody screening and identification. Sensitivity is shown to be equal to that of manual technics for ABO, Lewis, Lutheran as well as K, S, M, Kpb, Xga, U and Vel antibodies detection. Nevertheless, a much greater sensitivity is achieved (titers 3 to 10 times higher) than by manual technics for Rh, -k, S, Fya antibodies detection. Polybren channel is suitable for anti-Rh, Duffy, I and M (human detection; bromelin channel however, has a greater sensitivity for other specificities. Anti-M and anti-N sera from rabbits were shown to be non specific when using this machine. Over almost 15 000 sera tested, no antibody (detected by manual techniques) escaped the automated screening. This antibody detection machine was applied to compatibility tests prior to transfusion. (21 480 units were tested. aimed to be transfused to 5 611 patients). A third, PVP without bromelin, was set in parallel in order not to let escape any anti-M, even a weak one. The sera distributor was slaved to the cells distributor so that the whole procedure was automated. Furthermore, each serum was tested against red cells to be transfused, but also against the patient's own red cells to be transfused, but also against the patient's own red cells and against two selected red cells panels, so as to ensure irregular antibody detection at the same time. Using this machine, 3 to 4% of the cell samples were rejected, i.e. more than with usual techniques. All manually detected antibodies were identified, but also some others, which showed only weak reactions by classical techniques. Total results can be obtained within 20 to 30 minutes, which is quite rapid, compared to techniques using for example antiglobulin tests.  相似文献   

12.
Successful clustering algorithms are highly dependent on parameter settings. The clustering performance degrades significantly unless parameters are properly set, and yet, it is difficult to set these parameters a priori. To address this issue, in this paper, we propose a unique splitting-while-merging clustering framework, named “splitting merging awareness tactics” (SMART), which does not require any a priori knowledge of either the number of clusters or even the possible range of this number. Unlike existing self-splitting algorithms, which over-cluster the dataset to a large number of clusters and then merge some similar clusters, our framework has the ability to split and merge clusters automatically during the process and produces the the most reliable clustering results, by intrinsically integrating many clustering techniques and tasks. The SMART framework is implemented with two distinct clustering paradigms in two algorithms: competitive learning and finite mixture model. Nevertheless, within the proposed SMART framework, many other algorithms can be derived for different clustering paradigms. The minimum message length algorithm is integrated into the framework as the clustering selection criterion. The usefulness of the SMART framework and its algorithms is tested in demonstration datasets and simulated gene expression datasets. Moreover, two real microarray gene expression datasets are studied using this approach. Based on the performance of many metrics, all numerical results show that SMART is superior to compared existing self-splitting algorithms and traditional algorithms. Three main properties of the proposed SMART framework are summarized as: (1) needing no parameters dependent on the respective dataset or a priori knowledge about the datasets, (2) extendible to many different applications, (3) offering superior performance compared with counterpart algorithms.  相似文献   

13.
复杂疾病的发生发展与机体内生物学通路的功能紊乱有密切联系,从高通量数据出发,利用计算机辅助方法来研究疾病与通路间的关系具有重要意义.本文提出了一个新的基于网络的全局性通路识别方法.该方法利用蛋白质互作信息和通路的基因集组成信息构建复杂的蛋白质-通路网.然后,基于表达谱数据,通过随机游走算法从全局层面优化疾病风险通路.最终,通过扰动方式识别统计学显著的风险通路.将该网络运用于结肠直肠癌风险通路识别,识别出15个与结肠直肠癌发生与发展过程显著相关的通路.通过与其他通路识别方法(超几何检验,SPIA)相比较,该方法能够更有效识别出疾病相关的风险通路.  相似文献   

14.
In the last decade, bacterial taxonomy witnessed a huge expansion. The swift pace of bacterial species (re-)definitions has a serious impact on the accuracy and completeness of first-line identification methods. Consequently, back-end identification libraries need to be synchronized with the List of Prokaryotic names with Standing in Nomenclature. In this study, we focus on bacterial fatty acid methyl ester (FAME) profiling as a broadly used first-line identification method. From the BAME@LMG database, we have selected FAME profiles of individual strains belonging to the genera Bacillus, Paenibacillus and Pseudomonas. Only those profiles resulting from standard growth conditions have been retained. The corresponding data set covers 74, 44 and 95 validly published bacterial species, respectively, represented by 961, 378 and 1673 standard FAME profiles. Through the application of machine learning techniques in a supervised strategy, different computational models have been built for genus and species identification. Three techniques have been considered: artificial neural networks, random forests and support vector machines. Nearly perfect identification has been achieved at genus level. Notwithstanding the known limited discriminative power of FAME analysis for species identification, the computational models have resulted in good species identification results for the three genera. For Bacillus, Paenibacillus and Pseudomonas, random forests have resulted in sensitivity values, respectively, 0.847, 0.901 and 0.708. The random forests models outperform those of the other machine learning techniques. Moreover, our machine learning approach also outperformed the Sherlock MIS (MIDI Inc., Newark, DE, USA). These results show that machine learning proves very useful for FAME-based bacterial species identification. Besides good bacterial identification at species level, speed and ease of taxonomic synchronization are major advantages of this computational species identification strategy.  相似文献   

15.
This study investigated whether infrared spectroscopy combined with a deep learning algorithm could be a useful tool for determining causes of death by analyzing pulmonary edema fluid from forensic autopsies. A newly designed convolutional neural network‐based deep learning framework, named DeepIR and eight popular machine learning algorithms, were used to construct classifiers. The prediction performances of these classifiers demonstrated that DeepIR outperformed the machine learning algorithms in establishing classifiers to determine the causes of death. Moreover, DeepIR was generally less dependent on preprocessing procedures than were the machine learning algorithms; it provided the validation accuracy with a narrow range from 0.9661 to 0.9856 and the test accuracy ranging from 0.8774 to 0.9167 on the raw pulmonary edema fluid spectral dataset and the nine preprocessing protocol‐based datasets in our study. In conclusion, this study demonstrates that the deep learning‐equipped Fourier transform infrared spectroscopy technique has the potential to be an effective aid for determining causes of death.  相似文献   

16.
Prediction of the β-Hairpins in Proteins Using Support Vector Machine   总被引:1,自引:0,他引:1  
Hu XZ  Li QZ 《The protein journal》2008,27(2):115-122
By using of the composite vector with increment of diversity and scoring function to express the information of sequence, a support vector machine (SVM) algorithm for predicting β-hairpin motifs is proposed. The prediction is done on a dataset of 3,088 non homologous proteins containing 6,027 β-hairpins. The overall accuracy of prediction and Matthew’s correlation coefficient are 79.9% and 0.59 for the independent testing dataset. In addition, a higher accuracy of 83.3% and Matthew’s correlation coefficient of 0.67 in the independent testing dataset are obtained on a dataset previously used by Kumar et al. (Nuclic Acid Res 33:154–159). The performance of the method is also evaluated by predicting the β-hairpins of in the CASP6 proteins, and the better results are obtained. Moreover, this method is used to predict four kinds of supersecondary structures. The overall accuracy of prediction is 64.5% for the independent testing dataset.  相似文献   

17.
Remote homology detection refers to the detection of structure homology in evolutionarily related proteins with low sequence similarity. Supervised learning algorithms such as support vector machine (SVM) are currently the most accurate methods. In most of these SVM-based methods, efforts have been dedicated to developing new kernels to better use the pairwise alignment scores or sequence profiles. Moreover, amino acids’ physicochemical properties are not generally used in the feature representation of protein sequences. In this article, we present a remote homology detection method that incorporates two novel features: (1) a protein's primary sequence is represented using amino acid's physicochemical properties and (2) the similarity between two proteins is measured using recurrence quantification analysis (RQA). An optimization scheme was developed to select different amino acid indices (up to 10 for a protein family) that are best to characterize the given protein family. The selected amino acid indices may enable us to draw better biological explanation of the protein family classification problem than using other alignment-based methods. An SVM-based classifier will then work on the space described by the RQA metrics. The classification scheme is named as SVM-RQA. Experiments at the superfamily level of the SCOP1.53 dataset show that, without using alignment or sequence profile information, the features generated from amino acid indices are able to produce results that are comparable to those obtained by the published state-of-the-art SVM kernels. In the future, better prediction accuracies can be expected by combining the alignment-based features with our amino acids property-based features. Supplementary information including the raw dataset, the best-performing amino acid indices for each protein family and the computed RQA metrics for all protein sequences can be downloaded from http://ym151113.ym.edu.tw/svm-rqa.  相似文献   

18.
ABSTRACT: BACKGROUND: In this study we explored preeclampsia through a bioinformatics approach. We create a comprehensive genes/proteins dataset by the analysis of both public proteomic data and text mining of public scientific literature. From this dataset the associated protein-protein interaction network has been obtained. Several indexes of centrality have been explored for hubs detection as well as the enrichment statistical analysis of metabolic pathway and disease. RESULTS: We confirmed the well known relationship between preeclampsia and cardiovascular diseases but also identified statistically significant relationships with respect to cancer and aging. Moreover, significant metabolic pathways such as apoptosis, cancer and cytokine-cytokine receptor interaction have also been identified by enrichment analysis. We obtained FLT1, VEGFA, FN1, F2 and PGF genes with the highest scores by hubs analysis; however, we also found other genes as PDIA3, LYN, SH2B2 and NDRG1 with high scores. CONCLUSIONS: The applied methodology not only led to the identification of well known genes related to preeclampsia but also to propose new candidates poorly explored or completely unknown in the pathogenesis of preeclampsia, which eventually need to be validated experimentally. Moreover, new possible connections were detected between preeclampsia and other diseases that could open new areas of research. More must be done in this area to resolve the identification of unknown interactions of proteins/genes and also for a better integration of metabolic pathways and diseases.  相似文献   

19.
Since the genome of Solanum lycopersicum L. was published in 2012, some studies have explored its proteome although with a limited depth. In this work, we present an extended characterization of the proteome of the tomato pericarp at its ripe red stage. Fractionation of tryptic peptides generated from pericarp proteins by off‐line high‐pH reverse‐phase phase chromatography in combination with LC‐MS/MS analysis on a Fisher Scientific Q Exactive and a Sciex Triple‐TOF 6600 resulted in the identification of 8588 proteins with a 1% FDR both at the peptide and protein levels. Proteins were mapped through GO and KEGG databases and a large number of the identified proteins were associated with cytoplasmic organelles and metabolic pathways categories. These results constitute one of the most extensive proteome datasets of tomato so far and provide an experimental confirmation of the existence of a high number of theoretically predicted proteins. All MS data are available in the ProteomeXchange repository with the dataset identifiers PXD004947 and PXD004932.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号