首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
2.
In the risk analysis of sequential events, the successive gap times are often correlated, e.g. as a result of an individual heterogeneity. Correlation is usually accounted for by using a shared gamma‐frailty model, where the variance φ of the random individual effect quantifies the correlation between gap times. This method is known to yield satisfactory estimates of covariate effects, but underestimates φ, which could result in a lack of power of the test of independence. We propose a new test of independence between two sequential gap times where the first is the time elapsed from the origin. The test is based on an approximation of the hazard of the second event given the first gap time in a frailty model, with a frailty distribution belonging to the power variance function family. Simulation results show an increased power of the new test compared with the test derived from the gamma‐frailty model. In the realistic case where hazards are event specific, and using event‐specific approaches, the proposed estimation of the variance of the frailty is less biased than the gamma‐frailty based estimation for a wide range of values ( with the set of parameters considered), and similar for higher values. As an illustration, the methods are applied to a previously analysed asthma prevention trial with results showing a significant positive association between the successive times to asthmatic events. We also analyse data from a cohort of HIV‐seropositive patients in order to assess the effect of risk factors on the occurrence of two successive markers of progression of the HIV disease. The results demonstrate the ability of the proposed model to account for negative correlations between gap times.  相似文献   

3.
Validation of genetic associations is understood to be a cornerstone for the scientific credibility of the results. To approach this topic, the general concept of genetic association studies is introduced briefly, followed by how the term 'validation' is used in the context of genetic association studies. As a central issue, reasons for the importance of validation and for failure of validation will be described.  相似文献   

4.
Li H 《Human genetics》2012,131(9):1395-1401
Many common human diseases are complex and are expected to be highly heterogeneous, with multiple causative loci and multiple rare and common variants at some of the causative loci contributing to the risk of these diseases. Data from the genome-wide association studies (GWAS) and metadata such as known gene functions and pathways provide the possibility of identifying genetic variants, genes and pathways that are associated with complex phenotypes. Single-marker-based tests have been very successful in identifying thousands of genetic variants for hundreds of complex phenotypes. However, these variants only explain very small percentages of the heritabilities. To account for the locus- and allelic-heterogeneity, gene-based and pathway-based tests can be very useful in the next stage of the analysis of GWAS data. U-statistics, which summarize the genomic similarity between pair of individuals and link the genomic similarity to phenotype similarity, have proved to be very useful for testing the associations between a set of single nucleotide polymorphisms and the phenotypes. Compared to single marker analysis, the advantages afforded by the U-statistics-based methods is large when the number of markers involved is large. We review several formulations of U-statistics in genetic association studies and point out the links of these statistics with other similarity-based tests of genetic association. Finally, potential application of U-statistics in analysis of the next-generation sequencing data and rare variants association studies are discussed.  相似文献   

5.
The binary decision element described by the decision rule depending upon weight vector w is a model of neuron examined in this paper. The environment of the element is described by some unknown, stationary distribution p(x). The input signals x[n] of the element appear in each step n independently in accordance with the distribution p(x). During an unsupervised learning process the weight vector w[n] is changed on the base of the input vector x[n]. In the paper there are regarded two self-learning algorithms which are stochastic approximation type. For both algorithms the same rule of past experiences neglecting or the rule of weight decrease has been introduced. The first algorithm differs from the other one by a rule of weight increase. It has been proved that only one of these algorithms always leads to the same decision rule in a given environment p(x).This work was done during stay of Dr. L. Bobrowski at the University of Salerno in the frame of Polish-Italian Agreement on Scientific Cooperation  相似文献   

6.
Models in which two susceptibility loci jointly influence the risk of developing disease can be explored using logistic regression analysis. Comparison of likelihoods of models incorporating different sets of disease model parameters allows inferences to be drawn regarding the nature of the joint effect of the loci. We have simulated case-control samples generated assuming different two-locus models and then analysed them using logistic regression. We show that this method is practicable and that, for the models we have used, it can be expected to allow useful inferences to be drawn from sample sizes consisting of hundreds of subjects. Interactions between loci can be explored, but interactive effects do not exactly correspond with classical definitions of epistasis. We have particularly examined the issue of the extent to which it is helpful to utilise information from a previously identified locus when investigating a second, unknown locus. We show that for some models conditional analysis can have substantially greater power while for others unconditional analysis can be more powerful. Hence we conclude that in general both conditional and unconditional analyses should be performed when searching for additional loci.  相似文献   

7.
Meta-analysis of genetic association studies   总被引:11,自引:0,他引:11  
Meta-analysis, a statistical tool for combining results across studies, is becoming popular as a method for resolving discrepancies in genetic association studies. Persistent difficulties in obtaining robust, replicable results in genetic association studies are almost certainly because genetic effects are small, requiring studies with many thousands of subjects to be detected. In this article, we describe how meta-analysis works and consider whether it will solve the problem of underpowered studies or whether it is another affliction visited by statisticians on geneticists. We show that meta-analysis has been successful in revealing unexpected sources of heterogeneity, such as publication bias. If heterogeneity is adequately recognized and taken into account, meta-analysis can confirm the involvement of a genetic variant, but it is not a substitute for an adequately powered primary study.  相似文献   

8.
Radar systems have been increasingly used to monitor birds. To take full advantage of the large datasets provided by radars, researchers have implemented machine learning (ML) techniques that automatically read and attempt to classify targets. Here we used data collected from two locations in Portugal with two marine radar antennas (VSR and HSR) to apply and compare the performance of six ML algorithms that are widely used in the literature: random forests (RF), support vector machine (SVM), artificial neural networks (NN), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and decision trees (DT), all trained with several dataset configurations. We found that all algorithms performed well (area under the receiver operating characteristic (AUC) and accuracy > 0.80, < 0.001) when discriminating birds from non‐biological targets such as vehicles, rain or wind turbines, but greater variance in the performance among algorithms was apparent when separating different bird functional groups or bird species (e.g. herons vs. gulls). In our case study, only RF was able to hold an accuracy > 0.80 for all classification tasks, although SVM and DT also performed well. Further, all algorithms correctly classified 86% and 66% (VSR and HSR) of the target points, and only 2% and 4% of these points were misclassified by all algorithms. Our results suggest that ML algorithms are suitable for classifying radar targets as birds, and thereby separating them from other non‐biological targets. The ability of these algorithms to correctly identify among bird species functional groups was found to be much weaker, but if properly trained and supported by a good ground truthing dataset, targeted to the relevant species groups, some of these algorithms are still able to achieve high accuracies in classification tasks. Such results indicate that ML algorithms are suitable for use in near real‐time monitoring of bird movements, and may help to mitigate collision of birds with, for example, wind turbines or airplanes.  相似文献   

9.
We aim to demonstrate that a complex plant tissue protein mixture can be reliably "fingerprinted" by running conventional 1-D SDS-PAGE in bulk and analyzing gel banding patterns using machine learning methods. An unsupervised approach to filter noise and systemic biases (principal component analysis) was coupled to state-of-the-art supervised methods for classification (support vector machines) and attribute ranking (ReliefF) to improve tissue discrimination, visualization, and recognition of important gel regions.  相似文献   

10.
Gromiha MM  Suwa M 《Proteins》2006,63(4):1031-1037
Discriminating outer membrane proteins (OMPs) from other folding types of globular and membrane proteins is an important task both for identifying OMPs from genomic sequences and for the successful prediction of their secondary and tertiary structures. In this work, we have analyzed the performance of different methods, based on Bayes rules, logistic functions, neural networks, support vector machines, decision trees, etc. for discriminating OMPs. We found that most of the machine learning techniques discriminate OMPs with similar accuracy. The neural network-based method could discriminate the OMPs from other proteins [globular/transmembrane helical (TMH)] at the fivefold cross-validation accuracy of 91.0% in a dataset of 1,088 proteins. The accuracy of discriminating globular proteins is 88.8% and that of TMH proteins is 93.7%. Further, the neural network method is tested with globular proteins belonging to 30 different folding types and it could successfully exclude 95% of the considered proteins. The proteins with SAM domain such as knottins, rubredoxin, and thioredoxin folds are eliminated with 100% accuracy. These accuracy levels are comparable to or better than other methods in the literature. We suggest that this method could be effectively used to discriminate OMPs and for detecting OMPs in genomic sequences.  相似文献   

11.
环境微生物研究中机器学习算法及应用   总被引:1,自引:0,他引:1  
陈鹤  陶晔  毛振镀  邢鹏 《微生物学报》2022,62(12):4646-4662
微生物在环境中无处不在,它们不仅是生物地球化学循环和环境演化的关键参与者,也在环境监测、生态治理和保护中发挥着重要作用。随着高通量技术的发展,大量微生物数据产生,运用机器学习对环境微生物大数据进行建模和分析,在微生物标志物识别、污染物预测和环境质量预测等领域的科学研究和社会应用方面均具有重要意义。机器学习可分为监督学习和无监督学习2大类。在微生物组学研究当中,无监督学习通过聚类、降维等方法高效地学习输入数据的特征,进而对微生物数据进行整合和归类。监督学习运用有特征和标记的微生物数据集训练模型,在面对只有特征没有标记的数据时可以判断出标记,从而实现对新数据的分类、识别和预测。然而,复杂的机器学习算法通常以牺牲可解释性为代价来重点关注模型预测的准确性。机器学习模型通常可以看作预测特定结果的“黑匣子”,即对模型如何得出预测所知甚少。为了将机器学习更多地运用于微生物组学研究、提高我们提取有价值的微生物信息的能力,深入了解机器学习算法、提高模型的可解释性尤为重要。本文主要介绍在环境微生物领域常用的机器学习算法和基于微生物组数据的机器学习模型的构建步骤,包括特征选择、算法选择、模型构建和评估等,并对各种机器学习模型在环境微生物领域的应用进行综述,深入探究微生物组与周围环境之间的关联,探讨提高模型可解释性的方法,并为未来环境监测、环境健康预测提供科学参考。  相似文献   

12.
The rapid transformation of land cover/land use (LCLU) is a strong indication of global environmental change. In order to monitor LCLU through maps, a significant dataset and robust technique are necessary. Thus, the primary objective of the current research is to evaluate and compare the efficiency of several notable satellite sensors including Landsat-8 (L-8), Sentinel-2 (S-2), Sentinel-1 (S-1), combined Sentinel-1 and Sentinel-2 (S-1-2), LISS III (L-3), and LISS IV (L-4) for LCLU mapping applying random forest (RF), logit boost (LB), stochastic gradient boosting (SGB), artificial neural network (ANN), and K-nearest neighbor (KNN) models. For this purpose, 300 samples for each of the six LCLU classes have been selected based on field survey and high resolution Cartosat-3 images. The classification accuracy namely producer accuracy (PA), user accuracy (UA), overall accuracy (OA) and kappa coefficient have been calculated from the confusion matrix of the applied models. This results show the highest accuracy has been derived from the integration of S-1-2 datasets followed by S-2, L-8, L-3, L-4, and S-1. On the other hand, LB model is the most consistent and efficient in comparison with other models for all the datasets. Regarding importance of variable, SWIR band is repeatedly the most crucial factor while blue band is the least significant variable. From this comparative assessment of sensors, it has been found that high spatial and spectral resolutions along with combination of satellite datasets are required to get better accuracy rather than only high spatial resolution in regional scale mapping. The present study strongly advocates the use of combined S-1-2 data together with the application of LB model for LCLU classification.  相似文献   

13.
14.
Gromiha MM  Suresh MX 《Proteins》2008,70(4):1274-1279
Discriminating thermophilic proteins from their mesophilic counterparts is a challenging task and it would help to design stable proteins. In this work, we have systematically analyzed the amino acid compositions of 3075 mesophilic and 1609 thermophilic proteins belonging to 9 and 15 families, respectively. We found that the charged residues Lys, Arg, and Glu as well as the hydrophobic residues, Val and Ile have higher occurrence in thermophiles than mesophiles. Further, we have analyzed the performance of different methods, based on Bayes rules, logistic functions, neural networks, support vector machines, decision trees and so forth for discriminating mesophilic and thermophilic proteins. We found that most of the machine learning techniques discriminate these classes of proteins with similar accuracy. The neural network-based method could discriminate the thermophiles from mesophiles at the five-fold cross-validation accuracy of 89% in a dataset of 4684 proteins. Moreover, this method is tested with 325 mesophiles in Xylella fastidosa and 382 thermophiles in Aquifex aeolicus and it could successfully discriminate them with the accuracy of 91%. These accuracy levels are better than other methods in the literature and we suggest that this method could be effectively used to discriminate mesophilic and thermophilic proteins.  相似文献   

15.
16.
Genetic association studies require that the genotype data from a given person can be correctly linked to the phenotype data from the same person. However, sample misidentification errors sometimes happen, whereby the link becomes invalid for some of the subjects in a study. This can have substantial consequences in terms of power to detect truly associated variants. In family-based studies, Mendelian inconsistencies can be used to detect sample misidentification. Genome-wide association studies (GWAS), however, typically use unrelated individuals, making error detection more problematic. Here we present a method for identifying potential sample misidentifications in GWAS and other genetic association studies building on ideas from forensic sciences. A widely used ad-hoc method for error detection is to check if the sex of an individual matches its X-linked genotype. We generalize this idea to less stringent associations between known genotypes and phenotypes, and show that if several known associations are combined, the power to detect misidentifications increases substantially. Individuals with an unlikely set of phenotypes given their genotypes are flagged as potential errors. We provide analytical and simulation results comparing the odds that the genotype and phenotype are both from the same individual for different numbers of available genotype-p henotype associations and for different information content of the associations. Our method has good sensitivity and specificity with as few as ten moderately informative genotype-phenotype associations. We apply the method to GWAS data from the Danish National Birth Cohort.  相似文献   

17.

Background

By using a standard Support Vector Machine (SVM) with a Sequential Minimal Optimization (SMO) method of training, Naïve Bayes and other machine learning algorithms we are able to distinguish between two classes of protein sequences: those folding to highly-designable conformations, or those folding to poorly- or non-designable conformations.

Results

First, we generate all possible compact lattice conformations for the specified shape (a hexagon or a triangle) on the 2D triangular lattice. Then we generate all possible binary hydrophobic/polar (H/P) sequences and by using a specified energy function, thread them through all of these compact conformations. If for a given sequence the lowest energy is obtained for a particular lattice conformation we assume that this sequence folds to that conformation. Highly-designable conformations have many H/P sequences folding to them, while poorly-designable conformations have few or no H/P sequences. We classify sequences as folding to either highly – or poorly-designable conformations. We have randomly selected subsets of the sequences belonging to highly-designable and poorly-designable conformations and used them to train several different standard machine learning algorithms.

Conclusion

By using these machine learning algorithms with ten-fold cross-validation we are able to classify the two classes of sequences with high accuracy – in some cases exceeding 95%.
  相似文献   

18.
European Americans are often treated as a homogeneous group, but in fact form a structured population due to historical immigration of diverse source populations. Discerning the ancestry of European Americans genotyped in association studies is important in order to prevent false-positive or false-negative associations due to population stratification and to identify genetic variants whose contribution to disease risk differs across European ancestries. Here, we investigate empirical patterns of population structure in European Americans, analyzing 4,198 samples from four genome-wide association studies to show that components roughly corresponding to northwest European, southeast European, and Ashkenazi Jewish ancestry are the main sources of European American population structure. Building on this insight, we constructed a panel of 300 validated markers that are highly informative for distinguishing these ancestries. We demonstrate that this panel of markers can be used to correct for stratification in association studies that do not generate dense genotype data.  相似文献   

19.
Implementation of effective conservation planning relies on a robust understanding of the spatiotemporal distribution of the target species. In the marine realm, this is even more challenging for species rarely seen at the sea surface due to their extreme diving behavior like the sperm whales. Our study aims at (a) investigating the seasonal movements, (b) predicting the potential distribution, and (c) assessing the diel vertical behavior of this species in the Mascarene Archipelago in the south‐west Indian Ocean. Using 21 satellite tracks of sperm whales and eight environmental predictors, 14 supervised machine learning algorithms were tested and compared to predict the whales'' potential distribution during the wet and dry season, separately. Fourteen of the whales remained in close proximity to Mauritius, while a migratory pattern was evidenced with a synchronized departure for eight females that headed towards Rodrigues Island. The best performing algorithm was the random forest, showing a strong affinity of the whales for sea surface height during the wet season and for bottom temperature during the dry season. A more dispersed distribution was predicted during the wet season, whereas a more restricted distribution to Mauritius and Reunion waters was found during the dry season, probably related to the breeding period. A diel pattern was observed in the diving behavior, likely following the vertical migration of squids. The results of our study fill a knowledge gap regarding seasonal movements and habitat affinities of this vulnerable species, for which a regional IUCN assessment is still missing in the Indian Ocean. Our findings also confirm the great potential of machine learning algorithms in conservation planning and provide highly reproductible tools to support dynamic ocean management.  相似文献   

20.
This protocol details the steps for data quality assessment and control that are typically carried out during case-control association studies. The steps described involve the identification and removal of DNA samples and markers that introduce bias. These critical steps are paramount to the success of a case-control study and are necessary before statistically testing for association. We describe how to use PLINK, a tool for handling SNP data, to perform assessments of failure rate per individual and per SNP and to assess the degree of relatedness between individuals. We also detail other quality-control procedures, including the use of SMARTPCA software for the identification of ancestral outliers. These platforms were selected because they are user-friendly, widely used and computationally efficient. Steps needed to detect and establish a disease association using case-control data are not discussed here. Issues concerning study design and marker selection in case-control studies have been discussed in our earlier protocols. This protocol, which is routinely used in our labs, should take approximately 8 h to complete.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号