首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
OBJECTIVES: To analyze the distant metastasis possibility based on computed tomography (CT) radiomic features in patients with lung cancer. METHODS: This was a retrospective analysis of 348 patients with lung cancer enrolled between 2014 and February 2015. A feature set containing clinical features and 485 radiomic features was extracted from the pretherapy CT images. Feature selection via concave minimization (FSV) was used to select effective features. A support vector machine (SVM) was used to evaluate the predictive ability of each feature. RESULTS: Four radiomic features and three clinical features were obtained by FSV feature selection. Classification accuracy by the proposed SVM with SGD method was 71.02%, and the area under the curve was 72.84% with only the radiomic features extracted from CT. After the addition of clinical features, 89.09% can be achieved. CONCLUSION: The radiomic features of the pretherapy CT images may be used as predictors of distant metastasis. And it also can be used in combination with the patient's gender and tumor T and N phase information to diagnose the possibility of distant metastasis in lung cancer.  相似文献   

2.
Cardiovascular disease (including coronary artery disease and myocardial infarction) is one of the leading causes of death in Europe, and is influenced by both environmental and genetic factors. With the recent advances in genomic tools and technologies there is potential to predict and diagnose heart disease using molecular data from analysis of blood cells. We analyzed gene expression data from blood samples taken from normal people (n = 21), non-significant coronary artery disease (n = 93), patients with unstable angina (n = 16), stable coronary artery disease (n = 14) and myocardial infarction (MI; n = 207). We used a feature selection approach to identify a set of gene expression variables which successfully differentiate different cardiovascular diseases. The initial features were discovered by fitting a linear model for each probe set across all arrays of normal individuals and patients with myocardial infarction. Three different feature optimisation algorithms were devised which identified two discriminating sets of genes, one using MI and normal controls (total genes = 6) and another one using MI and unstable angina patients (total genes = 7). In all our classification approaches we used a non-parametric k-nearest neighbour (KNN) classification method (k = 3). The results proved the diagnostic robustness of the final feature sets in discriminating patients with myocardial infarction from healthy controls. Interestingly it also showed efficacy in discriminating myocardial infarction patients from patients with clinical symptoms of cardiac ischemia but no myocardial necrosis or stable coronary artery disease, despite the influence of batch effects and different microarray gene chips and platforms.  相似文献   

3.
This review focuses on the genetic features of psoriatic arthritis (PsA) and their relationship to phenotypic heterogeneity in the disease, and addresses three questions: what do the recent studies on human leukocyte antigen (HLA) tell us about the genetic relationship between cutaneous psoriasis (PsO) and PsA – that is, is PsO a unitary phenotype; is PsA a genetically heterogeneous or homogeneous entity; and do the genetic factors implicated in determining susceptibility to PsA predict clinical phenotype? We first discuss the results from comparing the HLA typing of two PsO cohorts: one cohort providing the dermatologic perspective, consisting of patients with PsO without evidence of arthritic disease; and the second cohort providing the rheumatologic perspective, consisting of patients with PsA. We show that these two cohorts differ considerably in their predominant HLA alleles, indicating the heterogeneity of the overall PsO phenotype. Moreover, the genotype of patients in the PsA cohort was shown to be heterogeneous with significant elevations in the frequency of haplotypes containing HLA-B*08, HLA-C*06:02, HLA-B*27, HLA-B*38 and HLA-B*39. Because different genetic susceptibility genes imply different disease mechanisms, and possibly different clinical courses and therapeutic responses, we then review the evidence for a phenotypic difference among patients with PsA who have inherited different HLA alleles. We provide evidence that different alleles and, more importantly, different haplotypes implicated in determining PsA susceptibility are associated with different phenotypic characteristics that appear to be subphenotypes. The implication of these findings for the overall pathophysiologic mechanisms involved in PsA is discussed with specific reference to their bearing on the discussion of whether PsA is conceptualised as an autoimmune process or one that is based on entheseal responses.  相似文献   

4.
In this paper, we compare the performance of six different feature selection methods for LC-MS-based proteomics and metabolomics biomarker discovery—t test, the Mann–Whitney–Wilcoxon test (mww test), nearest shrunken centroid (NSC), linear support vector machine–recursive features elimination (SVM-RFE), principal component discriminant analysis (PCDA), and partial least squares discriminant analysis (PLSDA)—using human urine and porcine cerebrospinal fluid samples that were spiked with a range of peptides at different concentration levels. The ideal feature selection method should select the complete list of discriminating features that are related to the spiked peptides without selecting unrelated features. Whereas many studies have to rely on classification error to judge the reliability of the selected biomarker candidates, we assessed the accuracy of selection directly from the list of spiked peptides. The feature selection methods were applied to data sets with different sample sizes and extents of sample class separation determined by the concentration level of spiked compounds. For each feature selection method and data set, the performance for selecting a set of features related to spiked compounds was assessed using the harmonic mean of the recall and the precision (f-score) and the geometric mean of the recall and the true negative rate (g-score). We conclude that the univariate t test and the mww test with multiple testing corrections are not applicable to data sets with small sample sizes (n = 6), but their performance improves markedly with increasing sample size up to a point (n > 12) at which they outperform the other methods. PCDA and PLSDA select small feature sets with high precision but miss many true positive features related to the spiked peptides. NSC strikes a reasonable compromise between recall and precision for all data sets independent of spiking level and number of samples. Linear SVM-RFE performs poorly for selecting features related to the spiked compounds, even though the classification error is relatively low.Biomarkers play an important role in advancing medical research through the early diagnosis of disease and prognosis of treatment interventions (1, 2). Biomarkers may be proteins, peptides, or metabolites, as well as mRNAs or other kinds of nucleic acids (e.g. microRNAs) whose levels change in relation to the stage of a given disease and which may be used to accurately assign the disease stage of a patient. The accurate selection of biomarker candidates is crucial, because it determines the outcome of further validation studies and the ultimate success of efforts to develop diagnostic and prognostic assays with high specificity and sensitivity. The success of biomarker discovery depends on several factors: consistent and reproducible phenotyping of the individuals from whom biological samples are obtained; the quality of the analytical methodology, which in turn determines the quality of the collected data; the accuracy of the computational methods used to extract quantitative and molecular identity information to define the biomarker candidates from raw analytical data; and finally the performance of the applied statistical methods in the selection of a limited list of compounds with the potential to discriminate between predefined classes of samples. De novo biomarker research consists of a biomarker discovery part and a biomarker validation part (3). Biomarker discovery uses analytical techniques that try to measure as many compounds as possible in a relatively low number of samples. The goal of subsequent data preprocessing and statistical analysis is to select a limited number of candidates, which are subsequently subjected to targeted analyses in large number of samples for validation.Advanced technology, such as high-performance liquid chromatography–mass spectrometry (LC-MS),1 is increasingly applied in biomarker discovery research. Such analyses detect tens of thousands of compounds, as well as background-related signals, in a single biological sample, generating enormous amounts of multivariate data. Data preprocessing workflows reduce data complexity considerably by trying to extract only the information related to compounds resulting in a quantitative feature matrix, in which rows and columns correspond to samples and extracted features, respectively, or vice versa. Features may also be related to data preprocessing artifacts, and the ratio of such erroneous features to compound-related features depends on the performance of the data preprocessing workflow (4). Preprocessed LC-MS data sets contain a large number of features relative to the sample size. These features are characterized by their m/z value and retention time, and in the ideal case they can be combined and linked to compound identities such as metabolites, peptides, and proteins. In LC-MS-based proteomics and metabolomics studies, sample analysis is so time consuming that it is practically impossible to increase the number of samples to a level that balances the number of features in a data set. Therefore, the success of biomarker discovery depends on powerful feature selection methods that can deal with a low sample size and a high number of features. Because of the unfavorable statistical situation and the risk of overfitting the data, it is ultimately pivotal to validate the selected biomarker candidates in a larger set of independent samples, preferably in a double-blinded fashion, using targeted analytical methods (1).Biomarker selection is often based on classification methods that are preceded by feature selection methods (filters) or which have built-in feature selection modules (wrappers and embedded methods) that can be used to select a list of compounds/peaks/features that provide the best classification performance for predefined sample groups (e.g. healthy versus diseased) (5). Classification methods are able to classify an unknown sample into a predefined sample class. Univariate feature selection methods such as filters (t test or Wilcoxon–Mann–Whitney tests) cannot be used for sample classification. Other classification methods such as the nearest shrunken centroid method have intrinsic feature selection ability, whereas other classification methods such as principal component discriminant analysis (PCDA) and partial least squares regression coupled with discriminant analysis (PLSDA) should be augmented with a feature selection method. There are classifiers having no feature selection option that perform the classification using all variables, such as support vector machines that use non-linear kernels (6). Classification methods without the ability to select features cannot be used for biomarker discovery, because these methods aim to classify samples into predefined classes but cannot identify the limited number of variables (features or compounds) that form the basis of the classification (6, 7). Different statistical methods with feature selection have been developed according to the complexity of the analyzed data, and these have been extensively reviewed (5, 6, 8, 9). Ways of optimizing such methods to improve sensitivity and specificity are a major topic in current biomarker discovery research and in the many “omics-related” research areas (6, 10, 11). Comparisons of classification methods with respect to their classification and learning performance have been initiated. Van der Walt et al. (12) focused on finding the most accurate classifiers for simulated data sets with sample sizes ranging from 20 to 100. Rubingh et al. (13) compared the influence of sample size in an LC-MS metabolomics data set on the performance of three different statistical validation tools: cross validation, jack-knifing model parameters, and a permutation test. That study concluded that for small sample sets, the outcome of these validation methods is influenced strongly by individual samples and therefore cannot be trusted, and the validation tool cannot be used to indicate problems due to sample size or the representativeness of sampling. This implies that reducing the dimensionality of the feature space is critical when approaching a classification problem in which the number of features exceeds the number of samples by a large margin. Dimensionality reduction retains a smaller set of features to bring the feature space in line with the sample size and thus allow the application of classification methods that perform with acceptable accuracy only when the sample size and the feature size are similar.In this study we compared different classification methods focusing on feature selection in two types of spiked LC-MS data sets that mimic the situation of a biomarker discovery study. Our results provide guidelines for researchers who will engage in biomarker discovery or other differential profiling “omics” studies with respect to sample size and selecting the most appropriate feature selection method for a given data set. We evaluated the following approaches: univariate t test and Mann–Whitney–Wilcoxon test (mww test) with multiple testing correction (14), nearest shrunken centroid (NSC) (15, 16), support vector machine–recursive features elimination (SVM-RFE) (17), PLSDA (18), and PCDA (19). PCDA and PLSDA were combined with the rank-product as a feature selection criterion (20). These methods were evaluated with data sets having three characteristics: different biological background, varying sample size, and varying within- and between-class variability of the added compounds. Data were acquired via LC-MS from human urine and porcine cerebrospinal fluid (CSF) samples that were spiked with a set of known peptides (true positives) at different concentration levels. These samples were then combined in two classes containing peptides spiked at low and high concentration levels. The performance of the classification methods with feature selection was measured based on their ability to select features that were related to the spiked peptides. Because true positives were known in our data set, we compared performance based on the f-score (the harmonic mean of precision and recall) and the g-score (the geometric mean of accuracy).  相似文献   

5.
The relationship between different levels of integration is a key feature for understanding the genotype-phenotype map. Here, we describe a novel method of integrated data analysis that incorporates protein abundance data into constraint-based modeling to elucidate the biological mechanisms underlying phenotypic variation. Specifically, we studied yeast genetic diversity at three levels of phenotypic complexity in a population of yeast obtained by pairwise crosses of eleven strains belonging to two species, Saccharomyces cerevisiae and S. uvarum. The data included protein abundances, integrated traits (life-history/fermentation) and computational estimates of metabolic fluxes. Results highlighted that the negative correlation between production traits such as population carrying capacity (K) and traits associated with growth and fermentation rates (Jmax) is explained by a differential usage of energy production pathways: a high K was associated with high TCA fluxes, while a high Jmax was associated with high glycolytic fluxes. Enrichment analysis of protein sets confirmed our results.This powerful approach allowed us to identify the molecular and metabolic bases of integrated trait variation, and therefore has a broad applicability domain.  相似文献   

6.

Background

Recently, artificial neural networks (ANN) have been proposed as promising machines for marker-based genomic predictions of complex traits in animal and plant breeding. ANN are universal approximators of complex functions, that can capture cryptic relationships between SNPs (single nucleotide polymorphisms) and phenotypic values without the need of explicitly defining a genetic model. This concept is attractive for high-dimensional and noisy data, especially when the genetic architecture of the trait is unknown. However, the properties of ANN for the prediction of future outcomes of genomic selection using real data are not well characterized and, due to high computational costs, using whole-genome marker sets is difficult. We examined different non-linear network architectures, as well as several genomic covariate structures as network inputs in order to assess their ability to predict milk traits in three dairy cattle data sets using large-scale SNP data. For training, a regularized back propagation algorithm was used. The average correlation between the observed and predicted phenotypes in a 20 times 5-fold cross-validation was used to assess predictive ability. A linear network model served as benchmark.

Results

Predictive abilities of different ANN models varied markedly, whereas differences between data sets were small. Dimension reduction methods enhanced prediction performance in all data sets, while at the same time computational cost decreased. For the Holstein-Friesian bull data set, an ANN with 10 neurons in the hidden layer achieved a predictive correlation of r=0.47 for milk yield when the entire marker matrix was used. Predictive ability increased when the genomic relationship matrix (r=0.64) was used as input and was best (r=0.67) when principal component scores of the marker genotypes were used. Similar results were found for the other traits in all data sets.

Conclusion

Artificial neural networks are powerful machines for non-linear genome-enabled predictions in animal breeding. However, to produce stable and high-quality outputs, variable selection methods are highly recommended, when the number of markers vastly exceeds sample size.  相似文献   

7.

Background

Modern experimental techniques deliver data sets containing profiles of tens of thousands of potential molecular and genetic markers that can be used to improve medical diagnostics. Previous studies performed with three different experimental methods for the same set of neuroblastoma patients create opportunity to examine whether augmenting gene expression profiles with information on copy number variation can lead to improved predictions of patients survival. We propose methodology based on comprehensive cross-validation protocol, that includes feature selection within cross-validation loop and classification using machine learning. We also test dependence of results on the feature selection process using four different feature selection methods.

Results

The models utilising features selected based on information entropy are slightly, but significantly, better than those using features obtained with t-test. The synergy between data on genetic variation and gene expression is possible, but not confirmed. A slight, but statistically significant, increase of the predictive power of machine learning models has been observed for models built on combined data sets. It was found while using both out of bag estimate and in cross-validation performed on a single set of variables. However, the improvement was smaller and non-significant when models were built within full cross-validation procedure that included feature selection within cross-validation loop. Good correlation between performance of the models in the internal and external cross-validation was observed, confirming the robustness of the proposed protocol and results.

Conclusions

We have developed a protocol for building predictive machine learning models. The protocol can provide robust estimates of the model performance on unseen data. It is particularly well-suited for small data sets. We have applied this protocol to develop prognostic models for neuroblastoma, using data on copy number variation and gene expression. We have shown that combining these two sources of information may increase the quality of the models. Nevertheless, the increase is small and larger samples are required to reduce noise and bias arising due to overfitting.

Reviewers

This article was reviewed by Lan Hu, Tim Beissbarth and Dimitar Vassilev.
  相似文献   

8.
Mutations in PLA2G6 gene have variable phenotypic outcome including infantile neuroaxonal dystrophy, atypical neuroaxonal dystrophy, idiopathic neurodegeneration with brain iron accumulation and Karak syndrome. The cause of this phenotypic variation is so far unknown which impairs both genetic diagnosis and appropriate family counseling. We report detailed clinical, electrophysiological, neuroimaging, histologic, biochemical and genetic characterization of 11 patients, from 6 consanguineous families, who were followed for a period of up to 17 years. Cerebellar atrophy was constant and the earliest feature of the disease preceding brain iron accumulation, leading to the provisional diagnosis of a recessive progressive ataxia in these patients. Ultrastructural characterization of patients’ muscle biopsies revealed focal accumulation of granular and membranous material possibly resulting from defective membrane homeostasis caused by disrupted PLA2G6 function. Enzyme studies in one of these muscle biopsies provided evidence for a relatively low mitochondrial content, which is compatible with the structural mitochondrial alterations seen by electron microscopy. Genetic characterization of 11 patients led to the identification of six underlying PLA2G6 gene mutations, five of which are novel. Importantly, by combining clinical and genetic data we have observed that while the phenotype of neurodegeneration associated with PLA2G6 mutations is variable in this cohort of patients belonging to the same ethnic background, it is partially influenced by the genotype, considering the age at onset and the functional disability criteria. Molecular testing for PLA2G6 mutations is, therefore, indicated in childhood-onset ataxia syndromes, if neuroimaging shows cerebellar atrophy with or without evidence of iron accumulation.  相似文献   

9.
The genetic basis of complex diseases is expected to be highly heterogeneous, with complex interactions among multiple disease loci and environment factors. Due to the multi-dimensional property of interactions among large number of genetic loci, efficient statistical approach has not been well developed to handle the high-order epistatic complexity. In this article, we introduce a new approach for testing genetic epistasis in multiple loci using an entropy-based statistic for a case-only design. The entropy-based statistic asymptotically follows a χ2 distribution. Computer simulations show that the entropy-based approach has better control of type I error and higher power compared to the standard χ2 test. Motivated by a schizophrenia data set, we propose a method for measuring and testing the relative entropy of a clinical phenotype, through which one can test the contribution or interaction of multiple disease loci to a clinical phenotype. A sequential forward selection procedure is proposed to construct a genetic interaction network which is illustrated through a tree-based diagram. The network information clearly shows the relative importance of a set of genetic loci on a clinical phenotype. To show the utility of the new entropy-based approach, it is applied to analyze two real data sets, a schizophrenia data set and a published malaria data set. Our approach provides a fast and testable framework for genetic epistasis study in a case-only design.  相似文献   

10.

Background

The goal of this work is to develop a non-invasive method in order to help detecting Alzheimer's disease in its early stages, by implementing voice analysis techniques based on machine learning algorithms.

Methods

We extract temporal and acoustical voice features (e.g. Jitter and Harmonics-to-Noise Ratio) from read speech of patients in Early Stage of Alzheimer's Disease (ES-AD), with Mild Cognitive Impairment (MCI), and from a Healthy Control (HC) group. Three classification methods are used to evaluate the efficiency of these features, namely kNN, SVM and decision Tree. To assess the effectiveness of this set of features, we compare them with two sets of feature parameters that are widely used in speech and speaker recognition applications. A two-stage feature selection process is conducted to optimize classification performance. For these experiments, the data samples of HC, ES-AD and MCI groups were collected at AP-HP Broca Hospital, in Paris.

Results

First, a wrapper feature selection method for each feature set is evaluated and the relevant features for each classifier are selected. By combining, for each classifier, the features selected from each initial set, we improve the classification accuracy by a relative gain of more than 30% for all classifiers. Then the same feature selection procedure is performed anew on the combination of selected feature sets, resulting in an additional significant improvement of classification accuracy.

Conclusion

The proposed method improved the classification accuracy for ES-AD, MCI and HC groups and promises the effectiveness of speech analysis and machine learning techniques to help detect pathological diseases.  相似文献   

11.
E Luquet  J-P Léna  C Miaud  S Plénet 《Heredity》2015,114(1):69-79
Variation in the environment can induce different patterns of genetic and phenotypic differentiation among populations. Both neutral processes and selection can influence phenotypic differentiation. Altitudinal phenotypic variation is of particular interest in disentangling the interplay between neutral processes and selection in the dynamics of local adaptation processes but remains little explored. We conducted a common garden experiment to study the phenotypic divergence in larval life-history traits among nine populations of the common toad (Bufo bufo) along an altitudinal gradient in France. We further used correlation among population pairwise estimates of quantitative trait (QST) and neutral genetic divergence (FST from neutral microsatellite markers), as well as altitudinal difference, to estimate the relative role of divergent selection and neutral genetic processes in phenotypic divergence. We provided evidence for a neutral genetic differentiation resulting from both isolation by distance and difference in altitude. We found evidence for phenotypic divergence along the altitudinal gradient (faster development, lower growth rate and smaller metamorphic size). The correlation between pairwise QSTs–FSTs and altitude differences suggested that this phenotypic differentiation was most likely driven by altitude-mediated selection rather than by neutral genetic processes. Moreover, we found different divergence patterns for larval traits, suggesting that different selective agents may act on these traits and/or selection on one trait may constrain the evolution on another through genetic correlation. Our study highlighted the need to design more integrative studies on the common toad to unravel the underlying processes of phenotypic divergence and its selective agents in the context of environmental clines.  相似文献   

12.
Mutations in the nuclear gene POLG (encoding the catalytic subunit of DNA polymerase gamma) are an important cause of mitochondrial disease. The most common POLG mutation, A467T, appears to exhibit considerable phenotypic heterogeneity. The mechanism by which this single genetic defect results in such clinical diversity remains unclear. In this study we evaluate the clinical, neuropathological and mitochondrial genetic features of four unrelated patients with homozygous A467T mutations. One patient presented with the severe and lethal Alpers-Huttenlocher syndrome, which was confirmed on neuropathology, and was found to have a depletion of mitochondrial DNA (mtDNA). Of the remaining three patients, one presented with mitochondrial encephalomyopathy, lactic acidosis and stroke-like episodes (MELAS), one with a phenotype in the Myoclonic Epilepsy, Myopathy and Sensory Ataxia (MEMSA) spectrum and one with Sensory Ataxic Neuropathy, Dysarthria and Ophthalmoplegia (SANDO). All three had secondary accumulation of multiple mtDNA deletions. Complete sequence analysis of muscle mtDNA using the MitoChip resequencing chip in all four cases demonstrated significant variation in mtDNA, including a pathogenic MT-ND5 mutation in one patient. These data highlight the variable and overlapping clinical and neuropathological phenotypes and downstream molecular defects caused by the A467T mutation, which may result from factors such as the mtDNA genetic background, nuclear genetic modifiers and environmental stressors.  相似文献   

13.
Landscape genetics lacks explicit methods for dealing with the uncertainty in landscape resistance estimation, which is particularly problematic when sample sizes of individuals are small. Unless uncertainty can be quantified, valuable but small data sets may be rendered unusable for conservation purposes. We offer a method to quantify uncertainty in landscape resistance estimates using multimodel inference as an improvement over single model‐based inference. We illustrate the approach empirically using co‐occurring, woodland‐preferring Australian marsupials within a common study area: two arboreal gliders (Petaurus breviceps, and Petaurus norfolcensis) and one ground‐dwelling antechinus (Antechinus flavipes). First, we use maximum‐likelihood and a bootstrap procedure to identify the best‐supported isolation‐by‐resistance model out of 56 models defined by linear and non‐linear resistance functions. We then quantify uncertainty in resistance estimates by examining parameter selection probabilities from the bootstrapped data. The selection probabilities provide estimates of uncertainty in the parameters that drive the relationships between landscape features and resistance. We then validate our method for quantifying uncertainty using simulated genetic and landscape data showing that for most parameter combinations it provides sensible estimates of uncertainty. We conclude that small data sets can be informative in landscape genetic analyses provided uncertainty can be explicitly quantified. Being explicit about uncertainty in landscape genetic models will make results more interpretable and useful for conservation decision‐making, where dealing with uncertainty is critical.  相似文献   

14.

Background

The acute-phase increase in serum C-reactive protein (CRP) is used to diagnose and monitor infectious and inflammatory diseases. Little is known about the influence of genetics on acute-phase CRP, particularly in patients with chronic inflammation.

Methods and Findings

We studied two independent sets of patients with chronic inflammation due to rheumatoid arthritis (total 695 patients). A tagSNP approach captured common variation at the CRP locus and the relationship between genotype and serum CRP was explored by linear modelling. Erythrocyte sedimentation rate (ESR) was incorporated as an independent marker of inflammation to adjust for the varying levels of inflammatory disease activity between patients. Common genetic variants at the CRP locus were associated with acute-phase serum CRP (for the most associated haplotype: p = 0.002, p<0.0005, p<0.0005 in patient sets 1, 2, and the combined sets, respectively), translating into an approximately 3.5-fold change in expected serum CRP concentrations between carriers of two common CRP haplotypes. For example, when ESR = 50 mm/h the expected geometric mean CRP (95% confidence interval) concentration was 43.1 mg/l (32.1–50.0) for haplotype 1 and 14.2 mg/l (9.5–23.2) for haplotype 4.

Conclusions

Our findings raise questions about the interpretation of acute-phase serum CRP. In particular, failure to take into account the potential for genetic effects may result in the inappropriate reassurance or suboptimal treatment of patients simply because they carry low-CRP–associated genetic variants. CRP is increasingly being incorporated into clinical algorithms to compare disease activity between patients and to predict future clinical events: our findings impact on the use of these algorithms. For example, where access to effective, but expensive, biological therapies in rheumatoid arthritis is rationed on the basis of a DAS28-CRP clinical activity score, then two patients with identical underlying disease severity could be given, or denied, treatment on the basis of CRP genotype alone. The accuracy and utility of these algorithms might be improved by using a genetically adjusted CRP measurement. Please see later in the article for the Editors'' Summary  相似文献   

15.
Statistical modeling of links between genetic profiles with environmental and clinical data to aid in medical diagnosis is a challenge. Here, we present a computational approach for rapidly selecting important clinical data to assist in medical decisions based on personalized genetic profiles. What could take hours or days of computing is available on-the-fly, making this strategy feasible to implement as a routine without demanding great computing power. The key to rapidly obtaining an optimal/nearly optimal mathematical function that can evaluate the "disease stage" by combining information of genetic profiles with personal clinical data is done by querying a precomputed solution database. The database is previously generated by a new hybrid feature selection method that makes use of support vector machines, recursive feature elimination and random sub-space search. Here, to evaluate the method, data from polymorphisms in the renin-angiotensin-aldosterone system genes together with clinical data were obtained from patients with hypertension and control subjects. The disease "risk" was determined by classifying the patients' data with a support vector machine model based on the optimized feature; then measuring the Euclidean distance to the hyperplane decision function. Our results showed the association of renin-angiotensin-aldosterone system gene haplotypes with hypertension. The association of polymorphism patterns with different ethnic groups was also tracked by the feature selection process. A demonstration of this method is also available online on the project's web site.  相似文献   

16.
17.
Creation of defined genetic mutations is a powerful method for dissecting mechanisms of bacterial disease; however, many genetic tools are only developed for laboratory strains. We have designed a modular and general negative selection strategy based on inducible toxins that provides high selection stringency in clinical Escherichia coli and Salmonella isolates. No strain- or species-specific optimization is needed, yet this system achieves better selection stringency than all previously reported negative selection systems usable in unmodified E. coli strains. The high stringency enables use of negative instead of positive selection in phage-mediated generalized transduction and also allows transfer of alleles between arbitrary strains of E. coli without requiring phage. The modular design should also allow further extension to other bacteria. This negative selection system thus overcomes disadvantages of existing systems, enabling definitive genetic experiments in both lab and clinical isolates of E. coli and other Enterobacteriaceae.  相似文献   

18.
Inferring precise phenotypic patterns from population-scale clinical data is a core computational task in the development of precision, personalized medicine. The traditional approach uses supervised learning, in which an expert designates which patterns to look for (by specifying the learning task and the class labels), and where to look for them (by specifying the input variables). While appropriate for individual tasks, this approach scales poorly and misses the patterns that we don’t think to look for. Unsupervised feature learning overcomes these limitations by identifying patterns (or features) that collectively form a compact and expressive representation of the source data, with no need for expert input or labeled examples. Its rising popularity is driven by new deep learning methods, which have produced high-profile successes on difficult standardized problems of object recognition in images. Here we introduce its use for phenotype discovery in clinical data. This use is challenging because the largest source of clinical data – Electronic Medical Records – typically contains noisy, sparse, and irregularly timed observations, rendering them poor substrates for deep learning methods. Our approach couples dirty clinical data to deep learning architecture via longitudinal probability densities inferred using Gaussian process regression. From episodic, longitudinal sequences of serum uric acid measurements in 4368 individuals we produced continuous phenotypic features that suggest multiple population subtypes, and that accurately distinguished (0.97 AUC) the uric-acid signatures of gout vs. acute leukemia despite not being optimized for the task. The unsupervised features were as accurate as gold-standard features engineered by an expert with complete knowledge of the domain, the classification task, and the class labels. Our findings demonstrate the potential for achieving computational phenotype discovery at population scale. We expect such data-driven phenotypes to expose unknown disease variants and subtypes and to provide rich targets for genetic association studies.  相似文献   

19.
20.
Through Genome Wide Association Studies (GWAS) many Single Nucleotide Polymorphism (SNP)-complex disease relations can be investigated. The output of GWAS can be high in amount and high dimensional, also relations between SNPs, phenotypes and diseases are most likely to be nonlinear. In order to handle high volume-high dimensional data and to be able to find the nonlinear relations we have utilized data mining approaches and a hybrid feature selection model of support vector machine and decision tree has been designed. The designed model is tested on prostate cancer data and for the first time combined genotype and phenotype information is used to increase the diagnostic performance. We were able to select phenotypic features such as ethnicity and body mass index, and SNPs those map to specific genes such as CRR9, TERT. The performance results of the proposed hybrid model, on prostate cancer dataset, with 90.92% of sensitivity and 0.91 of area under ROC curve, shows the potential of the approach for prediction and early detection of the prostate cancer.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号