首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MOTIVATION: Microarrays have become a central tool in biological research. Their applications range from functional annotation to tissue classification and genetic network inference. A key step in the analysis of gene expression data is the identification of groups of genes that manifest similar expression patterns. This translates to the algorithmic problem of clustering genes based on their expression patterns. RESULTS: We present a novel clustering algorithm, called CLICK, and its applications to gene expression analysis. The algorithm utilizes graph-theoretic and statistical techniques to identify tight groups (kernels) of highly similar elements, which are likely to belong to the same true cluster. Several heuristic procedures are then used to expand the kernels into the full clusters. We report on the application of CLICK to a variety of gene expression data sets. In all those applications it outperformed extant algorithms according to several common figures of merit. We also point out that CLICK can be successfully used for the identification of common regulatory motifs in the upstream regions of co-regulated genes. Furthermore, we demonstrate how CLICK can be used to accurately classify tissue samples into disease types, based on their expression profiles. Finally, we present a new java-based graphical tool, called EXPANDER, for gene expression analysis and visualization, which incorporates CLICK and several other popular clustering algorithms. AVAILABILITY: http://www.cs.tau.ac.il/~rshamir/expander/expander.html  相似文献   

2.
Genome-wide association analysis involving many single nucleotide polymorphisms (SNPs) data is challenging mathematically and computationally. It is time consuming to classify the combination of multilocus genotypes into high- and low-risk groups without false positive and negative errors. Hence, we propose the odds ratio-based genetic algorithms (OR-GA) method that uses the odds ratio as a new quantitative measure of disease risk among many SNP combinations. Genetic algorithms (GA) are applied to generate SNP "barcodes" of genotypes, which propose the maximal difference of occurrence between the case and control groups, to predict disease susceptibility (e.g., osteoporosis). When individuals are grouped into a low and high bone mass density (BMD) range, different SNP barcode patterns may occur several times in each of these two groups. Our results showed that a GA can effectively identify a specific SNP barcode with an optimized fitness value. SNP barcodes with a low fitness value will naturally be discarded from the population. A representative SNP barcode with a variable number of SNPs is processed by odds ratio analysis to determine the maximum difference between the low and high BMD groups in a statistical manner. Therefore, this paper introduces a powerful procedure for analysis of disease-associated SNP barcode in genome-wide genes.  相似文献   

3.
We investigate a new method to place patients into risk groups in censored survival data. Properties such as median survival time, and end survival rate, are implicitly improved by optimizing the area under the survival curve. Artificial neural networks (ANN) are trained to either maximize or minimize this area using a genetic algorithm, and combined into an ensemble to predict one of low, intermediate, or high risk groups. Estimated patient risk can influence treatment choices, and is important for study stratification. A common approach is to sort the patients according to a prognostic index and then group them along the quartile limits. The Cox proportional hazards model (Cox) is one example of this approach. Another method of doing risk grouping is recursive partitioning (Rpart), which constructs a decision tree where each branch point maximizes the statistical separation between the groups. ANN, Cox, and Rpart are compared on five publicly available data sets with varying properties. Cross-validation, as well as separate test sets, are used to validate the models. Results on the test sets show comparable performance, except for the smallest data set where Rpart’s predicted risk groups turn out to be inverted, an example of crossing survival curves. Cross-validation shows that all three models exhibit crossing of some survival curves on this small data set but that the ANN model manages the best separation of groups in terms of median survival time before such crossings. The conclusion is that optimizing the area under the survival curve is a viable approach to identify risk groups. Training ANNs to optimize this area combines two key strengths from both prognostic indices and Rpart. First, a desired minimum group size can be specified, as for a prognostic index. Second, the ability to utilize non-linear effects among the covariates, which Rpart is also able to do.  相似文献   

4.
There is evidence that Tetracyclines are potentially useful drugs to treat prion disease, the fatal neurodegenerative disease in which cellular prion proteins change in conformation to become a disease-specific species (PrPSc). Based on an in vitro anti-fibrillogenesis test, and using the peptide PrP106–126 in the presence of tetracycline and 14 derivatives, we carried out a three-dimensional quantitative structure-activity relationship (3D-QSAR) study to investigate the stereoelectronic features required for anti-fibrillogenic activity. A preliminary variable reduction technique was used to search for grid points where statistical indexes of interaction potential distributions present local maximum (or minimum) values. Variable selection genetic algorithms were then used to search for the best 3D-QSAR models. A 6-variable model showed the best predictability of the anti-fibrillogenic activity that highlighted the best tetracycline substitution patterns: hydroxyl group presence in positions 5 and 6, electrodonor substituents on the aromatic D-ring, alkylamine substituent at the amidic group in position 2 and non-epi configuration of the NMe2 group.  相似文献   

5.

Background

The use of structural equation models for the analysis of recursive and simultaneous relationships between phenotypes has become more popular recently. The aim of this paper is to illustrate how these models can be applied in animal breeding to achieve parameterizations of different levels of complexity and, more specifically, to model phenotypic recursion between three calving traits: gestation length (GL), calving difficulty (CD) and stillbirth (SB). All recursive models considered here postulate heterogeneous recursive relationships between GL and liabilities to CD and SB, and between liability to CD and liability to SB, depending on categories of GL phenotype.

Methods

Four models were compared in terms of goodness of fit and predictive ability: 1) standard mixed model (SMM), a model with unstructured (co)variance matrices; 2) recursive mixed model 1 (RMM1), assuming that residual correlations are due to the recursive relationships between phenotypes; 3) RMM2, assuming that correlations between residuals and contemporary groups are due to recursive relationships between phenotypes; and 4) RMM3, postulating that the correlations between genetic effects, contemporary groups and residuals are due to recursive relationships between phenotypes.

Results

For all the RMM considered, the estimates of the structural coefficients were similar. Results revealed a nonlinear relationship between GL and the liabilities both to CD and to SB, and a linear relationship between the liabilities to CD and SB.Differences in terms of goodness of fit and predictive ability of the models considered were negligible, suggesting that RMM3 is plausible.

Conclusions

The applications examined in this study suggest the plausibility of a nonlinear recursive effect from GL onto CD and SB. Also, the fact that the most restrictive model RMM3, which assumes that the only cause of correlation is phenotypic recursion, performs as well as the others indicates that the phenotypic recursion may be an important cause of the observed patterns of genetic and environmental correlations.  相似文献   

6.
Complex sex-biased dispersal patterns often characterize social-group-living species and may ultimately drive patterns of cooperation and competition within and among groups. This study investigates whether observational data or genetic data alone can elucidate the potentially complex dispersal patterns of social-group-living black and white colobus monkeys ( Colobus guereza , 'guerezas'), or whether combining both data types provides novel insights. We employed long-term observation of eight neighbouring guereza groups in Kibale National Park, Uganda, as well as microsatellite genotyping of these and two other neighbouring groups. We created a statistical model to examine the observational data and used dyadic relatedness values within and among groups to analyse the genetic data. Analyses of observational and genetic data both supported the conclusion that males typically disperse from their natal groups and often transfer into nearby groups and probably beyond. Both data types also supported the conclusion that females are more philopatric than males but provided somewhat conflicting evidence about the extent of female philopatry. Observational data suggested that female dispersal is rare or nonexistent and transfers into neighbouring groups do not occur, but genetic data revealed numerous pairs of closely related adult females among neighbouring groups. Only by combining both data types were we able to understand the complexity of sex-biased dispersal patterns in guerezas and the processes that could explain our seemingly conflicting results. We suggest that the data are compatible with a scenario of group dissolution prior to the start of this study, followed by female transfers into different neighbouring groups.  相似文献   

7.
We model genetic regulatory networks in the framework of continuous-time recurrent networks. The network parameters are determined from gene expression level time series data using genetic algorithms. We have applied the method to expression data from the development of rat central nervous system, where the active genes cluster into four groups, within which the temporal expression patterns are similar. The data permit us to identify approximately the interactions between these groups of genes. We find that generally a single time series is of limited value in determining the interactions in the network, but multiple time series collected in related tissues or under treatment with different drugs can fix their values much more precisely.  相似文献   

8.
The present work provides the first broad-scale screening of allozymes in the land snail Helix aspersa. By using overall information available on the distribution of genetic variation between 102 populations previously investigated, we expect to strengthen our knowledge on the spread of the invasive aspersa subspecies in the Western Mediterranean. We propose a new approach based on a centre-based clustering procedure to cluster populations into groups following rules of geographical proximity and genetic similarity. Assuming a stepping-stone model of diffusion, we apply a partitioning algorithm which clusters only populations that are geographically contiguous. The algorithm used, which is actually part of leading methods developed for analysing large microarray datasets, is that of the k-means. Its goal is to minimize the within-group variance. The spatial constraint is provided by a list of connections between localities deduced from a Delaunay network. After testing each optimal group for the presence of spatial arrangement in the genetic data, the inferred genetic structure was compared with partitions obtained from other methods published for defining homogeneous groups (i.e. the Monmonier and SAMOVA algorithms). Competing biogeographical scenarios inferred from the k-means procedure were then compared and discussed to shed more light on colonization routes taken by the species.  相似文献   

9.
Many social animals live in stable groups, and it has been argued that kinship plays a major role in their group formation process. In this study we present the mathematical analysis of a recent model which uses kinship as a main factor to explain observed group patterns in a finite sample of individuals. We describe the average number of groups and the probability distribution of group sizes predicted by this model. Our method is based on the study of recursive equations underlying these quantities. We obtain asymptotic equivalents for probability distributions and moments as the sample size increases, and we exhibit power-law behaviours. Computer simulations are also utilized to measure the extent to which the asymptotic approximation can be applied with confidence.  相似文献   

10.
The present study is an extension of the investigations made by Grieszbach and Schack (1993) where the recursive estimators of the quantile were introduced. Attention is focused on statistical properties and on the controlling of these estimators in order to reduce their variance and to improve their capability of adaptation. Using methods of stochastic approximation, several control algorithms have been developed, where both the consistent and the adaptive estimation are considered. Due to the recursive computation formula the estimators are suitable for the analysis of large data sets and for sets whose elements are obtained sequentially. In this study, application examples from the analysis of EEG‐records are presented, where quantiles are used as threshold values.  相似文献   

11.
This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data. It can be used for gene grouping, selection, and classification. The partitioning of a relational table into attribute subgroups allows a small number of attributes within or across the groups to be selected for analysis. By clustering attributes, the search dimension of a data mining algorithm is reduced. The reduction of search dimension is especially important to data mining in gene expression data because such data typically consist of a huge number of genes (attributes) and a small number of gene expression profiles (tuples). Most data mining algorithms are typically developed and optimized to scale to the number of tuples instead of the number of attributes. The situation becomes even worse when the number of attributes overwhelms the number of tuples, in which case, the likelihood of reporting patterns that are actually irrelevant due to chances becomes rather high. It is for the aforementioned reasons that gene grouping and selection are important preprocessing steps for many data mining algorithms to be effective when applied to gene expression data. This paper defines the problem of attribute clustering and introduces a methodology to solving it. Our proposed method groups interdependent attributes into clusters by optimizing a criterion function derived from an information measure that reflects the interdependence between attributes. By applying our algorithm to gene expression data, meaningful clusters of genes are discovered. The grouping of genes based on attribute interdependence within group helps to capture different aspects of gene association patterns in each group. Significant genes selected from each group then contain useful information for gene expression classification and identification. To evaluate the performance of the proposed approach, we applied it to two well-known gene expression data sets and compared our results with those obtained by other methods. Our experiments show that the proposed method is able to find the meaningful clusters of genes. By selecting a subset of genes which have high multiple-interdependence with others within clusters, significant classification information can be obtained. Thus, a small pool of selected genes can be used to build classifiers with very high classification rate. From the pool, gene expressions of different categories can be identified.  相似文献   

12.
The question of whether animals possess ‘cultures’ or ‘traditions’ continues to generate widespread theoretical and empirical interest. Studies of wild chimpanzees have featured prominently in this discussion, as the dominant approach used to identify culture in wild animals was first applied to them. This procedure, the ‘method of exclusion,’ begins by documenting behavioural differences between groups and then infers the existence of culture by eliminating ecological explanations for their occurrence. The validity of this approach has been questioned because genetic differences between groups have not explicitly been ruled out as a factor contributing to between-group differences in behaviour. Here we investigate this issue directly by analysing genetic and behavioural data from nine groups of wild chimpanzees. We find that the overall levels of genetic and behavioural dissimilarity between groups are highly and statistically significantly correlated. Additional analyses show that only a very small number of behaviours vary between genetically similar groups, and that there is no obvious pattern as to which classes of behaviours (e.g. tool-use versus communicative) have a distribution that matches patterns of between-group genetic dissimilarity. These results indicate that genetic dissimilarity cannot be eliminated as playing a major role in generating group differences in chimpanzee behaviour.  相似文献   

13.
The problem of detecting DNA motifs with functional relevance in real biological sequences is difficult due to a number of biological, statistical and computational issues and also because of the lack of knowledge about the structure of searched patterns. Many algorithms are implemented in fully automated processes, which are often based upon a guess of input parameters from the user at the very first step. In this paper, we present a novel method for the detection of seeded DNA motifs, composed by regions with a different extent of variability. The method is based on a multi-step approach, which was implemented in a motif searching web tool (MOST). Overrepresented exact patterns are extracted from input sequences and clustered to produce motifs core regions, which are then extended and scored to generate seeded motifs. The combination of automated pattern discovery algorithms and different display tools for the evaluation and selection of results at several analysis steps can potentially lead to much more meaningful results than complete automation can produce. Experimental results on different yeast and human real datasets proved the methodology to be a promising solution for finding seeded motifs. MOST web tool is freely available at http://telethon.bio.unipd.it/bioinfo/MOST.  相似文献   

14.
Molecular markers are frequently used to study genetic variation among individuals within or between populations. Differences in marker banding patterns can be used to verify if individuals do, or do not, represent distinct groups or populations. Only in 2005, more than 500 studies used molecular markers to group individuals in clusters. Such studies make use of an arbitrary number of molecular markers from each of an arbitrary number of individuals presumed to represent distinct genotypes. However, the greater the genetic variation, the more likely a larger number of individuals and markers will be needed to capture a population's genetic signature. The numbers of both, markers and individuals included thus affect the way in which individuals are organized through cluster analyses, thereby affecting the conclusions drawn. Here we present a method that provides statistical criteria to verify that individual and marker sample sizes are sufficient to accurately depict genetic differentiation among different populations. Our method uses a resampling technique to assess the reproducibility of obtaining a particular grouping pattern for specific data sets. It thus, allows to estimate the robustness of the results obtained without including additional individuals, or markers.  相似文献   

15.
The technique of Finite Markov Chain Imbedding (FMCI) is a classical approach to complex combinatorial problems related to sequences. In order to get efficient algorithms, it is known that such approaches need to be first rewritten using recursive relations. We propose here to give here a general recursive algorithms allowing to compute in a numerically stable manner exact Cumulative Distribution Function (CDF) or complementary CDF (CCDF). These algorithms are then applied in two particular cases: the local score of one sequence and pattern statistics. In both cases, asymptotic developments are derived. For the local score, our new approach allows for the very first time to compute exact p-values for a practical study (finding hydrophobic segments in a protein database) where only approximations were available before. In this study, the asymptotic approximations appear to be completely unreliable for 99.5% of the considered sequences. Concerning the pattern statistics, the new FMCI algorithms dramatically outperform the previous ones as they are more reliable, easier to implement, faster and with lower memory requirements.  相似文献   

16.
Statistical modeling of links between genetic profiles with environmental and clinical data to aid in medical diagnosis is a challenge. Here, we present a computational approach for rapidly selecting important clinical data to assist in medical decisions based on personalized genetic profiles. What could take hours or days of computing is available on-the-fly, making this strategy feasible to implement as a routine without demanding great computing power. The key to rapidly obtaining an optimal/nearly optimal mathematical function that can evaluate the "disease stage" by combining information of genetic profiles with personal clinical data is done by querying a precomputed solution database. The database is previously generated by a new hybrid feature selection method that makes use of support vector machines, recursive feature elimination and random sub-space search. Here, to evaluate the method, data from polymorphisms in the renin-angiotensin-aldosterone system genes together with clinical data were obtained from patients with hypertension and control subjects. The disease "risk" was determined by classifying the patients' data with a support vector machine model based on the optimized feature; then measuring the Euclidean distance to the hyperplane decision function. Our results showed the association of renin-angiotensin-aldosterone system gene haplotypes with hypertension. The association of polymorphism patterns with different ethnic groups was also tracked by the feature selection process. A demonstration of this method is also available online on the project's web site.  相似文献   

17.
? Premise of the study: Despite its small size, New Caledonia is characterized by a very diverse flora and striking environmental gradients, which make it an ideal setting to study species diversification. Thirteen of the 19 Araucaria species are endemic to the territory and form a monophyletic group, but patterns and processes that lead to such a high species richness are largely unexplored. ? Methods: We used 142 polymorphic AFLP markers and performed analyses based on Bayesian clustering algorithms, genetic distances, and cladistics on 71 samples representing all New Caledonian Araucaria species. We examined correlations between the inferred evolutionary relationships and shared morphological, ecological, or geographic parameters among species, to investigate evolutionary processes that may have driven speciation. ? Key results: We showed that genetic divergence among the present New Caledonian Araucaria species is low, suggesting recent diversification rather than pre-existence on Gondwana. We identified three genetic groups that included small-leaved, large-leaved, and coastal species, but detected no association with soil preference, ecological habitat, or rainfall. The observed patterns suggested that speciation events resulted from both differential adaptation and vicariance. Last, we hypothesize that speciation is ongoing and/or there are cryptic species in some genetically (sometimes also morphologically) divergent populations. ? Conclusions: Further data are required to provide better resolution and understanding of the diversification of New Caledonian Araucaria species. Nevertheless, our study allowed insights into their evolutionary relationships and provides a framework for future investigations on the evolution of this emblematic group of plants in one of the world's biodiversity hotspots.  相似文献   

18.
DNA extracted directly from nodules was used to assess the genetic diversity of Frankia strains symbiotically associated with two species of the genus Casuarina and two of the genus Allocasuarina naturally occurring in northeastern Australia. DNA from field-collected nodules or extracted from reference cultures of Casuarina-infective Frankia strains was used as the template in PCRs with primers targeting two DNA regions, one in the ribosomal operon and the other in the nif operon. PCR products were then analyzed by using a set of restriction endonucleases. Five distinct genetic groups were recognized on the basis of these restriction patterns. These groups were consistently associated with the host species from which the nodules originated. All isolated reference strains had similar patterns and were assigned to group 1 along with six of the eight unisolated Frankia strains from Casuarina equisetifolia in Australia. Group 2 consisted of two unisolated Frankia strains from C. equisetifolia, whereas groups 3 to 5 comprised all unisolated strains from Casuarina cunninghamiana, Allocasuarina torulosa, and Allocasuarina littoralis, respectively. These results demonstrate that, contrary to the results of previous molecular studies of isolated strains, there is genetic diversity among Frankia strains that infect members of the family Casuarinacaeae. The apparent high homogeneity of Frankia strains in these previous studies probably relates to the single host species from which the strains were obtained and the origin of these strains from areas outside the natural geographic range of members of the family Casuarinaceae, where genetic diversity could be lower than in Australia.  相似文献   

19.
Unisexual vertebrates typically form through hybridization events between sexual species in which reproductive mode transitions occur in the hybrid offspring. This evolutionary history is thought to have important consequences for the ecology of unisexual lineages and their interactions with congeners in natural communities. However, these consequences have proven challenging to study owing to uncertainty about patterns of population genetic diversity in unisexual lineages. Of particular interest is resolving the contribution of historical hybridization events versus post formational mutation to patterns of genetic diversity in nature. Here we use restriction site associated DNA genotyping to evaluate genetic diversity and demographic history in Aspidoscelis laredoensis, a diploid unisexual lizard species from the vicinity of the Rio Grande River in southern Texas and northern Mexico. The sexual progenitor species from which one or more lineages are derived also occur in the Rio Grande Valley region, although patterns of distribution across individual sites are quite variable. Results from population genetic and phylogenetic analyses resolved the major axes of genetic variation in this species and highlight how these match predictions based on historical patterns of hybridization. We also found discordance between results of demographic modelling using different statistical approaches with the genomic data. We discuss these insights within the context of the ecological and evolutionary mechanisms that generate and maintain lineage diversity in unisexual species. As one of the most dynamic, intriguing, and geographically well investigated groups of whiptail lizards, these species hold substantial promise for future studies on the constraints of diversification in unisexual vertebrates.  相似文献   

20.
Haplotypes contain genealogical information and play a prominent part in population genetic and evolutionary studies. However, haplotype inference is a complex statistical problem, showing considerable internal algorithm variability and among-algorithm discordance. Thus, haplotypes inferred by statistical algorithms often contain hidden uncertainties, which may complicate and even mislead downstream analysis. Consensus strategy is one of the effective means to increase the confidence of inferred haplotypes. Here, we present a consensus tool, the CVhaplot package, to automate consensus techniques for haplotype inference. It generates consensus haplotypes from inferrals of competing algorithms to increase the confidence of haplotype inference results, while improving the performance of individual algorithms by considering their internal variability. It can effectively identify uncertain haplotypes potentially associated with inference errors. In addition, this tool allows file format conversion for several popular algorithms and extends the applicability of some algorithms to complex data containing triallelic polymorphic sites. CVhaplot is written in PERL and freely available at http://www.ioz.ac.cn/department/agripest/group/zhangdx/CVhaplot.htm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号