首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
3.
4.
5.
Errors‐in‐variables models in high‐dimensional settings pose two challenges in application. First, the number of observed covariates is larger than the sample size, while only a small number of covariates are true predictors under an assumption of model sparsity. Second, the presence of measurement error can result in severely biased parameter estimates, and also affects the ability of penalized methods such as the lasso to recover the true sparsity pattern. A new estimation procedure called SIMulation‐SELection‐EXtrapolation (SIMSELEX) is proposed. This procedure makes double use of lasso methodology. First, the lasso is used to estimate sparse solutions in the simulation step, after which a group lasso is implemented to do variable selection. The SIMSELEX estimator is shown to perform well in variable selection, and has significantly lower estimation error than naive estimators that ignore measurement error. SIMSELEX can be applied in a variety of errors‐in‐variables settings, including linear models, generalized linear models, and Cox survival models. It is furthermore shown in the Supporting Information how SIMSELEX can be applied to spline‐based regression models. A simulation study is conducted to compare the SIMSELEX estimators to existing methods in the linear and logistic model settings, and to evaluate performance compared to naive methods in the Cox and spline models. Finally, the method is used to analyze a microarray dataset that contains gene expression measurements of favorable histology Wilms tumors.  相似文献   

6.
Redundancy Analysis (RDA) is a well‐known method used to describe the directional relationship between related data sets. Recently, we proposed sparse Redundancy Analysis (sRDA) for high‐dimensional genomic data analysis to find explanatory variables that explain the most variance of the response variables. As more and more biomolecular data become available from different biological levels, such as genotypic and phenotypic data from different omics domains, a natural research direction is to apply an integrated analysis approach in order to explore the underlying biological mechanism of certain phenotypes of the given organism. We show that the multiset sparse Redundancy Analysis (multi‐sRDA) framework is a prominent candidate for high‐dimensional omics data analysis since it accounts for the directional information transfer between omics sets, and, through its sparse solutions, the interpretability of the result is improved. In this paper, we also describe a software implementation for multi‐sRDA, based on the Partial Least Squares Path Modeling algorithm. We test our method through simulation and real omics data analysis with data sets of 364,134 methylation markers, 18,424 gene expression markers, and 47 cytokine markers measured on 37 patients with Marfan syndrome.  相似文献   

7.
8.
9.
10.
Summary Second‐generation sequencing (sec‐gen) technology can sequence millions of short fragments of DNA in parallel, making it capable of assembling complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to fully sequence the genomes of approximately 1200 people. The prospect of comparative analysis at the sequence level of a large number of samples across multiple populations may be achieved within the next five years. These data present unprecedented challenges in statistical analysis. For instance, analysis operates on millions of short nucleotide sequences, or reads—strings of A,C,G, or T's, between 30 and 100 characters long—which are the result of complex processing of noisy continuous fluorescence intensity measurements known as base‐calling. The complexity of the base‐calling discretization process results in reads of widely varying quality within and across sequence samples. This variation in processing quality results in infrequent but systematic errors that we have found to mislead downstream analysis of the discretized sequence read data. For instance, a central goal of the 1000 Genomes Project is to quantify across‐sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Sec‐gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequence reads is of utmost importance. In this article, we present a simple model to capture uncertainty arising in the base‐calling procedure of the Illumina/Solexa GA platform. Model parameters have a straightforward interpretation in terms of the chemistry of base‐calling allowing for informative and easily interpretable metrics that capture the variability in sequencing quality. Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base‐calling performance.  相似文献   

11.
An input‐output‐based life cycle inventory (IO‐based LCI) is grounded on economic environmental input‐output analysis (IO analysis). It is a fast and low‐budget method for generating LCI data sets, and is used to close data gaps in life cycle assessment (LCA). Due to the fact that its methodological basis differs from that of process‐based inventory, its application in LCA is a matter of controversy. We developed a German IO‐based approach to derive IO‐based LCI data sets that is based on the German IO accounts and on the German environmental accounts, which provide data for the sector‐specific direct emissions of seven airborne compounds. The method to calculate German IO‐based LCI data sets for building products is explained in detail. The appropriateness of employing IO‐based LCI for German buildings is analyzed by using process‐based LCI data from the Swiss Ecoinvent database to validate the calculated IO‐based LCI data. The extent of the deviations between process‐based LCI and IO‐based LCI varies considerably for the airborne emissions we investigated. We carried out a systematic evaluation of the possible reasons for this deviation. This analysis shows that the sector‐specific effects (aggregation of sectors) and the quality of primary data for emissions from national inventory reporting (NIR) are the main reasons for the deviations. As a rule, IO‐based LCI data sets seem to underestimate specific emissions while overestimating sector‐specific aspects.  相似文献   

12.

Aim

We analysed beta‐diversity patterns of various biological groups simultaneously, from the perspective of site ecological uniqueness. We also investigated whether ecological uniqueness variation could be explained by variations in environmental conditions and spatial variables.

Data

Central Amazonia.

Methods

We estimated the total beta diversity and ecological uniqueness for 14 biological groups, including plants and animals, sampled at the same sites on a mesoscale in central Amazonia, Brazil. The uniqueness values for all biological groups were combined in a single matrix (multi‐taxa matrix of site uniqueness), which was then used as a response variable matrix in a partial redundancy analysis. We also investigated differences in the uniqueness patterns between plant and animal groups.

Results

In general, plants showed higher total beta diversity than animals. For plants, uniqueness was explained mainly by environmental conditions, while for animals, uniqueness was also related to spatial variables. Although variation in uniqueness was mainly related to soil clay content, it is difficult to determine a single major environmental variable underlying the variation in uniqueness because the topographical gradient influences many of them, including soil clay content.

Main Conclusion

The uniqueness values were higher in low‐lying areas, indicating that near‐stream sites were more ecologically unique. Despite the lower number of species in the lowlands, their unique biota contributed strongly to the maintenance of the total beta diversity of the area. This finding should be considered in conservation plans that aim to represent and preserve the regional biota. Our approach proved to be useful to analyse and compare the ecological uniqueness of multiple taxa.
  相似文献   

13.
Abstract. Variation partitioning by (partial) constrained ordination is a popular method for exploratory data analysis, but applications are mostly restricted to simple ecological questions only involving two or three sets of explanatory variables, such as climate and soil, this because of the rapid increase in complexity of calculations and results with an increasing number of explanatory variable sets. The existence is demonstrated of a unique algorithm for partitioning the variation in a set of response variables on n sets of explanatory variables; it is shown how the 2n– 1 non‐overlapping components of variation can be calculated. Methods for evaluation and presentation of variation partitioning results are reviewed, and a recursive algorithm is proposed for distributing the many small components of variation over simpler components. Several issues related to the use and usefulness of variation partitioning with n sets of explanatory variables are discussed with reference to a worked example.  相似文献   

14.
15.
In this work we propose the use of functional data analysis (FDA) to deal with a very large dataset of atmospheric aerosol size distribution resolved in both space and time. Data come from a mobile measurement platform in the town of Perugia (Central Italy). An OPC (Optical Particle Counter) is integrated on a cabin of the Minimetrò, an urban transportation system, that moves along a monorail on a line transect of the town. The OPC takes a sample of air every six seconds and counts the number of particles of urban aerosols with a diameter between 0.28 m and 10 m and classifies such particles into 21 size bins according to their diameter. Here, we adopt a 2D functional data representation for each of the 21 spatiotemporal series. In fact, space is unidimensional since it is measured as the distance on the monorail from the base station of the Minimetrò. FDA allows for a reduction of the dimensionality of each dataset and accounts for the high space‐time resolution of the data. Functional cluster analysis is then performed to search for similarities among the 21 size channels in terms of their spatiotemporal pattern. Results provide a good classification of the 21 size bins into a relatively small number of groups (between three and four) according to the season of the year. Groups including coarser particles have more similar patterns, while those including finer particles show a more different behavior according to the period of the year. Such features are consistent with the physics of atmospheric aerosol and the highlighted patterns provide a very useful ground for prospective model‐based studies.  相似文献   

16.
High‐throughput sequencing methods have become a routine analysis tool in environmental sciences as well as in public and private sector. These methods provide vast amount of data, which need to be analysed in several steps. Although the bioinformatics may be applied using several public tools, many analytical pipelines allow too few options for the optimal analysis for more complicated or customized designs. Here, we introduce PipeCraft, a flexible and handy bioinformatics pipeline with a user‐friendly graphical interface that links several public tools for analysing amplicon sequencing data. Users are able to customize the pipeline by selecting the most suitable tools and options to process raw sequences from Illumina, Pacific Biosciences, Ion Torrent and Roche 454 sequencing platforms. We described the design and options of PipeCraft and evaluated its performance by analysing the data sets from three different sequencing platforms. We demonstrated that PipeCraft is able to process large data sets within 24 hr. The graphical user interface and the automated links between various bioinformatics tools enable easy customization of the workflow. All analytical steps and options are recorded in log files and are easily traceable.  相似文献   

17.
The MobilEe‐study was the first cross‐sectional population‐based study to investigate possible health effects of mobile communication networks on children using personal dosimetry. Exposure was assessed every second resulting in 86,400 measurements over 24 h for each participant. Therefore, a functional approach to analyze the exposure data was considered appropriate. The aim was to categorize exposure taking into account the course of the measurements over 24 h. The analyses were based on the 480 maxima of each 3 min time interval. Exposure was classified using a nonparametric functional method. Heterogeneity of a sample of functional data was assessed by comparing the functional mode and mean of the distribution of a functional variable. The partition was built within a descending hierarchical method. The resulting exposure groups were compared with categories derived from a standard method, which used the average exposure over 24 h and set the cut‐off at the 90th percentile. The functional classification resulted in a splitting of the exposure data into two groups. Plots of the mean curves showed that the groups could be interpreted as children with “low exposure” (88%) and “higher exposure” (12%). These groups were comparable with categories of the standard method. No association between the categorized exposure and well‐being was observed in logistic regression models. The functional classification approach yielded a plausible partition of the exposure data. The comparability with the standard approach might be due to the data structure and should not be generalized to other exposures. Bioelectromagnetics 30:261–269, 2009. © 2009 Wiley‐Liss, Inc.  相似文献   

18.
This study was performed in order to evaluate a new LED‐based 2D‐fluorescence spectrometer for in‐line bioprocess monitoring of Chinese hamster ovary (CHO) cell culture processes. The new spectrometer used selected excitation wavelengths of 280, 365, and 455 nm to collect spectral data from six 10‐L fed‐batch processes. The technique provides data on various fluorescent compounds from the cultivation medium as well as from cell metabolism. In addition, scattered light offers information about the cultivation status. Multivariate data analysis tools were applied to analyze the large data sets of the collected fluorescence spectra. First, principal component analysis was used to accomplish an overview of all spectral data from all six CHO cultivations. Partial least square regression models were developed to correlate 2D‐fluorescence spectral data with selected critical process variables as offline reference values. A separate independent fed‐batch process was used for model validation and prediction. An almost continuous in‐line bioprocess monitoring was realized because 2D‐fluorescence spectra were collected every 10 min during the whole cultivation. The new 2D‐fluorescence device demonstrates the significant potential for accurate prediction of the total cell count, viable cell count, and the cell viability. The results strongly indicated that the technique is particularly capable to distinguish between different cell statuses inside the bioreactor. In addition, spectral data provided information about the lactate metabolism shift and cellular respiration during the cultivation process. Overall, the 2D‐fluorescence device is a highly sensitive tool for process analytical technology applications in mammalian cell cultures.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号