首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 23 毫秒
1.
Large quantities of data have been generated from multiple sources at exponential rates in the last few years. These data are generated at high velocity as real time and streaming data in variety of formats. These characteristics give rise to challenges in its modeling, computation, and processing. Hadoop MapReduce (MR) is a well known data-intensive distributed processing framework using the distributed file system (DFS) for Big Data. Current implementations of MR only support execution of a single algorithm in the entire Hadoop cluster. In this paper, we propose MapReducePack (MRPack), a variation of MR that supports execution of a set of related algorithms in a single MR job. We exploit the computational capability of a cluster by increasing the compute-intensiveness of MapReduce while maintaining its data-intensive approach. It uses the available computing resources by dynamically managing the task assignment and intermediate data. Intermediate data from multiple algorithms are managed using multi-key and skew mitigation strategies. The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce. The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost. Complexity and qualitative results analysis shows significant performance improvement.  相似文献   

2.

Background

High-throughput molecular profiling data has been used to improve clinical decision making by stratifying subjects based on their molecular profiles. Unsupervised clustering algorithms can be used for stratification purposes. However, the current speed of the clustering algorithms cannot meet the requirement of large-scale molecular data due to poor performance of the correlation matrix calculation. With high-throughput sequencing technologies promising to produce even larger datasets per subject, we expect the performance of the state-of-the-art statistical algorithms to be further impacted unless efforts towards optimisation are carried out. MapReduce is a widely used high performance parallel framework that can solve the problem.

Results

In this paper, we evaluate the current parallel modes for correlation calculation methods and introduce an efficient data distribution and parallel calculation algorithm based on MapReduce to optimise the correlation calculation. We studied the performance of our algorithm using two gene expression benchmarks. In the micro-benchmark, our implementation using MapReduce, based on the R package RHIPE, demonstrates a 3.26-5.83 fold increase compared to the default Snowfall and 1.56-1.64 fold increase compared to the basic RHIPE in the Euclidean, Pearson and Spearman correlations. Though vanilla R and the optimised Snowfall outperforms our optimised RHIPE in the micro-benchmark, they do not scale well with the macro-benchmark. In the macro-benchmark the optimised RHIPE performs 2.03-16.56 times faster than vanilla R. Benefiting from the 3.30-5.13 times faster data preparation, the optimised RHIPE performs 1.22-1.71 times faster than the optimised Snowfall. Both the optimised RHIPE and the optimised Snowfall successfully performs the Kendall correlation with TCGA dataset within 7 hours. Both of them conduct more than 30 times faster than the estimated vanilla R.

Conclusions

The performance evaluation found that the new MapReduce algorithm and its implementation in RHIPE outperforms vanilla R and the conventional parallel algorithms implemented in R Snowfall. We propose that MapReduce framework holds great promise for large molecular data analysis, in particular for high-dimensional genomic data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new algorithm as a basis for optimising high-throughput molecular data correlation calculation for Big Data.  相似文献   

3.
《Biophysical journal》2021,120(21):4798-4808
After translation, nascent proteins must escape the ribosomal exit tunnel to attain complete folding to their native states. This escape process also frees up the ribosome tunnel for a new translation job. In this study, we investigate the impacts of energetic interactions between the ribosomal exit tunnel and nascent proteins on the protein escape process by molecular dynamics simulations using partially coarse-grained models that incorporate hydrophobic and electrostatic interactions of the ribosome tunnel of Haloarcula marismortui with nascent proteins. We find that, in general, attractive interactions slow down the protein escape process, whereas repulsive interactions speed it up. For the small globular proteins considered, the median escape time correlates with both the number of hydrophobic residues, Nh, and the net charge, Q, of a nascent protein. A correlation coefficient exceeding 0.96 is found for the relation between the median escape time and a combined quantity of Nh + 5.9Q, suggesting that it is ∼6 times more efficient to modulate the escape time by changing the total charge than the number of hydrophobic residues. The estimated median escape times are found in the submillisecond-to-millisecond range, indicating that the escape does not delay the ribosome recycling. For various types of the tunnel model, with and without hydrophobic and electrostatic interactions, the escape time distribution always follows a simple diffusion model that describes the escape process as a downhill drift of a Brownian particle, suggesting that nascent proteins escape along barrier-less pathways at the ribosome tunnel.  相似文献   

4.
MapReduce offers an ease-of-use programming paradigm for processing large data sets, making it an attractive model for opportunistic compute resources. However, unlike dedicated resources, where MapReduce has mostly been deployed, opportunistic resources have significantly higher rates of node volatility. As a consequence, the data and task replication scheme adopted by existing MapReduce implementations is woefully inadequate on such volatile resources. In this paper, we propose MOON, short for MapReduce On Opportunistic eNvironments, which is designed to offer reliable MapReduce service for opportunistic computing. MOON adopts a hybrid resource architecture by supplementing opportunistic compute resources with a small set of dedicated resources, and it extends Hadoop, an open-source implementation of MapReduce, with adaptive task and data scheduling algorithms to take advantage of the hybrid resource architecture. Our results on an emulated opportunistic computing system running atop a 60-node cluster demonstrate that MOON can deliver significant performance improvements to Hadoop on volatile compute resources and even finish jobs that are not able to complete in Hadoop.  相似文献   

5.
MapReduce has become a popular framework for Big Data applications. While MapReduce has received much praise for its scalability and efficiency, it has not been thoroughly evaluated for power consumption. Our goal with this paper is to explore the possibility of scheduling in a power-efficient manner without the need for expensive power monitors on every node. We begin by considering that no cluster is truly homogeneous with respect to energy consumption. From there we develop a MapReduce framework that can evaluate the current status of each node and dynamically react to estimated power usage. In so doing, we shift work toward more energy efficient nodes which are currently consuming less power. Our work shows that given an ideal framework configuration, certain nodes may consume only 62.3 % of the dynamic power they consumed when the same framework was configured as it would be in a traditional MapReduce implementation.  相似文献   

6.
We investigate proactive dynamic load balancing on multicore systems, in which threads are continually migrated to reduce the impact of processor/thread mismatches. Our goal is to enhance the flexibility of the SPMD-style programming model and enable SPMD applications to run efficiently in multiprogrammed environments. We present Juggle, a practical decentralized, user-space implementation of a proactive load balancer that emphasizes portability and usability. In this paper we assume perfect intrinsic load balance and focus on extrinsic imbalances caused by OS noise, multiprogramming and mismatches of threads to hardware parallelism. Juggle shows performance improvements of up to 80 % over static load balancing for oversubscribed UPC, OpenMP, and pthreads benchmarks. We also show that Juggle is effective in unpredictable, multiprogrammed environments, with up to a 50 % performance improvement over the Linux load balancer and a 25 % reduction in performance variation. We analyze the impact of Juggle on parallel applications and derive lower bounds and approximations for thread completion times. We show that results from Juggle closely match theoretical predictions across a variety of architectures, including NUMA and hyper-threaded systems.  相似文献   

7.

Background

Ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Their secondary structures are crucial for the RNA functionality, and the prediction of the secondary structures is widely studied. Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of the chunks independently using thermodynamic methods, and reconstructing the entire secondary structure from the predicted chunk structures can yield better accuracy than predicting the secondary structure using the RNA sequence as a whole. The chunking, prediction, and reconstruction processes can use different methods and parameters, some of which produce more accurate predictions than others. In this paper, we study the prediction accuracy and efficiency of three different chunking methods using seven popular secondary structure prediction programs that apply to two datasets of RNA with known secondary structures, which include both pseudoknotted and non-pseudoknotted sequences, as well as a family of viral genome RNAs whose structures have not been predicted before. Our modularized MapReduce framework based on Hadoop allows us to study the problem in a parallel and robust environment.

Results

On average, the maximum accuracy retention values are larger than one for our chunking methods and the seven prediction programs over 50 non-pseudoknotted sequences, meaning that the secondary structure predicted using chunking is more similar to the real structure than the secondary structure predicted by using the whole sequence. We observe similar results for the 23 pseudoknotted sequences, except for the NUPACK program using the centered chunking method. The performance analysis for 14 long RNA sequences from the Nodaviridae virus family outlines how the coarse-grained mapping of chunking and predictions in the MapReduce framework exhibits shorter turnaround times for short RNA sequences. However, as the lengths of the RNA sequences increase, the fine-grained mapping can surpass the coarse-grained mapping in performance.

Conclusions

By using our MapReduce framework together with statistical analysis on the accuracy retention results, we observe how the inversion-based chunking methods can outperform predictions using the whole sequence. Our chunk-based approach also enables us to predict secondary structures for very long RNA sequences, which is not feasible with traditional methods alone.
  相似文献   

8.
The current works about MapReduce task scheduling with deadline constraints neither take the differences of Map and Reduce task, nor the cluster’s heterogeneity into account. This paper proposes an extensional MapReduce Task Scheduling algorithm for Deadline constraints in Hadoop platform: MTSD. It allows user specify a job’s deadline and tries to make the job be finished before the deadline. Through measuring the node’s computing capacity, a node classification algorithm is proposed in MTSD. This algorithm classifies the nodes into several levels in heterogeneous clusters. Under this algorithm, we firstly illuminate a novel data distribution model which distributes data according to the node’s capacity level respectively. The experiments show that the node classification algorithm can improved data locality observably to compare with default scheduler and it also can improve other scheduler’s locality. Secondly, we calculate the task’s average completion time which is based on the node level. It improves the precision of task’s remaining time evaluation. Finally, MTSD provides a mechanism to decide which job’s task should be scheduled by calculating the Map and Reduce task slot requirements.  相似文献   

9.
Cloud computing serves as a platform for remote users to utilize the heterogeneous resources in data-centers to compute High-Performance Computing jobs. The physical resources are virtualized in Cloud to entertain user services employing Virtual Machines (VMs). Job scheduling is deemed as a quintessential part of Cloud and efficient utilization of VMs by Cloud Service Providers demands an optimal job scheduling heuristic. An ideal scheduling heuristic should be efficient, fair, and starvation-free to produce a reduced makespan with improved resource utilization. However, static heuristics often lead to inefficient and poor resource utilization in the Cloud. An idle and underutilized host machine in Cloud still consumes up to 70% of the energy required by an active machine (Ray, in Indian J Comput Sci Eng 1(4):333–339, 2012). Consequently, it demands a load-balanced distribution of workload to achieve optimal resource utilization in Cloud. Existing Cloud scheduling heuristics such as Min–Min, Max–Min, and Sufferage distribute workloads among VMs based on minimum job completion time that ultimately causes a load imbalance. In this paper, a novel Resource-Aware Load Balancing Algorithm (RALBA) is presented to ensure a balanced distribution of workload based on computation capabilities of VMs. The RABLA framework comprises of two phases: (1) scheduling based on computing capabilities of VMs, and (2) the VM with earliest finish time is selected for jobs mapping. The outcomes of the RALBA have revealed that it provides substantial improvement against traditional heuristics regarding makespan, resource utilization, and throughput.  相似文献   

10.
A data-parallel framework is very attractive for large-scale data processing since it enables such an application to easily process a huge amount of data on commodity machines. MapReduce, a popular data-parallel framework, is used in various fields such as web search, data mining and data warehouses; it is proven to be very practical for such a data-parallel application. A star-join query is a popular query in data warehouses that are a current target domain of data-parallel frameworks. This article proposes a new algorithm that efficiently processes star-join queries in data-parallel frameworks such as MapReduce and Dryad. Our star-join algorithm for general data-parallel frameworks is called Scatter-Gather-Merge, and it processes star-join queries in a constant number of computation steps, although the number of participating dimension tables increases. By adopting bloom filters, Scatter-Gather-Merge reduces a non-trivial amount of IO. We also show that Scatter-Gather-Merge can be easily applied to MapReduce. Our experimental results in both cluster and cloud environments show that Scatter-Gather-Merge outperforms existing approaches.  相似文献   

11.
How heterogeneous are proteome folding timescales and what physical principles, if any, dictate its limits? We answer this by predicting copy number weighted folding speed distribution – using the native topology – for E.coli and Yeast proteome. E.coli and Yeast proteomes yield very similar distributions with average folding times of 100 milliseconds and 170 milliseconds, respectively. The topology-based folding time distribution is well described by a diffusion-drift mutation model on a flat-fitness landscape in free energy barrier between two boundaries: i) the lowest barrier height determined by the upper limit of folding speed and ii) the highest barrier height governed by the lower speed limit of folding. While the fastest time scale of the distribution is near the experimentally measured speed limit of 1 microsecond (typical of barrier-less folders), we find the slowest folding time to be around seconds (8 seconds for Yeast distribution), approximately an order of magnitude less than the fastest halflife (approximately 2 minutes) in the Yeast proteome. This separation of timescale implies even the fastest degrading protein will have moderately high (96%) probability of folding before degradation. The overall agreement with the flat-fitness landscape model further hints that proteome folding times did not undergo additional major selection pressures – to make proteins fold faster – other than the primary requirement to “sufficiently beat the clock” against its lifetime. Direct comparison between the predicted folding time and experimentally measured halflife further shows 99% of the proteome have a folding time less than their corresponding lifetime. These two findings together suggest that proteome folding kinetics may be bounded by protein halflife.  相似文献   

12.
As GPUs, ARM CPUs and even FPGAs are widely used in modern computing, a data center gradually develops towards the heterogeneous clusters. However, many well-known programming models such as MapReduce are designed for homogeneous clusters and have poor performance in heterogeneous environments. In this paper, we reconsider the problem and make four contributions: (1) We analyse the causes of MapReduce poor performance in heterogeneous clusters, and the most important one is unreasonable task allocation between nodes with different computing ability. (2) Based on this, we propose MrHeter, which separates MapReduce process into map-shuffle stage and reduce stage, then constructs optimization model separately for them and gets different task allocation \(ml_{ij}, mr_{ij}, r_{ij}\) for heterogeneous nodes based on computing ability.(3) In order to make it suitable for dynamic execution, we propose D-MrHeter, which includes monitor and feedback mechanism. (4) Finally, we prove that MrHeter and D-MrHeter can greatly decrease total execution time of MapReduce from 30 to 70 % in heterogeneous cluster comparing with original Hadoop, having better performance especially in the condition of heavy-workload and large-difference between nodes computing ability.  相似文献   

13.
In this paper, we propose a mathematical expression of closure to efficient causation in terms of λ-calculus; we argue that this opens up the perspective of developing principled computer simulations of systems closed to efficient causation in an appropriate programming language. An important implication of our formulation is that, by exhibiting an expression in λ-calculus, which is a paradigmatic formalism for computability and programming, we show that there are no conceptual or principled problems in realizing a computer simulation or model of closure to efficient causation. We conclude with a brief discussion of the question whether closure to efficient causation captures all relevant properties of living systems. We suggest that it might not be the case, and that more complex definitions could indeed create crucial some obstacles to computability.  相似文献   

14.
It is generally held that random-coil polypeptide chains undergo a barrier-less continuous collapse when the solvent conditions are changed to favor the fully folded native conformation. We test this hypothesis by probing intramolecular distance distributions during folding in one of the paradigms of folding reactions, that of cytochrome c. The Trp59-to-heme distance was probed by time-resolved Förster resonance energy transfer in the microsecond time range of refolding. Contrary to expectation, a state with a Trp59–heme distance close to that of the guanidinium hydrochloride (GdnHCl) denatured state is present after ~ 27 μs of folding. A concomitant decrease in the population of this state and an increase in the population of a compact high-FRET (Förster resonance energy transfer) state (efficiency > 90%) show that the collapse is barrier limited. Small-angle X-ray scattering (SAXS) measurements over a similar time range show that the radius of gyration under native favoring conditions is comparable to that of the GdnHCl denatured unfolded state. An independent comprehensive global thermodynamic analysis reveals that marginally stable partially folded structures are also present in the nominally unfolded GdnHCl denatured state. These observations suggest that specifically collapsed intermediate structures with low stability in rapid equilibrium with the unfolded state may contribute to the apparent chain contraction observed in previous fluorescence studies using steady-state detection. In the absence of significant dynamic averaging of marginally stable partially folded states and with the use of probes sensitive to distance distributions, barrier-limited chain contraction is observed upon transfer of the GdnHCl denatured state ensemble to native-like conditions.  相似文献   

15.
Document similarity has important real life applications such as finding duplicate web sites and identifying plagiarism. While the basic techniques such as k-similarity algorithms have been long known, overwhelming amount of data, being collected such as in big data setting, calls for novel algorithms to find highly similar documents in reasonably short amount of time. In particular, pairwise comparison of documents’ features, a key operation in calculating document similarity, necessitates prohibitively high storage and computation power. In this paper, we propose a new filtering technique that decreases the number of comparisons between the query set and the search set to find highly similar documents. The proposed filtering technique utilizes Z-order prefix, based on the cosine similarity measure, in which only the most important features are used first to find highly similar documents. We propose a three-phase approach, where the phases are near duplicate detection, common important terms and join phase. We utilize the Hadoop distributed file system and the MapReduce parallel programming model to scale our techniques to big data setting. Our experimental results on real data show that the proposed method performs better than the previous work in the literature in terms of the number of joins, and therefore, speed.  相似文献   

16.
In this paper, we propose a flexible neighbourhood search strategy for quay crane scheduling problems based on the framework of tabu search (TS) algorithm. In the literature, the container workload of a ship is partitioned into a number of fixed jobs to deal with the complexity of the problem. In this paper, we propose flexible jobs which are dynamically changed by TS throughout the search process to eliminate the impact of fixed jobs on the generated schedules. Alternative job sequences are investigated for quay cranes and a new quay crane dispatching policy is developed to generate schedules. Computational experiments conducted with problem instances available in the literature showed that our algorithm is capable of generating quality schedules for quay crane handling operations at reasonable times.  相似文献   

17.
18.
ABSTRACT: BACKGROUND: The MapReduce framework enables a scalable processing and analyzing of large datasets by distributing the computational load on connected computer nodes, referred to as a cluster. In Bioinformatics, MapReduce has already been adopted to various case scenarios such as mapping next generation sequencing data to a reference genome, finding SNPs from short read data or matching strings in genotype files. Nevertheless, tasks like installing and maintaining MapReduce on a cluster system, importing data into its distributed file system or executing MapReduce programs require advanced knowledge in computer science and could thus prevent scientists from usage of currently available and useful software solutions. RESULTS: Here we present Cloudgene, a freely available platform to improve the usability of MapReduce programs in Bioinformatics by providing a graphical user interface for the execution, the import and export of data and the reproducibility of workflows on in-house (private clouds) and rented clusters (public clouds). The aim of Cloudgene is to build a standardized graphical execution environment for currently available and future MapReduce programs, which can all be integrated by using its plug-in interface. Since Cloudgene can be executed on private clusters, sensitive datasets can be kept in house at all time and data transfer times are therefore minimized. CONCLUSIONS: Our results show that MapReduce programs can be integrated into Cloudgene with little effort and without adding any computational overhead to existing programs. This platform gives developers the opportunity to focus on the actual implementation task and provides scientists a platform with the aim to hide the complexity of MapReduce. In addition to MapReduce programs, Cloudgene can also be used to launch predefined systems (e.g. Cloud BioLinux, RStudio) in public clouds. Currently, five different bioinformatic programs using MapReduce and two systems are integrated and have been successfully deployed. Cloudgene is freely available at http://cloudgene.uibk.ac.at.  相似文献   

19.
With the advances of cloud computing and virtualization technologies, running MapReduce applications over clouds has been attracting more and more attention in recent years. However, as a fundamental problem, the performance of MapReduce applications can sometimes be severely degraded due to the overheads from I/O virtualization and resource competitions among virtual machines. In this paper, we propose a dynamic block device reconfiguration algorithm in virtual MapReduce clusters, which reduces the data transfer time between virtual machines and thereby improving the performance of MapReduce applications on top of the clouds. The proposed algorithm utilizes a block device reconfiguration scheme, where a block device attached to a virtual machine can be dynamically detached and reattached to other virtual machines at runtime. This scheme allows us to move files easily across different virtual machines without any network transfers between virtual machines. This algorithm is also dynamic in a sense that it estimates the total data transfer times between virtual machines using multiple regression analysis based on CPU utilization and data size, and adaptively determines a least-cost data transfer path between a mapper virtual machine and a reducer virtual machine. We have implemented our algorithm in Hadoop MapReduce. The benchmarking results showed that the overheads incurred by transferring data from mapper virtual machines to reducer virtual machines are minimized and the execution times of MapReduce applications are shortened up to 14 %.  相似文献   

20.
Pit latrines are the most common latrine technology in rural Bangladesh, and untreated effluent from pits can directly contaminate surrounding aquifers. Sand barriers installed around the latrine pit can help reduce contamination but can also alter the decomposition of the fecal sludge and accelerate pit fill-up, which can counteract their benefits. We aimed to evaluate whether there was a difference in decomposition of fecal sludge and survival of soil-transmitted helminth (STH) ova among latrines where a 50-cm sand barrier was installed surrounding and at the bottom of the pit, compared to latrines without a sand barrier, in coastal Bangladesh. We assessed decomposition in latrine pits by measuring the carbon-nitrogen (C/N) ratio of fecal sludge. We enumerated Ascaris lumbricoides and Trichuris trichiura ova in the pit following 18 and 24 months of latrine use. We compared these outcomes between latrines with and without sand barriers using generalized linear models with robust standard errors to adjust for clustering at the village level. The C/N ratio in latrines with and without a sand barrier was 13.47 vs. 22.64 (mean difference: 9.16, 95% CI: 0.15, 18.18). Pits with sand barriers filled more quickly and were reportedly emptied three times more frequently than pits without; 27/34 latrines with sand barriers vs. 9/34 latrines without barriers were emptied in the previous six months. Most reported disposal methods were unsafe. Compared to latrines without sand barriers, latrines with sand barriers had significantly higher log10 mean counts of non-larvated A. lumbricoides ova (log10 mean difference: 0.35, 95% CI: 0.12, 0.58) and T. trichiura ova (log10 mean difference: 0.47, 95% CI: 0.20, 0.73). Larvated ova counts were similar for the two types of latrines for both A. lumbricoides and T. trichiura. Our findings suggest that sand barriers help contain helminth ova within the pits but pits with barriers fill up more quickly, leading to more frequent emptying of insufficiently decomposed fecal sludge. Further research is required on latrine technologies that can both isolate pathogens from the environment and achieve rapid decomposition.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号