Frameworks for solving problems by HPC and Cloud computing
Albert Garcia
Abstract: During the last decade, the landscape of distributed computing has changed deeply due to the appearance and popularization of two new actors: MapReduce and Cloud computing.
The MapReduce paradigm consists of a data-centric programming paradigm, in which large amounts of information can be potentially processed through map and reduce operations, which only rely upon the input data they are fed with. In absence of dependence between subsets of the input data, several mappers (or reducers) can run simultaneously with little or no communication. Cloud computing raises as an option to move forward into the ideal unlimited scalability by providing virtually infinite resources, yet applications must be adapted to deal with unknown (perhaps high) network latencies. Due to their features, both technologies perform very well together.
Nevertheless, not all problems can be expressed properly in terms of map and reduce operations. In contrast to pleasingly parallel problems, tightly coupled problems require intensive communications, and input data cannot be decomposed easily. In HPC clusters and supercomputers, MPI remains as the standard to express and parallelize compute intensive problems, in spite of its current limitations.
We will provide a fast review of the Hadoop MapReduce framework, as well as novel frameworks based on the same programming paradigm. Through our experiences, we will expose its limitations, providing a real use case in which a real simulator was adapted and evaluated. Finally, we will discuss about how to bridge the gap between Cloud & MR on one side and HPC & MPI on the other.