Presenting at the SC11 Doctoral Symposium : Architectures for Iterative Data Intensive Analysis Computations on Clouds and Heterogeneous Environments

I'm going to present my research work on "performing parallel computations on clouds",  at the SC11 Doctoral Symposium on Tuesday Nov.15th at 4.45 pm at room TCC LL1.  Drop by if you are attending SC11.

Abstract: Iterative computations are at the core of the vast majority of data-intensive scientific computations. Recent advancements in data intensive computational fields are fueling a dramatic growth in number as well as usage of such data intensive iterative computations. The utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure services offers a very viable environment for the scientists to perform data intensive computations. However, clouds by nature offer unique reliability and sustained performance challenges to large scale distributed computations necessitating computation frameworks specifically tailored for cloud characteristics to harness the power of clouds easily and effectively. My research focuses on identifying and developing user-friendly distributed parallel computation frameworks to facilitate the optimized efficient execution of iterative as well as non-iterative data-intensive computations in cloud environments, alongside the evaluation of heterogeneous cloud resources offering GPGPU resources in addition to CPU resources, for data-intensive iterative computations.

Cloud computing paradigms for biomedical applications

This paper is an extended version of our earlier paper presented at the ECMLS workshop in HPDC 2010, which got invited for an journal publication. In this paper we evaluate the feasibility of cloud platforms to perform pleasingly parallel biomedical computations including Cap3 sequence assembly, BLAST sequence searching and Generative Topographic Mapping (GTM) interpolation. We implement these applications for Amazon EC2 and Windows Azure clouds using a simple framework we created using Cloud infrastructure services, which we call as Classic Cloud Framework. We also compare them with Apache Hadoop implementations and Microsoft DryadLINQ implementations.  Since then, we have expanded on this work to implement MapReduce and iterative MapReduce frameworks for cloud environments using Cloud Infrastructure Services.
Abstract:Cloud computing offers exciting new approaches for scientific computing that leverage major commercial players’ hardware and software investments in large-scale data centers. Loosely coupled problems are very important in many scientific fields, and with the ongoing move towards data-intensive computing, they are on the rise. There exist several different approaches to leveraging clouds and cloud-oriented data processing frameworks to perform pleasingly parallel (also called embarrassingly parallel) computations. In this paper, we present three pleasingly parallel biomedical applications: (i) assembly of genome fragments; (ii) sequence alignment and similarity search; and (iii) dimension reduction in the analysis of chemical structures, which are implemented utilizing a cloud infrastructure service-based utility computing models of Amazon Web Services (http://Amazon.com Inc., Seattle, WA, USA) and Microsoft Windows Azure (Microsoft Corp., Redmond, WA, USA) as well as utilizing MapReduce-based data processing frameworks Apache Hadoop (Apache Software Foundation, Los Angeles, CA, USA) and Microsoft DryadLINQ. We review and compare each of these frameworks, performing a comparative study among them based on performance, cost, and usability. High latency, eventually consistent cloud infrastructure service-based frameworks that rely on off-the-node cloud storage were able to exhibit performance efficiencies and scalability comparable to the MapReduce-based frameworks with local disk-based storage for the applications considered. In this paper, we also analyze variations in cost among the different platform choices (e.g., Elastic Compute Cloud instance types), highlighting the importance of selecting an appropriate platform based on the nature of the computation. 

Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs

Last year we started implementing several iterative data intensive scientific applications on GPGPU using OpenCL as a class project for the B649:Parallel Architectures and Computing class with Prof.Arun Chauhan. Even though it's a pain to program using OpenCL (at least with the current state of the frameworks), we got fascinated with the GPGPU programming and the results we obtained, which lead to the continuation of the project towards a paper. We plan on integrating this work with our iterative MapReduce work to create a distributed GPGPU programming framework.
Abstract - We present a study of three important kernels that occur frequently in iterative statistical applications: K-Means, Multi-Dimensional Scaling (MDS), and PageRank. We implemented each kernel using OpenCL and evaluated their performance on an NVIDIA Tesla GPGPU card. By examining the underlying algorithms and empirically measuring the performance of various components of the kernel we explored the optimization of these kernels by four main techniques: (1) caching invariant data in GPU memory across iterations, (2) selectively placing data in di erent memory levels, (3) rearranging data in memory, and (4) dividing the work between the GPU and the CPU. The optimizations resulted in performance improvements of up to 5X, compared to naive OpenCL implementations. We believe that these categories of optimizations are also applicable to other similar kernels. Finally, we draw several lessons that would be useful in not only implementing other similar kernels with OpenCL, but also in devising code generation strategies in compilers that target GPGPUs through OpenCL.


Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure

Our paper on the data intensive iterative scientific applications of "Twister4Azure : Iterative MapReduce on Azure Cloud" was accepted for publication at IEEE/ACM UCC 2011. In this paper we use present non-trivial parallel applications implemented using Twister4Azure executing on Azure cloud environment utilizing up to 256 instances and tens of thousands of tasks per job.


Abstract— Recent advancements in data intensive computing for science discovery are fueling a dramatic growth in use of data-intensive iterative computations. The utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure services offers a very attractive environment for scientists to perform such data intensive computations. The challenges to large scale distributed computations on clouds demand new computation frameworks that are specifically tailored for cloud characteristics in order to easily and effectively harness the power of clouds.  Twister4Azure is a distributed decentralized iterative MapReduce runtime for Windows Azure Cloud. It extends the familiar, easy-to-use MapReduce programming model with iterative extensions, enabling a wide array of large-scale iterative data analysis for scientific applications on Azure cloud. This paper presents the applicability of Twister4Azure with highlighted features of fault-tolerance, efficiency and simplicity.  We study three data-intensive applications − two iterative scientific applications, Multi-Dimensional Scaling and KMeans Clustering; one data–intensive pleasingly parallel scientific application, BLAST+ sequence searching. Performance measurements show comparable or a factor of 2 to 4 better results than the traditional MapReduce runtimes deployed on up to 256 instances and for jobs with tens of thousands of tasks.

Map Reduce on Azure (MRRoles4Azure)

Last year we wanted to run some of our distributed parallel applications on  Azure and we found that Azure did not have support for any high level distributed computing frameworks (other than the simple queue based model) at that time. This motivated me to develop a MapReduce framework for Azure using Azure infrastructure services. Even though it's based on high-latency cloud services, we were able to achieve performance comparable to Hadoop and DryadLinq for our applications, which further motivated us to release the MapReduce framework to the public.


MapReduceRoles4Azure (MRRoles4Azure) is a distributed decentralized MapReduce runtime for Windows Azure that was developed using Azure cloud infrastructure services.MapReduceRoles4Azure uses Azure Queues for map and reduce task scheduling,Azure Tables for metadata & monitoring data storage, Azure Blob storage for input, output and intermediate data storage and the Window Azure Computeworker roles to perform the computations.  The usage of the cloud infrastructure services allows the MapReduceRoles4Azure implementation to take advantage of the scalability, high availability and the distributed nature of such services guaranteed by the cloud service providers to avoid single point of failures, bandwidth bottlenecks (network as well as storage bottlenecks) and management overheads.

The usage of cloud services usually introduces latencies larger than their optimized non-cloud counterparts and often does not guarantee the time for the data's first availability. These overheads can be conquered, however, by using a sufficiently coarser grained map and reduce tasks. MapReduceRoles4Azureovercomes the availability issues by retrying and by designing the system so it does not rely on the immediate availability of data to all the workers.

MapReduceRoles4Azure is designed around a decentralized control model without a master node, thus avoiding the possible single point of failure.MapReduceRoles4Azure provides users with the capability to dynamically scale up or down the number of computing instances, even in the middle of a MapReduce computation, as and when it is needed.

You can download MRRoles4Azure from here. User guide is available here. We are currently working on an Iterative MapReduce (Twister4Azure) implementation for Azure based on MRRoles4Azure. If you want to try it out, please get in touch with me and I'll provide you with the current development version of Twister4Azure.
Map Reduce in the Clouds   (At that time of this presentation we used call it as AzureMapReduce) 

Generating documentation for .NET projects

Coming from a Java background, I always wanted to find a way to generate JavaDoc like documentation for my .net projects. Looks like http://sandcastle.codeplex.com/ is the way to go.

Installing Sun (Oracle) JDK 6 on Ubuntu

Ubuntu 10.10 installs OpenJDK as the default java distro. Often times I go ahead and replace it with closed source Sun (Oracle) JDK 1.6, in order to avoid any hiccups that I might encounter down the road with OpenJDK. It's possible to download the Sun Linux JDK distro from Oracle directly, but I always find it much easier and cleaner to do it directly through synaptic.

$sudo add-apt-repository ppa:sun-java-community-team/sun-java6
$sudo apt-get  update
$sudo apt-get  install sun-java6-jdk
$sudo apt-get  install sun-java6-plugin
# to set this as default java version
$sudo update-java-alternatives -s java-6-sun      

Do a "java -version" to ensure everything worked well.

Experience with adapting a WS-BPEL runtime for eScience workflows

The paper I presented in the SC09 GCE workshop.

Abstract

"Scientists believe in the concept of collective intelligence and are increasingly collaborating with their peers, sharing data and simulation techniques. These collaborations are made possible by building eScience infrastructures. eScience infrastructures build and assemble various scientific workflow and data management tools which provide rich end user functionality while abstracting the complexities of many underlying technologies. For instance, workflow systems provide a means to execute complex sequence of tasks with or without intensive user intervention and in ways that support flexible reordering and reconfiguration of the workflow. As the workflow technologies continue to emerge, the need for interoperability and standardization clamorous. The Web Services Business Process Execution Language (WS-BPEL) provides one such standard way of defining workflows. WS-BPEL specification encompasses broad range of workflow composition and description capabilities that can be applied to both abstract as well as concrete executable components.

Scientific workflows with their agile characteristics present significant challenges in embracing WS-BPEL for eScience purposes. In this paper we discuss the experiences in adopting a WS-BPEL runtime within an eScience infrastructure with reference to an early implementation of a custom eScience motivated BPEL like workflow engine. Specifically the paper focuses on replacing the early adopter research system with a widely used open source WS-BPEL runtime, Apache ODE, while retaining the interoperable design to switch to any WS-BPEL compliant workflow runtime in future. The paper discusses the challenges encountered in extending a business motivated workflow engine for scientific workflow executions. Further, the paper presents performance benchmarks for the developed system."

Shell scripting examples : file distribution

Today I had to write a shell script to distribute a number of data files among several nodes. Following are some tips which helped me for listing, manipulating and distributing files using shell scripts.

  • Command line arguments
    Shell scripts expose command line arguments using $1, $2, $3,.. variables corresponding to the order of args.
    You can check the given number of arguments using s#
    if [ $# -ne 3 ]
    ....

    fi
  • Assigning the list of files in a dir to an array
    filelist=(`ls /tmp`)
  • Arithmetic operations
    result=$(($x/$y))
  • Looping through an array
    for file in ${filelist[@]}; do
    ....
    done
  • Simple for loop
    for (( i=0;i<10; i++))
    do
    ....
    done
  • Simple while loop. -lt stands for "less than", while -le stands for "less than or equal to"
    while [ $i -lt 10 ]; do
    ....
    i++
    done
  • Secure copy a file to a different machine.
    scp $dir/$file $node_address:$dest_dir

Some helpful links...
http://www.arachnoid.com/linux/shell_programming.html
http://www.freeos.com/guides/lsst/
http://www.tech-recipes.com/rx/636/bash-shell-script-iterate-through-array-values/

Debugging SSL

Currently we are in the process of upgrading some components in our cyberinfrastructure (catchy word huh..) in a move towards more standardised solutions. Our existing system is more or less home grown and has it's own customized SSL layer based on puretls to handle the mutual authentication based on PEM encoded X509 certificats. We are currently moving towards Tomcat based solutions for some parts of our infrastructure. Part of our challenge is to figure out how to setup SSL mutual authentication on Tomcat. Other half of the challenge is to make that work witht he existing clients of other components. 

Soon it proved to be a very rough ride, taking weeks to debug, especially since I'm new to SSL stuff. If you are also new to SSL and want to debug a SSL setup, chances are high that you'll also go on the same path as I went. We tried to use the various clients (existing puretls based, jsse) to debug and figure out what's going on, hoping the stacktraces will give us a clue.. Oh...wait... We are dealing with security.. Even the error messages are encrypted (or is it security by obscurity), as any error would translate to a very small set of generic error messages. This is when OpenSSL command line tool came to rescue us..

Openssl command line program provides use with tools to verify certificates, ssl clients to test SSL enabled servers and even supports setting up a temporary SSL enables server to test your clients. 

  • Testing the mutual authenticaton enabled server SSL setup
[me@home tmp]$ openssl s_client -connect localhost:8443 -cert client_cert.pem -CAfile server_ca.pem -state -key client_key.pem

Above command will use client_cert.pem to authenticate himself to the server and will use the certificate of the trusted CA, server_ca.pem , to authenticate the server. "-state" will give more detailed debug information. Believe me, the error codes I got from this are very specific and quickly lead us to the issue we had. Above command hinted use that something is wrong in our client certificate. This is something we never thought of earlier, as our existing system works well with these certificates. Before this we mainly thought that the error occured due to lack of interoperability between mod_ssl & much older puretls implementation.

  •  Verifying the certificate
[me@home tmp]$ openssl verify -CAfile ca.pem -purpose sslclient client_cert.pem

The above command threw an error mentioning that the "purpose" of our client certificate is not sslclient, which in fact was the bug responsible for many of my fallen hair during the last week. Eventually we found that our existing system which is based on Puretls does not validate this certificate extensions, which made our exisiting system to work even with this bug.

All and all I found openssl command line program to be a very usefull/helpfull extensive tool which comes very handy when debuging SSL setups.