Hadoop as a Service (HaaS)
A number of major cloud service providers, including Amazon, Cloudera, Microsoft and IBM are announcing new Hadoop-as-a-Service (HaaS) offerings to be made available in the coming months. This article takes a look Hadoop as well as some of these new service offerings.
What is Hadoop?
Apache Hadoop is advertised as a new way for enterprises to store and analyze their data. Hadoop, administered by the Apache Software Foundation is an open-source, Java-based project that involves contributors who work for some of the world’s biggest technology companies.
Hadoop was designed to respond to today’s data reality; enterprises currently collect and generate more data than at any point in the past. Hadoop was designed to provide fast, reliable analysis of both structured data as well as complex data. Thus, many enterprises are beginning to deploy Hadoop alongside their legacy IT systems. This enables them to combine old and new data sets in more practical ways.
Hadoop was originally developed on the basis of Google’s MapReduce, which breaks down an application into a number of small parts. Any of these parts (AKA ‘fragments’ or ‘blocks’) are able to run on any node in the cluster. Hadoop is able to run applications on systems with thousands of nodes that can involve thousands of terabytes.
Currently, the Hadoop framework is used by major companies, including Google, Yahoo and IBM. These companies use the framework for applications that involve search engines and advertising. Hadoop is compatible with the following operating systems: Windows, Linux, BSD, OS X.
Amazon’s MapReduce
Back in April 2009, Amazon announced MapReduce on its Elastic Compute Cloud and storage services. Originally created by Google, MapReduce is a programming model that helps to deal with very large data sets. The service allows AWS customers to access the power of a Google- or Yahoo-style server, and other programming infrastructure, in order to model business decisions and analyze very large sets of customer or corporate data.
ZDNet blogger Dana Gardner, raved,
“Think of it as having your own tuned supercomputer that you can plug gigantic data sets into and ask questions that will determine the course of your businesses for the next decade. Oh, and you can pay for the pleasure on a credit card.”
Cloudera’s CDH3
Industry experts commented that Cloudera’s CDH3 took Amazon’s MapReduce to the next level. CDH3 is a tuned Hadoop AMI which includes a number of additional software products, which aid in the administration and running of complex job son Hadoop. Many of these additions are open-source projects, and include:
- Apache Mahout
- Flume
- Sqoop
- Pig
- Oozie
- Hive
- HBase
- ZooKeeper
- Whirr
Microsoft’s HaaS Offering
At the PASS Summit 2011 in October, Microsoft announced that it would expand its data platform and add a Hadoop-as-a-Service offering to its growing list. The company plans to integrate this new service into Windows Azure and SQL Server in 2012. While Microsoft did not release many details, it did promise to maintain compatibility with Apache Hadoop codebase and to contribute to the open-source project.
The official Microsoft Blog discussed this pending release as a way to help customers manage and analyze data of any size:
“Microsoft makes this possible through SQL Server 2012 and through new investments to help customers manage ‘big data’, including an Apache Hadoop-based distribution for Windows Server and Windows Azure and a strategic partnership with Hortonworks… We often talk about the economics of the cloud, detailing how customers can achieve unmatched economies of scale by taking advantage of public or private cloud architectures.”
IBM’s BigInsights
On October 24, 2011, IBM announced its Hadoop-based InfoSphere BigInsights product will be available as a service on the IBM SmartCloud platform. According to industry experts, IBM’s offering is relatively unique, as it targets business users, rather than skilled programmers.
BigInsights uses a spreadsheet interface for working with data and creating jobs. It includes a query language, known as Jaql that has many similarities with SQL. This language was designed to query both structured as well as unstructured data. BigInsights also provides a wide variety of data-visualization options, which are designed to simplify the results of any given job.
Summary
This article takes a look at Hadoop and Hadoop-as-a-Service (HaaS) offerings, which are touted as a new way for enterprises to deal with data. The article explores the Hadoop framework, which is an open-source, Java-based project which helps to analyze structured as well as complex data. The article also looks at recent HaaS offerings by Amazon, Cloudera, Microsoft and IBM.
CCSK Exam Preparation
In preparation for the Certificate of Cloud Security Knowledge (CCSK), a security professional should be comfortable with topics related to this post, including:
- Data Security Lifecycle (Domain 5)
- Provider selection (Domain 8)
- Differences in S-P-I models (Domain 10)