This document discusses Hadoop and how it is gaining adoption in the enterprise. It provides an overview of the Hadoop ecosystem including the core components of HDFS, MapReduce, Hive, Pig and HBase. It describes how enterprises are using Hadoop to store and analyze large amounts of structured and unstructured data across clustered servers. The document also discusses data integration tools like Pentaho and Talend that can help schedule and develop ETL processes for Hadoop. Finally, it provides some references and examples of companies that are using Hadoop for applications like log analysis, machine learning and data mining.
SQL Database Design For Developers at php[tek] 2024
Hadoop For Enterprises
1. Hadoop for
Enterprise
rev 7
Rajesh Nadipalli
Mar 2012
rajesh.nadipalli@gmail.com
2. Hadoop getting attention
• Feb 2012: Microsoft, Hortonworks in partnership to develop Excel
plug-in for Hadoop
• Jan 2012: Oracle announces Big Data Appliance with Cloudera’s
Hadoop distribution
• Dec 2011: EMC released Unified Analytics Platform which includes
Greenplum Apache Hadoop distribution
• Oct 2011: Microsoft plans to add Hadoop support to SQL server 2012
• May 2010: IBM introduces Hadoop based InfoSphereBigInsights
Rajesh.nadipalli@gmail.com
3. In this Presentation…
Big Data – Big Opportunities
Hadoop for Enterprise – Reference
Arch
Map Reduce Overview
Hive
References
Rajesh.nadipalli@gmail.com
4. BIG DATA – BIG
OPPORTUNITIES
Rajesh.nadipalli@gmail.c
5. Big Data - Business
Opportunity
Enterprises today are challenged with..
Exponential data growth
Complex data needs- structured & unstructured
Real time insights with key indicators
Heterogeneous environment: private and
public clouds
Tighter budgets and the need to do more with
less
Traditional relational databases are not
able to scale and meet these challenges
Rajesh.nadipalli@gmail.com
7. Why Hadoop?
Hadoop provides…
Distributed File System
Parallel computing across several nodes
Support for structured and un-structured
content
Fault tolerance and linear scalability
Open source under Apache foundation
Increasing support from vendors
Key Philosophy: “moving compute is cheaper
than moving data”
Forrester regards Hadoop as the nucleus of the next-generation EDW in the
cloud.
Rajesh.nadipalli@gmail.com
8. Some users of Hadoop…
http://wiki.apache.org/hadoop/PoweredBy
• Use Hadoop to store copies of internal log and dimension data sources and use it as
a source for reporting/analytics and machine learning.
• Currently we have 2 major clusters: A 1100-machine cluster with 8800 cores and
about 12 PB raw storage. A 300-machine cluster with 2400 cores and about 3 PB raw
storage.
• Each (commodity) node has 8 cores and 12 TB of storage.
• Hadoop used to analyze the log of search and do some mining work on web page
database
• We handle about 3000TB per week Our clusters vary from 10 to 500 nodes
• 532 nodes cluster (8 * 532 cores, 5.3PB).
• Heavy usage of Java MapReduce, Pig, Hive, HBase
• Using it for search optimization and Research
•5 machine cluster (8 cores/machine, 5TB/machine storage)
•Existing 19 virtual machine cluster (2 cores/machine 30TB storage
•Predominantly Hive and Streaming API based jobs (~20,000 jobs a week)
•Daily batch ETL; Log analysis; Data mining; Machine learning
Rajesh.nadipalli@gmail.com
10. Hadoop for Enterprise – Technology Stack
User Experience
Ad-hoc Notifications Embedded
Analytics
queries /Alerts Search
Data Access
Excel R (Rhipe,
Hive Pig Datameer
RBits)
Zookeeper (Orchestration, Quorum)
Pentaho (Scheduling, Integrations)
Data Processing
Mapreduce
Hadoop Data Store
Hbase (NOSQL DB)
HDFS
Sqoop
Data Sources
Application Database Log RSS
Cloud Others
s (internal) s Files Feeds
Rajesh.nadipalli@gmail.com
11. Hadoop for BI – Reference Architecture
Data Hadoop Distributed Computing Enterprise Apps
Environment
Dashboards
RDBMS
Excel
M
XML
A N-Node
JSON P scalable
cluster ERP, Enterprise
Apps
Binary R
E
CSV D
U
Log C Import RDBM
E S
Java Hadoop File
Objects System
(HDFS)
Rajesh.nadipalli@gmail.com
12. http://www.oracle.com/technetwork/database/focus-areas/bi-datawarehousing/wp-big-data-with-oracle-
521209.pdf?ssSourceSiteId=ocomen
Oracle’s Big Data Solution
• Oracle sees Hadoop is good for unstructured sourcing and map reduce.
• It recommends to use Oracle database for the final analyze stage
• Oracle Data Integrator can make Hive queries (ETL)
• Oracle has a wrapper on top of sqoop which is called Oraoop (see
references)
Rajesh.nadipalli@gmail.com
15. Map Reduce Tips
The first step is to understand what data you
have, and how to feed it into the Hadoop
distributed computing environment.
Using distributed applications, provide
analytics of the massive data sets, while
simultaneously enabling the surfacing of
opportunities.
Hadoop stores your information for future
queries, enhancing the exploratory
capabilities (as well as historical reference) of
your data. Rajesh.nadipalli@gmail.com
17. HDFS
Distributed file system consisting of
◦ One single node is called “Namenode” and
has metadata
◦ Several “Datanodes”
Designed to run on commodity hardware
Data gets imported as blocks (64 MB)
These Blocks are replicated (typically 3
copies) to protect for hardware failures
Access via Java API’s or hadoop
command line ($hadoop fs…)
Rajesh.nadipalli@gmail.com
19. HBase
Distributed, column-oriented database
(NoSQL)
Failure-tolerant
Low latency
HDFS aware
Access via Java APIs or REST APIs
It is not a replacement for RDBMS
Recommended to use Hbase when
◦ Data is searched by key (or range)
◦ Data does not conform to a schema (for
instance if you have attributes that change by
record).
Rajesh.nadipalli@gmail.com
20. Hbase Architecture
Zookeeper
Avatar
Hbase
(Failover of
Master
master)
Region Region Region Region
Server Server Server Server
Zookeeper maintains quorum and knows which server
is the master
Master keeps track of regions and region servers
Region servers store table regions
Rajesh.nadipalli@gmail.com
21. Hbase Column Storage
Hbase stores data like tags for a key;
for example:
Row Column Column Cell
Family
Cast Cast:Actor1 Harrison Ford
Star Wars Cast:Actor2 Carrie Fisher
Reviews Review: IMDB Review URL
Review: ET Review URL2
Rajesh.nadipalli@gmail.com
23. Hive Overview
Data warehouse software built on top
of Hadoop
HiveQL provides a SQL like interface
and performs a map reduce job
Provides structure to HDFS data
similar to Oracle External table
Rajesh.nadipalli@gmail.com
25. Pig Overview
Pig is a layer on top of map-reduce for
statisticians (programmers)
It provides several standard operators:
join, order by etc
It allows user defined functions to be
included.
Java or phyton supported for UDF’s
Rajesh.nadipalli@gmail.com
26. http://www.datameer.com/
Datameer Overview
Key philosophy: Business users understand Excel; let them
do the grouping, sorting, filtering, aggregates
Key Steps:
Datameer’s source is a mapreduce output.
Datameer takes a quick sample of 5000 records.
The end user is next presented an Excel like interface on top
of this 5000 records. This is where the end users can define
their filters, formula, grouping, aggregations, joins across
sheets (even join hadoop data with data from a relational
database table)
Once the end user has defined what they want as the end
result, they can submit a job to run on the complete dataset.
Datameer will then build the necessary map reduce jobs and
run it on the complete data set.
Next the user gets the results and can build charts, tables etc
– all on the browser
Rajesh.nadipalli@gmail.com
29. Pentaho Data Integration
Pentaho is considered as “strong
performer” by Forrestor (Feb 2012)
It makes building MapReduce easy via
it’s Data Integration IDE.
It can read/write to HDFS, run map
reduce and Pig scripts
The IDE has several standard
connectors, transformation, and allows
custom java code
http://www.pentaho.com/big-data/
Rajesh.nadipalli@gmail.com
34. User Experience
This layer of stack is generally custom
development. However some tools
that work with Hadoop are:
Tableau for data analysis &
visualizations
SAS Enterprise Miner
IBM BigInsights
Rajesh.nadipalli@gmail.com
39. MAP-R
No single point of failure of name node
Performance improvements (5 times
faster than HDFS)
Snapshots, Multi-site copies
They have separate Mapreduce
(extended mapreduce)
MapR is 8K blocks instead of 64MB
block size of HDFS
Rajesh.nadipalli@gmail.com
40. Open Topics – why there is
adoption issue
Security – no concept of roles
Backup, Recovery
ACID not supported
Rajesh.nadipalli@gmail.com
41. Thank You to my viewers
Rajesh.nadipalli@gmail.com
Hadoop implements the core features that are at the heart of most modern EDWs: cloud-facing architectures, MPP, in-database analytics, mixed workload management, and a hybrid storage layer
http://wiki.apache.org/hadoop/PoweredBy
HBase is a full-fledged database (albeit not relational) which uses HDFS as storage. This means you can run interactive queries and updates on your dataset. Sqoop takes data from any DB that supports JDBC and moves it into HDFSIf you haven’t already, check out Toad® for Cloud Databases, our free, fully functional, commercial-grade cloud solution. With Toad for Cloud Databases, you can easily generate queries, migrate, browse, and edit data, as well as create reports and tables – all in a familiar SQL view. Finally, everyone can experience the productivity gains and cost benefits of NoSQL and big data – without the headachesToad for Cloud Databases provides unrivaled support for Apache Hive, Apache HBase, Apache Cassandra, MongoDB, Amazon SimpleDB, Microsoft Azure Table Services, Microsoft SQL Azure, and all open database connectivity (ODBC)-enabled relational databases (including Oracle, SQL Server, MySQL, DB2, and others)
Netflix uses similar reference architecture for movie recommendations. Hadoop is not suited for low latency. Facebook does use Hbase for messaging which is close to a real time functionality