You know Pig is more than a farm animal and that Hive is not some ultra-hip bar. You've beyond the buzz words and the word count demos. Now…you're ready to figure out how it all fits in. In this session we will review common integration scenarios, proven patterns and best practices for integration Big Data solutions into your existing data warehouse and BI architecture. Learn how you too can ride the Big Data wave without reinventing the wheel to both enhance the information you currently deliver while solving problems that were previously unapproachable.
3. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Big Data
Data Explosion
As recently as 2000, only ¼ of data was digital
Paper, film or other analog media
According to IBM, 90% of data created in last 2
years
Data volume now growing 10% every 5 years
Approximately, 85% from new sources
Consumerization
4.3 connected devices per adult
27% use social media input
4. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Big Data
Data Complexity: Variety and Velocity
Terabytes
Gigabytes
Megabytes
Petabytes
Big Data
Service Logs
Spatial &
GPS coordinates
Data market feeds
eGov feeds
Weather
Text/image
Click stream
Wikis/blogs
Sensors
RFID/Devices
SMS
HD Audio/video
Source: Brian Mitchel, TechEd 2013
5. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Big Data is well…Big
Drove $28b in IT investment in 2012
Expected to grow to $34b in 2014
Challenges:
Data Volumes (Hardware/Storage Economics)
Data Diversity (Multiple Types & Sources)
Data Velocity (Real-Time)
User-Expectations
How do we plan/integrate…….
8. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Hadoop on Windows
HDInsight on Windows Azure
Seamlessly scale in the cloud
Backed by Azure Storage Vault (ASV)
Hortonworks Data Platform (HDP)
On-Premise
Based on HDFS
13. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Tool, Techniques & Strategies
Enterprise Data Services
WebHDFS
Sqoop
Hcatalog
Pig/Hive
Enterprise Operational Services
Oozie
Other
Windows Azure Blob Storage & AzCopy
Hive ODBC
Polybase
14. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
WebHDFS
Born from HFTP, intended as a replacement
Widely used by Yahoo!
High performance, first class native protocol
using industry standard RESTful mechanism
Complete interface for reading, writing &
managing files
Supports secure authentication
Data Locality – requests sent to data nodes
15. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
WebHDFS – Get Example
Request:
curl -i -L
http://host:port/webhdfs/v1/foo/bar?op=OPEN
Response:
HTTP/1.1 307 TEMPORARY_REDIRECT
Content-Type: application/octet-stream
Location:
http://datanode:50075/webhdfs/v1/foo/bar?op=OPEN&offset=0
Content-Length: 0
HTTP/1.1 200 OK
Content-Type: application/octet-stream
Content-Length: 22
Hello, webhdfs user!
16. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
WebHDFS – More Examples
Rename Request:
curl -i -X PUT
http://host:port/webhdfs/v1/foo/bar?op=RENAME&a
mp;destination=/foo/bar2
Create Directory Request:
curl -i -X PUT
http://host:port/webhdfs/v1/foo2?op=MKDIRS
17. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Sqoop
Tool designed to efficiently move data
between Hadoop (Hive & Hbase) and RDBMS
Importing (single and all tables)
Exporting
Eval (Query Execution)
Merge (Multiple HDFS datasets)
Incremental Imports
Generates MapReduce jobs
Can control the level of parallelism
19. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
HCatalog/Hive/Pig
Hcatalog – Metadata & table management
Users interact with a set of defined tables
Abstracts away the where/how of data storage
Allows for consistent access
Pig – ETL/Data Transformation Scripting
Pig Latin
Java User-Defined Functions (Piggybank/DataFu)
Hive – SQL-like interface
Allows ad-hoc queries for data summarizations and
analysis
ODBC Connector
21. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Oozie
Scalable, Reliable, Extensible Workflow
Management System/Job Scheduler
Triggered by:
Time
Data Availability
Can run and orchestrate multiple jobs:
MapReduce and Streaming MapReduce
Hive
Pig
22. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Windows Azure Blob Storage
Also called Azure Storage Vault (ASV)
Scalable, persistent, highly-scalable storage with built-
in geo-replication
Azure HDInsight clusters are wired for ASV
On-Premise HDP uses HDFS
Separates data from compute nodes:
Clusters can be created and dropped, minimizing costs
Multiple clusters can share data
The Azure Flat (Quantum 10) mesh grid network is the
key
Violates the principal of data locality, but out-performs
HDFS and Azure competitors
24. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
AzCopy
Windows Azure Blob Storage
Copies files to and from
Similar to Robocopy
Command-line:
/S /V
Recursively (/S) copies all files in the Beer
directory with Verbose (/V) logging
26. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
PolyBase
Part of Parallel Data Warehouse, allows
integration of relational and non-relational
data
Creates external tables via a HDFS bridge
Allows on-the-fly joins within SQL Server
Supports parallel:
Imports from HDFS
Exports to HDFS
27. MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Resources
Bloggers:
Denny Lee
http://dennyglee.com/
Carl Nolan
http://blogs.msdn.com/b/carlnol/archive/tags/hadoop+streaming/
Cindy Gross
http://blogs.msdn.com/b/cindygross/archive/tags/big+data/
Books:
Hadoop the Definite Guide - Tom White
Programming Pig - Alan Gates
Programming Hive - Edward Capriolo
Hadoop MapReduce Cookbook - Srinath Perera
Links to this Presentation:
http://bluewatersql.wordpress.com/resources/
http://www.slideshare.net/bluewatersql/big-dataguide