With so many new, evolving frameworks, tools, and languages, a new big data project can lead to confusion and unwarranted risk.
Many organizations have found Data Warehouse Optimization with Hadoop to be a good starting point on their Big Data journey. Offloading ETL workloads from the enterprise data warehouse (EDW) into Hadoop is a well-defined use case that produces tangible results for driving more insights while lowering costs. You gain significant business agility, avoid costly EDW upgrades, and free up EDW capacity for faster queries. This quick win builds credibility and generates savings to reinvest in more Big Data projects.
A proven reference architecture that includes everything you need in a turnkey solution – the Hadoop distribution, data integration software, servers, networking and services – makes it even easier to get started.
2. Panel moderator
Armando Acosta, Dell
Armando Acosta
• Subject Matter Expert for Dell Big Data Solutions
• Product Manager for the Dell Hadoop Solutions
• Works with customers to transform IT into better business
outcomes
• Seventeen years in technology
4. Organizations actively using data grow 50% faster
50%
39% 42%
( 2 0 1 4 ) ( 2 0 1 5 )
The number of
organizations who
understand the
benefits of big data
grew slightly.
5. Older technology
can’t keep up
The ability to scale to support all data
and unpredictable workloads means
effective data management and data
integration are key priorities
Data silos hinder
decision-making
Need to analyze all data,
regardless of type or where it
resides – and apply to use cases
Determining the
value
IT/business alignment on
strategic business objectives
and use cases is critical to
achieving ROI from all data
There are challenges that must be addressed
7. 7
How data is moved and prepared
for analysis
The basics of big data and analytics
Where data is
analyzed
• Databases
• Social media
• Sensor data (IoT)
• Devices
• LOB applications
• Cloud
• External sources
Where data
originates
• Analytical engine
• Business
intelligence
• In-memory
computing
• Enterprise data
warehouse
Data integration, aggregation
and transformation
8. Sean Anderson
Sean Anderson, Cloudera
Product Marketing - IT Solutions at Cloudera
Sean is a tenured infrastructure scaling and cloud
strategy consultant with a strong focus on strategic
partnerships and innovative hybrid technology. He has
been a part of integral shifts in technology including the
rise of cloud computing, open source standardization,
and big data. Sean quickly became a go-to resource
and speaker for data specific workloads focusing on
technologies like Hadoop, MongoDB, Redis,
ElasticSearch, SQL, and Data Warehousing. At
Rackspace Hosting, Sean helped build and launch
open-source cloud platforms around Hadoop,
MongoDB, and Redis. Sean is currently marketing
director for IT Solutions at Cloudera; the pioneers of
Apache Hadoop.
9. Inefficient data workloads cost customers money
Frequent ETL breakdowns Long reporting wait times
Ad hoc access pressure on EDW Extreme query complexity
10. OPERATIONS
DATA
MANAGEMENT
BATCH REAL-TIME
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT SECURITY
FILESYSTEM RELATIONAL NoSQL
STORE
INTEGRATE
BATCH STREAM SQL SEARCH SDK
Cloudera Enterprise
Making Hadoop Fast, Easy, and Secure
A new kind of data
platform.
• One place for unlimited data
• Unified data access
Cloudera makes it:
• Fast for business
• Easy to manage
• Secure without compromise
11. Cloudera Navigator Optimizer
Unlock Your Best Hadoop Strategy, Instantly
Active Data
Optimization for
Hadoop to save you
time and money
• Instant workload
insights
• Intelligent optimization
guidance
• Reduce Hadoop
workload development
effort
12. Intel
Brandon Draeger
Director of Marketing and Business Development for Big
Data Solutions
Brandon is a Director of Marketing and Business
Development for Big Data Solutions at Intel and manages
the GTM relationship for Intel and Cloudera and their
shared partner ecosystem. Brandon has over 15 years of
experience in a variety of enterprise technology disciplines
and has held roles in engineering, product management,
and strategy at Dell, Symantec, and Dorado Software.
13. Customers Are Struggling
Traditional Tools Aren’t Working
Data integration and transformation
workloads consume as much as 80%
of EDW capacity
80%
Of all Data Warehouses are
performance and capacity
constrained –
70%
#1 Challenge
Organizations cite TCO as biggest
obstacle to data integration tools
Gartner: “The State of Data Warehousing
in 2014, June 19, 2014”
Gartner: “The State of Data Warehousing
in 2014, June 19, 2014”
Gartner: “The State of Data Warehousing
in 2014, June 19, 2014”
14. #1 Use Case for Hadoop
Data Warehouse Optimization - ETL Offload
Customer Challenge- Processing and storing ever-increasing data volumes with traditional enterprise data
warehouses and related data integration technology, and their legacy pricing models, is taxing stagnant IT
budgets
Practitioners who have shifted one or
more workloads from legacy data
warehouses or mainframes to Hadoop
The most popular workloads being
shifted are large-scale data
transformations
61%
Customers have
implemented
Hadoop
Syncsort Customer Survey 2014
15. 15
Operational efficiency
Connect
Unify all data from disparate tables/sources to
reduce existing system load and data
transformation costs
Analyze
Deliver streamlined business reporting
even with existing analytical tools
Act
Utilize better, faster reporting for improved
data-driven decision making
Key use cases
• Data warehouse acceleration
• Log aggregation
• Data pipeline modernization
Data challenges for operational efficiency
16. Syncsort
Mark Muncy
Technical Product Marketing Manager – Big Data,
Syncsort
Mark Muncy leads Technical Product Marketing for
Syncsort’s Big Data portfolio, working with technical and
client-facing teams to deliver high-value solutions to the
most data intensive companies in the world. Mark
brings to his current role over a decade of hands-on
experience in data architecture and ETL development in
the gaming, data services, & financial services
industries.
17. Modern Data Pipeline
Traditional Data Pipeline
Too Many Workloads in the EDW
Modernize the Data Pipeline with Hadoop
Data Staging Tool
Extract & Load Data
Clean & Parse Data
Disparate
Data Sources
Enterprise data
warehouse + ETL
Data Transformation Jobs
Business Reporting Query
Perf
Capacity
The Results
Longer data
transformation job
times
Not meeting SLAs for
business reporting
Slow Ad Hoc Query
Too costly to scale
Disparate
Data Sources
Enterprise data warehouse
Business Reporting
Query
Perf
Capacity
The Results
Reduced data
transformation job
times
Improved SLAs for
business reporting
Fast Ad Hoc Query
Scales Economically
Hadoop + ETL
Data Transformation
Jobs Clean, Parse,
Transform
18. Syncsort DMX-h: A Complete Solution for Hadoop
Connect Transform Optimize
• Smarter Architecture – Engine runs natively within
MapReduce and Spark
• Smarter Connectivity – Connect streaming and batch
data sources across the organization, including
mainframe, NoSQL and everything in between.
• Smarter Development – GUI for developing &
maintaining Hadoop data pipeline
• Smarter Productivity – Use-case Accelerators to fast-
track development
• Enterprise Grade Solution – Integrated support for
Cloudera Navigator, Sentry, Kerberos and LDAP
Design Once, Deploy Anywhere
• Free users from underlying complexities of Hadoop
• Intelligent Execution dynamically optimizes the job
for any platform on premise or in the cloud
• Future-proof your applications!
19. 19
3. Act2. Analyze1. ConnectSource
Operational efficiency architecture
ManagementServices Security Dell Financial ServicesInfrastructure
Operational data
sources
Enterprise data
warehouse
Relational
management
database
Data mart
Extract, translate,
and load
Sort
Aggregate
Group
Parse
Clean
Translate
Enterprise data
warehouse
Relational
management
database
Data mart
Business reporting
and query
Price
optimization
Improved
forecasting
Uptime
optimization
Accelerated
response
Faster
reporting
Improved
service
levels
Dell | Cloudera |
Syncsort |Intel
Microsoft APS, SAP HANA
20. Redeploying talent /
reducing staff costs
Entry level employee using the Dell |
Cloudera | Syncsort solution for Hadoop
could save 76.3% over three years
compared to a senior engineer using a
DIY, open source approach.
Save time and cost on Hadoop ETL jobs.
Expert Cost (contractor) $559.298
Expert Cost (employee) $279,149
Beginner Cost
$132,326
Total administrative costs over three years to design 4 ETL jobs per month.
21. Entry Level vs. Senior
Engineer
Time to complete ETL jobs
comparing experience engineers
(green) to new hires (blue)
Complete Hadoop jobs faster
30 min, 11 sec
36 min, 39 sec
4 min, 48 sec
5 min, 51 sec
6 min, 15 sec
15 min, 45 sec
Data validation and pre-processing
Fact dimension load with type 2 SCD
Vendor mainframe file integration
60.3%
less time
17.6%
less time
17.9%
less time
22. Save 53.7%
in time
Using the Dell |
Cloudera |
Syncsort solution
for Hadoop, the
entry-level
technician
developed and
deployed Hadoop
ETL jobs in 53.7%
less time
Reclaim days of valuable time
Fact dimension load
with type 2 SCD
Data validation
and
pre-processing
Vendor
mainframe
file
integration
Load Validat Int
8.3 Days
3.8 Days