This document describes a customer success story involving Cloudera and Xpand IT. It discusses how Xpand IT developed a solution to provide near real-time monitoring and management of Hadoop clusters. The solution involves collecting telemetry data from Hadoop jobs, storing it in Kafka for real-time access, and using Spark to parse the logs and load data into Impala and HBase. This allows for real-time monitoring and control of ETL jobs across multiple Hadoop components in a fault-tolerant manner. The architecture is designed according to lambda architecture principles to handle both real-time and batch data processing.
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Cloudera Customer Success Story
1. Customer Success Story
Cloudera & Xpand IT
Nuno Barreto
Associate Partner & Big Data Lead
nuno.barreto@xpand-it.com
Proprietary & Confidential www.xpand-it.com
2. THE PROBLEM
How is process Y
progressing?
Who are the main cluster
users/departments?
Which engines does
each department use?
Do I need to plan
on an upgrade?
How much is process
X costing me?
Are there available
time slots?
7. REAL-TIME & STREAMING
CORE AGENT(s)
QUEUE
REAL-TIME
ONLINEDB
ANALYTICSREPO
ETL
start/stop
jobs
start/stop
jobs
PDI
extensionlogflow control
ANALYTICS
ANALYTICSDB
status check
metadata access
data
access
analytics data
analytical
queries
operational data
11. COLLECT LOG DATA IN (AS) REALTIME (AS
POSSIBLE)
SPARK AS KAFKA COLLECTOR
REAL TIME LOG PARSING
ETL TOOL ADAPTABLE
DATA DUMPS IN IMPALA AND
HBASE
GENERATES NOTIFICATIONS
14. DISCLAIMER
What you are about to see is a
Work In Progress so, be gentle in
case…
• the demo doesn’t work
• features don’t work as
described
• connection goes down