SQL on Hadoop for Enterprise Analytics

A LITTLE BIT OF
HISTORY
Everythingoldisnewagain.
SQLForever.

The story so far
Why hasn’t SQL died yet?
It’s 2016 and we’re still using it?!

Everything old is new again
Existing architecture keeps reappearing
It takes time to figure out what tools are right for what jobs
SQL is still the best tool for business analytics

Database problems
Database outage
Data integrity issues
Data latency
Late 1990

By the time I graduated, SQL was on its last legs
2009

SQL golden age ends, NoSQL takes off
2010
Column Graph
Key-Value Document

Awesome things about NoSQL
No SQL, normal languages as APIs!
Non relational!
FAST!
2010

Remember EAV(Entity Attribute Value)?
1968

Kind of looks like columns…
1968

NoSQL has some rough bumps
2010

NoSQL has A LOT of rough bumps…
2011

MPP columnar DBs! Wait... SQL is back?!
2015

What’s next?
~2020?
“If you have an architecture where you’re trying to periodically
trying to dump from one system to the other and synchronize,
you can simplify your life quite a bit by just putting your data in
this storage system called Kudu.” – Todd Lipcon

Fin
“If it ain’t broke, don’t fix it”

CUSTOMER STORY
Buildingaeventanalyticspipelineusing HadoopandSpark

Why Consider a Big Data Pipeline?
37
You arerapidly exceedingthelimits ofyour existing database
Everythingon yourwebsitecan be
analyzed.
Waitinguntilthenextdayisn’tfor
you
Datacomes andgoestomany places, andyou
wantoneprocess forit

Big DATA CULTURE
38
Summarydatais notgood enough Companyismandatingnew
technologies
Youwanttobuild adatadriven
culture
Big SQLis theheartof a data-drivenculture

CASE STUDY
39
A major healthcare provider wants to create a web event pipeline that:
Duringperiodsofhealthcareregistrationandnew
coveragestartandcan dialbacktherestoftheyear
Massive Scaling Large data volumes
10-15Mcustomersworthof data.Provides
dataforanalysisinunder1minute.
AND Utilizes existing in house technologies (such as Cloudera Impala)
Pageloads
Registrations
Logins
Errors
All events processed

Solution: Build an event processing framework
5
Events
Event Collector Hadoop
?

High Level Process
6
Events
Event Collector
Message Processing
HDFS
Looker
To be designed

Why is Hadoop so hard?
7
Needtowritein Javaand
Scala
We don’thavestructure
NoteasytogetdataoutintoBItools
EventCollectorsdon’ttendtofeed
toHDFS
outofthebox
Typicallyfollowa batchprocessing
framework

Ingestion mechanism
8
Low-Latency Inflighttransformationand
processing
Abilitytopopulatemultiple
destinations
Our ideal ingestion would have three key aspects

Spark vs Storm
9
VS
• OwnMasterServer
• Run onHDFS
• Microbatching
• Exact once delivery(eliminates
vulnerability)
• NotnativetoHadoop
• LessDeveloped
• Oneata time
• ETL inflight
• Subsecondlatency
Twoofthemajorplayersin datastreaming/processing

Flume
45
Source Interceptor Selector Channel Sinks
Managed by the Flume Agent
Web Server
Web Server
Web Server
Web Server Investor Channel
HDFSNo in flight transformation, so this just needs to meet workload

KAFKA
46
Broker
Broker
Broker
Producer Broker Consumer
Producer
Producer
SparkStreaming
Other
ZooKeeper
Broker

Flume vs. Kafka
12
Use Both: Out-of-the box with Flafka and native connectors
Flume
Kafka
Source
Spark
Custom
connector
Custom
connector
Flume KafkaSource Spark

Storing the output
48
Data can be queried viaHive, Impala, or
SparkSQL
Clouderaisour Enterprise
choice
We can process asubset in-stream with Mlib
or other machine learning algorithms
Output summaries toother
RDBMS systems
Our streaming Spark cluster consumes messages from Kafka. We batch these every
minute into a HDFS cluster. We chose this because

Final Result
14
Events
Event Collector
Kafka
Flume SparkSQL
Cloudera
Other storage
(RDBMS)
Other storage
(logs)

Pipeline Summary
15
Add datatoanypointof
thepipeline
Kafka,Flume,Impala,Looker
withoutmanycustom
connectors
Pipelineincludes additionalsources
liketeradata,oracle
Add in-flightpredictivemodeltraining
andexecutionwithoutsignificant
additionalprocessingtime
Our pipeline provides several points for flexibility as well as meets our key priorities.

Priority # 1: Scale
Kafkais easy toscale, Asmorevolumecomes in,
addingnew brokerscan be automatedusing the
PartitionReassignmentTool
Bymonitoringbatchtimesin LookeronSparkSQL,
wecan alertwhenweneed toscale up thecluster
using Scheduled Looks
16

Priority #2: Flexibility
17
Differenteventscan beparsed outtodifferentSparkstreamingapplications
withKafkatopics (Oranothertype of consumer)
Addmoredataatanypoint(flume, kafkaproducer,ordirectlytospark)
Lookerconnects towhereverthedatalands, as long as wecan query it.Perform
analysis INCLUSTER

Priority #3 Speed Analyzing the stream
53
Events per hour
Identifymissingbatches
Volume andTiming
Rightsizinghardware
Duplicate events
And missinginformation

Priority #4: In house Technologies
19
Provide access to Hadoop/Impala via
a centralized data hub:
Asingle place toaccess webbased reports,explores,
BI toolsand code libraries
Enable users to ask questions and
query web data without writing SQL or
knowing about the pipeline

Analyzing the stream
55
Looking for Lost data
=/=

Analyzing the stream
21
By connecting Looker to various points
in the stream we can verify complete
loads:
We also mask the location of
information, one dashboard may show
a variety of reliable sources.
• ImpalaSQL
• SourceLogs
• SummaryReports

Other uses and benefits
57
Match data in flight to
find bad user accounts
In flight alerts for
missing data
Analysis without
needing to know the
location in the stream
SQL on Hadoop BI
solution doesn’t
require new skillset

Sources
http://www.slideshare.net/Dataversity/thu-1200-penchikalasrinicolor
http://seldo.com/weblog/2011/08/11/orm_is_an_antipattern
http://mashable.com/2010/10/04/foursquare-downtime/#aPh4mhYxLSq6
http://blogs.adobe.com/security/files/2011/04/NoSQL-But-Even-Less-Security.pdf?file=2011/04/NoSQL-But-Even-Less-
Security.pdf
http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb
https://www.percona.com
http://techcrunch.com/
http://mashable.com/

SQL on Hadoop for Enterprise Analytics

Recomendados

Recomendados

Más contenido relacionado

Más de DATAVERSITY

Más de DATAVERSITY (20)

Último

Último (20)

SQL on Hadoop for Enterprise Analytics

Notas del editor