SlideShare una empresa de Scribd logo
1 de 24
High Speed Smart Data Ingest
into Hadoop
Oleg Zhurakousky
Principal Architect @ Hortonworks
Twitter: z_oleg

© Hortonworks Inc. 2012
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/real-time-hadoop

InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Agenda
• Domain Problem
– Event Sourcing
– Streaming Sources
– Why Hadoop?

• Architecture
– Compute Mechanics
– Why things are slow – understand the problem
– Mechanics of the ingest
– Write >= Read | Else ;(

– Achieving Mechanical Sympathy –
– Suitcase pattern – optimize Storage, I/O and CPU

© Hortonworks Inc. 2012
Domain Problem

© Hortonworks Inc. 2012
Domain Events
• Domain Events represent the state of entities at a
given time when an important event occurred
• Every state change of interest to the business is
materialized in a Domain Event
Retail & Web
Point of Sale
Web Logs

Energy
Metering
Diagnostics

Telco
Network Data
Device Interactions
Financial
Services
Trades
Market Data
© Hortonworks Inc. 2012

Healthcare
Clinical Trials
Claims
Page 4
Event Sourcing
• All Events are sent to an Event Processor
• The Event Processor stores all events in an Event Log
• Many different Event Listeners can be added to the
Event Processor (or listen directly on the Event Log)

Web

Messaging

Batch

© Hortonworks Inc. 2012

Search

Event
Processor

Analytics

Apps

Page 5
Streaming Sources
• When dealing with Event data (log messages, call
detail records, other event / message type data) there
are some important observations:
– Source data is streamed (not a static copy), thus always in motion
– More then one streaming source
– Streaming (by definition) never stops
– Events are rather small in size (up to several hundred bytes), but
there is a lot of them
Hadoop was not design with that in mind
Telco Use Case
E

E

Geos

© Hortonworks Inc. 2012

E

– Tailing the syslog of a network device
– 200K events per second per streaming source
– Bursts that can double or triple event production
Why Hadoop for Event Sourcing?
• Event data is immutable, an event log is append-only
• Event Sourcing is not just event storage, it is event
processing (HDFS + MapReduce)
• While event data is generally small in size, event
volumes are HUGE across the enterprise
• For an Event Sourcing Solution to deliver its benefits
to the enterprise, it must be able to retain all history
• An Event Sourcing solution is ecosystem dependent –
it must be easily integrated into the Enterprise
technology portfolio (management, data, reporting)

© Hortonworks Inc. 2012

Page 7
Ingest Architecture

© Hortonworks Inc. 2012
Mechanics of Data Ingest & Why and which
things are slow and/or inefficient ?
• Demo

© Hortonworks Inc. 2012

Page 9
Mechanics of a compute
Atos, Portos, Aramis and the D’Artanian – of the compute

© Hortonworks Inc. 2012
Why things are slow?
• I/O | Network bound
• Inefficient use of storage
• CPU is not helping
• Memory is not helping

Imbalanced resource usage leads to limited growth potentials
and misleading assumptions about the capability of yoru
hardware

© Hortonworks Inc. 2012

Page 11
Mechanics of the ingest
• Multiple sources of data (readers)
• Write (Ingest) >= Read (Source) | Else
• “Else” is never good
– Slows the everything down or
– Fills accumulation buffer to the capacity which slows everything
down
– Renders parts or whole system unavailable which slows or brings
everything down
– Etc…
“Else” is just not good!

© Hortonworks Inc. 2012

Page 12
Strive for Mechanical Sympathy
• Mechanical Sympathy – ability to write software with
deep understanding of its impact or lack of impact on
the hardware its running on
• http://mechanical-sympathy.blogspot.com/
• Ensure that all available resources (hardware and
software) work in balance to help achieve the end
goal.

© Hortonworks Inc. 2012

Page 13
Achieving mechanical sympathy during the
ingest?
• Demo

© Hortonworks Inc. 2012

Page 14
The Suitcase Pattern
• Storing event data in uncompressed form not feasible
– Event Data just keeps on coming, and storage is finite
– The more history we can keep in Hadoop the better our trend
analytics will be
– Not just a Storage problem, but also Processing
– I/O is #1 impact on both Ingest & MapReduce performance

• Suitcase Pattern
– Before we travel, we take our clothes off the
rack and pack them (easier to store)
– We then unpack them when we arrive and put
them back on the rack (easier to process)
– Consider event data “traveling” over the network to Hadoop
– we want to compress it before it makes the trip, but in a way that
facilitates how we intend to process it once it arrives
© Hortonworks Inc. 2012
What’s next?
• How much data is available or could be available to us
during the ingest is lost after the ingest?
• How valuable is the data available to us during the
ingest after the ingest completed?
• How expensive would it be to retain (not loose) the
data available during the ingest?

© Hortonworks Inc. 2012

Page 16
Data available during the ingest?
• Record count
• Highest/Lowest record length
• Average record length
• Compression ratio
But with a little more work. . .
• Column parsing
– Unique values
– Unique values per column
– Access to values of each column independently from the record
– Relatively fast column-based searches, without indexing
– Value encoding
– Etc…
Imagine the implications on later data access. . .
© Hortonworks Inc. 2012

Page 17
How expensive is to do that extra work?
• Demo

© Hortonworks Inc. 2012

Page 18
Conclusion - 1
• Hadoop wasn’t optimized to process <1 kB records
– Mappers are event-driven consumers: they do not have a running
thread until a callback thread delivers a message
– Message delivery is an event that triggers the Mapper into action

• “Don’t wake me up unless it is important”
– In Hadoop, generally speaking,
several thousand bytes to
several hundred thousand bytes
is deemed important
– Buffering records during collection allows us to collect more data
before waking up the elephant
– Buffering records during collection also allows us to compress the
whole block of records as a single record to be sent over the
network to Hadoop – resulting in lower network and file I/O
– Buffering records during collection also helps us handle bursts
© Hortonworks Inc. 2012
Conclusion - 2
• Understand your SLAs, but think ahead
– While you may have accomplished your SLA for the ingest your
overall system may still be in the unbalanced state
– Don’t settle on “throw more hardware” until you can validate that
you have approached the limits of your existing hardware
– Strive for mechanical sympathy. Profile your software to
understand its impact on the hardware its running on.
– Don’t think about the ingest in isolation. Think how the ingested
data is going to be used.
– Analyze, the information that is available to you during the ingest
and devise a convenient mechanism for storing it. Your data
analysts will thank you for that.

© Hortonworks Inc. 2012

Page 20
Thank You!
Questions & Answers
Follow: @z_oleg

© Hortonworks Inc. 2012

Page 21
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/realtime-hadoop

Más contenido relacionado

Más de C4Media

High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 

Más de C4Media (20)

High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 

Último

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

High Speed Smart Data Ingest into Hadoop

  • 1. High Speed Smart Data Ingest into Hadoop Oleg Zhurakousky Principal Architect @ Hortonworks Twitter: z_oleg © Hortonworks Inc. 2012
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /real-time-hadoop InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Agenda • Domain Problem – Event Sourcing – Streaming Sources – Why Hadoop? • Architecture – Compute Mechanics – Why things are slow – understand the problem – Mechanics of the ingest – Write >= Read | Else ;( – Achieving Mechanical Sympathy – – Suitcase pattern – optimize Storage, I/O and CPU © Hortonworks Inc. 2012
  • 6. Domain Events • Domain Events represent the state of entities at a given time when an important event occurred • Every state change of interest to the business is materialized in a Domain Event Retail & Web Point of Sale Web Logs Energy Metering Diagnostics Telco Network Data Device Interactions Financial Services Trades Market Data © Hortonworks Inc. 2012 Healthcare Clinical Trials Claims Page 4
  • 7. Event Sourcing • All Events are sent to an Event Processor • The Event Processor stores all events in an Event Log • Many different Event Listeners can be added to the Event Processor (or listen directly on the Event Log) Web Messaging Batch © Hortonworks Inc. 2012 Search Event Processor Analytics Apps Page 5
  • 8. Streaming Sources • When dealing with Event data (log messages, call detail records, other event / message type data) there are some important observations: – Source data is streamed (not a static copy), thus always in motion – More then one streaming source – Streaming (by definition) never stops – Events are rather small in size (up to several hundred bytes), but there is a lot of them Hadoop was not design with that in mind Telco Use Case E E Geos © Hortonworks Inc. 2012 E – Tailing the syslog of a network device – 200K events per second per streaming source – Bursts that can double or triple event production
  • 9. Why Hadoop for Event Sourcing? • Event data is immutable, an event log is append-only • Event Sourcing is not just event storage, it is event processing (HDFS + MapReduce) • While event data is generally small in size, event volumes are HUGE across the enterprise • For an Event Sourcing Solution to deliver its benefits to the enterprise, it must be able to retain all history • An Event Sourcing solution is ecosystem dependent – it must be easily integrated into the Enterprise technology portfolio (management, data, reporting) © Hortonworks Inc. 2012 Page 7
  • 11. Mechanics of Data Ingest & Why and which things are slow and/or inefficient ? • Demo © Hortonworks Inc. 2012 Page 9
  • 12. Mechanics of a compute Atos, Portos, Aramis and the D’Artanian – of the compute © Hortonworks Inc. 2012
  • 13. Why things are slow? • I/O | Network bound • Inefficient use of storage • CPU is not helping • Memory is not helping Imbalanced resource usage leads to limited growth potentials and misleading assumptions about the capability of yoru hardware © Hortonworks Inc. 2012 Page 11
  • 14. Mechanics of the ingest • Multiple sources of data (readers) • Write (Ingest) >= Read (Source) | Else • “Else” is never good – Slows the everything down or – Fills accumulation buffer to the capacity which slows everything down – Renders parts or whole system unavailable which slows or brings everything down – Etc… “Else” is just not good! © Hortonworks Inc. 2012 Page 12
  • 15. Strive for Mechanical Sympathy • Mechanical Sympathy – ability to write software with deep understanding of its impact or lack of impact on the hardware its running on • http://mechanical-sympathy.blogspot.com/ • Ensure that all available resources (hardware and software) work in balance to help achieve the end goal. © Hortonworks Inc. 2012 Page 13
  • 16. Achieving mechanical sympathy during the ingest? • Demo © Hortonworks Inc. 2012 Page 14
  • 17. The Suitcase Pattern • Storing event data in uncompressed form not feasible – Event Data just keeps on coming, and storage is finite – The more history we can keep in Hadoop the better our trend analytics will be – Not just a Storage problem, but also Processing – I/O is #1 impact on both Ingest & MapReduce performance • Suitcase Pattern – Before we travel, we take our clothes off the rack and pack them (easier to store) – We then unpack them when we arrive and put them back on the rack (easier to process) – Consider event data “traveling” over the network to Hadoop – we want to compress it before it makes the trip, but in a way that facilitates how we intend to process it once it arrives © Hortonworks Inc. 2012
  • 18. What’s next? • How much data is available or could be available to us during the ingest is lost after the ingest? • How valuable is the data available to us during the ingest after the ingest completed? • How expensive would it be to retain (not loose) the data available during the ingest? © Hortonworks Inc. 2012 Page 16
  • 19. Data available during the ingest? • Record count • Highest/Lowest record length • Average record length • Compression ratio But with a little more work. . . • Column parsing – Unique values – Unique values per column – Access to values of each column independently from the record – Relatively fast column-based searches, without indexing – Value encoding – Etc… Imagine the implications on later data access. . . © Hortonworks Inc. 2012 Page 17
  • 20. How expensive is to do that extra work? • Demo © Hortonworks Inc. 2012 Page 18
  • 21. Conclusion - 1 • Hadoop wasn’t optimized to process <1 kB records – Mappers are event-driven consumers: they do not have a running thread until a callback thread delivers a message – Message delivery is an event that triggers the Mapper into action • “Don’t wake me up unless it is important” – In Hadoop, generally speaking, several thousand bytes to several hundred thousand bytes is deemed important – Buffering records during collection allows us to collect more data before waking up the elephant – Buffering records during collection also allows us to compress the whole block of records as a single record to be sent over the network to Hadoop – resulting in lower network and file I/O – Buffering records during collection also helps us handle bursts © Hortonworks Inc. 2012
  • 22. Conclusion - 2 • Understand your SLAs, but think ahead – While you may have accomplished your SLA for the ingest your overall system may still be in the unbalanced state – Don’t settle on “throw more hardware” until you can validate that you have approached the limits of your existing hardware – Strive for mechanical sympathy. Profile your software to understand its impact on the hardware its running on. – Don’t think about the ingest in isolation. Think how the ingested data is going to be used. – Analyze, the information that is available to you during the ingest and devise a convenient mechanism for storing it. Your data analysts will thank you for that. © Hortonworks Inc. 2012 Page 20
  • 23. Thank You! Questions & Answers Follow: @z_oleg © Hortonworks Inc. 2012 Page 21
  • 24. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/realtime-hadoop