SlideShare una empresa de Scribd logo
1 de 24
H2O on Hadoop
[ H2O – The Open Source In-Memory
Prediction Engine for Big Data ]

Tom Kraljevic
Outline

H2O as a Standalone HDFS client
H2O on Hadoop
Configuration settings
Stuff we learned
Questions
Using H2O as an HDFS client

Why
I want to do run a Generalized Linear Model, and my data lives in
HDFS.

What
Use H2O standalone.

Why
Let’s also use the CPUs and Memory of my Hadoop cluster!

What
Run H2O on Hadoop.
H2O Standalone Deployment Using HDFS

java -jar h2o.jar

java -jar h2o.jar
H2O on Hadoop Deployment
H2O on Hadoop Deployment
H2O on Hadoop Deployment
Standalone Mode Command-Line Invocation

unzip

java -jar
H2O on Hadoop Client Invocation
tomk@mr-0xb1:~/h2o-2.1.0.1144/hadoop$ hadoop

jar h2odriver_cdh4.jar water.hadoop.h2odriver
mapperXmx 30g -nodes 3 -output hdfsOutDir
Determining driver host interface for mapper->driver callback...
[Possible callback IP address: 192.168.1.161]
[Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 192.168.1.161:60576
(You can override these with -driverif and -driverport.)
Driver program compiled with MapReduce V1 (Classic)
13/12/12 13:25:29 WARN conf.Configuration: mapred.map.child.java.opts is deprecated. Instead, use mapreduce.map.java.opts
Memory Settings:
mapred.child.java.opts:
-Xms30g -Xmx30g
mapred.map.child.java.opts: -Xms30g -Xmx30g
Extra memory percent:
10
mapreduce.map.memory.mb: 33792
Job name 'H2O_49792' submitted
JobTracker job ID is 'job_1386713862878_0002'
Waiting for H2O cluster to come up...
H2O node 192.168.1.163:54321 requested flatfile
H2O node 192.168.1.161:54323 requested flatfile
H2O node 192.168.1.162:54339 requested flatfile
Sending flatfiles to nodes...
[Sending flatfile to node 192.168.1.163:54321]
[Sending flatfile to node 192.168.1.161:54323]
[Sending flatfile to node 192.168.1.162:54339]
H2O node 192.168.1.161:54323 reports H2O cluster size 1
H2O node 192.168.1.163:54321 reports H2O cluster size 1
H2O node 192.168.1.162:54339 reports H2O cluster size 1
H2O node 192.168.1.162:54339 reports H2O cluster size 3
H2O node 192.168.1.163:54321 reports H2O cluster size 3
H2O node 192.168.1.161:54323 reports H2O cluster size 3
H2O cluster (3 nodes) is up
(Note: Use the -disown option to exit the driver after cluster formation)
(Press Ctrl-C to kill the cluster)
Blocking until the H2O cluster shuts down...

-libjars ../h2o.jar -
H2O is not the typical Hadoop MapReduce job

• Long-lived Hadoop mapper tasks that take up CPUs and memory
• No Hadoop reduce tasks
• No mapper task retry (user needs to restart H2O cloud on failure)
• No HDFS “input file” to the Hadoop map phase
•

Null splits define number of Hadoop map tasks

• No HDFS “output files” produced by the Hadoop job.
• Multithreaded

• Big Java heap (e.g. 128 GB)
• Mappers talk to each other
•

Mappers need to be run at the same time
H2O on Hadoop: Take a look
H2O on Hadoop: Take a look
H2O on Hadoop: take a look
H2O: take a look
H2O on Hadoop: take a look
Hadoop
Configuration
Client-side (Driver) Hadoop Configuration

mapreduce.job.ubertask.enable

false

[ mapreduceMapMemoryMb = Xmx + 10% fudge factor; ]
mapreduce.map.memory.mb

mapreduceMapMemoryMb

[ mapChildJavaOpts = "-Xms" + mapperXmx + " -Xmx" + mapperXmx; ]
mapred.child.java.opts

mapChildJavaOpts

mapred.map.child.java.opts

mapChildJavaOpts

mapreduce.map.speculative

false

mapred.map.tasks.speculative.execution

false

mapred.map.max.attempts

1

mapred.job.reuse.jvm.num.tasks

1
YARN Settings in CDH4

mapreduce.map.memory.mb

128 GB

mapreduce.map.java.opts.max.heap

100 GB

yarn.nodemanager.resource.memory-mb

128 GB

ResourceManager Configuration Safety Valve for yarn-site.xml
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>131072</value>
</property>
Stuff We Learned

• Empty splits determine the mapper count
• Use Hadoop counters to avoid having Hadoop kill your
job (which doesn’t map anything) after 10 minutes
• Hook System.exit() to get mappers to exit properly
• YARN /etc/alternatives setting
• YARN memory settings
• YARN FIFO scheduler mapper clumping
• Hortonworks -libjars distribution unpacks the .jar file
Debugging a mapper process is an experience

• Finding the process is a challenge
• Finding the output is a challenge

• Convincing Hadoop not to delete the output on a failure is a
challenge
•

Even if your logger setup code gets a chance to run, you still have to
find the output.

• Attaching with standard JDWP debuggers is a challenge
•
•

Finding the process is (again) a challenge
Debugger doesn’t understand the mapper environment; attach didn’t
work.

• Finally resorted to POST-code style printing techniques
•

“Log” output to /tmp, bypassing everything, so I get to keep it
Possible Future Work

• Automated log collection upon failure
•

Fishing out logs from a cluster of 1000 Hadoop nodes is a great job
for a tool.

• Driver vs. host checking of Hadoop version
•

Helps avoid subtle errors

• Custom YARN Application Master
• Experiment further with I/O optimization for H2O Node
placement based on data locality

Your Contributions Welcome
Key Takeaways

• H2O is the premier platform for In-Memory Predictive
Analytics on Big Data
• Open Source Apache 2 license
• H2O mapreduce (in-memory) is not Hadoop mapreduce (HDFS)!

• H2O can read your data from HDFS
• Standalone or running on Hadoop

• Run H2O on Hadoop to use CPUs and Memory from your
existing Hadoop cluster
• Get started easily with the gear you’ve already got
H2O on Hadoop in the Wild

K-Means on a Terabyte of data (Major Insurance Co.)
Partners

K-Means on a Terabyte of data (Insurance)

Más contenido relacionado

La actualidad más candente

Introduction to hadoop high availability
Introduction to hadoop high availability Introduction to hadoop high availability
Introduction to hadoop high availability Omid Vahdaty
 
Hadoop migration and upgradation
Hadoop migration and upgradationHadoop migration and upgradation
Hadoop migration and upgradationShashwat Shriparv
 
MapR Tutorial Series
MapR Tutorial SeriesMapR Tutorial Series
MapR Tutorial Seriesselvaraaju
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best PracticesCloudera, Inc.
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Cosmin Lehene
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadeaviadea
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityEdureka!
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxAlex Moundalexis
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with DockerFabio Fumarola
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneCeph Community
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache RatisNoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache RatisAnkit Singhal
 
Redundancy for Big Hadoop Clusters is hard - Stuart Pook
Redundancy for Big Hadoop Clusters is hard  - Stuart PookRedundancy for Big Hadoop Clusters is hard  - Stuart Pook
Redundancy for Big Hadoop Clusters is hard - Stuart PookEvention
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterAltoros
 
Develop with linux containers and docker
Develop with linux containers and dockerDevelop with linux containers and docker
Develop with linux containers and dockerFabio Fumarola
 

La actualidad más candente (20)

Introduction to hadoop high availability
Introduction to hadoop high availability Introduction to hadoop high availability
Introduction to hadoop high availability
 
Hadoop migration and upgradation
Hadoop migration and upgradationHadoop migration and upgradation
Hadoop migration and upgradation
 
MapR Tutorial Series
MapR Tutorial SeriesMapR Tutorial Series
MapR Tutorial Series
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best Practices
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadea
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache RatisNoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
NoSql day 2019 - Floating on a Raft - Apache HBase durability with Apache Ratis
 
Redundancy for Big Hadoop Clusters is hard - Stuart Pook
Redundancy for Big Hadoop Clusters is hard  - Stuart PookRedundancy for Big Hadoop Clusters is hard  - Stuart Pook
Redundancy for Big Hadoop Clusters is hard - Stuart Pook
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Develop with linux containers and docker
Develop with linux containers and dockerDevelop with linux containers and docker
Develop with linux containers and docker
 

Similar a H2O on Hadoop: take a look

H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 Sri Ambati
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Søren Lund
 
DC HUG Hadoop for Windows
DC HUG Hadoop for WindowsDC HUG Hadoop for Windows
DC HUG Hadoop for WindowsTerry Padgett
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersAmal G Jose
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
2 Hadoop 1.x presentation in understading .pptx
2 Hadoop 1.x presentation in understading .pptx2 Hadoop 1.x presentation in understading .pptx
2 Hadoop 1.x presentation in understading .pptxKishanhari3
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudEdureka!
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterEdureka!
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an exampleNikita Kesharwani
 

Similar a H2O on Hadoop: take a look (20)

H2O on Hadoop Dec 12
H2O on Hadoop Dec 12 H2O on Hadoop Dec 12
H2O on Hadoop Dec 12
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
 
DC HUG Hadoop for Windows
DC HUG Hadoop for WindowsDC HUG Hadoop for Windows
DC HUG Hadoop for Windows
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
2 Hadoop 1.x presentation in understading .pptx
2 Hadoop 1.x presentation in understading .pptx2 Hadoop 1.x presentation in understading .pptx
2 Hadoop 1.x presentation in understading .pptx
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug
 
Unit 5
Unit  5Unit  5
Unit 5
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 

Más de Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxSri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thSri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 

Más de Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Último

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Último (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

H2O on Hadoop: take a look

  • 1. H2O on Hadoop [ H2O – The Open Source In-Memory Prediction Engine for Big Data ] Tom Kraljevic
  • 2. Outline H2O as a Standalone HDFS client H2O on Hadoop Configuration settings Stuff we learned Questions
  • 3. Using H2O as an HDFS client Why I want to do run a Generalized Linear Model, and my data lives in HDFS. What Use H2O standalone. Why Let’s also use the CPUs and Memory of my Hadoop cluster! What Run H2O on Hadoop.
  • 4. H2O Standalone Deployment Using HDFS java -jar h2o.jar java -jar h2o.jar
  • 5. H2O on Hadoop Deployment
  • 6. H2O on Hadoop Deployment
  • 7. H2O on Hadoop Deployment
  • 8. Standalone Mode Command-Line Invocation unzip java -jar
  • 9. H2O on Hadoop Client Invocation tomk@mr-0xb1:~/h2o-2.1.0.1144/hadoop$ hadoop jar h2odriver_cdh4.jar water.hadoop.h2odriver mapperXmx 30g -nodes 3 -output hdfsOutDir Determining driver host interface for mapper->driver callback... [Possible callback IP address: 192.168.1.161] [Possible callback IP address: 127.0.0.1] Using mapper->driver callback IP address and port: 192.168.1.161:60576 (You can override these with -driverif and -driverport.) Driver program compiled with MapReduce V1 (Classic) 13/12/12 13:25:29 WARN conf.Configuration: mapred.map.child.java.opts is deprecated. Instead, use mapreduce.map.java.opts Memory Settings: mapred.child.java.opts: -Xms30g -Xmx30g mapred.map.child.java.opts: -Xms30g -Xmx30g Extra memory percent: 10 mapreduce.map.memory.mb: 33792 Job name 'H2O_49792' submitted JobTracker job ID is 'job_1386713862878_0002' Waiting for H2O cluster to come up... H2O node 192.168.1.163:54321 requested flatfile H2O node 192.168.1.161:54323 requested flatfile H2O node 192.168.1.162:54339 requested flatfile Sending flatfiles to nodes... [Sending flatfile to node 192.168.1.163:54321] [Sending flatfile to node 192.168.1.161:54323] [Sending flatfile to node 192.168.1.162:54339] H2O node 192.168.1.161:54323 reports H2O cluster size 1 H2O node 192.168.1.163:54321 reports H2O cluster size 1 H2O node 192.168.1.162:54339 reports H2O cluster size 1 H2O node 192.168.1.162:54339 reports H2O cluster size 3 H2O node 192.168.1.163:54321 reports H2O cluster size 3 H2O node 192.168.1.161:54323 reports H2O cluster size 3 H2O cluster (3 nodes) is up (Note: Use the -disown option to exit the driver after cluster formation) (Press Ctrl-C to kill the cluster) Blocking until the H2O cluster shuts down... -libjars ../h2o.jar -
  • 10. H2O is not the typical Hadoop MapReduce job • Long-lived Hadoop mapper tasks that take up CPUs and memory • No Hadoop reduce tasks • No mapper task retry (user needs to restart H2O cloud on failure) • No HDFS “input file” to the Hadoop map phase • Null splits define number of Hadoop map tasks • No HDFS “output files” produced by the Hadoop job. • Multithreaded • Big Java heap (e.g. 128 GB) • Mappers talk to each other • Mappers need to be run at the same time
  • 11. H2O on Hadoop: Take a look
  • 12. H2O on Hadoop: Take a look
  • 13. H2O on Hadoop: take a look
  • 14. H2O: take a look
  • 15. H2O on Hadoop: take a look
  • 17. Client-side (Driver) Hadoop Configuration mapreduce.job.ubertask.enable false [ mapreduceMapMemoryMb = Xmx + 10% fudge factor; ] mapreduce.map.memory.mb mapreduceMapMemoryMb [ mapChildJavaOpts = "-Xms" + mapperXmx + " -Xmx" + mapperXmx; ] mapred.child.java.opts mapChildJavaOpts mapred.map.child.java.opts mapChildJavaOpts mapreduce.map.speculative false mapred.map.tasks.speculative.execution false mapred.map.max.attempts 1 mapred.job.reuse.jvm.num.tasks 1
  • 18. YARN Settings in CDH4 mapreduce.map.memory.mb 128 GB mapreduce.map.java.opts.max.heap 100 GB yarn.nodemanager.resource.memory-mb 128 GB ResourceManager Configuration Safety Valve for yarn-site.xml <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>131072</value> </property>
  • 19. Stuff We Learned • Empty splits determine the mapper count • Use Hadoop counters to avoid having Hadoop kill your job (which doesn’t map anything) after 10 minutes • Hook System.exit() to get mappers to exit properly • YARN /etc/alternatives setting • YARN memory settings • YARN FIFO scheduler mapper clumping • Hortonworks -libjars distribution unpacks the .jar file
  • 20. Debugging a mapper process is an experience • Finding the process is a challenge • Finding the output is a challenge • Convincing Hadoop not to delete the output on a failure is a challenge • Even if your logger setup code gets a chance to run, you still have to find the output. • Attaching with standard JDWP debuggers is a challenge • • Finding the process is (again) a challenge Debugger doesn’t understand the mapper environment; attach didn’t work. • Finally resorted to POST-code style printing techniques • “Log” output to /tmp, bypassing everything, so I get to keep it
  • 21. Possible Future Work • Automated log collection upon failure • Fishing out logs from a cluster of 1000 Hadoop nodes is a great job for a tool. • Driver vs. host checking of Hadoop version • Helps avoid subtle errors • Custom YARN Application Master • Experiment further with I/O optimization for H2O Node placement based on data locality Your Contributions Welcome
  • 22. Key Takeaways • H2O is the premier platform for In-Memory Predictive Analytics on Big Data • Open Source Apache 2 license • H2O mapreduce (in-memory) is not Hadoop mapreduce (HDFS)! • H2O can read your data from HDFS • Standalone or running on Hadoop • Run H2O on Hadoop to use CPUs and Memory from your existing Hadoop cluster • Get started easily with the gear you’ve already got
  • 23. H2O on Hadoop in the Wild K-Means on a Terabyte of data (Major Insurance Co.)
  • 24. Partners K-Means on a Terabyte of data (Insurance)