SlideShare una empresa de Scribd logo
1 de 52
Descargar para leer sin conexión
Hadoop – The War Stories
Running Hadoop in large enterprise environment
Nikolai Grigoriev (ngrigoriev@gmail.com, @nikgrig)
Principal Software Engineer, http://sociablelabs.com
Agenda
● Why Hadoop?
● Planning Hadoop deployment
● Hadoop and read hardware
● Understanding the software stack
● Tuning HDFS, MapReduce and HBase
● Troubleshooting examples
● Testing your applications
Disclaimer: this presentation is based on the combined work experience from more than
one company and represents the author's personal point of view on the problems discussed in it.
Why Hadoop (and why have we decided to
use it)?
● Need to store hundreds of Tb of info
● Need to process it in parallel
● Desire to have both storage and processing
horizontally scalable
● Having and open-source platform with
commercial support
Our application
Application servers
(many :) )
Log processors
“ETL process”
Our application in numbers
● Thousands of user sessions per second
● Average session log size: ~30Kb, 3-7 events
per log
● Target retention period – at least ~90 days
● Redundancy and HA everywhere
● Pluggable “ETL” modules for additional data
processing
Main problem
Team had no practical knowledge
of Hadoop, HDFS and HBase…
...and there was nobody at the
company to help
But we did not realize...
It was not THE ONLY problem we
were about to face!
First fight – capacity planning
● Tons of articles are written about Hadoop
capacity planning
● Architects may be spending months making
educated guesses
● Capacity planning is really about finding the
amount of $$$ to be spent on your cluster for
target workload
– If we had infinite amount of $$$ why would we
bother at all? ;)
Hadoop performance limiting factors
It is all about the balance
● Your Hadoop cluster and your apps use all
these resources at different time
● Over-provisioning of one of the resources
usually lead to the shortage of another one -
wasted $$$
What can we say about an app?
● It is going to store X Tb of data
– Amount of storage (not to forget the RF!)
– Accommodate for growth and failures
● It is going to ingest the data at Y Mb/s
– Your network speed and number of nodes
● Latency
– More HDDs and faster HDDs
– More RAM
– More nodes
We are big enterprise...
Geeky Hadoop developer
Old School Senior IT Guy
- many “commodity+” hosts
- good but inexpensive
networking
- more regular HDDs
- lots of RAM
- I also love cloud…
- my recent OS
- my software configuration
- simple network
SANs, RAIDs, SCSI, racks,
Blades, redundancy,
Cisco, HP, fiber optics,
4-year-old
rock-solid RHEL, SNMP
monitoring…
what? I am the Boss...
Hadoop cluster vs. old school
application servers
● Mostly identical “commodity+” machines
– Probably with the exception of NN, JT
● Better to have more simpler machines than fewer
monster ones
● No RAID, just JBOD!
● Ethernet depending on the storage density, bonded
1Gbit may be enough
● Hadoop achieves with software what used to be
achievable with [expensive!] hardware
But still, your application is the
driver, not the IT guy!
From Cloudera website – Hadoop machine configuration according to workload
Your job is:
● Educate your IT, get them on your side or at
least earn their trust
● Try to build a capacity planning spreadsheet
based on what you do know
● Apply common sense to guess what you do not
know
● ...and plan a decent buffer
● Set reasonable performance targets for your
application
Fight #2 – OMG, our application is
slow!!!
● Main part of our application was the MR job merging the
logs
● We have committed to deliver X logs/sec on a target test
cluster with sample workload
● We were delivering like ~30% of that
● ...weeks before release :)
● ...and we have ran out of other excuses :(
● It was clearly our software and/or
configuration
Wait a second – we have support
contract from Hadoop vendor!
● I mean no disrespect to the vendors!
● But they do not know your application
● And they do not know your hardware
● And they do not know exactly your OS
● And they do not know your network equipment
● They can help you with some tuning, they can
help you with bugs and crashes – but they
won't be able (or sometimes simply qualified) to
do your job!
We are on our own :(
● We have realized that our testing methods were
not adequate to Hadoop-based ETL process
● Testing the product end-to-end was too difficult,
tracking changes was impossible
● Turn-around was too long, we could not try
something quickly and revert back
● Observing and monitoring the live system with
dummy incoming data was not productive
enough
Key to successful testing
● Representative data set
● Ability to repeat the same operation as many
times as needed with quick turnaround
● Each engineer had to be able to run the tests
and try something
● Establishing the key metrics you monitor and try
to improve
● Methodological approach – analyze, change,
test, be ready to roll back
Our “reference runner”
Large sample
dataset
“Reset” tool Runner tool Statistics
Recreates HBase tables
(predefined regions),
cleans HDFS etc
Injects the test data,
prepares the environment,
launches the MR job like real
application, allows to quickly
rebuild and redeploy the part
of the application
Any improvements since
last run?
Manager
Tuning results
● In two weeks we had the job that worked about
3 times faster
● Tuning was done everywhere – from OS to
Hadoop/HBase and our code
● We were confident that the software was ready
to go to production
● During following 2 years later we realized how
bad was our design and how it should have
been done ;)
Hadoop MapReduce DOs
● Think processes, not threads
● Reusable objects, lower GC overhead
● Snappy data compression is generally good
● Reasonable use of counters provides important
information
● For frequently running jobs, distributed cache helps a
lot
● Minimize disk I/O (spills etc), RAM is cheap
● Avoid unnecessary serialization/deserialization
Hadoop MapReduce DONTs
● Small files in HDFS
● Multithreaded programming inside
mapper/reducer
● Fat tasks using too much heap
● Any I/O in M-R other than HDFS, ZK or HBase
● Over-complicated code (simple things work
better)
Fight #3 – Going Production!
● Remember the slide about engineer vs. IT God
preferences ;)
● Production hardware was slightly different from
the test cluster
● Cluster has been deployed by the people who
did not know Hadoop
● First attempt to run the software resulted in
major failure and the cluster was finally handed
over to the developers for fixing ;)
Production hardware
● HP blade servers, 32 core, 128GB of RAM
● Emulex dual-port 10G Ethernet NICs
● 14 HDDs per machine
● OEL 6.3
● 10G switch modules
● Company hosting center with dedicated
networking and operations staff
Hardware
BIOS/Firmware(s)
BIOS/Firmware settings
OS (Linux)
Java (JVM)
Hadoop services
Your application(s)
Step back – 10,000 ft look at
Hadoop stack
Hardware
BIOS/Firmware(s)
BIOS/Firmware settings
OS (Linux)
Java (JVM)
Hadoop services
Your application(s)
Network
- Hadoop is not just a bunch
of Java apps
- It is a data and application
platform
- It can run well, just run,
barely run and cause
constant headache –
depending on how much
love does it receive :)
Hadoop stack (continued)
● In Hadoop a small problem, even sometimes on
a single node can be a major pain
● Isolating and finding that small problem may be
difficult
● Symptoms are often obvious only at high level
(e.g. application)
● Complex hardware (like HP) adds more
potential problems
Example of one of the problems we
had initially
● Jobs were failing because of timeouts
● Numerous I/O errors observed in job and HDFS logs
● This simple test was failing:
$ dd if=/dev/zero of=test8Gb.bin bs=1M count=8192
$ time hdfs dfs -copyFromLocal test8Gb.bin /
Zzz..zzz...zzz...5min...zzz…
real 4m10.002s
user 0m15.130s
sys 0m4.094s
● IT was clueless but did not really bother
● In fact, 8192Mb / (4 * 60 + 10) = 32Mb/s (!?!?!)
● 10Gb network transfers to HDFS at ~160Mb/s
Role of HDFS in Hadoop
● In Hadoop HDFS is the key layer that provides
the distributed filesystem services for other
components
● Health of HDFS directly (and drastically) affects
the health of other components
HDFS
Map-Reduce
Data
HBase
So, clearly HDFS was the problem
● But what was the problem with HDFS??
● How exactly HDFS writing works?
Chasing it down
● Due to node-to-node streaming it was difficult to
understand who was responsible
● Theory of “one bad node in pipeline” was ruled
out as results were consistently bad with the
cluster of 14 nodes
● Idea (isolating the problem is good):
$ time hdfs -Ddfs.replication=1 dfs -copyFromLocal test8Gb.bin /
real 0m42.002s
$ time hdfs -Ddfs.replication=2 dfs -copyFromLocal test8Gb.bin /
real 2m53.184s
$ time hdfs -Ddfs.replication=3 dfs -copyFromLocal test8Gb.bin /
real 3m41.072s
● 8192/42=195 Mb/s – hmmm….
Discoveries
● To make even longer story short...
– Bug in “cubic” TCP congestion protocol in Linux kernel
– NIC firmware was too old
– Kernel driver for Emulex 10G NICs was too old
– Only one out of 8 NIC RX queues was enabled on some
hosts
– A number of network settings were not appropriate for 10G
network
– “irqbalance” process (due to kernel bug) was locking NIC
RX queues by “losing” NIC IRQ handlers
– ...
More discoveries
– Nodes were set up multi-homed, even HDFS at that
time did not support that
– Misconfigured DNS and reverse DNS
● On disk I/O side
– Bad filesystem parameters
– Read-ahead settings were wrong
– Disk controller firmware was old
HDFS “litmus” test - TestDFSIO
13/03/13 16:30:02 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
13/03/13 16:30:02 INFO fs.TestDFSIO: Date & time: Wed Mar 13 16:30:02 UTC 2013
13/03/13 16:30:02 INFO fs.TestDFSIO: Number of files: 16
13/03/13 16:30:02 INFO fs.TestDFSIO: Total MBytes processed: 160000.0
13/03/13 16:30:02 INFO fs.TestDFSIO: Throughput mb/sec: 103.42190773343779
13/03/13 16:30:02 INFO fs.TestDFSIO: Average IO rate mb/sec: 103.61066436767578
13/03/13 16:30:02 INFO fs.TestDFSIO: IO rate std deviation: 4.513343367320971
13/03/13 16:30:02 INFO fs.TestDFSIO: Test exec time sec: 114.876
13/03/13 16:31:31 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
13/03/13 16:31:31 INFO fs.TestDFSIO: Date & time: Wed Mar 13 16:31:31 UTC 2013
13/03/13 16:31:31 INFO fs.TestDFSIO: Number of files: 16
13/03/13 16:31:31 INFO fs.TestDFSIO: Total MBytes processed: 160000.0
13/03/13 16:31:31 INFO fs.TestDFSIO: Throughput mb/sec: 586.8243268024676
13/03/13 16:31:31 INFO fs.TestDFSIO: Average IO rate mb/sec: 648.8555908203125
13/03/13 16:31:31 INFO fs.TestDFSIO: IO rate std deviation: 267.0954600161208
13/03/13 16:31:31 INFO fs.TestDFSIO: Test exec time sec: 33.683
13/03/13 16:31:31 INFO fs.TestDFSIO:
Fight #4 – tuning Hadoop
● Why do people tune things
(IT was not interested ;) )?
● With your own expensive
hardware you want the
maximum IOPS and CPU
power for $$$ you have
paid
● Not to mention that you simply want your apps to
run faster
● Tuning is an endless process but 80/20 rule
works perfectly
Even before you have something to
tune….
● Pick reasonably good hardware but do not go
high-end
● Same for network equipment
● Hadoop scales well and the redundancy is
achieved by software
● More nodes is almost always better than going
for extra node power and/or storage space
● Simpler systems are easier to tune, maintain
and troubleshoot
● Different machines for master nodes
Tuning the hardware and BIOS
● Updating BIOS and firmwares to recent versions
● Disabling dynamic CPU frequency scaling
● Tuning memory speed, power profile
● Disk controller, tune disk cache
OS Tuning
● Pick the filesystem (ext3, ext4, XFS...), parameters (reserve
blocks 0%) and mount options (noatime,nodiratime, barriers
etc)
● I/O scheduler depending on your disks and tasks
● Read-ahead settings
● Disable swap!
● irqbalance for big machines
● Tune other parameters (number of FDs, sockets)
● Install major troubleshooting tools (iostat, iotop, tcpdump,
strace…) on every one
Network tuning
● Test your TCP performance with iperf, ttcp or any other
tools you like
● Know your NICs well, install right firmware and kernel
modules
● Tune your TCP and IP parameters (work harder if you
have expensive 10G network)
● If your NIC supports TCP offload and it works – use it
● txqueuelen, MTU 9000 (if appropriate), HDFS is chatty
● Learn ethtool and see what it can do for you
● Basic IP networking set-up (DNS etc) has to be 100%
perfect
JVM tuning
● Hadoop allows you to set JVM options for all
processes
● Your Data Node, Name Node and HBase
Region Servers are going to work hard and you
need to help them to deal with your workload
● If your MR code is well designed you will most
likely NOT need to tune JVM for MR tasks
● Your main enemy will be GC – until you
become at lease allies, if not friends :)
Tuning Hadoop services
● NameNode deals with many connections and
needs ~150 bytes per HDFS block
● NameNode and DataNode are highly concurrent,
latter needs many threads
● Use HDFS short-circuit reads if appropriate
● ZooKeeper needs to handle enough connections
● HBase uses LOTS of heap
● Reuse JVMs for MR jobs if appropriate
Tuning MapReduce tasks (that means
tuning for your code and data)
● If you run different MR jobs, consider tuning
parameters for each of them, not once and for
all of them
● Configure job scheduler to enforce the SLAs
● Estimate the resource needed for each job
● Plan how are you going to run your jobs
Tuning your own code
● Test and profile your complex MR code outside of
Hadoop (your savings will scale too!)
● Check for GC overhead
● Use reusable objects
● Avoid using expensive formats like JSON and XML
● Anything you waste is multiplied by the number of
rows and the number of tasks!
● Evaluate the need for intermediate data compression
Tuning HBase
● That requires separate presentation
● You will need to fight hard for reducing GC
pauses and overhead
● Pre-splitting regions may be a good idea to
better balance the load
● Understand HBase compactions and deal with
major compactions your way
Set up your monitoring (and
alarming)
● You cannot improve what you cannot see!
● Monitor OS, Hadoop and your app metrics
● Ganglia, Graphite, LogStash, even Cloudera
Manager are your friends
● Set the baseline, track your changes, observe
the outcome
Fight #5 - Operations
● Real hand-over to the Operations people
actually never happened
● In case of any problems either it was ignored or
escalation to engineers was taking about 1
minute
● Neither NOC nor Operations staff wanted to
acquire enough knowledge of Hadoop and the
apps
● Monitoring was nearly non-existing
● Same for appropriate alarms
If you are serious...
● Send your Ops for Hadoop training (or buy
them books and have them read those!)
● Have them automate everything
● Ops have to understand your applications, not
just the platform they are running on
● Your Ops need to be decent Linux admins
● ...and it would be great if they are also OK
programmers (scripting, Java…)
● Of course, the motivation is the key
Plan and train for disaster
● Train your Ops how to
help your system to
survive till Monday
morning
● Decide what sort of
loss you will tolerate
(BigData is not always
so precious)
● Design your system for resilience, async
processing, queuing etc
Fight #6 - evolution
● Sooner or later you will need to increase your
capacity
– Unless your business is stagnating
● Technically, you will either
– Run out of storage space
– Start hitting the wall on IOPS or CPU and fail to
respect your SLAs (even if only internal ones)
– Won't be able to deploy new applications
Understand your application - again
● Even if your apps runs fine you need to monitor the
performance factors
● Build spreadsheets reflecting your current numbers
● Plan for the business growth
● Translate this into the number of additional nodes
and networking equipment
● Especially important if your hardware purchase
cycle takes months
Conclusions
● Not all companies are ready for BigData – often
because of conservative people in key positions
● Traditional IT/Ops/NOC organizations are often
unable to support these platforms
● Engineers have to be given more power to
control how the things they build are ran
(DevOps)
● Hadoop is a complex platform and has to be
taken seriously for serious applications
● If you really depend on Hadoop you do need to
build in-house expertise
Questions?
Thanks for listening!
Nikolai Grigoriev
ngrigoriev@gmail.com

Más contenido relacionado

La actualidad más candente

Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 
Geographically Distributed PostgreSQL
Geographically Distributed PostgreSQLGeographically Distributed PostgreSQL
Geographically Distributed PostgreSQLmason_s
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distributionmcsrivas
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityEdureka!
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationDataWorks Summit
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 

La actualidad más candente (20)

Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Geographically Distributed PostgreSQL
Geographically Distributed PostgreSQLGeographically Distributed PostgreSQL
Geographically Distributed PostgreSQL
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distribution
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 

Destacado

Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Allen Day, PhD
 
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...Hakka Labs
 
The SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data ProcessorThe SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data Processorinside-BigData.com
 
DR Benard Fanaroff on the Square Killometre Array (SKA) project
DR  Benard Fanaroff on the Square Killometre Array (SKA) projectDR  Benard Fanaroff on the Square Killometre Array (SKA) project
DR Benard Fanaroff on the Square Killometre Array (SKA) projectAfrican Academy of Sciences
 
Petascale Storage -- Do It Yourself!
Petascale Storage -- Do It Yourself!Petascale Storage -- Do It Yourself!
Petascale Storage -- Do It Yourself!Tim Lossen
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
 
Honor Roll Academy
Honor Roll AcademyHonor Roll Academy
Honor Roll Academytheboge
 
High-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life SciencesHigh-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life SciencesAri Berman
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Cloudera, Inc.
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD
 
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...Vladimir Bacvanski, PhD
 
Jasper Horrell - SKA and Big Data: Up in Space and on the Ground
Jasper Horrell - SKA and Big Data: Up in Space and on the GroundJasper Horrell - SKA and Big Data: Up in Space and on the Ground
Jasper Horrell - SKA and Big Data: Up in Space and on the GroundSaratoga
 
Big Data Analytic with Hadoop: Customer Stories
Big Data Analytic with Hadoop: Customer StoriesBig Data Analytic with Hadoop: Customer Stories
Big Data Analytic with Hadoop: Customer StoriesYellowfin
 
Architecting for Massive Scalability - St. Louis Day of .NET 2011 - Aug 6, 2011
Architecting for Massive Scalability - St. Louis Day of .NET 2011 - Aug 6, 2011Architecting for Massive Scalability - St. Louis Day of .NET 2011 - Aug 6, 2011
Architecting for Massive Scalability - St. Louis Day of .NET 2011 - Aug 6, 2011Eric D. Boyd
 
Chicago finance-big-data
Chicago finance-big-dataChicago finance-big-data
Chicago finance-big-dataTed Dunning
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Scaling up Business Intelligence from the scratch and to 15 countries worldwi...
Scaling up Business Intelligence from the scratch and to 15 countries worldwi...Scaling up Business Intelligence from the scratch and to 15 countries worldwi...
Scaling up Business Intelligence from the scratch and to 15 countries worldwi...Sergii Khomenko
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 

Destacado (20)

Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
 
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
 
The SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data ProcessorThe SKA Project - The World's Largest Streaming Data Processor
The SKA Project - The World's Largest Streaming Data Processor
 
DR Benard Fanaroff on the Square Killometre Array (SKA) project
DR  Benard Fanaroff on the Square Killometre Array (SKA) projectDR  Benard Fanaroff on the Square Killometre Array (SKA) project
DR Benard Fanaroff on the Square Killometre Array (SKA) project
 
Petascale Storage -- Do It Yourself!
Petascale Storage -- Do It Yourself!Petascale Storage -- Do It Yourself!
Petascale Storage -- Do It Yourself!
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
Honor Roll Academy
Honor Roll AcademyHonor Roll Academy
Honor Roll Academy
 
High-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life SciencesHigh-Performance Networking Use Cases in Life Sciences
High-Performance Networking Use Cases in Life Sciences
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San Jose
 
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
How to Crunch Petabytes with Hadoop and Big Data using InfoSphere BigInsights...
 
Jasper Horrell - SKA and Big Data: Up in Space and on the Ground
Jasper Horrell - SKA and Big Data: Up in Space and on the GroundJasper Horrell - SKA and Big Data: Up in Space and on the Ground
Jasper Horrell - SKA and Big Data: Up in Space and on the Ground
 
Big Data Analytic with Hadoop: Customer Stories
Big Data Analytic with Hadoop: Customer StoriesBig Data Analytic with Hadoop: Customer Stories
Big Data Analytic with Hadoop: Customer Stories
 
Architecting for Massive Scalability - St. Louis Day of .NET 2011 - Aug 6, 2011
Architecting for Massive Scalability - St. Louis Day of .NET 2011 - Aug 6, 2011Architecting for Massive Scalability - St. Louis Day of .NET 2011 - Aug 6, 2011
Architecting for Massive Scalability - St. Louis Day of .NET 2011 - Aug 6, 2011
 
Chicago finance-big-data
Chicago finance-big-dataChicago finance-big-data
Chicago finance-big-data
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Scaling up Business Intelligence from the scratch and to 15 countries worldwi...
Scaling up Business Intelligence from the scratch and to 15 countries worldwi...Scaling up Business Intelligence from the scratch and to 15 countries worldwi...
Scaling up Business Intelligence from the scratch and to 15 countries worldwi...
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 

Similar a BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around HadoopDataWorks Summit
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Labs
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine LearningSudarsun Santhiappan
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingSam Ng
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Cloudera, Inc.
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 

Similar a BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs (20)

Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Training
TrainingTraining
Training
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
 
Redis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs TalksRedis Developers Day 2014 - Redis Labs Talks
Redis Developers Day 2014 - Redis Labs Talks
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop
HadoopHadoop
Hadoop
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 

Último

Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdfSteve Caron
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxRTS corp
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxSasikiranMarri
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 

Último (20)

Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
[ CNCF Q1 2024 ] Intro to Continuous Profiling and Grafana Pyroscope.pdf
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Advantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptxAdvantages of Cargo Cloud Solutions.pptx
Advantages of Cargo Cloud Solutions.pptx
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptxUnderstanding Plagiarism: Causes, Consequences and Prevention.pptx
Understanding Plagiarism: Causes, Consequences and Prevention.pptx
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

  • 1. Hadoop – The War Stories Running Hadoop in large enterprise environment Nikolai Grigoriev (ngrigoriev@gmail.com, @nikgrig) Principal Software Engineer, http://sociablelabs.com
  • 2. Agenda ● Why Hadoop? ● Planning Hadoop deployment ● Hadoop and read hardware ● Understanding the software stack ● Tuning HDFS, MapReduce and HBase ● Troubleshooting examples ● Testing your applications Disclaimer: this presentation is based on the combined work experience from more than one company and represents the author's personal point of view on the problems discussed in it.
  • 3. Why Hadoop (and why have we decided to use it)? ● Need to store hundreds of Tb of info ● Need to process it in parallel ● Desire to have both storage and processing horizontally scalable ● Having and open-source platform with commercial support
  • 4. Our application Application servers (many :) ) Log processors “ETL process”
  • 5. Our application in numbers ● Thousands of user sessions per second ● Average session log size: ~30Kb, 3-7 events per log ● Target retention period – at least ~90 days ● Redundancy and HA everywhere ● Pluggable “ETL” modules for additional data processing
  • 6. Main problem Team had no practical knowledge of Hadoop, HDFS and HBase… ...and there was nobody at the company to help
  • 7. But we did not realize... It was not THE ONLY problem we were about to face!
  • 8. First fight – capacity planning ● Tons of articles are written about Hadoop capacity planning ● Architects may be spending months making educated guesses ● Capacity planning is really about finding the amount of $$$ to be spent on your cluster for target workload – If we had infinite amount of $$$ why would we bother at all? ;)
  • 10. It is all about the balance ● Your Hadoop cluster and your apps use all these resources at different time ● Over-provisioning of one of the resources usually lead to the shortage of another one - wasted $$$
  • 11. What can we say about an app? ● It is going to store X Tb of data – Amount of storage (not to forget the RF!) – Accommodate for growth and failures ● It is going to ingest the data at Y Mb/s – Your network speed and number of nodes ● Latency – More HDDs and faster HDDs – More RAM – More nodes
  • 12. We are big enterprise... Geeky Hadoop developer Old School Senior IT Guy - many “commodity+” hosts - good but inexpensive networking - more regular HDDs - lots of RAM - I also love cloud… - my recent OS - my software configuration - simple network SANs, RAIDs, SCSI, racks, Blades, redundancy, Cisco, HP, fiber optics, 4-year-old rock-solid RHEL, SNMP monitoring… what? I am the Boss...
  • 13. Hadoop cluster vs. old school application servers ● Mostly identical “commodity+” machines – Probably with the exception of NN, JT ● Better to have more simpler machines than fewer monster ones ● No RAID, just JBOD! ● Ethernet depending on the storage density, bonded 1Gbit may be enough ● Hadoop achieves with software what used to be achievable with [expensive!] hardware
  • 14. But still, your application is the driver, not the IT guy! From Cloudera website – Hadoop machine configuration according to workload
  • 15. Your job is: ● Educate your IT, get them on your side or at least earn their trust ● Try to build a capacity planning spreadsheet based on what you do know ● Apply common sense to guess what you do not know ● ...and plan a decent buffer ● Set reasonable performance targets for your application
  • 16. Fight #2 – OMG, our application is slow!!! ● Main part of our application was the MR job merging the logs ● We have committed to deliver X logs/sec on a target test cluster with sample workload ● We were delivering like ~30% of that ● ...weeks before release :) ● ...and we have ran out of other excuses :( ● It was clearly our software and/or configuration
  • 17. Wait a second – we have support contract from Hadoop vendor! ● I mean no disrespect to the vendors! ● But they do not know your application ● And they do not know your hardware ● And they do not know exactly your OS ● And they do not know your network equipment ● They can help you with some tuning, they can help you with bugs and crashes – but they won't be able (or sometimes simply qualified) to do your job!
  • 18. We are on our own :( ● We have realized that our testing methods were not adequate to Hadoop-based ETL process ● Testing the product end-to-end was too difficult, tracking changes was impossible ● Turn-around was too long, we could not try something quickly and revert back ● Observing and monitoring the live system with dummy incoming data was not productive enough
  • 19. Key to successful testing ● Representative data set ● Ability to repeat the same operation as many times as needed with quick turnaround ● Each engineer had to be able to run the tests and try something ● Establishing the key metrics you monitor and try to improve ● Methodological approach – analyze, change, test, be ready to roll back
  • 20. Our “reference runner” Large sample dataset “Reset” tool Runner tool Statistics Recreates HBase tables (predefined regions), cleans HDFS etc Injects the test data, prepares the environment, launches the MR job like real application, allows to quickly rebuild and redeploy the part of the application Any improvements since last run? Manager
  • 21. Tuning results ● In two weeks we had the job that worked about 3 times faster ● Tuning was done everywhere – from OS to Hadoop/HBase and our code ● We were confident that the software was ready to go to production ● During following 2 years later we realized how bad was our design and how it should have been done ;)
  • 22. Hadoop MapReduce DOs ● Think processes, not threads ● Reusable objects, lower GC overhead ● Snappy data compression is generally good ● Reasonable use of counters provides important information ● For frequently running jobs, distributed cache helps a lot ● Minimize disk I/O (spills etc), RAM is cheap ● Avoid unnecessary serialization/deserialization
  • 23. Hadoop MapReduce DONTs ● Small files in HDFS ● Multithreaded programming inside mapper/reducer ● Fat tasks using too much heap ● Any I/O in M-R other than HDFS, ZK or HBase ● Over-complicated code (simple things work better)
  • 24. Fight #3 – Going Production! ● Remember the slide about engineer vs. IT God preferences ;) ● Production hardware was slightly different from the test cluster ● Cluster has been deployed by the people who did not know Hadoop ● First attempt to run the software resulted in major failure and the cluster was finally handed over to the developers for fixing ;)
  • 25. Production hardware ● HP blade servers, 32 core, 128GB of RAM ● Emulex dual-port 10G Ethernet NICs ● 14 HDDs per machine ● OEL 6.3 ● 10G switch modules ● Company hosting center with dedicated networking and operations staff
  • 26. Hardware BIOS/Firmware(s) BIOS/Firmware settings OS (Linux) Java (JVM) Hadoop services Your application(s) Step back – 10,000 ft look at Hadoop stack Hardware BIOS/Firmware(s) BIOS/Firmware settings OS (Linux) Java (JVM) Hadoop services Your application(s) Network - Hadoop is not just a bunch of Java apps - It is a data and application platform - It can run well, just run, barely run and cause constant headache – depending on how much love does it receive :)
  • 27. Hadoop stack (continued) ● In Hadoop a small problem, even sometimes on a single node can be a major pain ● Isolating and finding that small problem may be difficult ● Symptoms are often obvious only at high level (e.g. application) ● Complex hardware (like HP) adds more potential problems
  • 28. Example of one of the problems we had initially ● Jobs were failing because of timeouts ● Numerous I/O errors observed in job and HDFS logs ● This simple test was failing: $ dd if=/dev/zero of=test8Gb.bin bs=1M count=8192 $ time hdfs dfs -copyFromLocal test8Gb.bin / Zzz..zzz...zzz...5min...zzz… real 4m10.002s user 0m15.130s sys 0m4.094s ● IT was clueless but did not really bother ● In fact, 8192Mb / (4 * 60 + 10) = 32Mb/s (!?!?!) ● 10Gb network transfers to HDFS at ~160Mb/s
  • 29. Role of HDFS in Hadoop ● In Hadoop HDFS is the key layer that provides the distributed filesystem services for other components ● Health of HDFS directly (and drastically) affects the health of other components HDFS Map-Reduce Data HBase
  • 30. So, clearly HDFS was the problem ● But what was the problem with HDFS?? ● How exactly HDFS writing works?
  • 31. Chasing it down ● Due to node-to-node streaming it was difficult to understand who was responsible ● Theory of “one bad node in pipeline” was ruled out as results were consistently bad with the cluster of 14 nodes ● Idea (isolating the problem is good): $ time hdfs -Ddfs.replication=1 dfs -copyFromLocal test8Gb.bin / real 0m42.002s $ time hdfs -Ddfs.replication=2 dfs -copyFromLocal test8Gb.bin / real 2m53.184s $ time hdfs -Ddfs.replication=3 dfs -copyFromLocal test8Gb.bin / real 3m41.072s ● 8192/42=195 Mb/s – hmmm….
  • 32. Discoveries ● To make even longer story short... – Bug in “cubic” TCP congestion protocol in Linux kernel – NIC firmware was too old – Kernel driver for Emulex 10G NICs was too old – Only one out of 8 NIC RX queues was enabled on some hosts – A number of network settings were not appropriate for 10G network – “irqbalance” process (due to kernel bug) was locking NIC RX queues by “losing” NIC IRQ handlers – ...
  • 33. More discoveries – Nodes were set up multi-homed, even HDFS at that time did not support that – Misconfigured DNS and reverse DNS ● On disk I/O side – Bad filesystem parameters – Read-ahead settings were wrong – Disk controller firmware was old
  • 34. HDFS “litmus” test - TestDFSIO 13/03/13 16:30:02 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write 13/03/13 16:30:02 INFO fs.TestDFSIO: Date & time: Wed Mar 13 16:30:02 UTC 2013 13/03/13 16:30:02 INFO fs.TestDFSIO: Number of files: 16 13/03/13 16:30:02 INFO fs.TestDFSIO: Total MBytes processed: 160000.0 13/03/13 16:30:02 INFO fs.TestDFSIO: Throughput mb/sec: 103.42190773343779 13/03/13 16:30:02 INFO fs.TestDFSIO: Average IO rate mb/sec: 103.61066436767578 13/03/13 16:30:02 INFO fs.TestDFSIO: IO rate std deviation: 4.513343367320971 13/03/13 16:30:02 INFO fs.TestDFSIO: Test exec time sec: 114.876 13/03/13 16:31:31 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read 13/03/13 16:31:31 INFO fs.TestDFSIO: Date & time: Wed Mar 13 16:31:31 UTC 2013 13/03/13 16:31:31 INFO fs.TestDFSIO: Number of files: 16 13/03/13 16:31:31 INFO fs.TestDFSIO: Total MBytes processed: 160000.0 13/03/13 16:31:31 INFO fs.TestDFSIO: Throughput mb/sec: 586.8243268024676 13/03/13 16:31:31 INFO fs.TestDFSIO: Average IO rate mb/sec: 648.8555908203125 13/03/13 16:31:31 INFO fs.TestDFSIO: IO rate std deviation: 267.0954600161208 13/03/13 16:31:31 INFO fs.TestDFSIO: Test exec time sec: 33.683 13/03/13 16:31:31 INFO fs.TestDFSIO:
  • 35. Fight #4 – tuning Hadoop ● Why do people tune things (IT was not interested ;) )? ● With your own expensive hardware you want the maximum IOPS and CPU power for $$$ you have paid ● Not to mention that you simply want your apps to run faster ● Tuning is an endless process but 80/20 rule works perfectly
  • 36. Even before you have something to tune…. ● Pick reasonably good hardware but do not go high-end ● Same for network equipment ● Hadoop scales well and the redundancy is achieved by software ● More nodes is almost always better than going for extra node power and/or storage space ● Simpler systems are easier to tune, maintain and troubleshoot ● Different machines for master nodes
  • 37. Tuning the hardware and BIOS ● Updating BIOS and firmwares to recent versions ● Disabling dynamic CPU frequency scaling ● Tuning memory speed, power profile ● Disk controller, tune disk cache
  • 38. OS Tuning ● Pick the filesystem (ext3, ext4, XFS...), parameters (reserve blocks 0%) and mount options (noatime,nodiratime, barriers etc) ● I/O scheduler depending on your disks and tasks ● Read-ahead settings ● Disable swap! ● irqbalance for big machines ● Tune other parameters (number of FDs, sockets) ● Install major troubleshooting tools (iostat, iotop, tcpdump, strace…) on every one
  • 39. Network tuning ● Test your TCP performance with iperf, ttcp or any other tools you like ● Know your NICs well, install right firmware and kernel modules ● Tune your TCP and IP parameters (work harder if you have expensive 10G network) ● If your NIC supports TCP offload and it works – use it ● txqueuelen, MTU 9000 (if appropriate), HDFS is chatty ● Learn ethtool and see what it can do for you ● Basic IP networking set-up (DNS etc) has to be 100% perfect
  • 40. JVM tuning ● Hadoop allows you to set JVM options for all processes ● Your Data Node, Name Node and HBase Region Servers are going to work hard and you need to help them to deal with your workload ● If your MR code is well designed you will most likely NOT need to tune JVM for MR tasks ● Your main enemy will be GC – until you become at lease allies, if not friends :)
  • 41. Tuning Hadoop services ● NameNode deals with many connections and needs ~150 bytes per HDFS block ● NameNode and DataNode are highly concurrent, latter needs many threads ● Use HDFS short-circuit reads if appropriate ● ZooKeeper needs to handle enough connections ● HBase uses LOTS of heap ● Reuse JVMs for MR jobs if appropriate
  • 42. Tuning MapReduce tasks (that means tuning for your code and data) ● If you run different MR jobs, consider tuning parameters for each of them, not once and for all of them ● Configure job scheduler to enforce the SLAs ● Estimate the resource needed for each job ● Plan how are you going to run your jobs
  • 43. Tuning your own code ● Test and profile your complex MR code outside of Hadoop (your savings will scale too!) ● Check for GC overhead ● Use reusable objects ● Avoid using expensive formats like JSON and XML ● Anything you waste is multiplied by the number of rows and the number of tasks! ● Evaluate the need for intermediate data compression
  • 44. Tuning HBase ● That requires separate presentation ● You will need to fight hard for reducing GC pauses and overhead ● Pre-splitting regions may be a good idea to better balance the load ● Understand HBase compactions and deal with major compactions your way
  • 45. Set up your monitoring (and alarming) ● You cannot improve what you cannot see! ● Monitor OS, Hadoop and your app metrics ● Ganglia, Graphite, LogStash, even Cloudera Manager are your friends ● Set the baseline, track your changes, observe the outcome
  • 46. Fight #5 - Operations ● Real hand-over to the Operations people actually never happened ● In case of any problems either it was ignored or escalation to engineers was taking about 1 minute ● Neither NOC nor Operations staff wanted to acquire enough knowledge of Hadoop and the apps ● Monitoring was nearly non-existing ● Same for appropriate alarms
  • 47. If you are serious... ● Send your Ops for Hadoop training (or buy them books and have them read those!) ● Have them automate everything ● Ops have to understand your applications, not just the platform they are running on ● Your Ops need to be decent Linux admins ● ...and it would be great if they are also OK programmers (scripting, Java…) ● Of course, the motivation is the key
  • 48. Plan and train for disaster ● Train your Ops how to help your system to survive till Monday morning ● Decide what sort of loss you will tolerate (BigData is not always so precious) ● Design your system for resilience, async processing, queuing etc
  • 49. Fight #6 - evolution ● Sooner or later you will need to increase your capacity – Unless your business is stagnating ● Technically, you will either – Run out of storage space – Start hitting the wall on IOPS or CPU and fail to respect your SLAs (even if only internal ones) – Won't be able to deploy new applications
  • 50. Understand your application - again ● Even if your apps runs fine you need to monitor the performance factors ● Build spreadsheets reflecting your current numbers ● Plan for the business growth ● Translate this into the number of additional nodes and networking equipment ● Especially important if your hardware purchase cycle takes months
  • 51. Conclusions ● Not all companies are ready for BigData – often because of conservative people in key positions ● Traditional IT/Ops/NOC organizations are often unable to support these platforms ● Engineers have to be given more power to control how the things they build are ran (DevOps) ● Hadoop is a complex platform and has to be taken seriously for serious applications ● If you really depend on Hadoop you do need to build in-house expertise
  • 52. Questions? Thanks for listening! Nikolai Grigoriev ngrigoriev@gmail.com