SlideShare una empresa de Scribd logo
1 de 71
© 2013 SpringOne 2GX. All rights reserved. Do not distribute without permission.
Hadoop
Just the Basics for Big Data Rookies
Adam Shook
ashook@gopivotal.com
Agenda
• Hadoop Overview
• HDFS Architecture
• Hadoop MapReduce
• Hadoop Ecosystem
• MapReduce Primer
• Buckle up!
Hadoop Overview
Hadoop Core
• Open-source Apache project out of Yahoo! in 2006
• Distributed fault-tolerant data storage and batch
processing
• Provides linear scalability on commodity hardware
• Adopted by many:
– Amazon, AOL, eBay, Facebook, Foursquare, Google, IBM,
Netflix, Twitter, Yahoo!, and many, many more
Why?
• Bottom line:
– Flexible
– Scalable
– Inexpensive
Overview
• Great at
– Reliable storage for multi-petabyte data sets
– Batch queries and analytics
– Complex hierarchical data structures with changing
schemas, unstructured and structured data
• Not so great at
– Changes to files (can’t do it…)
– Low-latency responses
– Analyst usability
• This is less of a concern now due to higher-level languages
Data Structure
• Bytes!
• No more ETL necessary
• Store data now, process later
• Structure on read
– Built-in support for common data types and formats
– Extendable
– Flexible
Versioning
• Version 0.20.x, 0.21.x, 0.22.x, 1.x.x
– Two main MR packages:
• org.apache.hadoop.mapred (deprecated)
• org.apache.hadoop.mapreduce (new hotness)
• Version 2.x.x, alpha’d in May 2012
– NameNode HA
– YARN – Next Gen MapReduce
HDFS Architecture
HDFS Overview
• Hierarchical UNIX-like file system for data storage
– sort of
• Splitting of large files into blocks
• Distribution and replication of blocks to nodes
• Two key services
– Master NameNode
– Many DataNodes
• Checkpoint Node (Secondary NameNode)
NameNode
• Single master service for HDFS
• Single point of failure (HDFS 1.x)
• Stores file to block to location mappings in the namespace
• All transactions are logged to disk
• NameNode startup reads namespace image and logs
Checkpoint Node (Secondary NN)
• Performs checkpoints of the NameNode’s namespace and
logs
• Not a hot backup!
1. Loads up namespace
2. Reads log transactions to modify namespace
3. Saves namespace as a checkpoint
DataNode
• Stores blocks on local disk
• Sends frequent heartbeats to NameNode
• Sends block reports to NameNode
• Clients connect to DataNode for I/O
How HDFS Works - Writes
DataNode A DataNode B DataNode C DataNode D
NameNode
1
Client
2
A1
3
A2 A3 A4
Client contacts NameNode to write data
NameNode says write it to these nodes
Client sequentially
writes blocks to DataNode
How HDFS Works - Writes
DataNode A DataNode B DataNode C DataNode D
NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
DataNodes replicate data
blocks, orchestrated
by the NameNode
How HDFS Works - Reads
DataNode A DataNode B DataNode C DataNode D
NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
1
2
3
Client contacts NameNode to read data
NameNode says you can find it here
Client sequentially
reads blocks from DataNode
DataNode A DataNode B DataNode C DataNode D
NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
Client connects to another
node serving that block
How HDFS Works - Failure
Block Replication
• Default of three replicas
• Rack-aware system
– One block on same rack
– One block on same rack,
different host
– One block on another rack
• Automatic re-copy by
NameNode, as needed
Rack 1
DN
DN
DN
…
Rack 2
DN
DN
DN
…
HDFS 2.0 Features
• NameNode High-Availability (HA)
– Two redundant NameNodes in active/passive configuration
– Manual or automated failover
• NameNode Federation
– Multiple independent NameNodes using the same collection
of DataNodes
Hadoop MapReduce
Hadoop MapReduce 1.x
• Moves the code to the data
• JobTracker
– Master service to monitor jobs
• TaskTracker
– Multiple services to run tasks
– Same physical machine as a DataNode
• A job contains many tasks
• A task contains one or more task attempts
JobTracker
• Monitors job and task progress
• Issues task attempts to TaskTrackers
• Re-tries failed task attempts
• Four failed attempts = one failed job
• Schedules jobs in FIFO order
– Fair Scheduler
• Single point of failure for MapReduce
TaskTrackers
• Runs on same node as DataNode service
• Sends heartbeats and task reports to JobTracker
• Configurable number of map and reduce slots
• Runs map and reduce task attempts
– Separate JVM!
Exploiting Data Locality
• JobTracker will schedule task on a TaskTracker that is
local to the block
– 3 options!
• If TaskTracker is busy, selects TaskTracker on same rack
– Many options!
• If still busy, chooses an available TaskTracker at random
– Rare!
How MapReduce Works
DataNode A
A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
JobTracker
1
Client
4
2
B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2
3
DataNode B DataNode C DataNode D
TaskTracker A TaskTracker B TaskTracker C TaskTracker D
Client submits job to JobTracker
JobTracker submits
tasks to TaskTrackers
Job output is written to
DataNodes w/replication
JobTracker reports metrics
DataNode A
A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
JobTrackerClient
B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2
DataNode B DataNode C DataNode D
TaskTracker A TaskTracker B TaskTracker C TaskTracker D
How MapReduce Works - Failure
JobTracker assigns task to different node
YARN
• Abstract framework for distributed application
development
• Split functionality of JobTracker into two components
– ResourceManager
– ApplicationMaster
• TaskTracker becomes NodeManager
– Containers instead of map and reduce slots
• Configurable amount of memory per NodeManager
MapReduce 2.x on YARN
• MapReduce API has not changed
– Rebuild required to upgrade from 1.x to 2.x
• Application Master launches and monitors job via YARN
• MapReduce History Server to store… history
Hadoop Ecosystem
Hadoop Ecosystem
• Core Technologies
– Hadoop Distributed File System
– Hadoop MapReduce
• Many other tools…
– Which I will be describing… now
Moving Data
• Sqoop
– Moving data between RDBMS and HDFS
– Say, migrating MySQL tables to HDFS
• Flume
– Streams event data from sources to sinks
– Say, weblogs from multiple servers into HDFS
Flume Architecture
Higher Level APIs
• Pig
– Data-flow language – aptly named PigLatin -- to generate
one or more MapReduce jobs against data stored locally or
in HDFS
• Hive
– Data warehousing solution, allowing users to write SQL-like
queries to generate a series of MapReduce jobs against
data stored in HDFS
Pig Word Count
A = LOAD '$input';
B = FOREACH A GENERATE FLATTEN(TOKENIZE($0)) AS word;
C = GROUP B BY word;
D = FOREACH C GENERATE group AS word, COUNT(B);
STORE D INTO '$output';
Key/Value Stores
• HBase
• Accumulo
• Implementations of Google’s Big Table for HDFS
• Provides random, real-time access to big data
• Supports updates and deletes of key/value pairs
HBase Architecture
MasterZooKeeper
RegionServer
Region
Store
StoreFile
MemStore
StoreFile
Store
StoreFile
MemStore
StoreFile
Client
HDFS
RegionServer
Region
Store
StoreFile
MemStore
StoreFile
Store
StoreFile
MemStore
StoreFile
Data Structure
• Avro
– Data serialization system designed for the Hadoop
ecosystem
– Expressed as JSON
• Parquet
– Compressed, efficient columnar storage for Hadoop and
other systems
Scalable Machine Learning
• Mahout
– Library for scalable machine learning written in Java
– Very robust examples!
– Classification, Clustering, Pattern Mining, Collaborative
Filtering, and much more
Workflow Management
• Oozie
– Scheduling system for Hadoop Jobs
– Support for:
• Java MapReduce
• Streaming MapReduce
• Pig, Hive, Sqoop, Distcp
• Any ol’ Java or shell script program
Real-time Stream Processing
• Storm
– Open-source project
which runs a streaming
of data, called a spout,
to a series of execution
agents called bolts
– Scalable and fault-
tolerant, with guaranteed
processing of data
– Benchmarks of over a
million tuples processed
per second per node
Distributed Application Coordination
• ZooKeeper
– An effort to develop and
maintain an open-source
server which enables
highly reliable distributed
coordination
– Designed to be simple,
replicated, ordered, and
fast
– Provides configuration
management, distributed
synchronization, and group
services for applications
ZooKeeper Architecture
Hadoop Streaming
• Write MapReduce mappers and reducers using stdin and
stdout
• Execute on command line using Hadoop Streaming JAR
// TODO verify
hadoop jar hadoop-streaming.jar -input input -output outputdir
-mapper org.apache.hadoop.mapreduce.Mapper -reduce /bin/wc
SQL on Hadoop
• Apache Drill
• Cloudera Impala
• Hive Stinger
• Pivotal HAWQ
• MPP execution of SQL queries against HDFS data
HAWQ Architecture
That’s a lot of projects
• I am likely missing several (Sorry, guys!)
• Each cropped up to solve a limitation of Hadoop Core
• Know your ecosystem
• Pick the right tool for the right job
Sample Architecture
HDFS
Flume
Agent
Flume
Agent
Flume
Agent
MapReduce Pig HBase Storm
Website
Oozie
Webserve
r
Sales
Call Center SQL
SQL
MapReduce Primer
MapReduce Paradigm
• Data processing system with two key phases
• Map
– Perform a map function on input key/value pairs to generate
intermediate key/value pairs
• Reduce
– Perform a reduce function on intermediate key/value groups
to generate output key/value pairs
• Groups created by sorting map output
Reduce Task 0 Reduce Task 1
Map Task 0 Map Task 1 Map Task 2
(0, "hadoop is fun") (52, "I love hadoop") (104, "Pig is more fun")
("hadoop", 1)
("is", 1)
("fun", 1)
("I", 1)
("love", 1)
("hadoop", 1)
("Pig", 1)
("is", 1)
("more", 1)
("fun", 1)
("hadoop", {1,1})
("is", {1,1})
("fun", {1,1})
("love", {1})
("I", {1})
("Pig", {1})
("more", {1})
("hadoop", 2)
("fun", 2)
("love", 1)
("I", 1)
("is", 2)
("Pig", 1)
("more", 1)
SHUFFLE AND SORT
Map Input
Map Output
Reducer Input Groups
Reducer Output
Hadoop MapReduce Components
• Map Phase
– Input Format
– Record Reader
– Mapper
– Combiner
– Partitioner
• Reduce Phase
– Shuffle
– Sort
– Reducer
– Output Format
– Record Writer
Writable Interfaces
public interface Writable {
void write(DataOutput out);
void readFields(DataInput in);
}
public interface WritableComparable<T> extends Writable, Comparable<T> {
}
• BooleanWritable
• BytesWritable
• ByteWritable
• DoubleWritable
• FloatWritable
• IntWritable
• LongWritable
• NullWritable
• Text
InputFormat
public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context);
public abstract RecordReader<K, V>
createRecordReader(InputSplit split, TaskAttemptContext context);
}
RecordReader
public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable {
public abstract void initialize(InputSplit split, TaskAttemptContext context);
public abstract boolean nextKeyValue();
public abstract KEYIN getCurrentKey();
public abstract VALUEIN getCurrentValue();
public abstract float getProgress();
public abstract void close();
}
Mapper
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void setup(Context context) { /* NOTHING */ }
protected void cleanup(Context context) { /* NOTHING */ }
protected void map(KEYIN key, VALUEIN value, Context context) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
public void run(Context context) {
setup(context);
while (context.nextKeyValue())
map(context.getCurrentKey(), context.getCurrentValue(), context);
cleanup(context);
}
}
Partitioner
public abstract class Partitioner<KEY, VALUE> {
public abstract int getPartition(KEY key, VALUE value, int numPartitions);
}
• Default HashPartitioner uses key’s hashCode() % numPartitions
Reducer
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void setup(Context context) { /* NOTHING */ }
protected void cleanup(Context context) { /* NOTHING */ }
protected void reduce(KEYIN key, Iterable<VALUEIN> value, Context context) {
for (VALUEIN value : values)
context.write((KEYOUT) key, (VALUEOUT) value);
}
public void run(Context context) {
setup(context);
while (context.nextKey())
reduce(context.getCurrentKey(), context.getValues(), context);
cleanup(context);
}
}
OutputFormat
public abstract class OutputFormat<K, V> {
public abstract RecordWriter<K, V>
getRecordWriter(TaskAttemptContext context);
public abstract void checkOutputSpecs(JobContext context);
public abstract OutputCommitter
getOutputCommitter(TaskAttemptContext context);
}
RecordWriter
public abstract class RecordWriter<K, V> {
public abstract void write(K key, V value);
public abstract void close(TaskAttemptContext context);
}
Word Count Example
Problem
• Count the number of times
each word is used in a
body of text
• Uses TextInputFormat and
TextOutputFormat
map(byte_offset, line)
foreach word in line
emit(word, 1)
reduce(word, counts)
sum = 0
foreach count in counts
sum += count
emit(word, sum)
Mapper Code
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable ONE = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, ONE);
}
}
}
Shuffle and Sort
P0 P1 1P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3
P0 P0 P0 P0 P1 P1 P1 P1 P2 P2 P2 P2 P3 P3 P3 P3
2
3
P0 P1 P2 P3
Reducer 0 Reducer 1 Reducer 2 Reducer 3
Mapper 0 Mapper 1 Mapper 2 Mapper 3
Mapper outputs
to a single logically
partitioned file
Reducers copy
their parts
Reducer
merges
partitions,
sorting by key
Reducer Code
public class IntSumReducer
extends Reducer<Text, LongWritable, Text, IntWritable> {
private IntWritable outvalue = new IntWritable();
private int sum = 0;
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
outvalue.set(sum);
context.write(key, outvalue);
}
}
So what’s so hard about it?
MapReduce
that’s a tiny box
All the problems you'll
ever have ever
So what’s so hard about it?
• MapReduce is a limitation
• Entirely different way of thinking
• Simple processing operations such as joins are not so
easy when expressed in MapReduce
• Proper implementation is not so easy
• Lots of configuration and implementation details for
optimal performance
– Number of reduce tasks, data skew, JVM size, garbage
collection
So what does this mean for you?
• Hadoop is written primarily in Java
• Components are extendable and configurable
• Custom I/O through Input and Output Formats
– Parse custom data formats
– Read and write using external systems
• Higher-level tools enable rapid development of big data
analysis
Resources, Wrap-up, etc.
• http://hadoop.apache.org
• Very supportive community
• Strata + Hadoop World Oct. 28th – 30th in Manhattan
• Plenty of resources available to learn more
– Blogs
– Email lists
– Books
– Shameless Plug -- MapReduce Design Patterns
Getting Started
• Pivotal HD Single-Node VM and Community Edition
– http://gopivotal.com/pivotal-products/data/pivotal-hd
• For the brave and bold -- Roll-your-own!
– http://hadoop.apache.org/docs/current
Acknowledgements
• Apache Hadoop, the Hadoop elephant logo, HDFS,
Accumulo, Avro, Drill, Flume, HBase, Hive, Mahout, Oozie,
Pig, Sqoop, YARN, and ZooKeeper are trademarks of the
Apache Software Foundation
• Cloudera Impala is a trademark of Cloudera
• Parquet is copyright Twitter, Cloudera, and other
contributors
• Storm is licensed under the Eclipse Public License
Learn More. Stay Connected.
• Talk to us on Twitter: @springcentral
• Find Session replays on YouTube: spring.io/video

Más contenido relacionado

La actualidad más candente

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2TarjeiRomtveit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 

La actualidad más candente (20)

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Introduction to hadoop V2
Introduction to hadoop V2Introduction to hadoop V2
Introduction to hadoop V2
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 

Destacado

Serialization (Avro, Message Pack, Kryo)
Serialization (Avro, Message Pack, Kryo)Serialization (Avro, Message Pack, Kryo)
Serialization (Avro, Message Pack, Kryo)오석 한
 
Analysing Banking Data to Provide Relevant Offers to Customers
Analysing Banking Data to Provide Relevant Offers to CustomersAnalysing Banking Data to Provide Relevant Offers to Customers
Analysing Banking Data to Provide Relevant Offers to CustomersMarc Torrens
 
Introduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleIntroduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleSpringPeople
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutionssolarisyougood
 
SOC presentation- Building a Security Operations Center
SOC presentation- Building a Security Operations CenterSOC presentation- Building a Security Operations Center
SOC presentation- Building a Security Operations CenterMichael Nickle
 
Industry 4.0: Sensor Driven Manufacturing
Industry 4.0: Sensor Driven ManufacturingIndustry 4.0: Sensor Driven Manufacturing
Industry 4.0: Sensor Driven ManufacturingSMARTRAC
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 
L'economia europea dei dati. Politiche europee e opportunità di finanziamento...
L'economia europea dei dati. Politiche europee e opportunità di finanziamento...L'economia europea dei dati. Politiche europee e opportunità di finanziamento...
L'economia europea dei dati. Politiche europee e opportunità di finanziamento...Data Driven Innovation
 
Industry 4.0: Merging Internet and Factories
Industry 4.0: Merging Internet and FactoriesIndustry 4.0: Merging Internet and Factories
Industry 4.0: Merging Internet and FactoriesFabernovel
 

Destacado (14)

Serialization (Avro, Message Pack, Kryo)
Serialization (Avro, Message Pack, Kryo)Serialization (Avro, Message Pack, Kryo)
Serialization (Avro, Message Pack, Kryo)
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Analysing Banking Data to Provide Relevant Offers to Customers
Analysing Banking Data to Provide Relevant Offers to CustomersAnalysing Banking Data to Provide Relevant Offers to Customers
Analysing Banking Data to Provide Relevant Offers to Customers
 
Top 9 Retail IoT Use Cases
Top 9 Retail IoT Use CasesTop 9 Retail IoT Use Cases
Top 9 Retail IoT Use Cases
 
Introduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleIntroduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeople
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
SOC presentation- Building a Security Operations Center
SOC presentation- Building a Security Operations CenterSOC presentation- Building a Security Operations Center
SOC presentation- Building a Security Operations Center
 
Industry 4.0: Sensor Driven Manufacturing
Industry 4.0: Sensor Driven ManufacturingIndustry 4.0: Sensor Driven Manufacturing
Industry 4.0: Sensor Driven Manufacturing
 
The Future or Everyday Banking
The Future or Everyday BankingThe Future or Everyday Banking
The Future or Everyday Banking
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
L'economia europea dei dati. Politiche europee e opportunità di finanziamento...
L'economia europea dei dati. Politiche europee e opportunità di finanziamento...L'economia europea dei dati. Politiche europee e opportunità di finanziamento...
L'economia europea dei dati. Politiche europee e opportunità di finanziamento...
 
What is big data?
What is big data?What is big data?
What is big data?
 
Industry 4.0: Merging Internet and Factories
Industry 4.0: Merging Internet and FactoriesIndustry 4.0: Merging Internet and Factories
Industry 4.0: Merging Internet and Factories
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 

Similar a Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 

Similar a Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013) (20)

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Hadoop
HadoopHadoop
Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Hadoop
HadoopHadoop
Hadoop
 

Más de VMware Tanzu

What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItVMware Tanzu
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023VMware Tanzu
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleVMware Tanzu
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023VMware Tanzu
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductVMware Tanzu
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready AppsVMware Tanzu
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And BeyondVMware Tanzu
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfVMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023VMware Tanzu
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptxVMware Tanzu
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchVMware Tanzu
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishVMware Tanzu
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVMware Tanzu
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - FrenchVMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023VMware Tanzu
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootVMware Tanzu
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerVMware Tanzu
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeVMware Tanzu
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsVMware Tanzu
 

Más de VMware Tanzu (20)

What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About It
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at Scale
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a Product
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptx
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - French
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - English
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - French
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs Practice
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
 

Último

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 

Último (20)

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

  • 1. © 2013 SpringOne 2GX. All rights reserved. Do not distribute without permission. Hadoop Just the Basics for Big Data Rookies Adam Shook ashook@gopivotal.com
  • 2. Agenda • Hadoop Overview • HDFS Architecture • Hadoop MapReduce • Hadoop Ecosystem • MapReduce Primer • Buckle up!
  • 4. Hadoop Core • Open-source Apache project out of Yahoo! in 2006 • Distributed fault-tolerant data storage and batch processing • Provides linear scalability on commodity hardware • Adopted by many: – Amazon, AOL, eBay, Facebook, Foursquare, Google, IBM, Netflix, Twitter, Yahoo!, and many, many more
  • 5. Why? • Bottom line: – Flexible – Scalable – Inexpensive
  • 6. Overview • Great at – Reliable storage for multi-petabyte data sets – Batch queries and analytics – Complex hierarchical data structures with changing schemas, unstructured and structured data • Not so great at – Changes to files (can’t do it…) – Low-latency responses – Analyst usability • This is less of a concern now due to higher-level languages
  • 7. Data Structure • Bytes! • No more ETL necessary • Store data now, process later • Structure on read – Built-in support for common data types and formats – Extendable – Flexible
  • 8. Versioning • Version 0.20.x, 0.21.x, 0.22.x, 1.x.x – Two main MR packages: • org.apache.hadoop.mapred (deprecated) • org.apache.hadoop.mapreduce (new hotness) • Version 2.x.x, alpha’d in May 2012 – NameNode HA – YARN – Next Gen MapReduce
  • 10. HDFS Overview • Hierarchical UNIX-like file system for data storage – sort of • Splitting of large files into blocks • Distribution and replication of blocks to nodes • Two key services – Master NameNode – Many DataNodes • Checkpoint Node (Secondary NameNode)
  • 11. NameNode • Single master service for HDFS • Single point of failure (HDFS 1.x) • Stores file to block to location mappings in the namespace • All transactions are logged to disk • NameNode startup reads namespace image and logs
  • 12. Checkpoint Node (Secondary NN) • Performs checkpoints of the NameNode’s namespace and logs • Not a hot backup! 1. Loads up namespace 2. Reads log transactions to modify namespace 3. Saves namespace as a checkpoint
  • 13. DataNode • Stores blocks on local disk • Sends frequent heartbeats to NameNode • Sends block reports to NameNode • Clients connect to DataNode for I/O
  • 14. How HDFS Works - Writes DataNode A DataNode B DataNode C DataNode D NameNode 1 Client 2 A1 3 A2 A3 A4 Client contacts NameNode to write data NameNode says write it to these nodes Client sequentially writes blocks to DataNode
  • 15. How HDFS Works - Writes DataNode A DataNode B DataNode C DataNode D NameNodeClient A1 A2 A3 A4 A1A1 A2A2 A3A3A4 A4 DataNodes replicate data blocks, orchestrated by the NameNode
  • 16. How HDFS Works - Reads DataNode A DataNode B DataNode C DataNode D NameNodeClient A1 A2 A3 A4 A1A1 A2A2 A3A3A4 A4 1 2 3 Client contacts NameNode to read data NameNode says you can find it here Client sequentially reads blocks from DataNode
  • 17. DataNode A DataNode B DataNode C DataNode D NameNodeClient A1 A2 A3 A4 A1A1 A2A2 A3A3A4 A4 Client connects to another node serving that block How HDFS Works - Failure
  • 18. Block Replication • Default of three replicas • Rack-aware system – One block on same rack – One block on same rack, different host – One block on another rack • Automatic re-copy by NameNode, as needed Rack 1 DN DN DN … Rack 2 DN DN DN …
  • 19. HDFS 2.0 Features • NameNode High-Availability (HA) – Two redundant NameNodes in active/passive configuration – Manual or automated failover • NameNode Federation – Multiple independent NameNodes using the same collection of DataNodes
  • 21. Hadoop MapReduce 1.x • Moves the code to the data • JobTracker – Master service to monitor jobs • TaskTracker – Multiple services to run tasks – Same physical machine as a DataNode • A job contains many tasks • A task contains one or more task attempts
  • 22. JobTracker • Monitors job and task progress • Issues task attempts to TaskTrackers • Re-tries failed task attempts • Four failed attempts = one failed job • Schedules jobs in FIFO order – Fair Scheduler • Single point of failure for MapReduce
  • 23. TaskTrackers • Runs on same node as DataNode service • Sends heartbeats and task reports to JobTracker • Configurable number of map and reduce slots • Runs map and reduce task attempts – Separate JVM!
  • 24. Exploiting Data Locality • JobTracker will schedule task on a TaskTracker that is local to the block – 3 options! • If TaskTracker is busy, selects TaskTracker on same rack – Many options! • If still busy, chooses an available TaskTracker at random – Rare!
  • 25. How MapReduce Works DataNode A A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3 JobTracker 1 Client 4 2 B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2 3 DataNode B DataNode C DataNode D TaskTracker A TaskTracker B TaskTracker C TaskTracker D Client submits job to JobTracker JobTracker submits tasks to TaskTrackers Job output is written to DataNodes w/replication JobTracker reports metrics
  • 26. DataNode A A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3 JobTrackerClient B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2 DataNode B DataNode C DataNode D TaskTracker A TaskTracker B TaskTracker C TaskTracker D How MapReduce Works - Failure JobTracker assigns task to different node
  • 27. YARN • Abstract framework for distributed application development • Split functionality of JobTracker into two components – ResourceManager – ApplicationMaster • TaskTracker becomes NodeManager – Containers instead of map and reduce slots • Configurable amount of memory per NodeManager
  • 28. MapReduce 2.x on YARN • MapReduce API has not changed – Rebuild required to upgrade from 1.x to 2.x • Application Master launches and monitors job via YARN • MapReduce History Server to store… history
  • 30. Hadoop Ecosystem • Core Technologies – Hadoop Distributed File System – Hadoop MapReduce • Many other tools… – Which I will be describing… now
  • 31. Moving Data • Sqoop – Moving data between RDBMS and HDFS – Say, migrating MySQL tables to HDFS • Flume – Streams event data from sources to sinks – Say, weblogs from multiple servers into HDFS
  • 33. Higher Level APIs • Pig – Data-flow language – aptly named PigLatin -- to generate one or more MapReduce jobs against data stored locally or in HDFS • Hive – Data warehousing solution, allowing users to write SQL-like queries to generate a series of MapReduce jobs against data stored in HDFS
  • 34. Pig Word Count A = LOAD '$input'; B = FOREACH A GENERATE FLATTEN(TOKENIZE($0)) AS word; C = GROUP B BY word; D = FOREACH C GENERATE group AS word, COUNT(B); STORE D INTO '$output';
  • 35. Key/Value Stores • HBase • Accumulo • Implementations of Google’s Big Table for HDFS • Provides random, real-time access to big data • Supports updates and deletes of key/value pairs
  • 37. Data Structure • Avro – Data serialization system designed for the Hadoop ecosystem – Expressed as JSON • Parquet – Compressed, efficient columnar storage for Hadoop and other systems
  • 38. Scalable Machine Learning • Mahout – Library for scalable machine learning written in Java – Very robust examples! – Classification, Clustering, Pattern Mining, Collaborative Filtering, and much more
  • 39. Workflow Management • Oozie – Scheduling system for Hadoop Jobs – Support for: • Java MapReduce • Streaming MapReduce • Pig, Hive, Sqoop, Distcp • Any ol’ Java or shell script program
  • 40. Real-time Stream Processing • Storm – Open-source project which runs a streaming of data, called a spout, to a series of execution agents called bolts – Scalable and fault- tolerant, with guaranteed processing of data – Benchmarks of over a million tuples processed per second per node
  • 41. Distributed Application Coordination • ZooKeeper – An effort to develop and maintain an open-source server which enables highly reliable distributed coordination – Designed to be simple, replicated, ordered, and fast – Provides configuration management, distributed synchronization, and group services for applications
  • 43. Hadoop Streaming • Write MapReduce mappers and reducers using stdin and stdout • Execute on command line using Hadoop Streaming JAR // TODO verify hadoop jar hadoop-streaming.jar -input input -output outputdir -mapper org.apache.hadoop.mapreduce.Mapper -reduce /bin/wc
  • 44. SQL on Hadoop • Apache Drill • Cloudera Impala • Hive Stinger • Pivotal HAWQ • MPP execution of SQL queries against HDFS data
  • 46. That’s a lot of projects • I am likely missing several (Sorry, guys!) • Each cropped up to solve a limitation of Hadoop Core • Know your ecosystem • Pick the right tool for the right job
  • 47. Sample Architecture HDFS Flume Agent Flume Agent Flume Agent MapReduce Pig HBase Storm Website Oozie Webserve r Sales Call Center SQL SQL
  • 49. MapReduce Paradigm • Data processing system with two key phases • Map – Perform a map function on input key/value pairs to generate intermediate key/value pairs • Reduce – Perform a reduce function on intermediate key/value groups to generate output key/value pairs • Groups created by sorting map output
  • 50. Reduce Task 0 Reduce Task 1 Map Task 0 Map Task 1 Map Task 2 (0, "hadoop is fun") (52, "I love hadoop") (104, "Pig is more fun") ("hadoop", 1) ("is", 1) ("fun", 1) ("I", 1) ("love", 1) ("hadoop", 1) ("Pig", 1) ("is", 1) ("more", 1) ("fun", 1) ("hadoop", {1,1}) ("is", {1,1}) ("fun", {1,1}) ("love", {1}) ("I", {1}) ("Pig", {1}) ("more", {1}) ("hadoop", 2) ("fun", 2) ("love", 1) ("I", 1) ("is", 2) ("Pig", 1) ("more", 1) SHUFFLE AND SORT Map Input Map Output Reducer Input Groups Reducer Output
  • 51. Hadoop MapReduce Components • Map Phase – Input Format – Record Reader – Mapper – Combiner – Partitioner • Reduce Phase – Shuffle – Sort – Reducer – Output Format – Record Writer
  • 52. Writable Interfaces public interface Writable { void write(DataOutput out); void readFields(DataInput in); } public interface WritableComparable<T> extends Writable, Comparable<T> { } • BooleanWritable • BytesWritable • ByteWritable • DoubleWritable • FloatWritable • IntWritable • LongWritable • NullWritable • Text
  • 53. InputFormat public abstract class InputFormat<K, V> { public abstract List<InputSplit> getSplits(JobContext context); public abstract RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context); }
  • 54. RecordReader public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable { public abstract void initialize(InputSplit split, TaskAttemptContext context); public abstract boolean nextKeyValue(); public abstract KEYIN getCurrentKey(); public abstract VALUEIN getCurrentValue(); public abstract float getProgress(); public abstract void close(); }
  • 55. Mapper public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { protected void setup(Context context) { /* NOTHING */ } protected void cleanup(Context context) { /* NOTHING */ } protected void map(KEYIN key, VALUEIN value, Context context) { context.write((KEYOUT) key, (VALUEOUT) value); } public void run(Context context) { setup(context); while (context.nextKeyValue()) map(context.getCurrentKey(), context.getCurrentValue(), context); cleanup(context); } }
  • 56. Partitioner public abstract class Partitioner<KEY, VALUE> { public abstract int getPartition(KEY key, VALUE value, int numPartitions); } • Default HashPartitioner uses key’s hashCode() % numPartitions
  • 57. Reducer public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { protected void setup(Context context) { /* NOTHING */ } protected void cleanup(Context context) { /* NOTHING */ } protected void reduce(KEYIN key, Iterable<VALUEIN> value, Context context) { for (VALUEIN value : values) context.write((KEYOUT) key, (VALUEOUT) value); } public void run(Context context) { setup(context); while (context.nextKey()) reduce(context.getCurrentKey(), context.getValues(), context); cleanup(context); } }
  • 58. OutputFormat public abstract class OutputFormat<K, V> { public abstract RecordWriter<K, V> getRecordWriter(TaskAttemptContext context); public abstract void checkOutputSpecs(JobContext context); public abstract OutputCommitter getOutputCommitter(TaskAttemptContext context); }
  • 59. RecordWriter public abstract class RecordWriter<K, V> { public abstract void write(K key, V value); public abstract void close(TaskAttemptContext context); }
  • 61. Problem • Count the number of times each word is used in a body of text • Uses TextInputFormat and TextOutputFormat map(byte_offset, line) foreach word in line emit(word, 1) reduce(word, counts) sum = 0 foreach count in counts sum += count emit(word, sum)
  • 62. Mapper Code public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ private final static IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } }
  • 63. Shuffle and Sort P0 P1 1P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P0 P0 P0 P1 P1 P1 P1 P2 P2 P2 P2 P3 P3 P3 P3 2 3 P0 P1 P2 P3 Reducer 0 Reducer 1 Reducer 2 Reducer 3 Mapper 0 Mapper 1 Mapper 2 Mapper 3 Mapper outputs to a single logically partitioned file Reducers copy their parts Reducer merges partitions, sorting by key
  • 64. Reducer Code public class IntSumReducer extends Reducer<Text, LongWritable, Text, IntWritable> { private IntWritable outvalue = new IntWritable(); private int sum = 0; public void reduce(Text key, Iterable<IntWritable> values, Context context) { sum = 0; for (IntWritable val : values) { sum += val.get(); } outvalue.set(sum); context.write(key, outvalue); } }
  • 65. So what’s so hard about it? MapReduce that’s a tiny box All the problems you'll ever have ever
  • 66. So what’s so hard about it? • MapReduce is a limitation • Entirely different way of thinking • Simple processing operations such as joins are not so easy when expressed in MapReduce • Proper implementation is not so easy • Lots of configuration and implementation details for optimal performance – Number of reduce tasks, data skew, JVM size, garbage collection
  • 67. So what does this mean for you? • Hadoop is written primarily in Java • Components are extendable and configurable • Custom I/O through Input and Output Formats – Parse custom data formats – Read and write using external systems • Higher-level tools enable rapid development of big data analysis
  • 68. Resources, Wrap-up, etc. • http://hadoop.apache.org • Very supportive community • Strata + Hadoop World Oct. 28th – 30th in Manhattan • Plenty of resources available to learn more – Blogs – Email lists – Books – Shameless Plug -- MapReduce Design Patterns
  • 69. Getting Started • Pivotal HD Single-Node VM and Community Edition – http://gopivotal.com/pivotal-products/data/pivotal-hd • For the brave and bold -- Roll-your-own! – http://hadoop.apache.org/docs/current
  • 70. Acknowledgements • Apache Hadoop, the Hadoop elephant logo, HDFS, Accumulo, Avro, Drill, Flume, HBase, Hive, Mahout, Oozie, Pig, Sqoop, YARN, and ZooKeeper are trademarks of the Apache Software Foundation • Cloudera Impala is a trademark of Cloudera • Parquet is copyright Twitter, Cloudera, and other contributors • Storm is licensed under the Eclipse Public License
  • 71. Learn More. Stay Connected. • Talk to us on Twitter: @springcentral • Find Session replays on YouTube: spring.io/video

Notas del editor

  1. Apache project based on two Google papers in 2003 and 2004 on the Google File System and MapReduce Spawned off of Nutch, open-source web-search software, when looking to store the data Linear scalability using commodity hardware – Facebook adopted hadoop, 10TB to 15PB Not for random reads/writes, not for updates – batch processing of large amounts of data Fault tolerant system primarily around distribution and replication of resources There is no true backup of your Hadoop cluster
  2. Use HDFS to store petabytes of data using thousands of nodes Largest Hadoop cluster (March 2011) could hold 30 PB of data
  3. Looks a lot like UNIX file system – contains files, folders, permissions, users, and groups Isn’t actually stored that way – Large data files are split into blocks and placed on DataNode services NameNode is the name server for the file name to block mapping – it knows how the file is split and where the data is in the cluster All read and write requests go through the namenode, but data is served from the DataNodes via HTTP Namespace is stored in memory, transactions are logged on the local file system Secondary NameNode or checkpoint node creates snapshots of the NameNode namespace for fault tolerance and faster restarts
  4. Now, namenode does not persist the data block locations themselves (but does store them in memory). DataNodes tell NameNode what they have
  5. Block reports contain all the block IDs it is holding onto, md5 checksums, etc.
  6. Client contacts the namenode with a request to write some data Namenode responds and says okay write it to these data nodes Client connects to each data node and writes out four blocks, one per node
  7. After the file is closed, the data nodes traffic data around to replicate the blocks to a triplicate, all orchestrated by the namenode In the event of a node failure, data can be accessed on other nodes and the namenode will move data blocks to other nodes
  8. Client contacts the namenode with a request to write some data Namenode responds and says okay write it to these data nodes Client connects to each data node and writes out four blocks, one per node
  9. Client contacts the namenode with a request to write some data Namenode responds and says okay write it to these data nodes Client connects to each data node and writes out four blocks, one per node
  10. Job tracker takes submitted jobs from clients and determines the locations of the blocks that make up the input One data block equals one task Task attempts are distributed to TaskTrackers running in parallel with each DataNode, thus giving data locality for reading the data Successful task attempts are good! Failed task attempts are given to another TaskTracker for processing 4 single failed task attempts equals one failed job
  11. Client submits a job to the JobTracker for processing JobTracker uses the input of the job to determine where the blocks are located (through the NameNode), and then distributed task attempts to the task trackers TaskTrackers coordinate the task attempts and data output is written back to the datanodes, which is distributed and replicated as normal HDFS operations Job statistics – not output – is reported back to the client upon job completion
  12. Client submits a job to the JobTracker for processing JobTracker uses the input of the job to determine where the blocks are located (through the NameNode), and then distributed task attempts to the task trackers TaskTrackers coordinate the task attempts and data output is written back to the datanodes, which is distributed and replicated as normal HDFS operations Job statistics – not output – is reported back to the client upon job completion
  13. RM manages global assignment of compute resources to applications AM manages application life cycle – tasked to negotiate resources form the RM and works with NM to execute and monitor tasks NodeManager executes containers which In YARN, a MapReduce application is equivalent to a job, executed by the MapReduce AM
  14. HBase Master Can run multiple HBase Master’s for high availability with automatic failover HBase RegionServer Hosts a table’s Regions, much how a DataNode hosts a file’s blocks
  15. How do I get data into this file system? How do we take all the boiler-plate code away? How do we update data in HDFS? Common data format for the ecosystem that plays well with Hadoop Stream processing, etc Common Data Format
  16. Uses key value pairs as input and output to both phases Highly parallelizable paradigm – very easy choice for data processing on a Hadoop cluster
  17. Uses key value pairs as input and output to both phases Highly parallelizable paradigm – very easy choice for data processing on a Hadoop cluster
  18. Talk about each piece
  19. Talk about each piece, mention keys must be writable comparable instances
  20. Defines the splits for the mapreduce job as well as the record reader
  21. Given an input split, used to create key value pairs out of the logical split. Responsible for respecting record boundaries
  22. Mapper is where the good stuff happens This is the Identity Mapper and the class that is overwritten.
  23. Talk about HashPartitioner
  24. Context is a subclass of Reducer
  25. Mapper outputs data to a single file that is logically partitioned by key Reducers copy their partition over to the local machine – “shuffle” Each reducer then sorts their partitions into a single sorted local file for processing
  26. HDFS is great and storing data, and MapReduce is great at scaling out processing However, MapReduce is a limitation as everything needs to be expressed in key/value pairs, and needs to fit in this box of map, shuffle, sort, reduce Many different ways to execute a join using MapReduce, need to choose which you want to do based on type of join, data size, etc Number of reducers is configurable Fewer reducers get more data to process Key skew will cause some reducers to get too much work
  27. Java is nice, since a lot of other things are written in Java and it is pretty easy to get something going soon Because of this, anything you can do in Java, you can apply to MapReduce Just need to ensure you don’t break the paradigm and leave everything parallel
  28. Easy to install on VM and begin exploring