SlideShare una empresa de Scribd logo
1 de 53
Descargar para leer sin conexión
1 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
A Presentation for:
Getting Started with Hadoop, Spark,
Hive and Kafka
Edelweiss	Kammermann
New	York
March	8th	2018
2 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
IT CONVERGENCE SNAPSHOT
3 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Over 600 Customers Engagements In More Than 50 Countries
3
EXTENSIVE EXPERTISE ACROSS THE GLOBE
4 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved4 4
About me
ü Computer Engineer, BI and Data Integration Specialist
ü Over 20 years of Consulting and Project Management experience in Oracle technology.
ü Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG)
ü Vice President of LAOUC (Latin America Oracle User Community)
ü BI Manager at ITConvergence
ü Writer and frequent speaker at international conferences: Collaborate, OTN Tour LA,
UKOUG Tech & Apps, OOW, etc
ü Oracle ACE Director
ü Oracle Big Data Implementation Specialist
5 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Uruguay
6 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
3	Membership	Tiers
• Oracle	ACE	Director
• Oracle	ACE
• Oracle	ACE	Associate
bit.ly/OracleACEProgram
500+	Technical	Experts	
Helping	Peers	Globally
Connect:
Nominate	yourself	or	someone	you	know:	acenomination.oracle.com
@oracleace
Facebook.com/oracleaces
oracle-ace_ww@oracle.com
7 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Index
What is Big Data?
Hadoop
Hive
Spark
8 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved8 8
What is Big Data?
ü Volume: High amount of data
ü Variety: Different data types formats.
Unstructured/semi-structured data
ü Velocity: Speed which data is created and/or consumed
ü Veracity: Quality of data. Accuracy
ü Value: Data has intrinsic value—but it must be discovered.
9 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved9 9
10 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved10
Hadoop
ü An open source software platform for distributed storage and processing
ü Manage huge volumes of unstructured data
ü Parallel processing of large data set
ü Highly scalable
ü Fault-tolerant
ü Two main components:
ü HDFS: Hadoop Distributed File System for storing information
ü MapReduce: programming framework that process information
11 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved11
HDFS Architecture (Simplified)
Client
NameNode
DataNodes
Manages	metadata	and	access	control
Has	the	info	of	where	the	data	is	(which	
DataNodes contains	the	blocks	of	each	file)	
Keeps	this	info	in	memory.	
Store	and	retrieves	data	
(blocks)	by	client	request.Requests	processes	as	read	
or	write	data
12 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved12
HDFS: Writing Data
Client
NameNode DataNodes
1
2
Divide	the	file	into	fixed	size	blocks	
(usually	64	or	128MB)
For	each	block:	Ask	Namenode
in	which	DataNodes can	write,
Specifying	block	size	and	
replication	factor	
For	each	block:	Provide	DataNodes
addresses,	sorted	in	increasing	
distance
3
13 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved13
HDFS: Writing Data
Client
NameNode DataNodes
1
2
Sends	the	data	of	the	block	and	
the	list	of	nodes	to	the	first	
DataNode
3
4
5
Sends	the	data	to	the	following	
DataNode
Replication	Pipeline
6
Each	DataNode sends	Done	to	
NameNode once	the	block	data	is	
written	to	hard	disk
14 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved14
HDFS: Reading Data
Client
NameNode
DataNode
1
Send	list	of	blocks	of	the	file.
List	of	DataNodes for	each	block
2
4
Send	data	for	required	block
Ask	NameNode for	a	specific	file
3
Download	data	from	the	nearest	
DataNode (send	block	number)
15 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved15
HDFS: Fault Tolerance
ü Node Failure
ü DataNodes send heartbeat every 3 seconds
ü If NameNode doesn’t receive it from 10 min consider that node dead.
ü Communication Failure
ü If ACK is not received from DataNode to the sender after many tries
ü Data Corruption
ü DataNodes send block reports to NameNode not including the blocks that are
corrupted (checksum validation)
16 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved16
HDFS: High Availabilty
ü Secondary NameNode (active-
standby configuration)
ü Namenodes use shared storage
ü Datanodes send block reports to both
namenodes
Shared	Storage
Passive	NameNodeActive	NameNode
17 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved17
HDFS: Command Examples
ü hadoop fs –ls
ü hadoop fs -put <local_path> <hdfs_path>
ü hadoop fs -get <hdfs_path> <local_path>
ü hadoop fs -cat <hdfs_path>
ü hadoop fs -rmr <hdfs_path>
ü hadoop fs –copyFromLocal <local_path> <hdfs_path>
18 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved18
MapReduce
ü Process data from HDFS
ü A MapReduce program is composed by
ü Map() method: performs filtering and sorting of the
<key, value> inputs
ü Reduce() method: summarize the <key,value> pairs
provided by the Mappers
ü Code can be written in many languages (Perl, Python,
Java. etc)
19 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved19
MapReduce Example
20 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved20
MapReduce Code Example
21 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Hadoop	Demo
22 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved22 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
But…
Map	Reduce	has	a	high	learning	curve….
How	to	analyze	Big	Data	with	some	familiar	language?
23 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved23 23
Hive
ü An open source data warehouse software on top of Apache Hadoop
ü Analyze and query data stored in HDFS
ü Structure the data into tables
ü Tools for simple ETL
ü SQL- like queries (HiveQL)
ü Metadata is stored in an RDBMS
ü Uses MapReduce as execution language
ü Metadata is stored in a RDBMS (
24 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved24 24
HiveQL
ü UPDATE,INSERT,DELETE
ü Limited transaction support
ü Indexes supported
ü Multitable insert support
ü SQL-92 Join support
ü Read only views
25 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved25 25
Hive: Code Example
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
26 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved26 26
Hive: Pros & Cons
ü Pros
ü Familiarity with SQL
ü Interactive
ü Connection through JDBC/ODBC drivers
ü Cons
ü High latency
ü Doesn’t have query cache
ü Only support equal joins
27 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Hive	Demo
28 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved28 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
But…
Hive	has	high	latency…
What	if	I	want	better	performance	and	analyze	real	time	data?
29 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved29
Spark
ü Apache Spark is a fast, in-memory data processing engine
ü Provides native bindings for Java, Scala, Python and R
ü Supports SQL, streaming data, machine learning and
graph processing.
ü Can run standalone, on Hadoop, or on Apache Mesos
30 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved30
Spark vs MapReduce
ü Spark main advantages vs MapReduce
ü Speed
ü Can perform tasks up to 100 times faster if
all the data can be contained in memory
ü Otherwise can be more than 10 times faster
ü Spark API (developer friendly)
31 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved31
Spark: Code Example
val textFile =	sparkSession.sparkContext.textFile(“hdfs:///tmp/words”)
val counts	=	textFile.flatMap(line	=>	line.split(“	“))
.map(word	=>	(word,	1))
.reduceByKey(_	+	_)
counts.saveAsTextFile(“hdfs:///tmp/words_agg”)
32 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved32
ü Spark Core
ü Spark Streaming
ü Spark SQL
ü MLLib
ü GraphX
Spark: Components
Spark	Core
Spark	
Streaming
Spark	SQL MLlib GraphX
33 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved33
Spark: Resilient Distributed Dataset (RDD)
ü A programming abstraction of objects collection
ü Cannot be modified (immutable)
ü Can be split across a computing cluster.
ü Can be created from text files, SQL databases, NoSQL db (Cassandra, MongoDB,etc)
ü Operations on RDDs
ü Can be split across the cluster and executed in a parallel batch process
ü Fast and scalable parallel processing.
34 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved34
Spark Streaming
ü Takes the data as it comes in and process it in near real time
ü Example: internet of things applications.
ü Breaking the stream down into individuals parts called microbatches,
ü Processed together as small RDDs
ü Reliable: “checkpoints” stores data to disk periodically for fault tolerance.
ü Windowing operations:compute results across a longer time period than your batch
interval
ü Example: Top sales from the past 2 hours.
35 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Spark	Demo
36 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved36 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
But…
What	if	I	want	to	integrate	Big	Data	with	my	other	systems?
37 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved37
Integration Challenge
RDBMS
Hadoop
NOSQL
Website
38 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved38
Kafka
RDBMS
Hadoop
NOSQL
Website
ü Distributed Streaming
Platform
ü Decouple Data
Streams
ü Fault-tolerant
ü High performance
ü Horizontally scalable
39 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved39
How Kafka works?: Kafka Core
Consumers
RDBMS
NoSQL
Website
Apps
Source	Systems
Producers Hadoop
RDBMS
NoSQL
Analytic	
Tools
Target	Systems
40 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved40
How Kafka works?: Extended API
Kafka	Connect	Sink
RDBMS
NoSQL
Website
Apps
Source	Systems
Kafka	Connect	Source
Hadoop
RDBMS
NoSQL
Analytic	
Tools
Target	Systems
41 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved41
How Kafka works?: Topics & Partitions
ü Messages are stored into Topics
ü Similar concept as a database table
ü Topics
ü Are identified by a unique name
ü Are split into Partitions (for redundancy and performance)
ü Partitions
ü Each partition is ordered
ü When a message arrives to a partition an id is assigned = Offset
42 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved42
How Kafka works?: Brokers
ü Brokers = servers in a Kafka cluster
ü Are identified by an ID number
ü Contain topic partitions
ü Recommended to have at least
3 Brokers in a cluster
Topic	3
Partition	0
Topic	3
Partition	1
Topic	2
Partition	1
Topic	2
Partition	2
Broker	1	 Broker	2 Broker	3
43 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved43
How Kafka works?: Replication Factor
ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data
Topic	3
Partition	0
Topic	3
Partition	0
Topic	2
Partition	1
Broker	1	 Broker	2 Broker	3	
Topic	2
Partition	1
44 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved44
How Kafka works?: Replication Factor
ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data
Topic	3
Partition	0
Topic	3
Partition	0
Topic	2
Partition	1
Broker	1	 Broker	2 Broker	3	
Topic	2
Partition	1
45 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved45
How Kafka works?: Replication Factor
ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data
Topic	3
Partition	0
Broker	2 Broker	3	
Topic	2
Partition	1
46 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved46
How Kafka works?: Producers
ü Producers write data into Topics
ü Can choose type of ACK from
partition
ü ACK=0 (no ack)
ü ACK=1 (only partition leader)
ü ACK=All (all the replicas)
Producer
Topic	1
Partition	1
Topic	1
Partition	0
Broker	1	
Broker	2
0 1 2 3 4 5 6 7
0 1 2 3 4
Offset	Partition	0
Offset	Partition	1
47 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved47
How Kafka works?: Consumers
ü Consumers read data from a Topic
ü Consumer reads
ü In order from each partition
ü In parallel between partitions
Consumer
Topic	1
Partition	1
Topic	1
Partition	0
Broker	1	
Broker	2
0 1 2 3 4 5 6 7
0 1 2 3 4
Offset	Partition	0
Offset	Partition	1
48 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
Kafka	Demo
49 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved49
Want to install those tools?
ü Hadoop
ü https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
ü https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm
ü Hive
ü https://www.tutorialspoint.com/hive/hive_installation.htm
ü Spark
ü https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm
ü Kafka
ü https://www.tutorialspoint.com/apache_kafka/apache_kafka_installation_steps.htm
ü https://kafka.apache.org/quickstart
50 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved50
Want to play with those tools?
ü Oracle Pre built VM Big Data Lite
ü http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html
ü Cloudera Quickstart VMs
ü https://www.cloudera.com/downloads/quickstart_vms/5-12.html
ü Apache Kafka Docker Container
ü https://github.com/Landoop/fast-data-dev
51 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reservedITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved51 51
Questions?
52 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
53 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved

Más contenido relacionado

La actualidad más candente

Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeDatabricks
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata StorageDataWorks Summit/Hadoop Summit
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with SparkMohammed Guller
 
Introduction à Cassandra - campus plex
Introduction à Cassandra - campus plexIntroduction à Cassandra - campus plex
Introduction à Cassandra - campus plexjaxio
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Apache Spark.
Apache Spark.Apache Spark.
Apache Spark.JananiJ19
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive TutorialSandeep Patil
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Building Your Data Streams for all the IoT
Building Your Data Streams for all the IoTBuilding Your Data Streams for all the IoT
Building Your Data Streams for all the IoTDevOps.com
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPDr Geetha Mohan
 

La actualidad más candente (20)

Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
 
Sqoop
SqoopSqoop
Sqoop
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
 
Big Data Analytics with Spark
Big Data Analytics with SparkBig Data Analytics with Spark
Big Data Analytics with Spark
 
Introduction à Cassandra - campus plex
Introduction à Cassandra - campus plexIntroduction à Cassandra - campus plex
Introduction à Cassandra - campus plex
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Spark.
Apache Spark.Apache Spark.
Apache Spark.
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Building Your Data Streams for all the IoT
Building Your Data Streams for all the IoTBuilding Your Data Streams for all the IoT
Building Your Data Streams for all the IoT
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOP
 
Data science big data and analytics
Data science big data and analyticsData science big data and analytics
Data science big data and analytics
 

Similar a Getting started with Hadoop, Hive, Spark and Kafka

The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersEdelweiss Kammermann
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...HBaseCon
 
There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9Gleb Otochkin
 
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...Insight Technology, Inc.
 
Real World Modern Development Use Cases with RackHD and Adobe
Real World Modern Development Use Cases with RackHD and AdobeReal World Modern Development Use Cases with RackHD and Adobe
Real World Modern Development Use Cases with RackHD and AdobeTimothy Gelter
 
Big Data Best Practices on GCP
Big Data Best Practices on GCPBig Data Best Practices on GCP
Big Data Best Practices on GCPAllCloud
 
Big Data Best Practices on GCP
Big Data Best Practices on GCPBig Data Best Practices on GCP
Big Data Best Practices on GCPAllCloud
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
CISCO - Presentation at Hortonworks Booth - Strata 2014
CISCO - Presentation at Hortonworks Booth - Strata 2014CISCO - Presentation at Hortonworks Booth - Strata 2014
CISCO - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWKent Graziano
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialRoxycodone Online
 
Big Data Infrastructure
Big Data InfrastructureBig Data Infrastructure
Big Data InfrastructureTrivadis
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsAshish Mrig
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Managing ScaleIO as Software on Mesos
Managing ScaleIO as Software on MesosManaging ScaleIO as Software on Mesos
Managing ScaleIO as Software on MesosDavid vonThenen
 

Similar a Getting started with Hadoop, Hive, Spark and Kafka (20)

The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
 
There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9
 
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
 
Real World Modern Development Use Cases with RackHD and Adobe
Real World Modern Development Use Cases with RackHD and AdobeReal World Modern Development Use Cases with RackHD and Adobe
Real World Modern Development Use Cases with RackHD and Adobe
 
Big Data Best Practices on GCP
Big Data Best Practices on GCPBig Data Best Practices on GCP
Big Data Best Practices on GCP
 
Big Data Best Practices on GCP
Big Data Best Practices on GCPBig Data Best Practices on GCP
Big Data Best Practices on GCP
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
OpenStack Days Krakow
OpenStack Days KrakowOpenStack Days Krakow
OpenStack Days Krakow
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
CISCO - Presentation at Hortonworks Booth - Strata 2014
CISCO - Presentation at Hortonworks Booth - Strata 2014CISCO - Presentation at Hortonworks Booth - Strata 2014
CISCO - Presentation at Hortonworks Booth - Strata 2014
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Big data overview by Edgars
Big data overview by EdgarsBig data overview by Edgars
Big data overview by Edgars
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study Material
 
Big Data Infrastructure
Big Data InfrastructureBig Data Infrastructure
Big Data Infrastructure
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Managing ScaleIO as Software on Mesos
Managing ScaleIO as Software on MesosManaging ScaleIO as Software on Mesos
Managing ScaleIO as Software on Mesos
 

Más de Edelweiss Kammermann

AWDC para desarrolladores y data scientists
AWDC para desarrolladores y data scientists AWDC para desarrolladores y data scientists
AWDC para desarrolladores y data scientists Edelweiss Kammermann
 
Oracle Autonomous Data Warehouse Cloud and Data Visualization
Oracle Autonomous Data Warehouse Cloud and Data VisualizationOracle Autonomous Data Warehouse Cloud and Data Visualization
Oracle Autonomous Data Warehouse Cloud and Data VisualizationEdelweiss Kammermann
 
Working with Oracle Big Data Cloud Compute Edition and Apache Zeppelin
Working with Oracle Big Data Cloud Compute Edition and Apache ZeppelinWorking with Oracle Big Data Cloud Compute Edition and Apache Zeppelin
Working with Oracle Big Data Cloud Compute Edition and Apache ZeppelinEdelweiss Kammermann
 
Moving OBIEE to Oracle Analytics Cloud
Moving OBIEE to Oracle Analytics CloudMoving OBIEE to Oracle Analytics Cloud
Moving OBIEE to Oracle Analytics CloudEdelweiss Kammermann
 
Como elegir entre BI Cloud, Data Visualization and Oracle Analytics Cloud Ser...
Como elegir entre BI Cloud, Data Visualization and Oracle Analytics Cloud Ser...Como elegir entre BI Cloud, Data Visualization and Oracle Analytics Cloud Ser...
Como elegir entre BI Cloud, Data Visualization and Oracle Analytics Cloud Ser...Edelweiss Kammermann
 
Oracle Analytics Cloud lo nuevo de Oracle BI en la nube
Oracle Analytics Cloud  lo nuevo de Oracle BI en la nubeOracle Analytics Cloud  lo nuevo de Oracle BI en la nube
Oracle Analytics Cloud lo nuevo de Oracle BI en la nubeEdelweiss Kammermann
 
Data Visualization Tips for Oracle BICS and DVCS
Data Visualization Tips for Oracle BICS and DVCSData Visualization Tips for Oracle BICS and DVCS
Data Visualization Tips for Oracle BICS and DVCSEdelweiss Kammermann
 
Empowering Business Users: OBIEE 12c Visual Analyzer and Data Mashup
Empowering Business Users: OBIEE 12c Visual Analyzer and Data MashupEmpowering Business Users: OBIEE 12c Visual Analyzer and Data Mashup
Empowering Business Users: OBIEE 12c Visual Analyzer and Data MashupEdelweiss Kammermann
 
Integrating Oracle Data Integrator with Oracle GoldenGate 12c
Integrating Oracle Data Integrator with Oracle GoldenGate 12cIntegrating Oracle Data Integrator with Oracle GoldenGate 12c
Integrating Oracle Data Integrator with Oracle GoldenGate 12cEdelweiss Kammermann
 
Integración de Oracle Data Integrator con Oracle GoldenGate 12c
Integración de Oracle Data Integrator  con Oracle GoldenGate 12cIntegración de Oracle Data Integrator  con Oracle GoldenGate 12c
Integración de Oracle Data Integrator con Oracle GoldenGate 12cEdelweiss Kammermann
 
OBIEE 11.1.1.7: Upgrade y Nuevas Características
OBIEE 11.1.1.7: Upgrade y Nuevas CaracterísticasOBIEE 11.1.1.7: Upgrade y Nuevas Características
OBIEE 11.1.1.7: Upgrade y Nuevas CaracterísticasEdelweiss Kammermann
 
Integrando Oracle BI, BPM y BAM 11g: El ciclo completo de la información
Integrando Oracle BI, BPM y BAM 11g:  El ciclo  completo de la informaciónIntegrando Oracle BI, BPM y BAM 11g:  El ciclo  completo de la información
Integrando Oracle BI, BPM y BAM 11g: El ciclo completo de la informaciónEdelweiss Kammermann
 
Integrating Oracle BI, BPM and BAM 11g: The complete cycle of information
Integrating Oracle BI, BPM and BAM 11g: The complete cycle of informationIntegrating Oracle BI, BPM and BAM 11g: The complete cycle of information
Integrating Oracle BI, BPM and BAM 11g: The complete cycle of informationEdelweiss Kammermann
 

Más de Edelweiss Kammermann (15)

AWDC para desarrolladores y data scientists
AWDC para desarrolladores y data scientists AWDC para desarrolladores y data scientists
AWDC para desarrolladores y data scientists
 
Oracle Autonomous Data Warehouse Cloud and Data Visualization
Oracle Autonomous Data Warehouse Cloud and Data VisualizationOracle Autonomous Data Warehouse Cloud and Data Visualization
Oracle Autonomous Data Warehouse Cloud and Data Visualization
 
Working with Oracle Big Data Cloud Compute Edition and Apache Zeppelin
Working with Oracle Big Data Cloud Compute Edition and Apache ZeppelinWorking with Oracle Big Data Cloud Compute Edition and Apache Zeppelin
Working with Oracle Big Data Cloud Compute Edition and Apache Zeppelin
 
Moving OBIEE to Oracle Analytics Cloud
Moving OBIEE to Oracle Analytics CloudMoving OBIEE to Oracle Analytics Cloud
Moving OBIEE to Oracle Analytics Cloud
 
Como elegir entre BI Cloud, Data Visualization and Oracle Analytics Cloud Ser...
Como elegir entre BI Cloud, Data Visualization and Oracle Analytics Cloud Ser...Como elegir entre BI Cloud, Data Visualization and Oracle Analytics Cloud Ser...
Como elegir entre BI Cloud, Data Visualization and Oracle Analytics Cloud Ser...
 
Oracle Analytics Cloud lo nuevo de Oracle BI en la nube
Oracle Analytics Cloud  lo nuevo de Oracle BI en la nubeOracle Analytics Cloud  lo nuevo de Oracle BI en la nube
Oracle Analytics Cloud lo nuevo de Oracle BI en la nube
 
Data Visualization Tips for Oracle BICS and DVCS
Data Visualization Tips for Oracle BICS and DVCSData Visualization Tips for Oracle BICS and DVCS
Data Visualization Tips for Oracle BICS and DVCS
 
Empowering Business Users: OBIEE 12c Visual Analyzer and Data Mashup
Empowering Business Users: OBIEE 12c Visual Analyzer and Data MashupEmpowering Business Users: OBIEE 12c Visual Analyzer and Data Mashup
Empowering Business Users: OBIEE 12c Visual Analyzer and Data Mashup
 
Integrating Oracle Data Integrator with Oracle GoldenGate 12c
Integrating Oracle Data Integrator with Oracle GoldenGate 12cIntegrating Oracle Data Integrator with Oracle GoldenGate 12c
Integrating Oracle Data Integrator with Oracle GoldenGate 12c
 
Integración de Oracle Data Integrator con Oracle GoldenGate 12c
Integración de Oracle Data Integrator  con Oracle GoldenGate 12cIntegración de Oracle Data Integrator  con Oracle GoldenGate 12c
Integración de Oracle Data Integrator con Oracle GoldenGate 12c
 
OBIEE 11.1.1.7: Upgrade y Nuevas Características
OBIEE 11.1.1.7: Upgrade y Nuevas CaracterísticasOBIEE 11.1.1.7: Upgrade y Nuevas Características
OBIEE 11.1.1.7: Upgrade y Nuevas Características
 
Integrando Oracle BI, BPM y BAM 11g: El ciclo completo de la información
Integrando Oracle BI, BPM y BAM 11g:  El ciclo  completo de la informaciónIntegrando Oracle BI, BPM y BAM 11g:  El ciclo  completo de la información
Integrando Oracle BI, BPM y BAM 11g: El ciclo completo de la información
 
Integrating Oracle BI, BPM and BAM 11g: The complete cycle of information
Integrating Oracle BI, BPM and BAM 11g: The complete cycle of informationIntegrating Oracle BI, BPM and BAM 11g: The complete cycle of information
Integrating Oracle BI, BPM and BAM 11g: The complete cycle of information
 
Bi Publisher 11g: Only good news
Bi Publisher 11g: Only good newsBi Publisher 11g: Only good news
Bi Publisher 11g: Only good news
 
OBI11g: la versión mas esperada
OBI11g: la versión mas esperadaOBI11g: la versión mas esperada
OBI11g: la versión mas esperada
 

Último

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Último (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Getting started with Hadoop, Hive, Spark and Kafka

  • 1. 1 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved A Presentation for: Getting Started with Hadoop, Spark, Hive and Kafka Edelweiss Kammermann New York March 8th 2018
  • 2. 2 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved IT CONVERGENCE SNAPSHOT
  • 3. 3 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved Over 600 Customers Engagements In More Than 50 Countries 3 EXTENSIVE EXPERTISE ACROSS THE GLOBE
  • 4. 4 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved4 4 About me ü Computer Engineer, BI and Data Integration Specialist ü Over 20 years of Consulting and Project Management experience in Oracle technology. ü Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG) ü Vice President of LAOUC (Latin America Oracle User Community) ü BI Manager at ITConvergence ü Writer and frequent speaker at international conferences: Collaborate, OTN Tour LA, UKOUG Tech & Apps, OOW, etc ü Oracle ACE Director ü Oracle Big Data Implementation Specialist
  • 5. 5 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved Uruguay
  • 6. 6 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved 3 Membership Tiers • Oracle ACE Director • Oracle ACE • Oracle ACE Associate bit.ly/OracleACEProgram 500+ Technical Experts Helping Peers Globally Connect: Nominate yourself or someone you know: acenomination.oracle.com @oracleace Facebook.com/oracleaces oracle-ace_ww@oracle.com
  • 7. 7 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved Index What is Big Data? Hadoop Hive Spark
  • 8. 8 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved8 8 What is Big Data? ü Volume: High amount of data ü Variety: Different data types formats. Unstructured/semi-structured data ü Velocity: Speed which data is created and/or consumed ü Veracity: Quality of data. Accuracy ü Value: Data has intrinsic value—but it must be discovered.
  • 9. 9 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved9 9
  • 10. 10 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved10 Hadoop ü An open source software platform for distributed storage and processing ü Manage huge volumes of unstructured data ü Parallel processing of large data set ü Highly scalable ü Fault-tolerant ü Two main components: ü HDFS: Hadoop Distributed File System for storing information ü MapReduce: programming framework that process information
  • 11. 11 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved11 HDFS Architecture (Simplified) Client NameNode DataNodes Manages metadata and access control Has the info of where the data is (which DataNodes contains the blocks of each file) Keeps this info in memory. Store and retrieves data (blocks) by client request.Requests processes as read or write data
  • 12. 12 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved12 HDFS: Writing Data Client NameNode DataNodes 1 2 Divide the file into fixed size blocks (usually 64 or 128MB) For each block: Ask Namenode in which DataNodes can write, Specifying block size and replication factor For each block: Provide DataNodes addresses, sorted in increasing distance 3
  • 13. 13 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved13 HDFS: Writing Data Client NameNode DataNodes 1 2 Sends the data of the block and the list of nodes to the first DataNode 3 4 5 Sends the data to the following DataNode Replication Pipeline 6 Each DataNode sends Done to NameNode once the block data is written to hard disk
  • 14. 14 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved14 HDFS: Reading Data Client NameNode DataNode 1 Send list of blocks of the file. List of DataNodes for each block 2 4 Send data for required block Ask NameNode for a specific file 3 Download data from the nearest DataNode (send block number)
  • 15. 15 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved15 HDFS: Fault Tolerance ü Node Failure ü DataNodes send heartbeat every 3 seconds ü If NameNode doesn’t receive it from 10 min consider that node dead. ü Communication Failure ü If ACK is not received from DataNode to the sender after many tries ü Data Corruption ü DataNodes send block reports to NameNode not including the blocks that are corrupted (checksum validation)
  • 16. 16 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved16 HDFS: High Availabilty ü Secondary NameNode (active- standby configuration) ü Namenodes use shared storage ü Datanodes send block reports to both namenodes Shared Storage Passive NameNodeActive NameNode
  • 17. 17 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved17 HDFS: Command Examples ü hadoop fs –ls ü hadoop fs -put <local_path> <hdfs_path> ü hadoop fs -get <hdfs_path> <local_path> ü hadoop fs -cat <hdfs_path> ü hadoop fs -rmr <hdfs_path> ü hadoop fs –copyFromLocal <local_path> <hdfs_path>
  • 18. 18 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved18 MapReduce ü Process data from HDFS ü A MapReduce program is composed by ü Map() method: performs filtering and sorting of the <key, value> inputs ü Reduce() method: summarize the <key,value> pairs provided by the Mappers ü Code can be written in many languages (Perl, Python, Java. etc)
  • 19. 19 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved19 MapReduce Example
  • 20. 20 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved20 MapReduce Code Example
  • 21. 21 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved Hadoop Demo
  • 22. 22 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved22 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved But… Map Reduce has a high learning curve…. How to analyze Big Data with some familiar language?
  • 23. 23 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved23 23 Hive ü An open source data warehouse software on top of Apache Hadoop ü Analyze and query data stored in HDFS ü Structure the data into tables ü Tools for simple ETL ü SQL- like queries (HiveQL) ü Metadata is stored in an RDBMS ü Uses MapReduce as execution language ü Metadata is stored in a RDBMS (
  • 24. 24 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved24 24 HiveQL ü UPDATE,INSERT,DELETE ü Limited transaction support ü Indexes supported ü Multitable insert support ü SQL-92 Join support ü Read only views
  • 25. 25 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved25 25 Hive: Code Example SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING having_condition] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number];
  • 26. 26 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved26 26 Hive: Pros & Cons ü Pros ü Familiarity with SQL ü Interactive ü Connection through JDBC/ODBC drivers ü Cons ü High latency ü Doesn’t have query cache ü Only support equal joins
  • 27. 27 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved Hive Demo
  • 28. 28 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved28 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved But… Hive has high latency… What if I want better performance and analyze real time data?
  • 29. 29 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved29 Spark ü Apache Spark is a fast, in-memory data processing engine ü Provides native bindings for Java, Scala, Python and R ü Supports SQL, streaming data, machine learning and graph processing. ü Can run standalone, on Hadoop, or on Apache Mesos
  • 30. 30 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved30 Spark vs MapReduce ü Spark main advantages vs MapReduce ü Speed ü Can perform tasks up to 100 times faster if all the data can be contained in memory ü Otherwise can be more than 10 times faster ü Spark API (developer friendly)
  • 31. 31 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved31 Spark: Code Example val textFile = sparkSession.sparkContext.textFile(“hdfs:///tmp/words”) val counts = textFile.flatMap(line => line.split(“ “)) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(“hdfs:///tmp/words_agg”)
  • 32. 32 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved32 ü Spark Core ü Spark Streaming ü Spark SQL ü MLLib ü GraphX Spark: Components Spark Core Spark Streaming Spark SQL MLlib GraphX
  • 33. 33 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved33 Spark: Resilient Distributed Dataset (RDD) ü A programming abstraction of objects collection ü Cannot be modified (immutable) ü Can be split across a computing cluster. ü Can be created from text files, SQL databases, NoSQL db (Cassandra, MongoDB,etc) ü Operations on RDDs ü Can be split across the cluster and executed in a parallel batch process ü Fast and scalable parallel processing.
  • 34. 34 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved34 Spark Streaming ü Takes the data as it comes in and process it in near real time ü Example: internet of things applications. ü Breaking the stream down into individuals parts called microbatches, ü Processed together as small RDDs ü Reliable: “checkpoints” stores data to disk periodically for fault tolerance. ü Windowing operations:compute results across a longer time period than your batch interval ü Example: Top sales from the past 2 hours.
  • 35. 35 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved Spark Demo
  • 36. 36 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved36 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved But… What if I want to integrate Big Data with my other systems?
  • 37. 37 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved37 Integration Challenge RDBMS Hadoop NOSQL Website
  • 38. 38 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved38 Kafka RDBMS Hadoop NOSQL Website ü Distributed Streaming Platform ü Decouple Data Streams ü Fault-tolerant ü High performance ü Horizontally scalable
  • 39. 39 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved39 How Kafka works?: Kafka Core Consumers RDBMS NoSQL Website Apps Source Systems Producers Hadoop RDBMS NoSQL Analytic Tools Target Systems
  • 40. 40 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved40 How Kafka works?: Extended API Kafka Connect Sink RDBMS NoSQL Website Apps Source Systems Kafka Connect Source Hadoop RDBMS NoSQL Analytic Tools Target Systems
  • 41. 41 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved41 How Kafka works?: Topics & Partitions ü Messages are stored into Topics ü Similar concept as a database table ü Topics ü Are identified by a unique name ü Are split into Partitions (for redundancy and performance) ü Partitions ü Each partition is ordered ü When a message arrives to a partition an id is assigned = Offset
  • 42. 42 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved42 How Kafka works?: Brokers ü Brokers = servers in a Kafka cluster ü Are identified by an ID number ü Contain topic partitions ü Recommended to have at least 3 Brokers in a cluster Topic 3 Partition 0 Topic 3 Partition 1 Topic 2 Partition 1 Topic 2 Partition 2 Broker 1 Broker 2 Broker 3
  • 43. 43 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved43 How Kafka works?: Replication Factor ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data Topic 3 Partition 0 Topic 3 Partition 0 Topic 2 Partition 1 Broker 1 Broker 2 Broker 3 Topic 2 Partition 1
  • 44. 44 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved44 How Kafka works?: Replication Factor ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data Topic 3 Partition 0 Topic 3 Partition 0 Topic 2 Partition 1 Broker 1 Broker 2 Broker 3 Topic 2 Partition 1
  • 45. 45 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved45 How Kafka works?: Replication Factor ü Topic should have a Replication Factor > 1 (usually 2 or 3) to avoid losing data Topic 3 Partition 0 Broker 2 Broker 3 Topic 2 Partition 1
  • 46. 46 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved46 How Kafka works?: Producers ü Producers write data into Topics ü Can choose type of ACK from partition ü ACK=0 (no ack) ü ACK=1 (only partition leader) ü ACK=All (all the replicas) Producer Topic 1 Partition 1 Topic 1 Partition 0 Broker 1 Broker 2 0 1 2 3 4 5 6 7 0 1 2 3 4 Offset Partition 0 Offset Partition 1
  • 47. 47 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved47 How Kafka works?: Consumers ü Consumers read data from a Topic ü Consumer reads ü In order from each partition ü In parallel between partitions Consumer Topic 1 Partition 1 Topic 1 Partition 0 Broker 1 Broker 2 0 1 2 3 4 5 6 7 0 1 2 3 4 Offset Partition 0 Offset Partition 1
  • 48. 48 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved Kafka Demo
  • 49. 49 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved49 Want to install those tools? ü Hadoop ü https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html ü https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm ü Hive ü https://www.tutorialspoint.com/hive/hive_installation.htm ü Spark ü https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm ü Kafka ü https://www.tutorialspoint.com/apache_kafka/apache_kafka_installation_steps.htm ü https://kafka.apache.org/quickstart
  • 50. 50 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved50 Want to play with those tools? ü Oracle Pre built VM Big Data Lite ü http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html ü Cloudera Quickstart VMs ü https://www.cloudera.com/downloads/quickstart_vms/5-12.html ü Apache Kafka Docker Container ü https://github.com/Landoop/fast-data-dev
  • 51. 51 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reservedITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved51 51 Questions?
  • 52. 52 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved
  • 53. 53 ITC CORPORATE PRESENTATION © IT Convergence 2017 • All rights reserved