[OracleCode SF] In memory analytics with apache spark and hazelcast

•

2 recomendaciones•621 vistas

Apache Spark is a distributed computation framework optimized to work in-memory, and heavily influenced by concepts from functional programming languages. Hazelcast - open source in-memory data grid capable of amazing feats of scale - provides wide range of distributed computing primitives computation, including ExecutorService, M/R and Aggregations frameworks. The nature of data exploration and analysis requires data scientists be able to ask questions that weren't planned to be asked—and get an answer fast! In this talk, Viktor will explore Spark and see how it works together with Hazelcast to provide a robust in-memory open-source big data analytics solution!

Tecnología

@gamussa @hazelcast #oraclecode
IN-MEMORY ANALYTICS
with APACHE SPARK and
HAZELCAST

@gamussa @hazelcast #oraclecode
Solutions Architect
Developer Advocate
@gamussa in internetz
Please, follow me on Twitter
I’m very interesting ©
Who am I?

@gamussa @hazelcast #oraclecode
What’s Apache Spark?
Lightning-Fast Cluster Computing

@gamussa @hazelcast #oraclecode
Run programs up to 100x
faster than Hadoop
MapReduce in memory,
or 10x faster on disk.

@gamussa @hazelcast #oraclecode
When to use Spark?
Data Science Tasks
when questions are unknown
Data Processing Tasks
when you have to much data
You’re tired of Hadoop

@gamussa @hazelcast #oraclecode
Spark Architecture

@gamussa @hazelcast #oraclecode
Resilient Distributed Datasets (RDD)
are the primary abstraction in Spark –
a fault-tolerant collection of elements that can be
operated on in parallel

@gamussa @hazelcast #oraclecode
RDD Operations

@gamussa @hazelcast #oraclecode
operations on RDDs:
transformations and actions

@gamussa @hazelcast #oraclecode
transformations are lazy
(not computed immediately)
the transformed RDD gets recomputed
when an action is run on it (default)

@gamussa @hazelcast #oraclecode
RDD
Transformations

@gamussa @hazelcast #oraclecode
RDD
Actions

@gamussa @hazelcast #oraclecode
RDD
Fault Tolerance

@gamussa @hazelcast #oraclecode
RDD
Construction

@gamussa @hazelcast #oraclecode
parallelized collections
take an existing Scala collection
and run functions on it in parallel

@gamussa @hazelcast #oraclecode
Hadoop datasets
run functions on each record of a file in Hadoop distributed
file system or any other storage system supported by
Hadoop

@gamussa @hazelcast #oraclecode
What’s Hazelcast IMDG?
The Fastest In-memory Data Grid

@gamussa @hazelcast #oraclecode
Hazelcast IMDG
is an operational,
in-memory,
distributed computing platform
that manages data using
in-memory storage, and
performs parallel execution for
breakthrough application speed
and scale

@gamussa @hazelcast #oraclecode
High-Density
Caching
In-Memory
Data Grid
Web Session
Clustering
Microservices
Infrastructure

@gamussa @hazelcast #oraclecode
What’s Hazelcast IMDG?
In-memory Data Grid
Apache v2 Licensed
Distributed
Caches (IMap, JCache)
Java Collections (IList, ISet, IQueue)
Messaging (Topic, RingBuffer)
Computation (ExecutorService, M-R)

@gamussa @hazelcast #oraclecode
Green
Primary
Green
Backup
Green
Shard

@gamussa @hazelcast #oraclecode
final SparkConf sparkConf = new SparkConf()
.set("hazelcast.server.addresses", "localhost")
.set("hazelcast.server.groupName", "dev")
.set("hazelcast.server.groupPass", "dev-pass")
.set("hazelcast.spark.readBatchSize", "5000")
.set("hazelcast.spark.writeBatchSize", "5000")
.set("hazelcast.spark.valueBatchingEnabled", "true");
final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077",
"app", sparkConf);
final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie");
final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my-
cache");

@gamussa @hazelcast #oraclecode
LIMITATIONS

@gamussa @hazelcast #oraclecode
DATA SHOULD NOT BE
UPDATED WHILE READING
FROM SPARK

@gamussa @hazelcast #oraclecode
MAP EXPANSION
SHUFFLES THE DATA
INSIDE THE BUCKET

@gamussa @hazelcast #oraclecode
CURSOR DOESN’T POINT TO
CORRECT ENTRY ANYMORE,
DUPLICATE OR MISSING
ENTRIES COULD OCCUR

@gamussa @hazelcast #oraclecode
github.com/hazelcast/hazelcast-spark

@gamussa @hazelcast #oraclecode
THANKS!
Any questions?
You can find me at
@gamussa
viktor@hazelcast.com

Más contenido relacionado

La actualidad más candente

GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalScyllaDB

Scylla @ GumGum: Contextual AdsScyllaDB

Wide Column Store NoSQL vs SQL Data ModelingScyllaDB

Empowering the AWS DynamoDB™ application developer with AlternatorScyllaDB

OOW Unconference 2010: Mining the AWR repository for Capacity Planning, Visua...Kristofferson A

Scylla: 1 Million CQL operations per second per serverAvi Kivity

The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and SparkAkshay Rai

Apache spark online training - GoLogicaGoLogica Technologies

Managing your Black Friday LogsJ On The Beach

Meeting the challenges of OLTP Big Data with ScyllaScyllaDB

AWS Summit Milan - AWS RDS for your data (and your sleep)Matteo Moretti

Redshift IntroductionDataKitchen

Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura

Case Study: Troubleshooting Cassandra performance issues as a developerCarlos Alonso Pérez

Hadoop + GPUVladimir Starostenkov

Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks

Introduction to dfMohit Jaggi

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit

«Почему Spark отнюдь не так хорош»Olga Lavrentieva

ScyllaDB: NoSQL at Ludicrous SpeedJ On The Beach

La actualidad más candente (20)

GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival

Scylla @ GumGum: Contextual Ads

Wide Column Store NoSQL vs SQL Data Modeling

Empowering the AWS DynamoDB™ application developer with Alternator

OOW Unconference 2010: Mining the AWR repository for Capacity Planning, Visua...

Scylla: 1 Million CQL operations per second per server

The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Apache spark online training - GoLogica

Managing your Black Friday Logs

Meeting the challenges of OLTP Big Data with Scylla

AWS Summit Milan - AWS RDS for your data (and your sleep)

Redshift Introduction

Lessons learned from embedding Cassandra in xPatterns

Case Study: Troubleshooting Cassandra performance issues as a developer

Hadoop + GPU

Building Data Quality pipelines with Apache Spark and Delta Lake

Introduction to df

Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...

«Почему Spark отнюдь не так хорош»

ScyllaDB: NoSQL at Ludicrous Speed

Destacado

Streamsets and sparkHari Shreedharan

Apache Flink's Table & SQL API - unified APIs for batch and stream processingTimo Walther

Akka-chan's Survival Guide for the Streaming WorldKonrad Malawski

Introduction to data flow management using apache nifiAnshuman Ghosh

[Jfokus] Riding the Jet StreamsViktor Gamov

[JokerConf] Верхом на реактивных стримах, 10/13/2016Viktor Gamov

[NYJavaSig] Riding the Distributed Streams - Feb 2nd, 2017Viktor Gamov

[Codemash] Caching Made "Bootiful"!Viktor Gamov

Think Distributed: The Hazelcast WayRahul Gupta

Hazelcast EssentialsRahul Gupta

Apache Spark and Oracle Stream AnalyticsPrabhu Thukkaram

Complex Event Processing with EsperTed Won

February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network

WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2

Dive into Spark StreamingGerard Maas

Streaming Data Analytics with Amazon Kinesis Firehose and RedshiftAmazon Web Services

Streaming all the things with akka streams Johan Andrén

Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...Lightbend

The Power of the LogBen Stopford

Kafka & Couchbase Integration PatternsManuel Hurtado

Destacado (20)

Streamsets and spark

Apache Flink's Table & SQL API - unified APIs for batch and stream processing

Akka-chan's Survival Guide for the Streaming World

Introduction to data flow management using apache nifi

[Jfokus] Riding the Jet Streams

[JokerConf] Верхом на реактивных стримах, 10/13/2016

[NYJavaSig] Riding the Distributed Streams - Feb 2nd, 2017

[Codemash] Caching Made "Bootiful"!

Think Distributed: The Hazelcast Way

Hazelcast Essentials

Apache Spark and Oracle Stream Analytics

Complex Event Processing with Esper

February 2017 HUG: Exactly-once end-to-end processing with Apache Apex

WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber

Dive into Spark Streaming

Streaming Data Analytics with Amazon Kinesis Firehose and Redshift

Streaming all the things with akka streams

Building Streaming And Fast Data Applications With Spark, Mesos, Akka, Cassan...

The Power of the Log

Kafka & Couchbase Integration Patterns

Similar a [OracleCode SF] In memory analytics with apache spark and hazelcast

Spark devoxx2014Andy Petrella

PYSPARK PROGRAMMING.pdfMuhammadFauzi713466

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson

Spark 1.6 vs Spark 2.0Sigmoid

SparkSQL et Cassandra - Tool In Action Devoxx 2015Alexander DEJANOVSKI

Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys

Intro to Apache Spark by CTO of TwingoMapR Technologies

In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...Gianmario Spacagna

Module01NPN Training

Fast Data Analytics with Spark and PythonBenjamin Bengfort

Scalable Machine Learning with PySparkLadle Patel

Apache Spark for Library Developers with Erik Erlandson and William BentonDatabricks

Cleveland Hadoop Users Group - SparkVince Gonzalez

Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...DataStax Academy

Spark ProgrammingTaewook Eom

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

Apache Spark Introductionsudhakara st

The How and Why of Fast Data Analytics with Apache SparkLegacy Typesafe (now Lightbend)

Escape from HadoopDataStax Academy

Apache Spark in Scientific ApplicationsDr. Mirko Kämpf

Similar a [OracleCode SF] In memory analytics with apache spark and hazelcast (20)

Spark devoxx2014

PYSPARK PROGRAMMING.pdf

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)

Spark 1.6 vs Spark 2.0

SparkSQL et Cassandra - Tool In Action Devoxx 2015

Big Data Processing with .NET and Spark (SQLBits 2020)

Intro to Apache Spark by CTO of Twingo

In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...

Module01

Fast Data Analytics with Spark and Python

Scalable Machine Learning with PySpark

Apache Spark for Library Developers with Erik Erlandson and William Benton

Cleveland Hadoop Users Group - Spark

Cassandra Day Denver 2014: Feelin' the Flow: Analyzing Data with Spark and Ca...

Spark Programming

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Apache Spark Introduction

The How and Why of Fast Data Analytics with Apache Spark

Escape from Hadoop

Apache Spark in Scientific Applications

Más de Viktor Gamov

[DataSciCon] Divide, distribute and conquer stream v. batchViktor Gamov

[Philly JUG] Divide, Distribute and Conquer: Stream v. BatchViktor Gamov

Testing containers with TestContainers @ AJUG 7/18/2017Viktor Gamov

Distributed caching for your next node.js project cf summit - 06-15-2017Viktor Gamov

[Philly ETE] Java Puzzlers NGViktor Gamov

Распределяй и властвуй — 2: Потоки данных наносят ответный ударViktor Gamov

[JBreak] Блеск И Нищета Распределенных Стримов - 04-04-2017Viktor Gamov

JavaOne 2013: «Java and JavaScript - Shaken, Not Stirred»Viktor Gamov

WebSockets: The Current State of the Most Valuable HTML5 API for Java DevelopersViktor Gamov

Functional UI testing of Adobe Flex RIAViktor Gamov

Testing Flex RIAs for NJ Flex user groupViktor Gamov

Más de Viktor Gamov (11)

[DataSciCon] Divide, distribute and conquer stream v. batch

[Philly JUG] Divide, Distribute and Conquer: Stream v. Batch

Testing containers with TestContainers @ AJUG 7/18/2017

Distributed caching for your next node.js project cf summit - 06-15-2017

[Philly ETE] Java Puzzlers NG

Распределяй и властвуй — 2: Потоки данных наносят ответный удар

[JBreak] Блеск И Нищета Распределенных Стримов - 04-04-2017

JavaOne 2013: «Java and JavaScript - Shaken, Not Stirred»

WebSockets: The Current State of the Most Valuable HTML5 API for Java Developers

Functional UI testing of Adobe Flex RIA

Testing Flex RIAs for NJ Flex user group

Último

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

"ML in Production",Oleksandr BaganFwdays

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

From Family Reminiscence to Scholarly Archive .Alan Dix

unit 4 immunoblotting technique complete.pptxBkGupta21

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

[OracleCode SF] In memory analytics with apache spark and hazelcast

1. @gamussa @hazelcast #oraclecode IN-MEMORY ANALYTICS with APACHE SPARK and HAZELCAST

2. @gamussa @hazelcast #oraclecode Solutions Architect Developer Advocate @gamussa in internetz Please, follow me on Twitter I’m very interesting © Who am I?

3. @gamussa @hazelcast #oraclecode What’s Apache Spark? Lightning-Fast Cluster Computing

4. @gamussa @hazelcast #oraclecode Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

5. @gamussa @hazelcast #oraclecode When to use Spark? Data Science Tasks when questions are unknown Data Processing Tasks when you have to much data You’re tired of Hadoop

6. @gamussa @hazelcast #oraclecode Spark Architecture

7. @gamussa @hazelcast #oraclecode

8. @gamussa @hazelcast #oraclecode RDD

9. @gamussa @hazelcast #oraclecode Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel

10. @gamussa @hazelcast #oraclecode

11. @gamussa @hazelcast #oraclecode RDD Operations

12. @gamussa @hazelcast #oraclecode operations on RDDs: transformations and actions

13. @gamussa @hazelcast #oraclecode transformations are lazy (not computed immediately) the transformed RDD gets recomputed when an action is run on it (default)

14. @gamussa @hazelcast #oraclecode RDD Transformations

15. @gamussa @hazelcast #oraclecode

16. @gamussa @hazelcast #oraclecode

17. @gamussa @hazelcast #oraclecode RDD Actions

18. @gamussa @hazelcast #oraclecode

19. @gamussa @hazelcast #oraclecode

20. @gamussa @hazelcast #oraclecode RDD Fault Tolerance

21. @gamussa @hazelcast #oraclecode

22. @gamussa @hazelcast #oraclecode RDD Construction

23. @gamussa @hazelcast #oraclecode parallelized collections take an existing Scala collection and run functions on it in parallel

24. @gamussa @hazelcast #oraclecode Hadoop datasets run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop

25. @gamussa @hazelcast #oraclecode What’s Hazelcast IMDG? The Fastest In-memory Data Grid

26. @gamussa @hazelcast #oraclecode Hazelcast IMDG is an operational, in-memory, distributed computing platform that manages data using in-memory storage, and performs parallel execution for breakthrough application speed and scale

27. @gamussa @hazelcast #oraclecode High-Density Caching In-Memory Data Grid Web Session Clustering Microservices Infrastructure

28. @gamussa @hazelcast #oraclecode What’s Hazelcast IMDG? In-memory Data Grid Apache v2 Licensed Distributed Caches (IMap, JCache) Java Collections (IList, ISet, IQueue) Messaging (Topic, RingBuffer) Computation (ExecutorService, M-R)

29. @gamussa @hazelcast #oraclecode Green Primary Green Backup Green Shard

30. @gamussa @hazelcast #oraclecode

31. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses", "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");

32. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses", "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");

33. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses", "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");

34. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses", "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");

35. @gamussa @hazelcast #oraclecode Demo

36. @gamussa @hazelcast #oraclecode LIMITATIONS

37. @gamussa @hazelcast #oraclecode DATA SHOULD NOT BE UPDATED WHILE READING FROM SPARK

38. @gamussa @hazelcast #oraclecode WHY ?

39. @gamussa @hazelcast #oraclecode MAP EXPANSION SHUFFLES THE DATA INSIDE THE BUCKET

40. @gamussa @hazelcast #oraclecode CURSOR DOESN’T POINT TO CORRECT ENTRY ANYMORE, DUPLICATE OR MISSING ENTRIES COULD OCCUR

41. @gamussa @hazelcast #oraclecode github.com/hazelcast/hazelcast-spark

42. @gamussa @hazelcast #oraclecode THANKS! Any questions? You can find me at @gamussa viktor@hazelcast.com

[OracleCode SF] In memory analytics with apache spark and hazelcast

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a [OracleCode SF] In memory analytics with apache spark and hazelcast

Similar a [OracleCode SF] In memory analytics with apache spark and hazelcast (20)

Más de Viktor Gamov

Más de Viktor Gamov (11)

Último

Último (20)

[OracleCode SF] In memory analytics with apache spark and hazelcast