SlideShare una empresa de Scribd logo
1 de 16
Spark overview
Using Spark 1.2 with Java 8 and Cassandra
by Denis Dus
Spark
Apache Spark is a fast and general-purpose
cluster computing system. It provides high-
level APIs in Java, Scala and Python, and an
optimized engine that supports general
execution graphs. It also supports a rich set of
higher-level tools including Spark SQL for SQL
and structured data processing, MLlib for
machine learning, GraphX for graph
processing, and Spark Streaming.
Components
1. Driver program
Our main program, which connects to Spark cluster through SparkContext object, submits
transformations and actions on RDD
2. Cluster manager
Allocates resources across applications (e.g. standalone manager, Mesos, YARN)
3. Worker node
Executor - A process launched for an application on a worker node, that runs tasks and keeps data in
memory or disk storage across them.
Task - A unit of work that will be sent to one executor
Spark RDD
Spark revolves around the concept of
a resilient distributed dataset (RDD), which is a
fault-tolerant collection of elements that can
be operated on in parallel. There are two ways
to create RDDs: parallelizing an existing
collection in your driver program, or
referencing a dataset in an external storage
system, such as a shared filesystem, HDFS,
HBase, or any data source offering a Hadoop
InputFormat.
RDD Operations
Spark Stages
Shared variables in Spark
Spark provides two limited types of shared variables for two common usage
patterns: broadcast variables and accumulators.
• Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached
on each machine rather than shipping a copy of it with tasks. They can be
used, for example, to give every node a copy of a large input dataset in an
efficient manner. Spark also attempts to distribute broadcast variables using
efficient broadcast algorithms to reduce communication cost.
• Accumulators
Accumulators are variables that are only “added” to through an associative
operation and can therefore be efficiently supported in parallel.
Spark natively supports accumulators of numeric types, and programmers can
add support for new types. If accumulators are created with a name, they will
be displayed in Spark’s UI. This can be useful for understanding the progress of
running stages.
Spark application workflow
Building a simple Spark application
SparkConf sparkConf = new SparkConf().setAppName("SparkApplication").setMaster("local[*]");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> file = sparkContext.textFile("hdfs://...");
JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
});
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() {
public Integer call(Integer a, Integer b) {
return a + b;
}
});
counts.saveAsTextFile("hdfs://...");
sparkContext.close();
Java 8 + Spark 1.2 + Cassandra for BI:
Driver program skeleton
SparkConf sparkConf = new SparkConf()
.setAppName("SparkCassandraTest")
.setMaster("local[*]")
.set("spark.cassandra.connection.host", "127.0.0.1");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
CassandraLoader<UserEvent> cassandraLoader = new CassandraLoader<>(sparkContext,
"dataanalytics", "user_events", UserEvent.class);
JavaRDD<UserEvent> rdd = cassandraLoader.fetchAndUnion(venues, startDate, endDate);
… Events processing here …
sparkContext.close();
Java 8 + Spark 1.2 + Cassandra for BI:
Load events from Cassandra
public class CassandraLoader<T> {
private JavaSparkContext sparkContext;
private String keySpace;
private String tableName;
private Class<T> clazz;
…
private CassandraJavaRDD<T> fetchForVenueAndDateShard (String venueId, String dateShard) {
RowReaderFactory<T> mapper = CassandraJavaUtil.mapRowTo(clazz);
return CassandraJavaUtil.
javaFunctions(sparkContext). // SparkContextJavaFunctions appears here
cassandraTable(keySpace, tableName, mapper). // CassandraJavaRDD appears here
where("venue_id=? AND date_shard=?", venueId, dateShard);
}
…
}
CassandraJavaUtil
The main entry point to Spark Cassandra Connector Java API. Builds useful wrappers around Spark Context, Streaming Context, RDD.
SparkContextJavaFunctions -> CassandraJavaRDD<T> cassandraTable (String keyspace, String table, RowReaderFactory<T> rrf)
Returns a view of a Cassandra table. With this method, each row is converted to a object of type T by a specified row reader factory.
CassandraJavaUtil -> RowReaderFactory<T> mapRowTo(Class<T> targetClass, Pair<String, String>... columnMappings)
Constructs a row reader factory which maps an entire row to an object of a specified type (JavaBean style convention).
The default mapping of attributes to column names can be changed by providing a custom map of attribute-column mappings for the pairs which do
not follow the general convention.
CassandraJavaRDD
CassandraJavaRDD<R> select(String... columnNames)
CassandraJavaRDD<R> where(String cqlWhereClause, Object... args)
Java 8 + Spark 1.2 + Cassandra for BI:
Load events from Cassandra
public Map<String, JavaRDD<T>> fetchByVenue(List<String> venueIds, Date startDate, Date endDate) {
Map<String, JavaRDD<T>> result = new HashMap<>();
List<String> dateShards = ShardingUtils.generateDailyShards(startDate, endDate);
List<CassandraJavaRDD<T>> dailyRddList = new LinkedList<>();
venueIds.stream().forEach(venueId -> {
dailyRddList.clear();
dateShards.stream().forEach(dateShard -> {
CassandraJavaRDD<T> rdd = fetchForVenueAndDateShard(venueId, dateShard);
dailyRddList.add(rdd);
});
result.put(venueId, unionRddCollection(dailyRddList));
});
return result;
}
private JavaRDD<T> unionRddCollection(Collection<? extends JavaRDD<T>> rddCollection) {
JavaRDD<T> result = null;
for (JavaRDD<T> rdd : rddCollection) {
result = (result == null) ? rdd : result.union(rdd);
}
return result;
}
public JavaRDD<T> fetchAndUnion(List<String> venueIds, Date startDate, Date endDate) {
Map<String, JavaRDD<T>> data = fetchByVenue(venueIds, startDate, endDate);
return unionRddCollection(data.values());
}
Java 8 + Spark 1.2 + Cassandra for BI:
Some processing
JavaPairRDD<String, Iterable<UserEvent>> groupedRdd = rdd.filter(event -> {
boolean result = false;
boolean isSessionEvent = TYPE_SESSION.equals(event.getEvent_type());
if (isSessionEvent) {
Map<String, String> payload = event.getPayload();
String action = payload.get(PAYLOAD_ACTION_KEY);
if (StringUtils.isNotEmpty(action)) {
result = ACTION_SESSION_START.equals(action) || ACTION_SESSION_STOP.equals(action);
}
}
return result;
}).groupBy(event -> event.getUser_id());
Java 8 + Spark 1.2 + Cassandra for BI:
Some processing
JavaRDD<SessionReport> reportsRdd = groupedRdd.map(pair -> {
String sessionId = pair._1();
Iterable<UserEvent> events = pair._2();
Date sessionStart = null;
Date sessionEnd = null;
for (UserEvent event : events) {
Date eventDate = event.getDate();
if (eventDate != null) {
String action = event.getPayload().get(PAYLOAD_ACTION_KEY);
if (ACTION_SESSION_START.equals(action)) {
if (sessionStart == null || eventDate.before(sessionStart))
sessionStart = eventDate;
}
if (ACTION_SESSION_STOP.equals(action)) {
if (sessionEnd == null || endDate.after(sessionEnd))
sessionEnd = eventDate;
}
}
}
String sessionType = ((sessionStart != null) && (sessionEnd != null)) ? SessionReport.TYPE_CLOSED : SessionReport.TYPE_ACTIVE;
return new SessionReport(sessionId, sessionType, sessionStart, sessionEnd);
});
Java 8 + Spark 1.2 + Cassandra for BI:
Get result to Driver Program
List<SessionReport> reportsList = reportsRdd.collect(); // Returns RDD as a List to driver program, be aware of OOM
reportsList.forEach(Main::printReport);
….
SessionReport{sessionId='36a39b8e-27b9-4560-a1c5-9bfa77679930', sessionType='closed', sessionStart=2014-08-13 21:37:38, sessionEnd=2014-08-13 21:39:12}
SessionReport{sessionId='aee19a86-e060-42fb-b34f-76cd698e483e', sessionType='closed', sessionStart=2014-07-28 17:17:21, sessionEnd=2014-07-28 19:58:12}
SessionReport{sessionId='cecc03eb-f2fb-4ed4-9354-76ec8a965d8d', sessionType='closed', sessionStart=2014-09-04 19:46:51, sessionEnd=2014-09-04 21:12:43}
SessionReport{sessionId='1bd85e46-3fe2-4d46-acc5-2fe69735c453', sessionType='closed', sessionStart=2014-08-24 15:56:54, sessionEnd=2014-08-24 15:57:55}
SessionReport{sessionId='0d4e4b9f-fbd0-4eaf-a815-4f46693dbb2b', sessionType='closed', sessionStart=2014-09-09 13:39:39, sessionEnd=2014-09-09 13:46:08}
SessionReport{sessionId='32e822a6-5835-4001-bd95-ede38746e3bd', sessionType='closed', sessionStart=2014-08-27 21:24:03, sessionEnd=2014-08-28 01:21:11}
SessionReport{sessionId='cd35f911-29f4-496a-92f0-a9f5b51b0298', sessionType='closed', sessionStart=2014-09-09 20:14:49, sessionEnd=2014-09-10 01:07:17}
SessionReport{sessionId='8941e14f-9278-4a42-b000-1a228244cbc9', sessionType='active', sessionStart=2014-09-15 16:58:39, sessionEnd=UNKNOWN}
SessionReport{sessionId='c5bf123a-2e34-4c85-a25f-a705a2d408fa', sessionType='closed', sessionStart=2014-09-10 21:20:15, sessionEnd=2014-09-10 23:58:42}
SessionReport{sessionId='4252c7fd-90c0-4a34-8ddb-8db47d68c5a6', sessionType='closed', sessionStart=2014-07-09 08:32:35, sessionEnd=2014-07-09 08:34:23}
SessionReport{sessionId='f6441966-8d6d-4f1c-801c-29201fa75fe6', sessionType='active', sessionStart=2014-08-05 20:47:14, sessionEnd=UNKNOWN}
….
The End! =)
http://spark.apache.org/docs/1.2.0/index.html

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Cassandra + Spark + Elk
Cassandra + Spark + ElkCassandra + Spark + Elk
Cassandra + Spark + Elk
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Spark core
Spark coreSpark core
Spark core
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeLessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 

Similar a Using spark 1.2 with Java 8 and Cassandra

Similar a Using spark 1.2 with Java 8 and Cassandra (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Big Data Analytics with Apache Spark
Big Data Analytics with Apache SparkBig Data Analytics with Apache Spark
Big Data Analytics with Apache Spark
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
A Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In ProductionA Tale of Two APIs: Using Spark Streaming In Production
A Tale of Two APIs: Using Spark Streaming In Production
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
DAGScheduler - The Internals of Apache Spark.pdf
DAGScheduler - The Internals of Apache Spark.pdfDAGScheduler - The Internals of Apache Spark.pdf
DAGScheduler - The Internals of Apache Spark.pdf
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Building highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache SparkBuilding highly scalable data pipelines with Apache Spark
Building highly scalable data pipelines with Apache Spark
 

Más de Denis Dus

Más de Denis Dus (7)

Probabilistic modeling in deep learning
Probabilistic modeling in deep learningProbabilistic modeling in deep learning
Probabilistic modeling in deep learning
 
Generative modeling with Convolutional Neural Networks
Generative modeling with Convolutional Neural NetworksGenerative modeling with Convolutional Neural Networks
Generative modeling with Convolutional Neural Networks
 
Sequence prediction with TensorFlow
Sequence prediction with TensorFlowSequence prediction with TensorFlow
Sequence prediction with TensorFlow
 
Reproducibility and automation of machine learning process
Reproducibility and automation of machine learning processReproducibility and automation of machine learning process
Reproducibility and automation of machine learning process
 
Assignment of arbitrarily distributed random samples to the fixed probability...
Assignment of arbitrarily distributed random samples to the fixed probability...Assignment of arbitrarily distributed random samples to the fixed probability...
Assignment of arbitrarily distributed random samples to the fixed probability...
 
word2vec (часть 2)
word2vec (часть 2)word2vec (часть 2)
word2vec (часть 2)
 
word2vec (part 1)
word2vec (part 1)word2vec (part 1)
word2vec (part 1)
 

Último

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 

Último (20)

Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

Using spark 1.2 with Java 8 and Cassandra

  • 1. Spark overview Using Spark 1.2 with Java 8 and Cassandra by Denis Dus
  • 2. Spark Apache Spark is a fast and general-purpose cluster computing system. It provides high- level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
  • 3. Components 1. Driver program Our main program, which connects to Spark cluster through SparkContext object, submits transformations and actions on RDD 2. Cluster manager Allocates resources across applications (e.g. standalone manager, Mesos, YARN) 3. Worker node Executor - A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Task - A unit of work that will be sent to one executor
  • 4. Spark RDD Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
  • 7. Shared variables in Spark Spark provides two limited types of shared variables for two common usage patterns: broadcast variables and accumulators. • Broadcast Variables Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost. • Accumulators Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. Spark natively supports accumulators of numeric types, and programmers can add support for new types. If accumulators are created with a name, they will be displayed in Spark’s UI. This can be useful for understanding the progress of running stages.
  • 9. Building a simple Spark application SparkConf sparkConf = new SparkConf().setAppName("SparkApplication").setMaster("local[*]"); JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); JavaRDD<String> file = sparkContext.textFile("hdfs://..."); JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); counts.saveAsTextFile("hdfs://..."); sparkContext.close();
  • 10. Java 8 + Spark 1.2 + Cassandra for BI: Driver program skeleton SparkConf sparkConf = new SparkConf() .setAppName("SparkCassandraTest") .setMaster("local[*]") .set("spark.cassandra.connection.host", "127.0.0.1"); JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); CassandraLoader<UserEvent> cassandraLoader = new CassandraLoader<>(sparkContext, "dataanalytics", "user_events", UserEvent.class); JavaRDD<UserEvent> rdd = cassandraLoader.fetchAndUnion(venues, startDate, endDate); … Events processing here … sparkContext.close();
  • 11. Java 8 + Spark 1.2 + Cassandra for BI: Load events from Cassandra public class CassandraLoader<T> { private JavaSparkContext sparkContext; private String keySpace; private String tableName; private Class<T> clazz; … private CassandraJavaRDD<T> fetchForVenueAndDateShard (String venueId, String dateShard) { RowReaderFactory<T> mapper = CassandraJavaUtil.mapRowTo(clazz); return CassandraJavaUtil. javaFunctions(sparkContext). // SparkContextJavaFunctions appears here cassandraTable(keySpace, tableName, mapper). // CassandraJavaRDD appears here where("venue_id=? AND date_shard=?", venueId, dateShard); } … } CassandraJavaUtil The main entry point to Spark Cassandra Connector Java API. Builds useful wrappers around Spark Context, Streaming Context, RDD. SparkContextJavaFunctions -> CassandraJavaRDD<T> cassandraTable (String keyspace, String table, RowReaderFactory<T> rrf) Returns a view of a Cassandra table. With this method, each row is converted to a object of type T by a specified row reader factory. CassandraJavaUtil -> RowReaderFactory<T> mapRowTo(Class<T> targetClass, Pair<String, String>... columnMappings) Constructs a row reader factory which maps an entire row to an object of a specified type (JavaBean style convention). The default mapping of attributes to column names can be changed by providing a custom map of attribute-column mappings for the pairs which do not follow the general convention. CassandraJavaRDD CassandraJavaRDD<R> select(String... columnNames) CassandraJavaRDD<R> where(String cqlWhereClause, Object... args)
  • 12. Java 8 + Spark 1.2 + Cassandra for BI: Load events from Cassandra public Map<String, JavaRDD<T>> fetchByVenue(List<String> venueIds, Date startDate, Date endDate) { Map<String, JavaRDD<T>> result = new HashMap<>(); List<String> dateShards = ShardingUtils.generateDailyShards(startDate, endDate); List<CassandraJavaRDD<T>> dailyRddList = new LinkedList<>(); venueIds.stream().forEach(venueId -> { dailyRddList.clear(); dateShards.stream().forEach(dateShard -> { CassandraJavaRDD<T> rdd = fetchForVenueAndDateShard(venueId, dateShard); dailyRddList.add(rdd); }); result.put(venueId, unionRddCollection(dailyRddList)); }); return result; } private JavaRDD<T> unionRddCollection(Collection<? extends JavaRDD<T>> rddCollection) { JavaRDD<T> result = null; for (JavaRDD<T> rdd : rddCollection) { result = (result == null) ? rdd : result.union(rdd); } return result; } public JavaRDD<T> fetchAndUnion(List<String> venueIds, Date startDate, Date endDate) { Map<String, JavaRDD<T>> data = fetchByVenue(venueIds, startDate, endDate); return unionRddCollection(data.values()); }
  • 13. Java 8 + Spark 1.2 + Cassandra for BI: Some processing JavaPairRDD<String, Iterable<UserEvent>> groupedRdd = rdd.filter(event -> { boolean result = false; boolean isSessionEvent = TYPE_SESSION.equals(event.getEvent_type()); if (isSessionEvent) { Map<String, String> payload = event.getPayload(); String action = payload.get(PAYLOAD_ACTION_KEY); if (StringUtils.isNotEmpty(action)) { result = ACTION_SESSION_START.equals(action) || ACTION_SESSION_STOP.equals(action); } } return result; }).groupBy(event -> event.getUser_id());
  • 14. Java 8 + Spark 1.2 + Cassandra for BI: Some processing JavaRDD<SessionReport> reportsRdd = groupedRdd.map(pair -> { String sessionId = pair._1(); Iterable<UserEvent> events = pair._2(); Date sessionStart = null; Date sessionEnd = null; for (UserEvent event : events) { Date eventDate = event.getDate(); if (eventDate != null) { String action = event.getPayload().get(PAYLOAD_ACTION_KEY); if (ACTION_SESSION_START.equals(action)) { if (sessionStart == null || eventDate.before(sessionStart)) sessionStart = eventDate; } if (ACTION_SESSION_STOP.equals(action)) { if (sessionEnd == null || endDate.after(sessionEnd)) sessionEnd = eventDate; } } } String sessionType = ((sessionStart != null) && (sessionEnd != null)) ? SessionReport.TYPE_CLOSED : SessionReport.TYPE_ACTIVE; return new SessionReport(sessionId, sessionType, sessionStart, sessionEnd); });
  • 15. Java 8 + Spark 1.2 + Cassandra for BI: Get result to Driver Program List<SessionReport> reportsList = reportsRdd.collect(); // Returns RDD as a List to driver program, be aware of OOM reportsList.forEach(Main::printReport); …. SessionReport{sessionId='36a39b8e-27b9-4560-a1c5-9bfa77679930', sessionType='closed', sessionStart=2014-08-13 21:37:38, sessionEnd=2014-08-13 21:39:12} SessionReport{sessionId='aee19a86-e060-42fb-b34f-76cd698e483e', sessionType='closed', sessionStart=2014-07-28 17:17:21, sessionEnd=2014-07-28 19:58:12} SessionReport{sessionId='cecc03eb-f2fb-4ed4-9354-76ec8a965d8d', sessionType='closed', sessionStart=2014-09-04 19:46:51, sessionEnd=2014-09-04 21:12:43} SessionReport{sessionId='1bd85e46-3fe2-4d46-acc5-2fe69735c453', sessionType='closed', sessionStart=2014-08-24 15:56:54, sessionEnd=2014-08-24 15:57:55} SessionReport{sessionId='0d4e4b9f-fbd0-4eaf-a815-4f46693dbb2b', sessionType='closed', sessionStart=2014-09-09 13:39:39, sessionEnd=2014-09-09 13:46:08} SessionReport{sessionId='32e822a6-5835-4001-bd95-ede38746e3bd', sessionType='closed', sessionStart=2014-08-27 21:24:03, sessionEnd=2014-08-28 01:21:11} SessionReport{sessionId='cd35f911-29f4-496a-92f0-a9f5b51b0298', sessionType='closed', sessionStart=2014-09-09 20:14:49, sessionEnd=2014-09-10 01:07:17} SessionReport{sessionId='8941e14f-9278-4a42-b000-1a228244cbc9', sessionType='active', sessionStart=2014-09-15 16:58:39, sessionEnd=UNKNOWN} SessionReport{sessionId='c5bf123a-2e34-4c85-a25f-a705a2d408fa', sessionType='closed', sessionStart=2014-09-10 21:20:15, sessionEnd=2014-09-10 23:58:42} SessionReport{sessionId='4252c7fd-90c0-4a34-8ddb-8db47d68c5a6', sessionType='closed', sessionStart=2014-07-09 08:32:35, sessionEnd=2014-07-09 08:34:23} SessionReport{sessionId='f6441966-8d6d-4f1c-801c-29201fa75fe6', sessionType='active', sessionStart=2014-08-05 20:47:14, sessionEnd=UNKNOWN} ….