SlideShare una empresa de Scribd logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
What’s New in Apache Spark 2.3 & 2.4
DWS Melbourne, Australia 2019
Robert Hryniewicz
2 © Hortonworks Inc. 2011–2018. All rights reserved
• Apache Spark and Kubernetes
• Native Vectorized ORC and SQL Cache Readers
• Pandas UDFs for PySpark
• Continuous Stream Processing
• Barrier Execution
• Avro/Image Data Source
• Higher-order Functions
Agenda Highlights
3 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Spark 2.3
4 © Hortonworks Inc. 2011–2018. All rights reserved
Native Vectorized ORC Reader
• Native ORC read and write: ‘spark.sql.orc.impl’ to ‘native’.
• Vectorized ORC reader: ‘spark.sql.orc.enableVectorizedReader’ to ‘true’
See also ORC Improvement in Apache Spark 2.3 by Dongjoon Hyun
5 © Hortonworks Inc. 2011–2018. All rights reserved
Normal UDF
Pandas UDFs (a.k.a. Vectorized UDFs)
Apache Spark
Python Worker
Internal Spark data
Convert to standard Java type
Pickled
Unpickled
Evaluate row by row
Convert to Python data
6 © Hortonworks Inc. 2011–2018. All rights reserved
Pandas UDF
Pandas UDFs (a.k.a. Vectorized UDFs)
Internal Spark data
Apache Arrow format
Convert to Pandas (cheap)
Vectorized operation by Pandas API
Apache Spark
Python Worker
7 © Hortonworks Inc. 2011–2018. All rights reserved
Pandas UDF
Pandas UDFs (a.k.a. Vectorized UDFs)
Internal Spark data
Apache Arrow format
Convert to Pandas (cheap)
Vectorized operation by Pandas API
Apache Spark
Python Worker
8 © Hortonworks Inc. 2011–2018. All rights reserved
Pandas UDF
Pandas UDFs (a.k.a. Vectorized UDFs)
See also Introducing Pandas UDFs for PySpark by Li Jin
9 © Hortonworks Inc. 2011–2018. All rights reserved
Conversion To/From Pandas With Apache Arrow
• Enable Apache Arrow optimization:
‘spark.sql.execution.arrow.enabled’ to
‘true’.
See also Speeding up PySpark with Apache Arrow by Bryan Cutler
10 © Hortonworks Inc. 2011–2018. All rights reserved
Structured Streaming
Continuous Stream Processing
11 © Hortonworks Inc. 2011–2018. All rights reserved
Structured Streaming: Microbatch
Continuous Stream Processing
See also Continuous Processing in Structured Streaming by Josh Torres
12 © Hortonworks Inc. 2011–2018. All rights reserved
Structured Streaming: Continuous Processing
Continuous Stream Processing
See also Continuous Processing in Structured Streaming by Josh Torres
13 © Hortonworks Inc. 2011–2018. All rights reserved
Structured Streaming: Continuous Processing
Continuous Stream Processing
See also Spark Summit Keynote Demo by Michael Armbrust
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing
14 © Hortonworks Inc. 2011–2018. All rights reserved
Stream-to-Stream Joins
See also Introducing Stream-Stream Joins in Apache Spark 2.3 by Tathagata Das and Joseph Torres
15 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Spark and Kubernetes
See also Running Spark on Kubernetes
16 © Hortonworks Inc. 2011–2018. All rights reserved
Image Support in Spark
• Convert from compressed Images format
(e.g., PNG and JPG) to raw representation
of an image for OpenCV
• One record per one image file
See also SPARK-21866 by Ilya Matiach, and Deep Learning Pipelines for Apache Spark
17 © Hortonworks Inc. 2011–2018. All rights reserved
(Stateless) History Server
History Server Using K-V Store
• Requires storing app lists and UI in the memory
• Requires reading/parsing the whole log file
See also SPARK-18085 and the proposal by Marcelo Vanzin
18 © Hortonworks Inc. 2011–2018. All rights reserved
History Server Using K-V Store
History Server Using K-V Store
• Store app lists and UI in a persistent K-V store (LevelDB)
• Set ‘spark.history.store.path’ to use this feature
• The event log written by lower versions is still compatible
See also SPARK-18085 and the proposal by Marcelo Vanzin
19 © Hortonworks Inc. 2011–2018. All rights reserved
R Structured Streaming
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
See also SSR: Structured Streaming on R for Machine Learning by Felix Cheung
20 © Hortonworks Inc. 2011–2018. All rights reserved
R Native Function Execution Stability
See also SPARK-21093
21 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Spark 2.4
22 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Spark 2.4
Barrier Execution
Apache Spark 3.0
• [SPARK-24374] barrier execution mode
• [SPARK-24374] barrier execution mode
• [SPARK-24579] optimized data exchange
• [SPARK-24615] accelerator-aware scheduling
See also Project Hydrogen: Unifying State-of-the-art AI and Big Data in Apache Spark by Reynold Xin
See also Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark by Xiangrui Meng
23 © Hortonworks Inc. 2011–2018. All rights reserved
Pandas UDFs: Grouped Aggregate Pandas UDFs
https://github.com/apache/spark/commit/9786ce66c
https://github.com/apache/spark/commit/b2ce17b4c
24 © Hortonworks Inc. 2011–2018. All rights reserved
Pandas UDFs: Grouped Aggregate Pandas UDFs
https://github.com/apache/spark/pull/22620/commits/06a7bd0c
Internal Spark data
Apache Arrow format
Convert to Pandas (cheap)
Vectorized operation by Pandas API
Apache Spark
Python Worker
25 © Hortonworks Inc. 2011–2018. All rights reserved
Eager Evaluation
• Set ‘spark.sql.repl.eagerEval.enabled’ to true to enable eager evaluation in Jupyter
26 © Hortonworks Inc. 2011–2018. All rights reserved
Eager Evaluation
• Set ‘spark.sql.repl.eagerEval.enabled’ to true to enable eager evaluation in Jupyter
See also (ongoing) SPARK-24572 for Eagar Evaluation at R side
27 © Hortonworks Inc. 2011–2018. All rights reserved
Flexible Streaming Sink
• Exposing output rows of each microbatch as a DataFrame
• foreachBatch(f: Dataset[T] => Unit) Scala/Java/Python APIs in DataStreamWriter.
28 © Hortonworks Inc. 2011–2018. All rights reserved
Avro Data Source
• Apache Avro (https://avro.apache.org)
• A data serialization format
• Widely used in the Spark and Hadoop ecosystem, especially for Kafka-based data pipelines.
• Spark-Avro package (https://github.com/databricks/spark-avro)
• Spark SQL can read and write the Avro data.
• Inlining Spark-Avro package [SPARK-24768]
• Better experience for first-time users of Spark SQL and structured streaming
• Expect further improve the adoption of structured streaming
29 © Hortonworks Inc. 2011–2018. All rights reserved
Avro Data Source
• from_avro/to_avro functions to read and write Avro data within a DataFrame instead of
just files.
• Example:
• Decode the Avro data into a struct
• Filter by column `favorite_color`
• Encode the column `name` in Avro format
30 © Hortonworks Inc. 2011–2018. All rights reserved
Avro Data Source
• Refactor Avro Serializer and Deserializer
• External
• Arrow Data -> Row -> InternalRow
• Native
• Arrow Data -> InternalRow
31 © Hortonworks Inc. 2011–2018. All rights reserved
Avro Data Source
• Options:
• compression: compression codec in write
• ignoreExtension: if ignore .avro or not in read
• recordNamespace: record namespace in write
• recordName: top root record name in write
• avroSchema: avro schema to use
• Logical type support:
• Date [SPARK-24772]
• Decimal [SPARK-24774]
• Timestamp [SPARK-24773]
32 © Hortonworks Inc. 2011–2018. All rights reserved
Image Data Source
• Spark datasource for image format
• ImageSchema deprecated use instead:
• SQL syntax support
• Partition discovery
33 © Hortonworks Inc. 2011–2018. All rights reserved
Higher-order Functions
• Takes functions to transform complex datatype like map, array and struct
34 © Hortonworks Inc. 2011–2018. All rights reserved
Higher-order Functions
35 © Hortonworks Inc. 2011–2018. All rights reserved
Built-in Functions
• New or extended built-in functions for ArrayTypes and MapTypes
• 26 functions for ArrayTypes
• transform, filter, reduce, array_distinct, array_intersect, array_union, array_except, array_join,
array_max, array_min, ...
• 8 functions for MapTypes
• map_from_arrays, map_from_entries, map_entries, map_concat, map_filter, map_zip_with,
transform_keys, transform_values
36 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Spark and Kubernetes
• New Spark scheduler backend
• PySpark support [SPARK-23984]
• SparkR support [SPARK-24433]
• Client-mode support [SPARK-23146]
• Support for mounting K8S volumes [SPARK-23529]
Scala 2.12 (Beta) Support
Build Spark against Scala 2.12 [SPARK-14220]
37 © Hortonworks Inc. 2011–2018. All rights reserved
PySpark Custom Worker
• Configuration to select the modules for daemon and worker in PySpark
• Set ‘spark.python.daemon.module’and/or ‘spark.python.worker.module’ to the worker or
daemon modules
See also Remote Python Debugging 4 Spark
38 © Hortonworks Inc. 2011–2018. All rights reserved
Data Source Changes
• CSV
• Option samplingRatio
• for schema inference [SPARK-23846]
• Option enforceSchema
• for throwing an exception when user-
specified schema doesn‘t match the CSV
header [SPARK-23786]
• Option encoding
• for specifying the encoding of outputs.
[SPARK-19018]
• JSON
• Option dropFieldIfAllNull
• for ignoring column of all null values or
empty array/struct during JSON schema
inference [SPARK-23772]
• Option lineSep
• for defining the line separator that should
be used for parsing [SPARK-23765]
• Option encoding
• for specifying the encoding of inputs and
outputs. [SPARK-23723]
39 © Hortonworks Inc. 2011–2018. All rights reserved
Data Source Changes
• Parquet
• Push down
• STRING [SPARK-23972]
• Decimal [SPARK-24549]
• Timestamp [SPARK-24718]
• Date [SPARK-23727]
• Byte/Short [SPARK-24706]
• StringStartsWith [SPARK-24638]
• IN [SPARK-17091]
• ORC
• Native ORC reader is on by default
[SPARK-23456]
• Turn on ORC filter push-down by
default [SPARK-21783]
• Use native ORC reader to read Hive
serde tables by default [SPARK-
22279]
40 © Hortonworks Inc. 2011–2018. All rights reserved
Data Source Changes
• JDBC
• Option queryTimeout
• for the number of seconds the the driver will wait for a Statement object to execute.
[SPARK-23856]
• Option query
• for specifying the query to read from JDBC [SPARK-24423]
• Option pushDownFilters
• for specifying whether the filter pushdown is allowed [SPARK-24288]
• Option cascadeTruncate [SPARK-22880]
41 © Hortonworks Inc. 2011–2018. All rights reserved
What About Apache Spark 3.0?
Spark 2.2.0 RC1
2017/05
Spark 2.2.0 released
2018/07
Spark 2.2.0 RC2, RC3, RC4, RC5
2017/06
Spark 2.2.0 RC6
2017/07
Spark 2.3.0 RC1
2018/01
Spark 2.3.0 RC2, RC3, RC4, RC5
2018/02
Spark 2.3.0 released
2018/02
Spark 2.4.0 RC1
2018/09
Spark 3.0.0
2019/05 (?)
Spark 2.4.0 RC2
2018/10
Spark 2.4.0
2018/11
See also the thread in Spark dev mailing list for Spark 3.0 discussion
42 © Hortonworks Inc. 2011–2018. All rights reserved
Newer Integration for Apache Hive with Apache Spark
• Apache Hive 3 support: Apache Spark
provides a basic Hive compatibility
• Apache Hive ACID table support
• Structured Streaming Support
• Apache Ranger integration support
• Use LLAP and vectorized read/write – fast!
See also this article for Hive warehouse connector
43 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
44 © Hortonworks Inc. 2011–2018. All rights reserved
Thanks!
Robert Hryniewicz

Más contenido relacionado

La actualidad más candente

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Apache Impalaパフォーマンスチューニング #dbts2018
Apache Impalaパフォーマンスチューニング #dbts2018Apache Impalaパフォーマンスチューニング #dbts2018
Apache Impalaパフォーマンスチューニング #dbts2018
Cloudera Japan
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Apache Sparkのご紹介 (後半:技術トピック)
Apache Sparkのご紹介 (後半:技術トピック)Apache Sparkのご紹介 (後半:技術トピック)
Apache Sparkのご紹介 (後半:技術トピック)
NTT DATA OSS Professional Services
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Databricks
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
Fernando Rodriguez
 
High-Volume Data Collection and Real Time Analytics Using Redis
High-Volume Data Collection and Real Time Analytics Using RedisHigh-Volume Data Collection and Real Time Analytics Using Redis
High-Volume Data Collection and Real Time Analytics Using Redis
cacois
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
Whiteklay
 
Apache Spark at Airbnb
Apache Spark at AirbnbApache Spark at Airbnb
Apache Spark at Airbnb
Databricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
Databricks
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 

La actualidad más candente (20)

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Apache Impalaパフォーマンスチューニング #dbts2018
Apache Impalaパフォーマンスチューニング #dbts2018Apache Impalaパフォーマンスチューニング #dbts2018
Apache Impalaパフォーマンスチューニング #dbts2018
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Apache Sparkのご紹介 (後半:技術トピック)
Apache Sparkのご紹介 (後半:技術トピック)Apache Sparkのご紹介 (後半:技術トピック)
Apache Sparkのご紹介 (後半:技術トピック)
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
High-Volume Data Collection and Real Time Analytics Using Redis
High-Volume Data Collection and Real Time Analytics Using RedisHigh-Volume Data Collection and Real Time Analytics Using Redis
High-Volume Data Collection and Real Time Analytics Using Redis
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Apache Spark at Airbnb
Apache Spark at AirbnbApache Spark at Airbnb
Apache Spark at Airbnb
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 

Similar a What s new in spark 2.3 and spark 2.4

What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
DataWorks Summit
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
Dongjoon Hyun
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
DataWorks Summit
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
boxu42
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
Hortonworks
 
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
Dongjoon Hyun
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
Xiao Li
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
Hortonworks
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark Summit
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
Sparc solaris servers
Sparc solaris serversSparc solaris servers
Sparc solaris servers
solarisyougood
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
Szehon Ho
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
 
The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...
Impetus Technologies
 
What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3
DataWorks Summit
 

Similar a What s new in spark 2.3 and spark 2.4 (20)

What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
 
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache SparkPerformance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
 
Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Sparc solaris servers
Sparc solaris serversSparc solaris servers
Sparc solaris servers
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...
 
What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3What’s new in Apache Spark 2.3
What’s new in Apache Spark 2.3
 

Más de DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 

Último (20)

Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 

What s new in spark 2.3 and spark 2.4

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved What’s New in Apache Spark 2.3 & 2.4 DWS Melbourne, Australia 2019 Robert Hryniewicz
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved • Apache Spark and Kubernetes • Native Vectorized ORC and SQL Cache Readers • Pandas UDFs for PySpark • Continuous Stream Processing • Barrier Execution • Avro/Image Data Source • Higher-order Functions Agenda Highlights
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Apache Spark 2.3
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Native Vectorized ORC Reader • Native ORC read and write: ‘spark.sql.orc.impl’ to ‘native’. • Vectorized ORC reader: ‘spark.sql.orc.enableVectorizedReader’ to ‘true’ See also ORC Improvement in Apache Spark 2.3 by Dongjoon Hyun
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Normal UDF Pandas UDFs (a.k.a. Vectorized UDFs) Apache Spark Python Worker Internal Spark data Convert to standard Java type Pickled Unpickled Evaluate row by row Convert to Python data
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Pandas UDF Pandas UDFs (a.k.a. Vectorized UDFs) Internal Spark data Apache Arrow format Convert to Pandas (cheap) Vectorized operation by Pandas API Apache Spark Python Worker
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Pandas UDF Pandas UDFs (a.k.a. Vectorized UDFs) Internal Spark data Apache Arrow format Convert to Pandas (cheap) Vectorized operation by Pandas API Apache Spark Python Worker
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Pandas UDF Pandas UDFs (a.k.a. Vectorized UDFs) See also Introducing Pandas UDFs for PySpark by Li Jin
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Conversion To/From Pandas With Apache Arrow • Enable Apache Arrow optimization: ‘spark.sql.execution.arrow.enabled’ to ‘true’. See also Speeding up PySpark with Apache Arrow by Bryan Cutler
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved Structured Streaming Continuous Stream Processing
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Structured Streaming: Microbatch Continuous Stream Processing See also Continuous Processing in Structured Streaming by Josh Torres
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved Structured Streaming: Continuous Processing Continuous Stream Processing See also Continuous Processing in Structured Streaming by Josh Torres
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Structured Streaming: Continuous Processing Continuous Stream Processing See also Spark Summit Keynote Demo by Michael Armbrust https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Stream-to-Stream Joins See also Introducing Stream-Stream Joins in Apache Spark 2.3 by Tathagata Das and Joseph Torres
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Apache Spark and Kubernetes See also Running Spark on Kubernetes
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Image Support in Spark • Convert from compressed Images format (e.g., PNG and JPG) to raw representation of an image for OpenCV • One record per one image file See also SPARK-21866 by Ilya Matiach, and Deep Learning Pipelines for Apache Spark
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved (Stateless) History Server History Server Using K-V Store • Requires storing app lists and UI in the memory • Requires reading/parsing the whole log file See also SPARK-18085 and the proposal by Marcelo Vanzin
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved History Server Using K-V Store History Server Using K-V Store • Store app lists and UI in a persistent K-V store (LevelDB) • Set ‘spark.history.store.path’ to use this feature • The event log written by lower versions is still compatible See also SPARK-18085 and the proposal by Marcelo Vanzin
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved R Structured Streaming https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html See also SSR: Structured Streaming on R for Machine Learning by Felix Cheung
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved R Native Function Execution Stability See also SPARK-21093
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Apache Spark 2.4
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Apache Spark 2.4 Barrier Execution Apache Spark 3.0 • [SPARK-24374] barrier execution mode • [SPARK-24374] barrier execution mode • [SPARK-24579] optimized data exchange • [SPARK-24615] accelerator-aware scheduling See also Project Hydrogen: Unifying State-of-the-art AI and Big Data in Apache Spark by Reynold Xin See also Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark by Xiangrui Meng
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved Pandas UDFs: Grouped Aggregate Pandas UDFs https://github.com/apache/spark/commit/9786ce66c https://github.com/apache/spark/commit/b2ce17b4c
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Pandas UDFs: Grouped Aggregate Pandas UDFs https://github.com/apache/spark/pull/22620/commits/06a7bd0c Internal Spark data Apache Arrow format Convert to Pandas (cheap) Vectorized operation by Pandas API Apache Spark Python Worker
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Eager Evaluation • Set ‘spark.sql.repl.eagerEval.enabled’ to true to enable eager evaluation in Jupyter
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Eager Evaluation • Set ‘spark.sql.repl.eagerEval.enabled’ to true to enable eager evaluation in Jupyter See also (ongoing) SPARK-24572 for Eagar Evaluation at R side
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Flexible Streaming Sink • Exposing output rows of each microbatch as a DataFrame • foreachBatch(f: Dataset[T] => Unit) Scala/Java/Python APIs in DataStreamWriter.
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Avro Data Source • Apache Avro (https://avro.apache.org) • A data serialization format • Widely used in the Spark and Hadoop ecosystem, especially for Kafka-based data pipelines. • Spark-Avro package (https://github.com/databricks/spark-avro) • Spark SQL can read and write the Avro data. • Inlining Spark-Avro package [SPARK-24768] • Better experience for first-time users of Spark SQL and structured streaming • Expect further improve the adoption of structured streaming
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Avro Data Source • from_avro/to_avro functions to read and write Avro data within a DataFrame instead of just files. • Example: • Decode the Avro data into a struct • Filter by column `favorite_color` • Encode the column `name` in Avro format
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Avro Data Source • Refactor Avro Serializer and Deserializer • External • Arrow Data -> Row -> InternalRow • Native • Arrow Data -> InternalRow
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Avro Data Source • Options: • compression: compression codec in write • ignoreExtension: if ignore .avro or not in read • recordNamespace: record namespace in write • recordName: top root record name in write • avroSchema: avro schema to use • Logical type support: • Date [SPARK-24772] • Decimal [SPARK-24774] • Timestamp [SPARK-24773]
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Image Data Source • Spark datasource for image format • ImageSchema deprecated use instead: • SQL syntax support • Partition discovery
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Higher-order Functions • Takes functions to transform complex datatype like map, array and struct
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved Higher-order Functions
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved Built-in Functions • New or extended built-in functions for ArrayTypes and MapTypes • 26 functions for ArrayTypes • transform, filter, reduce, array_distinct, array_intersect, array_union, array_except, array_join, array_max, array_min, ... • 8 functions for MapTypes • map_from_arrays, map_from_entries, map_entries, map_concat, map_filter, map_zip_with, transform_keys, transform_values
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved Apache Spark and Kubernetes • New Spark scheduler backend • PySpark support [SPARK-23984] • SparkR support [SPARK-24433] • Client-mode support [SPARK-23146] • Support for mounting K8S volumes [SPARK-23529] Scala 2.12 (Beta) Support Build Spark against Scala 2.12 [SPARK-14220]
  • 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved PySpark Custom Worker • Configuration to select the modules for daemon and worker in PySpark • Set ‘spark.python.daemon.module’and/or ‘spark.python.worker.module’ to the worker or daemon modules See also Remote Python Debugging 4 Spark
  • 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved Data Source Changes • CSV • Option samplingRatio • for schema inference [SPARK-23846] • Option enforceSchema • for throwing an exception when user- specified schema doesn‘t match the CSV header [SPARK-23786] • Option encoding • for specifying the encoding of outputs. [SPARK-19018] • JSON • Option dropFieldIfAllNull • for ignoring column of all null values or empty array/struct during JSON schema inference [SPARK-23772] • Option lineSep • for defining the line separator that should be used for parsing [SPARK-23765] • Option encoding • for specifying the encoding of inputs and outputs. [SPARK-23723]
  • 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved Data Source Changes • Parquet • Push down • STRING [SPARK-23972] • Decimal [SPARK-24549] • Timestamp [SPARK-24718] • Date [SPARK-23727] • Byte/Short [SPARK-24706] • StringStartsWith [SPARK-24638] • IN [SPARK-17091] • ORC • Native ORC reader is on by default [SPARK-23456] • Turn on ORC filter push-down by default [SPARK-21783] • Use native ORC reader to read Hive serde tables by default [SPARK- 22279]
  • 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved Data Source Changes • JDBC • Option queryTimeout • for the number of seconds the the driver will wait for a Statement object to execute. [SPARK-23856] • Option query • for specifying the query to read from JDBC [SPARK-24423] • Option pushDownFilters • for specifying whether the filter pushdown is allowed [SPARK-24288] • Option cascadeTruncate [SPARK-22880]
  • 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved What About Apache Spark 3.0? Spark 2.2.0 RC1 2017/05 Spark 2.2.0 released 2018/07 Spark 2.2.0 RC2, RC3, RC4, RC5 2017/06 Spark 2.2.0 RC6 2017/07 Spark 2.3.0 RC1 2018/01 Spark 2.3.0 RC2, RC3, RC4, RC5 2018/02 Spark 2.3.0 released 2018/02 Spark 2.4.0 RC1 2018/09 Spark 3.0.0 2019/05 (?) Spark 2.4.0 RC2 2018/10 Spark 2.4.0 2018/11 See also the thread in Spark dev mailing list for Spark 3.0 discussion
  • 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved Newer Integration for Apache Hive with Apache Spark • Apache Hive 3 support: Apache Spark provides a basic Hive compatibility • Apache Hive ACID table support • Structured Streaming Support • Apache Ranger integration support • Use LLAP and vectorized read/write – fast! See also this article for Hive warehouse connector
  • 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved Thanks! Robert Hryniewicz