SlideShare una empresa de Scribd logo
1 de 3
QUICK GUIDE TO REFRESH SPARK SKILLS
1. limitations of spark
A. Small files problem,Expensive, near real time
processing.
2. Difference b/w map, flat-map
A. flatMap() : transforms an data of length N into another
data of length M. M > N or M < N or M = N.
Example - 1
data : ["aa bb,cc", "k,m", "dd"]
data.flatMap(lines=> lines.split(",")) // M > N.
[["aa bb","cc"],["k","m"],"dd"] ->flat->
["aa bb","cc","k","m","dd"]
Example - 2
data.flatMap(lines=> "".contact(lines.split(","))) // M<N
[["aa bbcc"],["km"],"dd"] ->flat-> ["aa bbcckm,dd"]
map() - : transforms an data of length N into another
data of length N
Example
data.map(lines => lines.split(","))
[["aa bb","cc"],["k","m"],"dd"]
3. Difference b/w map, flat-map
A. a HashMap is a collection of key and value pairs which
are stored internally using a Hash Table data structure.
HashMap is an implementation of Map. As you can see
in their definitions HashMap is a class and Map is a trait
4. What is a case class
A. Caseclasses arespecial classes in Scala that provide you
with the boiler plate implementation of the constructor,
getters (accessors), equals and hashCode, and
implement Serializable.Case classes work really well to
encapsulatedata as objects.Readers, familiarwith Java,
can relate it to plain old Java objects (POJOs) or Java
bean
Spark 1.6 : supports only 22 fields
Spark 2.0 : supports any number field
used to apply schema over RDD to convert to
DataFrame and DataFrame to Dataset
5. What is a trait
A. Traits in Scala are very much similar to Java Interfaces
but traits can have methods with implementation. traits
cannot have constructor parameters.
advantage - One big advantage of traits is that you can
extend multiple traits with clause. but we can extend
only one abstract class.
https://stackoverflow.com/questions/1229743/what-
are-the-pros-of-using-traits-over-abstract-classes
6. spark 1.6 vs spark 2.0
A. i. Single entry point with sparkSession instead of
SqlContext, HiveContext and sparkContext.
ii. Dataset API and DataFrame API are unified. In Scala,
DataFrame = Dataset[Row], while Java API users must
replace DataFrame with Dataset
iii.Dataset and DataFrame API unionAll has been
deprecated and replaced by union
iv.Dataset and DataFrame API explode has been
deprecated, alternatively, use functions.explode() with
select or flatMap
v. Dataset and DataFrame API registerTempTable has
been deprecated and replaced by
createOrReplaceTempView
vi. From Spark 2.0, CREATE TABLE ... LOCATION is
equivalent to CREATE EXTERNAL TABLE ... LOCATION in
order to prevent accidental dropping the existing data
in the user-provided locations. That means, a Hive table
created in Spark SQL with the user-specified location is
always a Hive external table.
vii. unified API for both batch and streaming
7. repartition vs coalesce
A. coalesce uses existing partitions to minimize the
amount of data that's shuffled. repartition creates new
partitions and does a full shuffle. coalesce results in
partitions with different amounts of data.
coalesce may run faster than repartition, but unequal
sized partitions are generally slower to work with than
equal sized partitions. You'll usually need to repartition
datasets after filtering a large data set. I've found
repartition to be faster overall because Spark is built to
work with equal sized partitions.
repartition algorithm doesn't distribute data as equally
for very small data sets
8. Spark joins
A. Broadcast hash join - broadcastsmall tableso no need
of shuffles
Shuffle Hash Join: if the average sizeof a singlepartition
is small enough to build a hash table
Sort Merge: if the matching join keys are sort able
Shuffle Has join is not part of 1.6, but part of spark 2.2
and 2.3.
Precedence order in 2.0 Broadcast, Shuffle and sort
merge.
https://sujithjay.com/spark-sql/2018/06/28/Shuffle-
Hash-and-Sort-Merge-Joins-in-Apache-Spark/
9. Spark joins optimizations
A. There aren't strictrules for optimization
i.Analyze the data sizes
ii.Analyze the keys
iii.based on that use Broadcasthash join or filters
potential causes of poor join performance
Dealing with Key Skew in a ShuffleHashJoin
Uneven sharding and limited parallelism - One table is
small and One table is big, no.of join keys are less
special cases
CartesianJoin (cross join)- what to do - may be enable
the cross join in spark
One to Many Joins - what to do - use parquet format
Theta Joins - what to do - create buckets on keys
https://databricks.com/session/optimizing-apache-
spark-sql-joins
10. RDD vs DataFrame vs DataSet
A. Underlying API for DataFrame or DataSet is RDD
RDD - return an Iterator[T]. underlying the Iterator
functionality is opaque or unknown so optimization is
difficult. even T the data is opaque so RDD doesn't know
which columns are important and which aren't. only
serializers like Kyro, zLib used for optimization.
Dataframe : DataFrame is an abstraction which gives a
schema view of data. dataframe is like a table in
database. offers custom memory management using
tungsten, so Data is stored in off-heap memory in binary
format. This saves a lot of memory space. Also there is
no Garbage Collection overhead involved. By knowing
the schema of data in advance and storing efficiently in
binary format, expensive java Serialization is also
avoided. Optimized Execution Plans using catalyst
optimizer.
DataSets : It is an extension to Dataframe API, the latest
abstraction which tries to provide best of both RDD and
Dataframe. compile time safety like RDD as well as
performance boosting features of Dataframe. dataset
scores over Dataframe is an additional feature it has:
Encoders - Encoders generate byte code to interact with
off-heap data and provide on-demand access to
individual attributes without having to de-serialize an
entire object.
https://www.linkedin.com/pulse/apache-spark-rdd-vs-
dataframe-dataset-chandan-prakash
11. Drawbacks of spark streaming
A. discretized stream or DStream and it is represented as a
sequence of RDDs.
Micro-batch systems have an inherent problem with
backpressure. If processing of a batch takes longer in
downstream operations, because of computational
complexity or just slow sink, than in the batching
operator (usually source), the micro-batch will take
longer than configured. This leads either to more and
more batches queueing up, or to a growing micro-batch
size.
Micro-batching, thus Spark Streaming can achieve high
throughput and once guarantees, but gives away low
latency, flow control and the native streaming
programming model.
12. How checkpoints are useful
A. Checkpoints freeze the content of your data frames
before you do something else. They're essential to
keeping track of your data frames
Spark has been offering checkpoints on streaming since
earlier versions (atleastv1.2.0),but checkpoints on data
frames are a different beast.
Metadata Checkpointing – Metadata means the data
about data. It refers to saving the metadata to fault
tolerant storage like HDFS. Metadata includes
configurations, DStream operations, and incomplete
batches. Configuration refers to the configuration used
to create streaming DStream operations are operations
which define the steaming application. Incomplete
batches are batches which are in the queue but are not
complete.
Data Checkpointing –: It refers to save the RDD to
reliable storage because its need arises in some of the
stateful transformations. It is in the case when the
upcoming RDD depends on the RDDs of previous
batches. Because of this, the dependency keeps on
increasing with time. Thus, to avoid such increase in
recovery time the intermediate RDDs are periodically
checkpointed to some reliable storage. As a result, it
cuts down the dependency chain.
Eager Checkpoint
An eager checkpoint will cut the lineage from previous
data frames and will allow you to start “fresh” from this
point on. In clear, Spark will dump your data frame in a
filespecified by setCheckpointDir() and will start a fresh
new data frame from it. You will also need to wait for
completion of the operation
Non-Eager Checkpoint
On the other hand, a non-eager checkpoint will keep the
lineage from previous operations in the data frame.
There are various differences between Spark Checkpoint
vs Persist. Let’s discuss it one by one-
Persist vs Checkpoint
When we persist RDD with DISK_ONLY storage level the
RDD gets stored in a location where the subsequent use
of that RDD will not reach that points in recomputing
the lineage.
After persist() is called, Spark remembers the lineage of
the RDD even though it doesn’t call it.
Secondly, after the job run is complete, the cache is
cleared and the files are destroyed.
Checkpointing
Checkpointing stores the RDD in HDFS. It deletes the
lineage which created it.
On completing the job run unlike cache the checkpoint
file is not deleted.
When we checkpointing an RDD it results in double
computation. The operation will firstcall a cache before
accomplishing the actual job of computing. Secondly, it
is written to checkpointing directory.
https://data-flair.training/blogs/apache-spark-
streaming-checkpoint/
13. Unit Testing frame for spark
A. Spark Fast Test library – I prefer
FunSuite, and ScalaTest, DataFrameSuiteBase
14. Various challenges faced while coding spark app
A. Heap space issue or out of memory – resolved by
increasing executor memory
Running Indefinitely long – check the spark UI for any
data skew problems due to duplicates in join keys,
unequal partitions and tune by repartitioning the data
for more parallelism
Nulls in join conditions – avoid null in join keys or use
null safe joins like 
Ambiguous column reference issues – occurs when
derived DF is used in the join with source DF, if it is equi
join condition can be resolved with Seq(join_columns)
otherwise rename the DF2 columns before joining.
15. Spark streaming diff in 2.0 and 2.3
A. Structured Streaming in Apache Spark 2.0 decoupled
micro-batch processing from its high-level APIs for a
couple of reasons. First, it made developer’s experience
with the APIs simpler: the APIs did not have to account
for micro-batches. Second, it allowed developers to
treat a stream as an infinite table to which they could
issue queries as they would a static table.
However, to provide developers with different modes of
stream processing, new millisecond low-latency mode
of streaming: continuous mode.
Structured Streaming in Spark 2.0 has supported joins
between a streaming DataFrame/Dataset and a static
one
Spark 2.3 supports stream-to-stream joins, both inner
and outer joins for numerous real-time use cases. The
canonical use case of joining two streams is that of ad-
monetization. Like joiningad impressions (views) and ad
clicks
https://databricks.com/blog/2018/02/28/introducing-
apache-spark-2-3.html

Más contenido relacionado

La actualidad más candente

Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
McGraw-Hill Optimizes Analytics Workloads with Databricks
 McGraw-Hill Optimizes Analytics Workloads with Databricks McGraw-Hill Optimizes Analytics Workloads with Databricks
McGraw-Hill Optimizes Analytics Workloads with DatabricksAmazon Web Services
 
Delta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache SparkDelta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache SparkGeorge Chow
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure DatabricksDustin Vannoy
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for DinnerKent Graziano
 
Machine Learning for z/OS
Machine Learning for z/OSMachine Learning for z/OS
Machine Learning for z/OSCuneyt Goksu
 
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAmazon Web Services
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data TipsQubole
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Piranha vs. mammoth predator appliances that chew up big data
Piranha vs. mammoth   predator appliances that chew up big dataPiranha vs. mammoth   predator appliances that chew up big data
Piranha vs. mammoth predator appliances that chew up big dataJack (Yaakov) Bezalel
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricksBrandon Berlinrut
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach
 
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQLDataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQLDataStax
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]shuwutong
 

La actualidad más candente (20)

Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
McGraw-Hill Optimizes Analytics Workloads with Databricks
 McGraw-Hill Optimizes Analytics Workloads with Databricks McGraw-Hill Optimizes Analytics Workloads with Databricks
McGraw-Hill Optimizes Analytics Workloads with Databricks
 
Delta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache SparkDelta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache Spark
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
SQL Server Disaster Recovery Implementation
SQL Server Disaster Recovery ImplementationSQL Server Disaster Recovery Implementation
SQL Server Disaster Recovery Implementation
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Machine Learning for z/OS
Machine Learning for z/OSMachine Learning for z/OS
Machine Learning for z/OS
 
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
 
Data Federation
Data FederationData Federation
Data Federation
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
2022 02 Integration Bootcamp
2022 02 Integration Bootcamp2022 02 Integration Bootcamp
2022 02 Integration Bootcamp
 
Disaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQLDisaster Recovery Site Implementation with MySQL
Disaster Recovery Site Implementation with MySQL
 
Piranha vs. mammoth predator appliances that chew up big data
Piranha vs. mammoth   predator appliances that chew up big dataPiranha vs. mammoth   predator appliances that chew up big data
Piranha vs. mammoth predator appliances that chew up big data
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricks
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQLDataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]
 

Similar a Quick Guide to Refresh Spark skills

Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche SparkAlex Thompson
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogsprateek kumar
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLHyderabad Scalability Meetup
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 

Similar a Quick Guide to Refresh Spark skills (20)

Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Spark
SparkSpark
Spark
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 

Último

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Último (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

Quick Guide to Refresh Spark skills

  • 1. QUICK GUIDE TO REFRESH SPARK SKILLS 1. limitations of spark A. Small files problem,Expensive, near real time processing. 2. Difference b/w map, flat-map A. flatMap() : transforms an data of length N into another data of length M. M > N or M < N or M = N. Example - 1 data : ["aa bb,cc", "k,m", "dd"] data.flatMap(lines=> lines.split(",")) // M > N. [["aa bb","cc"],["k","m"],"dd"] ->flat-> ["aa bb","cc","k","m","dd"] Example - 2 data.flatMap(lines=> "".contact(lines.split(","))) // M<N [["aa bbcc"],["km"],"dd"] ->flat-> ["aa bbcckm,dd"] map() - : transforms an data of length N into another data of length N Example data.map(lines => lines.split(",")) [["aa bb","cc"],["k","m"],"dd"] 3. Difference b/w map, flat-map A. a HashMap is a collection of key and value pairs which are stored internally using a Hash Table data structure. HashMap is an implementation of Map. As you can see in their definitions HashMap is a class and Map is a trait 4. What is a case class A. Caseclasses arespecial classes in Scala that provide you with the boiler plate implementation of the constructor, getters (accessors), equals and hashCode, and implement Serializable.Case classes work really well to encapsulatedata as objects.Readers, familiarwith Java, can relate it to plain old Java objects (POJOs) or Java bean Spark 1.6 : supports only 22 fields Spark 2.0 : supports any number field used to apply schema over RDD to convert to DataFrame and DataFrame to Dataset 5. What is a trait A. Traits in Scala are very much similar to Java Interfaces but traits can have methods with implementation. traits cannot have constructor parameters. advantage - One big advantage of traits is that you can extend multiple traits with clause. but we can extend only one abstract class. https://stackoverflow.com/questions/1229743/what- are-the-pros-of-using-traits-over-abstract-classes 6. spark 1.6 vs spark 2.0 A. i. Single entry point with sparkSession instead of SqlContext, HiveContext and sparkContext. ii. Dataset API and DataFrame API are unified. In Scala, DataFrame = Dataset[Row], while Java API users must replace DataFrame with Dataset iii.Dataset and DataFrame API unionAll has been deprecated and replaced by union iv.Dataset and DataFrame API explode has been deprecated, alternatively, use functions.explode() with select or flatMap v. Dataset and DataFrame API registerTempTable has been deprecated and replaced by createOrReplaceTempView vi. From Spark 2.0, CREATE TABLE ... LOCATION is equivalent to CREATE EXTERNAL TABLE ... LOCATION in order to prevent accidental dropping the existing data in the user-provided locations. That means, a Hive table created in Spark SQL with the user-specified location is always a Hive external table. vii. unified API for both batch and streaming 7. repartition vs coalesce A. coalesce uses existing partitions to minimize the amount of data that's shuffled. repartition creates new partitions and does a full shuffle. coalesce results in partitions with different amounts of data. coalesce may run faster than repartition, but unequal sized partitions are generally slower to work with than equal sized partitions. You'll usually need to repartition datasets after filtering a large data set. I've found repartition to be faster overall because Spark is built to work with equal sized partitions. repartition algorithm doesn't distribute data as equally for very small data sets 8. Spark joins A. Broadcast hash join - broadcastsmall tableso no need of shuffles Shuffle Hash Join: if the average sizeof a singlepartition is small enough to build a hash table Sort Merge: if the matching join keys are sort able Shuffle Has join is not part of 1.6, but part of spark 2.2 and 2.3. Precedence order in 2.0 Broadcast, Shuffle and sort merge. https://sujithjay.com/spark-sql/2018/06/28/Shuffle- Hash-and-Sort-Merge-Joins-in-Apache-Spark/
  • 2. 9. Spark joins optimizations A. There aren't strictrules for optimization i.Analyze the data sizes ii.Analyze the keys iii.based on that use Broadcasthash join or filters potential causes of poor join performance Dealing with Key Skew in a ShuffleHashJoin Uneven sharding and limited parallelism - One table is small and One table is big, no.of join keys are less special cases CartesianJoin (cross join)- what to do - may be enable the cross join in spark One to Many Joins - what to do - use parquet format Theta Joins - what to do - create buckets on keys https://databricks.com/session/optimizing-apache- spark-sql-joins 10. RDD vs DataFrame vs DataSet A. Underlying API for DataFrame or DataSet is RDD RDD - return an Iterator[T]. underlying the Iterator functionality is opaque or unknown so optimization is difficult. even T the data is opaque so RDD doesn't know which columns are important and which aren't. only serializers like Kyro, zLib used for optimization. Dataframe : DataFrame is an abstraction which gives a schema view of data. dataframe is like a table in database. offers custom memory management using tungsten, so Data is stored in off-heap memory in binary format. This saves a lot of memory space. Also there is no Garbage Collection overhead involved. By knowing the schema of data in advance and storing efficiently in binary format, expensive java Serialization is also avoided. Optimized Execution Plans using catalyst optimizer. DataSets : It is an extension to Dataframe API, the latest abstraction which tries to provide best of both RDD and Dataframe. compile time safety like RDD as well as performance boosting features of Dataframe. dataset scores over Dataframe is an additional feature it has: Encoders - Encoders generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object. https://www.linkedin.com/pulse/apache-spark-rdd-vs- dataframe-dataset-chandan-prakash 11. Drawbacks of spark streaming A. discretized stream or DStream and it is represented as a sequence of RDDs. Micro-batch systems have an inherent problem with backpressure. If processing of a batch takes longer in downstream operations, because of computational complexity or just slow sink, than in the batching operator (usually source), the micro-batch will take longer than configured. This leads either to more and more batches queueing up, or to a growing micro-batch size. Micro-batching, thus Spark Streaming can achieve high throughput and once guarantees, but gives away low latency, flow control and the native streaming programming model. 12. How checkpoints are useful A. Checkpoints freeze the content of your data frames before you do something else. They're essential to keeping track of your data frames Spark has been offering checkpoints on streaming since earlier versions (atleastv1.2.0),but checkpoints on data frames are a different beast. Metadata Checkpointing – Metadata means the data about data. It refers to saving the metadata to fault tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches. Configuration refers to the configuration used to create streaming DStream operations are operations which define the steaming application. Incomplete batches are batches which are in the queue but are not complete. Data Checkpointing –: It refers to save the RDD to reliable storage because its need arises in some of the stateful transformations. It is in the case when the upcoming RDD depends on the RDDs of previous batches. Because of this, the dependency keeps on increasing with time. Thus, to avoid such increase in recovery time the intermediate RDDs are periodically checkpointed to some reliable storage. As a result, it cuts down the dependency chain. Eager Checkpoint An eager checkpoint will cut the lineage from previous data frames and will allow you to start “fresh” from this point on. In clear, Spark will dump your data frame in a filespecified by setCheckpointDir() and will start a fresh new data frame from it. You will also need to wait for completion of the operation Non-Eager Checkpoint On the other hand, a non-eager checkpoint will keep the lineage from previous operations in the data frame. There are various differences between Spark Checkpoint vs Persist. Let’s discuss it one by one-
  • 3. Persist vs Checkpoint When we persist RDD with DISK_ONLY storage level the RDD gets stored in a location where the subsequent use of that RDD will not reach that points in recomputing the lineage. After persist() is called, Spark remembers the lineage of the RDD even though it doesn’t call it. Secondly, after the job run is complete, the cache is cleared and the files are destroyed. Checkpointing Checkpointing stores the RDD in HDFS. It deletes the lineage which created it. On completing the job run unlike cache the checkpoint file is not deleted. When we checkpointing an RDD it results in double computation. The operation will firstcall a cache before accomplishing the actual job of computing. Secondly, it is written to checkpointing directory. https://data-flair.training/blogs/apache-spark- streaming-checkpoint/ 13. Unit Testing frame for spark A. Spark Fast Test library – I prefer FunSuite, and ScalaTest, DataFrameSuiteBase 14. Various challenges faced while coding spark app A. Heap space issue or out of memory – resolved by increasing executor memory Running Indefinitely long – check the spark UI for any data skew problems due to duplicates in join keys, unequal partitions and tune by repartitioning the data for more parallelism Nulls in join conditions – avoid null in join keys or use null safe joins like  Ambiguous column reference issues – occurs when derived DF is used in the join with source DF, if it is equi join condition can be resolved with Seq(join_columns) otherwise rename the DF2 columns before joining. 15. Spark streaming diff in 2.0 and 2.3 A. Structured Streaming in Apache Spark 2.0 decoupled micro-batch processing from its high-level APIs for a couple of reasons. First, it made developer’s experience with the APIs simpler: the APIs did not have to account for micro-batches. Second, it allowed developers to treat a stream as an infinite table to which they could issue queries as they would a static table. However, to provide developers with different modes of stream processing, new millisecond low-latency mode of streaming: continuous mode. Structured Streaming in Spark 2.0 has supported joins between a streaming DataFrame/Dataset and a static one Spark 2.3 supports stream-to-stream joins, both inner and outer joins for numerous real-time use cases. The canonical use case of joining two streams is that of ad- monetization. Like joiningad impressions (views) and ad clicks https://databricks.com/blog/2018/02/28/introducing- apache-spark-2-3.html