Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Quick Guide to Refresh Spark skills
1. QUICK GUIDE TO REFRESH SPARK SKILLS
1. limitations of spark
A. Small files problem,Expensive, near real time
processing.
2. Difference b/w map, flat-map
A. flatMap() : transforms an data of length N into another
data of length M. M > N or M < N or M = N.
Example - 1
data : ["aa bb,cc", "k,m", "dd"]
data.flatMap(lines=> lines.split(",")) // M > N.
[["aa bb","cc"],["k","m"],"dd"] ->flat->
["aa bb","cc","k","m","dd"]
Example - 2
data.flatMap(lines=> "".contact(lines.split(","))) // M<N
[["aa bbcc"],["km"],"dd"] ->flat-> ["aa bbcckm,dd"]
map() - : transforms an data of length N into another
data of length N
Example
data.map(lines => lines.split(","))
[["aa bb","cc"],["k","m"],"dd"]
3. Difference b/w map, flat-map
A. a HashMap is a collection of key and value pairs which
are stored internally using a Hash Table data structure.
HashMap is an implementation of Map. As you can see
in their definitions HashMap is a class and Map is a trait
4. What is a case class
A. Caseclasses arespecial classes in Scala that provide you
with the boiler plate implementation of the constructor,
getters (accessors), equals and hashCode, and
implement Serializable.Case classes work really well to
encapsulatedata as objects.Readers, familiarwith Java,
can relate it to plain old Java objects (POJOs) or Java
bean
Spark 1.6 : supports only 22 fields
Spark 2.0 : supports any number field
used to apply schema over RDD to convert to
DataFrame and DataFrame to Dataset
5. What is a trait
A. Traits in Scala are very much similar to Java Interfaces
but traits can have methods with implementation. traits
cannot have constructor parameters.
advantage - One big advantage of traits is that you can
extend multiple traits with clause. but we can extend
only one abstract class.
https://stackoverflow.com/questions/1229743/what-
are-the-pros-of-using-traits-over-abstract-classes
6. spark 1.6 vs spark 2.0
A. i. Single entry point with sparkSession instead of
SqlContext, HiveContext and sparkContext.
ii. Dataset API and DataFrame API are unified. In Scala,
DataFrame = Dataset[Row], while Java API users must
replace DataFrame with Dataset
iii.Dataset and DataFrame API unionAll has been
deprecated and replaced by union
iv.Dataset and DataFrame API explode has been
deprecated, alternatively, use functions.explode() with
select or flatMap
v. Dataset and DataFrame API registerTempTable has
been deprecated and replaced by
createOrReplaceTempView
vi. From Spark 2.0, CREATE TABLE ... LOCATION is
equivalent to CREATE EXTERNAL TABLE ... LOCATION in
order to prevent accidental dropping the existing data
in the user-provided locations. That means, a Hive table
created in Spark SQL with the user-specified location is
always a Hive external table.
vii. unified API for both batch and streaming
7. repartition vs coalesce
A. coalesce uses existing partitions to minimize the
amount of data that's shuffled. repartition creates new
partitions and does a full shuffle. coalesce results in
partitions with different amounts of data.
coalesce may run faster than repartition, but unequal
sized partitions are generally slower to work with than
equal sized partitions. You'll usually need to repartition
datasets after filtering a large data set. I've found
repartition to be faster overall because Spark is built to
work with equal sized partitions.
repartition algorithm doesn't distribute data as equally
for very small data sets
8. Spark joins
A. Broadcast hash join - broadcastsmall tableso no need
of shuffles
Shuffle Hash Join: if the average sizeof a singlepartition
is small enough to build a hash table
Sort Merge: if the matching join keys are sort able
Shuffle Has join is not part of 1.6, but part of spark 2.2
and 2.3.
Precedence order in 2.0 Broadcast, Shuffle and sort
merge.
https://sujithjay.com/spark-sql/2018/06/28/Shuffle-
Hash-and-Sort-Merge-Joins-in-Apache-Spark/
2. 9. Spark joins optimizations
A. There aren't strictrules for optimization
i.Analyze the data sizes
ii.Analyze the keys
iii.based on that use Broadcasthash join or filters
potential causes of poor join performance
Dealing with Key Skew in a ShuffleHashJoin
Uneven sharding and limited parallelism - One table is
small and One table is big, no.of join keys are less
special cases
CartesianJoin (cross join)- what to do - may be enable
the cross join in spark
One to Many Joins - what to do - use parquet format
Theta Joins - what to do - create buckets on keys
https://databricks.com/session/optimizing-apache-
spark-sql-joins
10. RDD vs DataFrame vs DataSet
A. Underlying API for DataFrame or DataSet is RDD
RDD - return an Iterator[T]. underlying the Iterator
functionality is opaque or unknown so optimization is
difficult. even T the data is opaque so RDD doesn't know
which columns are important and which aren't. only
serializers like Kyro, zLib used for optimization.
Dataframe : DataFrame is an abstraction which gives a
schema view of data. dataframe is like a table in
database. offers custom memory management using
tungsten, so Data is stored in off-heap memory in binary
format. This saves a lot of memory space. Also there is
no Garbage Collection overhead involved. By knowing
the schema of data in advance and storing efficiently in
binary format, expensive java Serialization is also
avoided. Optimized Execution Plans using catalyst
optimizer.
DataSets : It is an extension to Dataframe API, the latest
abstraction which tries to provide best of both RDD and
Dataframe. compile time safety like RDD as well as
performance boosting features of Dataframe. dataset
scores over Dataframe is an additional feature it has:
Encoders - Encoders generate byte code to interact with
off-heap data and provide on-demand access to
individual attributes without having to de-serialize an
entire object.
https://www.linkedin.com/pulse/apache-spark-rdd-vs-
dataframe-dataset-chandan-prakash
11. Drawbacks of spark streaming
A. discretized stream or DStream and it is represented as a
sequence of RDDs.
Micro-batch systems have an inherent problem with
backpressure. If processing of a batch takes longer in
downstream operations, because of computational
complexity or just slow sink, than in the batching
operator (usually source), the micro-batch will take
longer than configured. This leads either to more and
more batches queueing up, or to a growing micro-batch
size.
Micro-batching, thus Spark Streaming can achieve high
throughput and once guarantees, but gives away low
latency, flow control and the native streaming
programming model.
12. How checkpoints are useful
A. Checkpoints freeze the content of your data frames
before you do something else. They're essential to
keeping track of your data frames
Spark has been offering checkpoints on streaming since
earlier versions (atleastv1.2.0),but checkpoints on data
frames are a different beast.
Metadata Checkpointing – Metadata means the data
about data. It refers to saving the metadata to fault
tolerant storage like HDFS. Metadata includes
configurations, DStream operations, and incomplete
batches. Configuration refers to the configuration used
to create streaming DStream operations are operations
which define the steaming application. Incomplete
batches are batches which are in the queue but are not
complete.
Data Checkpointing –: It refers to save the RDD to
reliable storage because its need arises in some of the
stateful transformations. It is in the case when the
upcoming RDD depends on the RDDs of previous
batches. Because of this, the dependency keeps on
increasing with time. Thus, to avoid such increase in
recovery time the intermediate RDDs are periodically
checkpointed to some reliable storage. As a result, it
cuts down the dependency chain.
Eager Checkpoint
An eager checkpoint will cut the lineage from previous
data frames and will allow you to start “fresh” from this
point on. In clear, Spark will dump your data frame in a
filespecified by setCheckpointDir() and will start a fresh
new data frame from it. You will also need to wait for
completion of the operation
Non-Eager Checkpoint
On the other hand, a non-eager checkpoint will keep the
lineage from previous operations in the data frame.
There are various differences between Spark Checkpoint
vs Persist. Let’s discuss it one by one-
3. Persist vs Checkpoint
When we persist RDD with DISK_ONLY storage level the
RDD gets stored in a location where the subsequent use
of that RDD will not reach that points in recomputing
the lineage.
After persist() is called, Spark remembers the lineage of
the RDD even though it doesn’t call it.
Secondly, after the job run is complete, the cache is
cleared and the files are destroyed.
Checkpointing
Checkpointing stores the RDD in HDFS. It deletes the
lineage which created it.
On completing the job run unlike cache the checkpoint
file is not deleted.
When we checkpointing an RDD it results in double
computation. The operation will firstcall a cache before
accomplishing the actual job of computing. Secondly, it
is written to checkpointing directory.
https://data-flair.training/blogs/apache-spark-
streaming-checkpoint/
13. Unit Testing frame for spark
A. Spark Fast Test library – I prefer
FunSuite, and ScalaTest, DataFrameSuiteBase
14. Various challenges faced while coding spark app
A. Heap space issue or out of memory – resolved by
increasing executor memory
Running Indefinitely long – check the spark UI for any
data skew problems due to duplicates in join keys,
unequal partitions and tune by repartitioning the data
for more parallelism
Nulls in join conditions – avoid null in join keys or use
null safe joins like
Ambiguous column reference issues – occurs when
derived DF is used in the join with source DF, if it is equi
join condition can be resolved with Seq(join_columns)
otherwise rename the DF2 columns before joining.
15. Spark streaming diff in 2.0 and 2.3
A. Structured Streaming in Apache Spark 2.0 decoupled
micro-batch processing from its high-level APIs for a
couple of reasons. First, it made developer’s experience
with the APIs simpler: the APIs did not have to account
for micro-batches. Second, it allowed developers to
treat a stream as an infinite table to which they could
issue queries as they would a static table.
However, to provide developers with different modes of
stream processing, new millisecond low-latency mode
of streaming: continuous mode.
Structured Streaming in Spark 2.0 has supported joins
between a streaming DataFrame/Dataset and a static
one
Spark 2.3 supports stream-to-stream joins, both inner
and outer joins for numerous real-time use cases. The
canonical use case of joining two streams is that of ad-
monetization. Like joiningad impressions (views) and ad
clicks
https://databricks.com/blog/2018/02/28/introducing-
apache-spark-2-3.html