2. RDD Dataframe Dataset
Data Representation Set of Java Objects
Representing data.
Distributed collection of
data organized into
named columns
Its an extension of
Dataframe API but
optimize.
Immutability and
Interoperability
It’s immutable but
interoperable can
create rdd to df
(.todf)and vis-à-vis
(.rdd)
Can not convert it into
domain object can
create from rdd but not
again convert back into.
It allows to regenerate
back to rdd also can
create from DF to
Dataset.
Compile-time type safety Yes provide compile-
time type safety.
No if you want to
column which is not
available DF API does
not throw error at
compile time but on
runtime it will throw.
Yes overcome DF
compile type scenario.
3. RDD Dataframe Dataset
Optimization No optimization
engine, never use
catalyst optimizer and
tungsten execution
engine.
Use catalyst optimizer
tungsten execution
engine.
More advance than DF.
Serialization Use java serialization
to store or distribute
the data and its
expensive and require
sending both data
and structure
nodes.
Serialize data into off-
heap (in-memory)in
binary format and apply
transformations and use
tungsten execution
engine to manage
memory and dynamically
generates bytecode.
Dataset API use encoder
concept and store tabular
representation using
tungsten.
Garbage Collection Overhead of garbage
collection.
Avoids the garbage
collection costs in
constructing individual
objects for each row in
the dataset.
There is also no need for
the garbage collector to
destroy object because
serialization takes place
through Tungsten. That
uses off heap data
serialization.
4. RDD Dataframe Dataset
Efficiency/Memory
use
Due to serialization
not much memory
efficient.
Use of off heap
memory for
serialization reduces
the overhead.
It allows performing
an operation on
serialized data and
improving memory
use.
Lazy Evolution Yes Yes Yes
Aggregation RDD API is slower to
perform simple
grouping and
aggregation
operations.
DataFrame API is very
easy to use. It is faster
for exploratory
analysis, creating
aggregated statistics
on large data sets.
In Dataset it is faster
to perform
aggregation
on plenty of data sets.