Spark DataFrames provide a more optimized way to work with structured data compared to RDDs. DataFrames allow skipping unnecessary data partitions when querying, such as only reading data partitions that match certain criteria like date ranges. DataFrames also integrate better with storage formats like Parquet, which stores data in a columnar format and allows skipping unrelated columns during queries to improve performance. The code examples demonstrate loading a CSV file into a DataFrame, finding and removing duplicate records, and counting duplicate records by key to identify potential duplicates.
3. Spark Core
• Spark Core is the base engine for large-scale
parallel and distributed data processing. It is
responsible for:
• memory management and fault recovery
• scheduling, distributing and monitoring jobs
on a cluster
• interacting with storage systems
4. Spark Core
• Spark introduces the concept of an RDD (Resilient
Distributed Dataset)
• an immutable fault-tolerant, distributed collection of objects
that can be operated on in parallel.
• contains any type of object and is created by loading an
external dataset or distributing a collection from the driver
program.
• RDDs support two types of operations:
• Transformations are operations (such as map, filter, join, union,
and so on) that are performed on an RDD and which yield a
new RDD containing the result.
• Actions are operations (such as reduce, count, first, and so
on) that return a value after running a computation on an RDD.
5. Spark DataFrames
• DataFrames API is inspired by data frames in R and Python
(Pandas), but designed from the ground-up to support
modern big data and data science applications:
• Ability to scale from kilobytes of data on a single laptop to
petabytes on a large cluster
• Support for a wide array of data formats and storage systems
• State-of-the-art optimization and code generation through the
Spark SQL Catalyst optimizer
• Seamless integration with all big data tooling and
infrastructure via Spark
• APIs for Python, Java, Scala, and R (in development via
SparkR)
11. Previously it was RDD but currently the Spark DataFrames API is
considered to be the primary interaction point of Spark. but RDD is
available if needed
12. What is partitioning in Apache Spark?
Partitioning is actually the main concept of access your entire Hardware resources while
executing any Job.
More Partition = More Parallelism
So conceptually you must check the number of slots in your hardware, how many tasks can
each of executors can handle.Each partition will leave in different Executor.
13.
14. DataFrames
• So Dataframe is more like column structure and each
record is actually a line.
• Can Run statistics naturally as its somewhat works like
SQL or Python/R Dataframe.
• In RDD, to process any data for last 7 days, spark
needed to go through entire dataset to get the details, but
in Dataframe you already get Time column to handle the
situation, so Spark won’t even see the data which is
greater than 7 days.
• Easier to program.
• Better performance and storage in the heap of executor.
15. How Dataframe ensures to
read less data?
• You can skip partition while reading the data
using Dataframe.
• Using Parquet
• Skipping data using statistucs (ie min, max)
• Using partitioning (ie year = 2015/month = 06…)
• Pushing predicates into storage systems.
16. What is Parquet
• You can skip partition while reading the data using
Dataframe.
• Using Parquet
• Skipping data using statistucs (ie min, max)
• Using partitioning (ie year = 2015/month = 06…)
• Pushing predicates into storage systems.
17. • Parquet should be the source for any operation or ETL. So if the
data is different format, preferred approach is to convert the source
to Parquet and then process.
• If any dataset in JSON or comma separated file, first ETL it to
convert it to Parquet.
• It limits I/O , so scans/reads only the columns that are needed.
• Parquet is columnar layout based, so it compresses better, so
save spaces.
• So parquet takes first column and store that as a file, and so on. So
if we have 3 different files and sql query is on 2 files, then parquet
won’t even consider to read the 3rd file.
18. val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/CollegeScoreCard.csv")
college.count
res2: Long = 7805
val collegeNoDups = college.distinct
collegeNoDups.count
res3: Long = 7805
val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv")
college: org.apache.spark.rdd.RDD[String] = /Users/marksmith/TulsaTechFest2016/colleges.csv MapPartitionsRDD[17] at textFile at <console>:27
val cNoDups = college.distinct
cNoDups: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[20] at distinct at <console>:29
cNoDups.count
res7: Long = 7805
college.count
res8: Long = 9000
val cRows = college.map(x => x.split(",",-1))
cRows: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[21] at map at <console>:29
val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x )
cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[22] at map at <console>:31
cKeyRows.take(2)
res11: Array[(String, Array[String])] = Array((UNITID_OPEID_opeid6_INSTNM,Array(
val cGrouped = cKeyRows.groupBy(x => x._1).map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer]))
cGrouped: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[27] at map at <console>:33
val cDups = cGrouped.filter(x => x._2.length > 1)
val cDups = cGrouped.filter(x => x._2.length > 1)
cDups: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[28] at filter at <console>:35
cDups.count
res12: Long = 1195
val cNoDups = cGrouped.map(x => x._2(0))
cNoDups: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[29] at map at <console>:35
cNoDups.count
res13: Long = 7805
val cNoDups = cGrouped.map(x => x._2(0)._2)
cNoDups: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[30] at map at <console>:35
cNoDups.take(5)
16/08/04 16:44:24 ERROR Executor: Managed memory leak detected; size = 41227428 bytes, TID = 28
res16: Array[Array[String]] = Array(Array(145725, 00169100, 001691, Illinois Institute of Technology, Chicago, IL, www.iit.edu, npc.collegeboard.org/student/app/iit, 0, 3, 2, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0,
NULL, 520, 640, 640, 740, 520, 640, 580, 690, 580, 25, 30, 24, 31, 26, 33, NULL, NULL, 28, 28, 30, NULL, 1252, 1252, 0, 0, 0.2026, 0, 0, 0, 0.1225, 0, 0, 0.4526, 0.0245
Demo RDD Code