An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web.
For more information please follow: https://github.com/tribbloid/spookystuff
A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.
Unlocking the Future of AI Agents with Large Language Models
How to build your query engine in spark
1. How to build your
query engine in Spark
Peng
Engineer@anchorbot
Love machine learning & algorithms
Part-time Mahout committer
2. Prior Knowledge
Scala – not important, it's always changing, so if you don't know it, congratulation you
don't have to learn it again and re-become the grandmaster you were.
Functional Programming – very important!
But not too functional – equally important! Will be explained later
Working with Amazon EC2/S3 – spot instances are dirt-cheap and unleash the power of
auto-scaling. (assuming you want to finish things in short burst AND your 8-hours sleep)
You DON'T have to know Hadoop, YARN OR HDFS (but highly recommended).
You DON'T have to know MapReduce, DAG dependency model OR Apache Akka (I never
did)
You DON'T have to know Machine Learning OR Data Science.
3. Guideline
Basic: RDD, Transformations and Actions.
Basic: Testing, Packaging and Deployment.
Advanced: Partitioning, Distribution and Staging.
Expert: Composite Mapping and Accumulator.
Example: A query engine for distributed web scraping.
Q&A.
4. Programming Bricks
Entities: Generic data abstractions
• RDD[T] (Resilient Distributed Dataset): a collection of java
objects spread across your cluster, inaccessible from your
local computer.
• LAR[T] (Locally Accessible Resources): a data source/sink
you can read/write from your local computer.
• Can be many things including but not limited to: JVM memory
block on your computer, local files, files on HDFS, files on S3,
tables in C* (new!), tables in Hive, Twitter feed(read-only), other
web API feed (read-only)
• This list is still growing.
Mappings: Methods that cast one entity to another
Parallelization: LAR[T] => RDD[T]
Transformation(f: {T} => {K}): RDD[T] => RDD[K],
generalizes map
Action(f: {K} => {K}): RDD[K] => LAR[K], generalizes reduce
RDDs
LARs
Transformation
ParallelizationAction
Plain java code
5. Programming Bricks
These bricks are atomic black boxes, do not attempt to
break or reverse-engineer them! If you want to try ------->
Instead, try to be agnostic and construct your complex
algorithm and framework by wiring them like IC chips.
They form a much larger superset of Map/Reduce.
They have no problem constructing the most complex
distributed algorithms in ML and Graph analysis.
Developers of Spark has made great effort in
abstracting these complex and ugly trivia away from you
so you can concentrate on the beauty of your
algorithm.
6. Advantages
Probably not optimized to the core.
But once you fit into the paradigm…
• No more thread unsafety, racing condition, resource pool, consumer starvation, buffer overflow,
deadlock, JVM OutOfHeapSpaceException, or whatever absurdities.
• No more RPC timeout, service unavailable, 500 internal server error, or Chaos Monkey’s miracles.
• No more weird exception that only happens after being deployed to cluster, local multi-thread
debugging and test capture 95% of them.
• No dependency on any external database, message queue, or a specific file system (pick one from
local, HDFS, S3, CFS and change it later in 5 seconds)
• Your code will be stripped down to its core
• 10~20% of your original code in cluster computing!
• 30~50% of that in multi-thread computing
7. Guideline
Basic: RDD, Transformations and Actions.
Basic: Testing, Packaging and Deployment.
Advanced: Partitioning, Distribution and Staging.
Expert: Composite Mapping and Accumulator.
Example: A query engine for distributed web scraping.
Q&A.
8. Testing
The first thing you should do even before cluster setup because:
On a laptop with 8 cores it is still a savage beast that outperforms most other programs with similar
size.
Does not require packaging and uploading, both are slow.
Read all logs from console output.
Is a self-contained multi-threaded process that fits into any debugger and test framework.
Supports 3 modes, set by ‘--master’ parameter
local[*]: use all local cores, won’t do failover! (better paranoid than sorry)
local[n,t]: use n cores (support for * is missing), will retry each task t-1 times.
local-cluster[n,c,m]: cluster-simulation mode!
Simulate a mini-cluster of size c, each computer has n cores and m megabytes of memory.
Technically no longer a single process, it will simulate everything including data distribution over
network.
As a result, you have to package first, do not support debugging, and better not using it in unit test.
Will expose 100% of your errors in local run.
9. Master
Seed node
Resource negotiator
3 options:
Native: lightweight, well tested, ugly UI,
primary/backup redundancy delegated to
ZooKeeper, support auto-scaling (totally abused
by DataBricks), recommended for beginners
YARN: scalable, heavyweight, threads run in
containers, beautiful UI, swarm redundancy
Mesos: don’t know why its still here
Remember the master URL on its UI after
setup, you are going to use it everywhere.
10. Worker
The muscle and the real deal
Report status to master and shuffle data to each
other.
Cores are segregated and share nothing in
computation, except broadcasted variables.
Disposable! Can be added or removed at will,
enables fluent scaling
3 options:
$SPARK_HOME/bin/spark-class
org.apache.spark.deploy.worker.Worker $MASTER_URL:
both the easiest and most flexible, support auto-scaling
by adding this line into startup script.
…/bin/start-all: launch both master and workers, need
to setup password-less ssh login first.
…/ec2/spark-ec2: launch many things on EC2 including
an in-memory HDFS, too heavyweight and too many
options hardcoded.
11. Driver
Node/JVM that runs your main function.
Merged with a random worker in cluster deploy
mode (see next page)
Distribute data
Control staging
Collect action results and accumulator changes.
Technically not part of cluster but still better to be
close to all other nodes. (Important for iterative
jobs)
Must have a public DNS to master! otherwise will
cause:
WARNING: Initial job has not accepted any resources…
$SPARK_HOME on it has to be identical to that on
workers (This is really sloppy but people no longer
care)
12. Packaging
Generate the all inclusive ‘fat/über’ JAR being distributed to nodes.
Self-contained, should include everything in your program’s dependency tree
This JAR won’t be generated by default, you have to generate it by:
Enable maven-shade plugin and run mvn package
Enable sbt-assembly plugin and run sbt> assembly
… EXCEPT those who overlap with Spark’s dependencies (and all modules’ dependencies,
including but not limited to: SparkSQL, Streaming, Mllib and GrpahX).
Excluding them by setting the scope of Spark artifact(s) in your dependency list to
‘provided’
You don’t have to do it but this decrease your JAR size by 90M+.
They already exist in $SPARK_HOME/lib/*.jar and will always be loaded BEFORE your JAR.
if your program and Spark have overlapping dependencies but in different versions, yours
will be ignored in runtime (Java’s first-found-first-serve principal), and you go straight into...
13. JAR hell
Manifest itself as either one of these errors that only appears after packaging:
NoClassDefFoundError
ClassNotFoundException
NoSuchFieldError
NoSuchMethodError
Unfortunately many dependencies of Spark are severely not up-to-date.
Even more unfortunately the list of these outdated dependencies is still growing, a curse
bestowed by Apache Foundation.
Switching to YARN won’t resolve it! It just box threads with containers but won’t change
class loading sequence.
Only (ugly but working) solution so far: package relocation!
Supported by maven-shade by setting relocation rule, don’t know how to do this in sbt :-<
Probably have third-party plugins that can detect it from dependency, need more testing.
Not very compatible with some IDE, if reporting a classpath error please re-import the
project.
14. Maven vs sbt
Maven
• The most extendable and widely-supported
build tool.
• Native to Java, but all Scala dependencies
are Java bytecode.
• Need maven-scala and maven-shade plugins
• I don’t know why but Spark official repo just
switched from sbt to maven after 0.9.0.
• Apparently slightly faster than ivy
• A personal tool of choice.
Simple Build Tool (used to be simple)
• No abominable xml
• Native to Scala
• Self-contained executable
• Beautiful build report by ivy backend
• Need sbt-assembly plugin (does NOT
support relocation :-<)
15. Deployment
$SPARK_HOME/bin/spark-submit --master $SPARK_MASTER_URL --jar $YOUR_JARS
full.path.to.your.main.object.
This command do everything including distributing JARS and run main function locally
as the driver. (a.k.a. client deploy mode)
Alternatively you can move the driver to a random node by overriding ‘--deploy-
mode’ to ‘cluster’, but it’s not recommended for beginners, reasons:
Don’t know which node until seeing Spark UI
Driver takes extra CPU and bandwidth load.
Cannot use local JAR –you have to upload it to HDFS or S3 first.
spark-submit is dumb, don’t know where to find it in JAR distribution dir.
Useless to any Spark-shell! And a few other things.
If its part of an SOA, have fun pointing all other clients to it.
16. Guideline
Basic: RDD, Transformations and Actions.
Basic: Testing, Packaging and Deployment.
Advanced: Partitioning, Distribution and Staging.
Expert: Composite Mapping and Accumulator.
Example: A query engine for distributed web scraping.
Q&A.
17. Partitioning (a.k.a. shuffling)
RDD[T].Partition: a smallest inseparable chunk of T, each cannot
spread over 2 cores or threads. -> generating each partition only
requires a self-contained single-thread subroutine (called task) that
won’t screw up and induces no overhead on
scheduling/synchronization.
Default number of partitions is the total number of cores in a cluster,
works great if workload on each partition is fairly balanced.
Otherwise some cores will finish first and fence in your cluster ------>
you’d better override this:
Many transformations and parallelizations takes an optional Int
parameter to result in RDD with desired number of partitions.
RDD[T].repartition(n: Int) returns an RDD[T] with identical content
but different number of partitions, also rebalance sizes of
partitions.
RDD[T].coalesce(n: Int) merge closest partitions (ideally on one
node) together. This is an incomplete partitioning (no shuffling)
which makes it faster.
18. Resilience
Partition is also the smallest unit to be discarded and regenerated from scratch
whenever:
The task generating it throws an exception and quit. (regardless you can customize
your mappings to retry locally inside the task thread to avoid discarding the already
succeeded part, Matei confirm this last time)
It is lost in a power outage or being disconnected from the cluster
(When speculative task is enabled) the task generating it takes too long to finish
comparing to the time to finish most other partitions. (In this case the old one
won’t be discarded but race with the new one)
It is being slowly (I mean discouragingly slow) redistributed from another node or
loaded from a remote cache (e.g. HDFS&S3) yet all its dependencies (prior
Partitions and mappings needed to generate it) are available locally. When this
really happens, you and your network admin will have a problem.
19. Rule No. 1:
No. partitions >= No. coresMore partitions/smaller partition =
Higher scheduling and distribution overhead
Higher workload/bandwidth consumption for driver/master node
Lower cost for retry/regeneration
Better scheduling for unbalanced partitioning and speculative task
Easier monitoring of progress
Less partitions/bigger partition =
Lower scheduling and distribution overhead
Lower workload/bandwidth consumption for driver/master node
Higher cost for retry/regeneration (again you can retry inside thread first)
Longer waiting time for unbalanced partitioning and speculative task
Progress bar will stay at 0% until you lose hope
20. Distribution
Both RDD[T] and Mappings are (hopefully evenly) spread across cores.
Supports 3 modes, from fast to slow:
1. (fastest) JAR!
Contains all precompiled java bytecode and artifacts, which include but not limited to:
IMMUTABLE static objects or constants (Mutable static objects are strictly forbidden:
if you modify any of these locally in runtime no other node will know it, and you go
straight into mutable hell), class methods, precompiled anonymous functions
EXCLUDING closures (they are just methods of classes that inherit Function interfaces.
BTW nice lambda operator, Java 8), manifest, anything you package into jar for
whatever reason
Not include: fields of dynamic objects (initialized at runtime), closure of anonymous
functions (same old thing), variables.
Only (and always) happens once before execution and reused in the lifespan of a job.
You can find the distributed jar in $SPARK_HOME/work/$JOB_UUID/… of each node
21. Distribution
2. (faster) Broadcast (bonus feature, prioritized due to importance)
Basically a cluster-wide singleton initialized in run-time.
Happens immediately during singleton’s initialization using an eponymous static function:
val wrapper = spark.broadcast(thing: T <: Serializable)
After which you can define any mapping to read it repeatedly across the entire cluster, by
using this wrapper in the mapping’s parameter:
wrapper.value
You can’t write it, its IMMUTABLE, will cause racing condition if you can anyway.
3. (fast) Shipping (Easiest, also the only way to distribute heterogeneous objects created at
runtime including non-singletons and closures)
also used to distribute broadcast wrappers (very small/fast)
Still much faster than reading from ANY non-local file system. (Giving it a performance
edge over Hadoop MR)
Happens automatically in partitioning, triggered on-demand.
You can see its time cost in ‘Shuffle Read/Write’ column in job UI.
22. Serialization Hell
Broadcast and shipping demands that all objects being distributed are SERIALIZABLE:
...broadcast[T <: Serializable](…
RDD[T <: Serializable]
…map(f <: Function[T,K] with Serializable)
Otherwise deep copy is no-can-do and program throws NotSerializableError.
Easiest (and most effective) solution: Don’t be a too functional! only put simple types and
collections in RDDs and closures. Will also makes shipping faster (very important for iterative
algorithms, beware R programmers.)
If not possible you still have 2 options:
Wrap complex objects with a serializable wrapper (recommended, used by many Spark
parallelizations/actions to distribute HDFS/S3 credentials)
Switch to Kryo Serializer (shipping is faster in most cases and favored by Mahout due to
extremely iterative ML algorithms, I haven’t tried yet)
Happens even at shipping between cores, only becomes useless when broadcasting locally
(singleton is not bind to cores).
One of the rare cases where you cluster-wise deployment fails yet local test succeeds.
23. Staging
A stage contains several mappings that are concatenated into embarrassingly
parallelizable longer tasks.
E.g. map->map = 1 stage, map->reduce = 2 stages
Technically reduce can start after a partition of its preceding map is generated, but
Spark is not that smart (or unnecessarily complex).
Staging can only be triggered by the following mappings:
All actions.
Wide transformations.
CanNOT be triggerd by caching or checkpointing: They are also embarrasingly
parallelizable.
24. Wide Transformations?
Narrow (no
partitioning):
• Map
• FlatMap
• MapPartitions
• Filter
• Sample
• Union
Wide (partitioning):
• Intersection
• Distinct
• ReduceByKey
• GroupByKey
• Join
• Cartesian
• Repartition
• Coalesce (I know its
incomplete but WTH)
25. Guideline
Basic: RDD, Transformations and Actions.
Basic: Testing, Packaging and Deployment.
Advanced: Partitioning, Distribution and Staging.
Expert: Composite Mapping and Accumulator.
Example: A query engine for distributed web scraping.
Q&A.
26. Composite Mapping
Used to create additional mappings and DSL that do complex things in one line, also reduces
jar and closure size. However…
You can’t inherit or extend RDD of which exact type is abstracted away from you.
You can’t break basic mappings as atomic black-boxes.
Only solution: use Scala implicit view!
Define a wrapper of RDD[T], implement all your methods/operators in it by wiring the
plethora of programming bricks.
Create a static object (usually referred as context) with an implicit method that converts
RDD[T] to your wrapper (referred as implicit converter)
In your main function, import the implicit converter by:
Import context._
Voila, you can use those methods/operators like mappings on any RDD[T].
Another reason why Scala is the language of choice.
27. Accumulator
Used to create counters, progress trackers, and performance metrics of things that are
not displayed on UI.
Created from eponymous function:
Val acc = spark.accumulator(i)
Only readable in main function, but can be updated anywhere in parallel by using:
acc += j
Type of i and j must be identical and inherit AccumulatorParam
No constraint on implementation, but order of j should have no impact on final result.
Prefer simple and fast implementation.
Updated in real time, but requires an extra thread or non-blocking function to read locally when
the main thread is blocked by stage execution.
28. Guideline
Basic: RDD, Transformations and Actions.
Basic: Testing, Packaging and Deployment.
Advanced: Partitioning, Distribution and Staging.
Expert: Composite Mapping and Accumulator.
Example: A query engine for distributed web scraping.
Q&A.
30. Implementation
Only 2 entities:
RDD[ActionPlan]: A sequence of “human” actions executable on a browser.
RDD[Page]: A web page or one of its sections.
Query syntax is entirely defined by composite mappings.
Mappings on RDD[ActionPlan] are infix operators
Mappings on RDD[Page] are methods with SQL/LINQ-ish names.
Enabled by importing context._, and work with other mapping extensions.
Web UI is ported from iScala-notebook, which is ported from iPython-notebook.
35. More mutable hell?
Many sources have claimed that RDD[T] enforces immutable pattern for T, namely:
T contains only immutable (val) fields, which themselves are also immutable types and collections.
Content of T cannot be modified in-place inside any mapping, always create a slightly different deep
copy.
However this pattern is perhaps less rigorous by now, reasons:
Serialization and shipping happens ubiquitously between stages which always enforce a deep copy.
2 threads/tasks cannot access one memory block by design to avoid racing.
Any partition modified by a mapping then used by another is either regenerated or loaded from a
previous cache/checkpoint created BEFORE the modifying mapping, in both cases those
modifications are simply discarded without collateral damage.
Immutable pattern requires great discipline in Java and Python.
So, my experience: just make sure your var and shallow copies won’t screw up things INSIDE each
single-threaded task, anticipate discarded changes, and theoretically you’ll be safe.
Of course there is no problem in using immutable pattern if you feel insecure.
Again mutable static object is FORBIDDEN.
36. Lazy Evaluation
RDD is empty at creation, only its type and partition ids are determined.
Can only be redeemed by an action called upon itself or its downstream RDD.
After which a recursive resolve request will be passed along its partitions’
dependency tree(a directed acyclic graph or DAG), resulting in their tasks
being executed in sequence after their respect dependencies are redeemed.
A task with all its dependencies are largely omitted if the partition it generates is already
cached, so always cache if you derive 2+ LARs from 1 RDD
Knowing this is trivial to programming until you start using first() action
Caching an RDD, call first() on it (triggers several mini-stages that has 1
partition/task), then do anything that triggers full-scale staging will take twice
as long as doing it otherwise. Spark’s caching and staging mechanism is not
smart enough to know what will be used in later stages.
37. Anecdotes
•Partitioning of an RDD[T] depends on the data structure of
T and who creates it, and is rarely random (yeah it’s also
called ‘shuffling’ to fool you guys).
• E.g. an RDD[(U,V)] (a.k.a. PairRDD) will use U as partition
keys and distribute (U,V) in a C* token ring-ish fashion for
obvious reasons.
• More complex T and expected usage case will results in
increasingly complex RDD implementations, notably the
SchemaRDD in SparkSQL.
• Again don’t try to reverse-engineer them unless you are
hardcore.