SlideShare una empresa de Scribd logo
1 de 37
How to build your
query engine in Spark
Peng
Engineer@anchorbot
Love machine learning & algorithms
Part-time Mahout committer
Prior Knowledge
 Scala – not important, it's always changing, so if you don't know it, congratulation you
don't have to learn it again and re-become the grandmaster you were.
 Functional Programming – very important!
 But not too functional – equally important! Will be explained later
 Working with Amazon EC2/S3 – spot instances are dirt-cheap and unleash the power of
auto-scaling. (assuming you want to finish things in short burst AND your 8-hours sleep)
 You DON'T have to know Hadoop, YARN OR HDFS (but highly recommended).
 You DON'T have to know MapReduce, DAG dependency model OR Apache Akka (I never
did)
 You DON'T have to know Machine Learning OR Data Science.
Guideline
 Basic: RDD, Transformations and Actions.
 Basic: Testing, Packaging and Deployment.
 Advanced: Partitioning, Distribution and Staging.
 Expert: Composite Mapping and Accumulator.
 Example: A query engine for distributed web scraping.
 Q&A.
Programming Bricks
Entities: Generic data abstractions
• RDD[T] (Resilient Distributed Dataset): a collection of java
objects spread across your cluster, inaccessible from your
local computer.
• LAR[T] (Locally Accessible Resources): a data source/sink
you can read/write from your local computer.
• Can be many things including but not limited to: JVM memory
block on your computer, local files, files on HDFS, files on S3,
tables in C* (new!), tables in Hive, Twitter feed(read-only), other
web API feed (read-only)
• This list is still growing.
Mappings: Methods that cast one entity to another
 Parallelization: LAR[T] => RDD[T]
 Transformation(f: {T} => {K}): RDD[T] => RDD[K],
generalizes map
 Action(f: {K} => {K}): RDD[K] => LAR[K], generalizes reduce
RDDs
LARs
Transformation
ParallelizationAction
Plain java code
Programming Bricks
These bricks are atomic black boxes, do not attempt to
break or reverse-engineer them! If you want to try ------->
Instead, try to be agnostic and construct your complex
algorithm and framework by wiring them like IC chips.
 They form a much larger superset of Map/Reduce.
 They have no problem constructing the most complex
distributed algorithms in ML and Graph analysis.
 Developers of Spark has made great effort in
abstracting these complex and ugly trivia away from you
so you can concentrate on the beauty of your
algorithm.
Advantages
Probably not optimized to the core.
But once you fit into the paradigm…
• No more thread unsafety, racing condition, resource pool, consumer starvation, buffer overflow,
deadlock, JVM OutOfHeapSpaceException, or whatever absurdities.
• No more RPC timeout, service unavailable, 500 internal server error, or Chaos Monkey’s miracles.
• No more weird exception that only happens after being deployed to cluster, local multi-thread
debugging and test capture 95% of them.
• No dependency on any external database, message queue, or a specific file system (pick one from
local, HDFS, S3, CFS and change it later in 5 seconds)
• Your code will be stripped down to its core
• 10~20% of your original code in cluster computing!
• 30~50% of that in multi-thread computing
Guideline
 Basic: RDD, Transformations and Actions.
 Basic: Testing, Packaging and Deployment.
 Advanced: Partitioning, Distribution and Staging.
 Expert: Composite Mapping and Accumulator.
 Example: A query engine for distributed web scraping.
 Q&A.
Testing
The first thing you should do even before cluster setup because:
 On a laptop with 8 cores it is still a savage beast that outperforms most other programs with similar
size.
 Does not require packaging and uploading, both are slow.
 Read all logs from console output.
 Is a self-contained multi-threaded process that fits into any debugger and test framework.
Supports 3 modes, set by ‘--master’ parameter
 local[*]: use all local cores, won’t do failover! (better paranoid than sorry)
 local[n,t]: use n cores (support for * is missing), will retry each task t-1 times.
 local-cluster[n,c,m]: cluster-simulation mode!
 Simulate a mini-cluster of size c, each computer has n cores and m megabytes of memory.
 Technically no longer a single process, it will simulate everything including data distribution over
network.
 As a result, you have to package first, do not support debugging, and better not using it in unit test.
 Will expose 100% of your errors in local run.
Master
Seed node
Resource negotiator
3 options:
 Native: lightweight, well tested, ugly UI,
primary/backup redundancy delegated to
ZooKeeper, support auto-scaling (totally abused
by DataBricks), recommended for beginners
 YARN: scalable, heavyweight, threads run in
containers, beautiful UI, swarm redundancy
 Mesos: don’t know why its still here
Remember the master URL on its UI after
setup, you are going to use it everywhere.
Worker
The muscle and the real deal
Report status to master and shuffle data to each
other.
Cores are segregated and share nothing in
computation, except broadcasted variables.
Disposable! Can be added or removed at will,
enables fluent scaling
3 options:
 $SPARK_HOME/bin/spark-class
org.apache.spark.deploy.worker.Worker $MASTER_URL:
both the easiest and most flexible, support auto-scaling
by adding this line into startup script.
 …/bin/start-all: launch both master and workers, need
to setup password-less ssh login first.
 …/ec2/spark-ec2: launch many things on EC2 including
an in-memory HDFS, too heavyweight and too many
options hardcoded.
Driver
Node/JVM that runs your main function.
Merged with a random worker in cluster deploy
mode (see next page)
Distribute data
Control staging
Collect action results and accumulator changes.
Technically not part of cluster but still better to be
close to all other nodes. (Important for iterative
jobs)
Must have a public DNS to master! otherwise will
cause:
WARNING: Initial job has not accepted any resources…
$SPARK_HOME on it has to be identical to that on
workers (This is really sloppy but people no longer
care)
Packaging
Generate the all inclusive ‘fat/über’ JAR being distributed to nodes.
Self-contained, should include everything in your program’s dependency tree
This JAR won’t be generated by default, you have to generate it by:
Enable maven-shade plugin and run mvn package
Enable sbt-assembly plugin and run sbt> assembly
… EXCEPT those who overlap with Spark’s dependencies (and all modules’ dependencies,
including but not limited to: SparkSQL, Streaming, Mllib and GrpahX).
 Excluding them by setting the scope of Spark artifact(s) in your dependency list to
‘provided’
 You don’t have to do it but this decrease your JAR size by 90M+.
 They already exist in $SPARK_HOME/lib/*.jar and will always be loaded BEFORE your JAR.
 if your program and Spark have overlapping dependencies but in different versions, yours
will be ignored in runtime (Java’s first-found-first-serve principal), and you go straight into...
JAR hell
Manifest itself as either one of these errors that only appears after packaging:
 NoClassDefFoundError
 ClassNotFoundException
 NoSuchFieldError
 NoSuchMethodError
 Unfortunately many dependencies of Spark are severely not up-to-date.
 Even more unfortunately the list of these outdated dependencies is still growing, a curse
bestowed by Apache Foundation.
 Switching to YARN won’t resolve it! It just box threads with containers but won’t change
class loading sequence.
Only (ugly but working) solution so far: package relocation!
 Supported by maven-shade by setting relocation rule, don’t know how to do this in sbt :-<
 Probably have third-party plugins that can detect it from dependency, need more testing.
 Not very compatible with some IDE, if reporting a classpath error please re-import the
project.
Maven vs sbt
Maven
• The most extendable and widely-supported
build tool.
• Native to Java, but all Scala dependencies
are Java bytecode.
• Need maven-scala and maven-shade plugins
• I don’t know why but Spark official repo just
switched from sbt to maven after 0.9.0.
• Apparently slightly faster than ivy
• A personal tool of choice.
Simple Build Tool (used to be simple)
• No abominable xml
• Native to Scala
• Self-contained executable
• Beautiful build report by ivy backend
• Need sbt-assembly plugin (does NOT
support relocation :-<)
Deployment
$SPARK_HOME/bin/spark-submit --master $SPARK_MASTER_URL --jar $YOUR_JARS
full.path.to.your.main.object.
This command do everything including distributing JARS and run main function locally
as the driver. (a.k.a. client deploy mode)
Alternatively you can move the driver to a random node by overriding ‘--deploy-
mode’ to ‘cluster’, but it’s not recommended for beginners, reasons:
 Don’t know which node until seeing Spark UI
 Driver takes extra CPU and bandwidth load.
 Cannot use local JAR –you have to upload it to HDFS or S3 first.
 spark-submit is dumb, don’t know where to find it in JAR distribution dir.
 Useless to any Spark-shell! And a few other things.
 If its part of an SOA, have fun pointing all other clients to it.
Guideline
 Basic: RDD, Transformations and Actions.
 Basic: Testing, Packaging and Deployment.
 Advanced: Partitioning, Distribution and Staging.
 Expert: Composite Mapping and Accumulator.
 Example: A query engine for distributed web scraping.
 Q&A.
Partitioning (a.k.a. shuffling)
RDD[T].Partition: a smallest inseparable chunk of T, each cannot
spread over 2 cores or threads. -> generating each partition only
requires a self-contained single-thread subroutine (called task) that
won’t screw up and induces no overhead on
scheduling/synchronization.
Default number of partitions is the total number of cores in a cluster,
works great if workload on each partition is fairly balanced.
Otherwise some cores will finish first and fence in your cluster ------>
you’d better override this:
 Many transformations and parallelizations takes an optional Int
parameter to result in RDD with desired number of partitions.
 RDD[T].repartition(n: Int) returns an RDD[T] with identical content
but different number of partitions, also rebalance sizes of
partitions.
 RDD[T].coalesce(n: Int) merge closest partitions (ideally on one
node) together. This is an incomplete partitioning (no shuffling)
which makes it faster.
Resilience
Partition is also the smallest unit to be discarded and regenerated from scratch
whenever:
 The task generating it throws an exception and quit. (regardless you can customize
your mappings to retry locally inside the task thread to avoid discarding the already
succeeded part, Matei confirm this last time)
 It is lost in a power outage or being disconnected from the cluster
 (When speculative task is enabled) the task generating it takes too long to finish
comparing to the time to finish most other partitions. (In this case the old one
won’t be discarded but race with the new one)
 It is being slowly (I mean discouragingly slow) redistributed from another node or
loaded from a remote cache (e.g. HDFS&S3) yet all its dependencies (prior
Partitions and mappings needed to generate it) are available locally. When this
really happens, you and your network admin will have a problem.
Rule No. 1:
No. partitions >= No. coresMore partitions/smaller partition =
Higher scheduling and distribution overhead
Higher workload/bandwidth consumption for driver/master node
Lower cost for retry/regeneration
Better scheduling for unbalanced partitioning and speculative task
Easier monitoring of progress
Less partitions/bigger partition =
Lower scheduling and distribution overhead
Lower workload/bandwidth consumption for driver/master node
Higher cost for retry/regeneration (again you can retry inside thread first)
Longer waiting time for unbalanced partitioning and speculative task
Progress bar will stay at 0% until you lose hope
Distribution
Both RDD[T] and Mappings are (hopefully evenly) spread across cores.
Supports 3 modes, from fast to slow:
1. (fastest) JAR!
 Contains all precompiled java bytecode and artifacts, which include but not limited to:
IMMUTABLE static objects or constants (Mutable static objects are strictly forbidden:
if you modify any of these locally in runtime no other node will know it, and you go
straight into mutable hell), class methods, precompiled anonymous functions
EXCLUDING closures (they are just methods of classes that inherit Function interfaces.
BTW nice lambda operator, Java 8), manifest, anything you package into jar for
whatever reason
 Not include: fields of dynamic objects (initialized at runtime), closure of anonymous
functions (same old thing), variables.
 Only (and always) happens once before execution and reused in the lifespan of a job.
 You can find the distributed jar in $SPARK_HOME/work/$JOB_UUID/… of each node
Distribution
2. (faster) Broadcast (bonus feature, prioritized due to importance)
 Basically a cluster-wide singleton initialized in run-time.
 Happens immediately during singleton’s initialization using an eponymous static function:
 val wrapper = spark.broadcast(thing: T <: Serializable)
 After which you can define any mapping to read it repeatedly across the entire cluster, by
using this wrapper in the mapping’s parameter:
 wrapper.value
 You can’t write it, its IMMUTABLE, will cause racing condition if you can anyway.
3. (fast) Shipping (Easiest, also the only way to distribute heterogeneous objects created at
runtime including non-singletons and closures)
 also used to distribute broadcast wrappers (very small/fast)
 Still much faster than reading from ANY non-local file system. (Giving it a performance
edge over Hadoop MR)
 Happens automatically in partitioning, triggered on-demand.
 You can see its time cost in ‘Shuffle Read/Write’ column in job UI.
Serialization Hell
Broadcast and shipping demands that all objects being distributed are SERIALIZABLE:
 ...broadcast[T <: Serializable](…
 RDD[T <: Serializable]
 …map(f <: Function[T,K] with Serializable)
Otherwise deep copy is no-can-do and program throws NotSerializableError.
Easiest (and most effective) solution: Don’t be a too functional! only put simple types and
collections in RDDs and closures. Will also makes shipping faster (very important for iterative
algorithms, beware R programmers.)
If not possible you still have 2 options:
 Wrap complex objects with a serializable wrapper (recommended, used by many Spark
parallelizations/actions to distribute HDFS/S3 credentials)
 Switch to Kryo Serializer (shipping is faster in most cases and favored by Mahout due to
extremely iterative ML algorithms, I haven’t tried yet)
Happens even at shipping between cores, only becomes useless when broadcasting locally
(singleton is not bind to cores).
 One of the rare cases where you cluster-wise deployment fails yet local test succeeds.
Staging
A stage contains several mappings that are concatenated into embarrassingly
parallelizable longer tasks.
 E.g. map->map = 1 stage, map->reduce = 2 stages
 Technically reduce can start after a partition of its preceding map is generated, but
Spark is not that smart (or unnecessarily complex).
Staging can only be triggered by the following mappings:
 All actions.
 Wide transformations.
CanNOT be triggerd by caching or checkpointing: They are also embarrasingly
parallelizable.
Wide Transformations?
Narrow (no
partitioning):
• Map
• FlatMap
• MapPartitions
• Filter
• Sample
• Union
Wide (partitioning):
• Intersection
• Distinct
• ReduceByKey
• GroupByKey
• Join
• Cartesian
• Repartition
• Coalesce (I know its
incomplete but WTH)
Guideline
 Basic: RDD, Transformations and Actions.
 Basic: Testing, Packaging and Deployment.
 Advanced: Partitioning, Distribution and Staging.
 Expert: Composite Mapping and Accumulator.
 Example: A query engine for distributed web scraping.
 Q&A.
Composite Mapping
Used to create additional mappings and DSL that do complex things in one line, also reduces
jar and closure size. However…
 You can’t inherit or extend RDD of which exact type is abstracted away from you.
 You can’t break basic mappings as atomic black-boxes.
Only solution: use Scala implicit view!
 Define a wrapper of RDD[T], implement all your methods/operators in it by wiring the
plethora of programming bricks.
 Create a static object (usually referred as context) with an implicit method that converts
RDD[T] to your wrapper (referred as implicit converter)
 In your main function, import the implicit converter by:
 Import context._
 Voila, you can use those methods/operators like mappings on any RDD[T].
 Another reason why Scala is the language of choice.
Accumulator
Used to create counters, progress trackers, and performance metrics of things that are
not displayed on UI.
 Created from eponymous function:
 Val acc = spark.accumulator(i)
 Only readable in main function, but can be updated anywhere in parallel by using:
 acc += j
 Type of i and j must be identical and inherit AccumulatorParam
 No constraint on implementation, but order of j should have no impact on final result.
 Prefer simple and fast implementation.
 Updated in real time, but requires an extra thread or non-blocking function to read locally when
the main thread is blocked by stage execution.
Guideline
 Basic: RDD, Transformations and Actions.
 Basic: Testing, Packaging and Deployment.
 Advanced: Partitioning, Distribution and Staging.
 Expert: Composite Mapping and Accumulator.
 Example: A query engine for distributed web scraping.
 Q&A.
Background
Implementation
Only 2 entities:
 RDD[ActionPlan]: A sequence of “human” actions executable on a browser.
 RDD[Page]: A web page or one of its sections.
Query syntax is entirely defined by composite mappings.
 Mappings on RDD[ActionPlan] are infix operators
 Mappings on RDD[Page] are methods with SQL/LINQ-ish names.
 Enabled by importing context._, and work with other mapping extensions.
Web UI is ported from iScala-notebook, which is ported from iPython-notebook.
RDD[T]. Map(T=>K)
Pipe(./bin)
FlatMap
(T=>K…)
Distinct() Union
(RDD[T])
Filter
(T=>y/n?)
Sample()
Intersection
(RDD[T])
GroupBy
(T~T’?)
Cartesian
(RDD[V])
𝑇1
𝑇2
⇒
𝐾1
𝐾2
𝑇
⇒
𝐾1
𝐾2
𝑇
𝑇
⇒ 𝑇
𝑇1
𝑇2
∪
𝑇1
𝑇3
⇒
𝑇1
𝑇2
𝑇3
𝑇1 ⇒ 𝑦
𝑇2 ⇒ 𝑛
⇒ 𝑇1
𝑇1
𝑇2
∩
𝑇1
𝑇3
⇒ 𝑇1
𝑇1
𝑇1′
𝑇2
⇒
[𝑇1, 𝑇1′
]
𝑇2
𝑇1
𝑇2
⨂ 𝑉
⇒
(𝑇1, 𝑉)
(𝑇2, 𝑉)
RDD
[(U,V)].
GroupByKey() (Left/right)Join(RDD[U,K])
Lookup(RDD[U])
ReduceByKey(V…=>V)
(𝑈1, 𝑉1)
(𝑈1, 𝑉2)
(𝑈2, 𝑉3)
⇒
(𝑈1, [𝑉1, 𝑉2])
(𝑈2, [𝑉3])
(𝑈1, 𝑉1)
(𝑈2, 𝑉2)
⋈
(𝑈2, 𝐾1)
(𝑈3, 𝐾2)
⇒
(𝑈1, 𝑉1, ∅ )
(𝑈2, 𝑉2, 𝐾1 )
(𝑈3, ∅, 𝐾2 )
(𝑈1, 𝑉1)
(𝑈1, 𝑉2)
(𝑈2, 𝑉3)
(𝑈2, 𝑉4)
⇒
(𝑈1, 𝑉12)
(𝑈2, 𝑉34)
RDD[T]. Reduce
(T…=>T)
Collect() Count() First() saveAsTextFi
le(filePath)
fromTextFile
(filePath)
Parallelize
(T…)
𝑇1
𝑇2
⇒ 𝑇12
𝑇3
𝑇4
⇒ 𝑇34
⇒ 𝑇
𝑇1
𝑇2
⇒ 𝑇 …
𝑇1
⋮
𝑇𝑛
⇒ 𝑛
𝑇1
⋮
𝑇𝑛
⇒ 𝑇1
𝑇1
𝑇2
⇒ ⇒
𝑇1
𝑇2
𝑇 …
⇒
𝑇1
𝑇2
Performance Benchmark
0
10000
20000
30000
40000
50000
60000
70000
10 cores 20 cores 30 cores 40 cores import.io
Pages/hour
Amazon.com Google Image Iherb
0
200
400
600
800
1000
1200
1400
1600
1800
10 cores 20 cores 30 cores 40 cores import.io
(10 cores?)
Pages/(hour*core)
Amazon.com Google Image Iherb
Thanks for tough
questions!
==In Chaos Monkey we trust==
Peng
pc175@uow.edu.au
github.com/tribbloid
Addendum
More Mutable Hell?
Lazy Evaluation
Anecdotes
More mutable hell?
Many sources have claimed that RDD[T] enforces immutable pattern for T, namely:
 T contains only immutable (val) fields, which themselves are also immutable types and collections.
 Content of T cannot be modified in-place inside any mapping, always create a slightly different deep
copy.
However this pattern is perhaps less rigorous by now, reasons:
 Serialization and shipping happens ubiquitously between stages which always enforce a deep copy.
 2 threads/tasks cannot access one memory block by design to avoid racing.
 Any partition modified by a mapping then used by another is either regenerated or loaded from a
previous cache/checkpoint created BEFORE the modifying mapping, in both cases those
modifications are simply discarded without collateral damage.
 Immutable pattern requires great discipline in Java and Python.
So, my experience: just make sure your var and shallow copies won’t screw up things INSIDE each
single-threaded task, anticipate discarded changes, and theoretically you’ll be safe.
Of course there is no problem in using immutable pattern if you feel insecure.
Again mutable static object is FORBIDDEN.
Lazy Evaluation
 RDD is empty at creation, only its type and partition ids are determined.
 Can only be redeemed by an action called upon itself or its downstream RDD.
 After which a recursive resolve request will be passed along its partitions’
dependency tree(a directed acyclic graph or DAG), resulting in their tasks
being executed in sequence after their respect dependencies are redeemed.
 A task with all its dependencies are largely omitted if the partition it generates is already
cached, so always cache if you derive 2+ LARs from 1 RDD
Knowing this is trivial to programming until you start using first() action
 Caching an RDD, call first() on it (triggers several mini-stages that has 1
partition/task), then do anything that triggers full-scale staging will take twice
as long as doing it otherwise. Spark’s caching and staging mechanism is not
smart enough to know what will be used in later stages.
Anecdotes
•Partitioning of an RDD[T] depends on the data structure of
T and who creates it, and is rarely random (yeah it’s also
called ‘shuffling’ to fool you guys).
• E.g. an RDD[(U,V)] (a.k.a. PairRDD) will use U as partition
keys and distribute (U,V) in a C* token ring-ish fashion for
obvious reasons.
• More complex T and expected usage case will results in
increasingly complex RDD implementations, notably the
SchemaRDD in SparkSQL.
• Again don’t try to reverse-engineer them unless you are
hardcore.

Más contenido relacionado

La actualidad más candente

Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand fordThu Hiền
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangDatabricks
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedOmid Vahdaty
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 

La actualidad más candente (20)

spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to apache spark stand ford
Intro to apache spark stand fordIntro to apache spark stand ford
Intro to apache spark stand ford
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Spark core
Spark coreSpark core
Spark core
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 

Destacado

Search Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer CentreSearch Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer Centrejatin batra
 
Building Highly Flexible, High Performance Query Engines
Building Highly Flexible, High Performance Query EnginesBuilding Highly Flexible, High Performance Query Engines
Building Highly Flexible, High Performance Query EnginesMapR Technologies
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging EnvironmentsPaul Groth
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 

Destacado (12)

Search Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer CentreSearch Engine Training Institute in Ambala!Batra Computer Centre
Search Engine Training Institute in Ambala!Batra Computer Centre
 
search engines
search enginessearch engines
search engines
 
Building Highly Flexible, High Performance Query Engines
Building Highly Flexible, High Performance Query EnginesBuilding Highly Flexible, High Performance Query Engines
Building Highly Flexible, High Performance Query Engines
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Similar a How to build your query engine in spark

Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche SparkAlex Thompson
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtssiddharth30121
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch ProcessingEdureka!
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkEren Avşaroğulları
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 
Optimizations in Spark; RDD, DataFrame
Optimizations in Spark; RDD, DataFrameOptimizations in Spark; RDD, DataFrame
Optimizations in Spark; RDD, DataFrameKnoldus Inc.
 
Spark to Production @Windward
Spark to Production @WindwardSpark to Production @Windward
Spark to Production @WindwardDemi Ben-Ari
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniSpark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniDemi Ben-Ari
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 

Similar a How to build your query engine in spark (20)

Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Learn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemtsLearn about SPARK tool and it's componemts
Learn about SPARK tool and it's componemts
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Apache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup TalkApache Spark Fundamentals Meetup Talk
Apache Spark Fundamentals Meetup Talk
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Optimizations in Spark; RDD, DataFrame
Optimizations in Spark; RDD, DataFrameOptimizations in Spark; RDD, DataFrame
Optimizations in Spark; RDD, DataFrame
 
Spark to Production @Windward
Spark to Production @WindwardSpark to Production @Windward
Spark to Production @Windward
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Module01
 Module01 Module01
Module01
 
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniSpark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 

Último

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 

Último (20)

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

How to build your query engine in spark

  • 1. How to build your query engine in Spark Peng Engineer@anchorbot Love machine learning & algorithms Part-time Mahout committer
  • 2. Prior Knowledge  Scala – not important, it's always changing, so if you don't know it, congratulation you don't have to learn it again and re-become the grandmaster you were.  Functional Programming – very important!  But not too functional – equally important! Will be explained later  Working with Amazon EC2/S3 – spot instances are dirt-cheap and unleash the power of auto-scaling. (assuming you want to finish things in short burst AND your 8-hours sleep)  You DON'T have to know Hadoop, YARN OR HDFS (but highly recommended).  You DON'T have to know MapReduce, DAG dependency model OR Apache Akka (I never did)  You DON'T have to know Machine Learning OR Data Science.
  • 3. Guideline  Basic: RDD, Transformations and Actions.  Basic: Testing, Packaging and Deployment.  Advanced: Partitioning, Distribution and Staging.  Expert: Composite Mapping and Accumulator.  Example: A query engine for distributed web scraping.  Q&A.
  • 4. Programming Bricks Entities: Generic data abstractions • RDD[T] (Resilient Distributed Dataset): a collection of java objects spread across your cluster, inaccessible from your local computer. • LAR[T] (Locally Accessible Resources): a data source/sink you can read/write from your local computer. • Can be many things including but not limited to: JVM memory block on your computer, local files, files on HDFS, files on S3, tables in C* (new!), tables in Hive, Twitter feed(read-only), other web API feed (read-only) • This list is still growing. Mappings: Methods that cast one entity to another  Parallelization: LAR[T] => RDD[T]  Transformation(f: {T} => {K}): RDD[T] => RDD[K], generalizes map  Action(f: {K} => {K}): RDD[K] => LAR[K], generalizes reduce RDDs LARs Transformation ParallelizationAction Plain java code
  • 5. Programming Bricks These bricks are atomic black boxes, do not attempt to break or reverse-engineer them! If you want to try -------> Instead, try to be agnostic and construct your complex algorithm and framework by wiring them like IC chips.  They form a much larger superset of Map/Reduce.  They have no problem constructing the most complex distributed algorithms in ML and Graph analysis.  Developers of Spark has made great effort in abstracting these complex and ugly trivia away from you so you can concentrate on the beauty of your algorithm.
  • 6. Advantages Probably not optimized to the core. But once you fit into the paradigm… • No more thread unsafety, racing condition, resource pool, consumer starvation, buffer overflow, deadlock, JVM OutOfHeapSpaceException, or whatever absurdities. • No more RPC timeout, service unavailable, 500 internal server error, or Chaos Monkey’s miracles. • No more weird exception that only happens after being deployed to cluster, local multi-thread debugging and test capture 95% of them. • No dependency on any external database, message queue, or a specific file system (pick one from local, HDFS, S3, CFS and change it later in 5 seconds) • Your code will be stripped down to its core • 10~20% of your original code in cluster computing! • 30~50% of that in multi-thread computing
  • 7. Guideline  Basic: RDD, Transformations and Actions.  Basic: Testing, Packaging and Deployment.  Advanced: Partitioning, Distribution and Staging.  Expert: Composite Mapping and Accumulator.  Example: A query engine for distributed web scraping.  Q&A.
  • 8. Testing The first thing you should do even before cluster setup because:  On a laptop with 8 cores it is still a savage beast that outperforms most other programs with similar size.  Does not require packaging and uploading, both are slow.  Read all logs from console output.  Is a self-contained multi-threaded process that fits into any debugger and test framework. Supports 3 modes, set by ‘--master’ parameter  local[*]: use all local cores, won’t do failover! (better paranoid than sorry)  local[n,t]: use n cores (support for * is missing), will retry each task t-1 times.  local-cluster[n,c,m]: cluster-simulation mode!  Simulate a mini-cluster of size c, each computer has n cores and m megabytes of memory.  Technically no longer a single process, it will simulate everything including data distribution over network.  As a result, you have to package first, do not support debugging, and better not using it in unit test.  Will expose 100% of your errors in local run.
  • 9. Master Seed node Resource negotiator 3 options:  Native: lightweight, well tested, ugly UI, primary/backup redundancy delegated to ZooKeeper, support auto-scaling (totally abused by DataBricks), recommended for beginners  YARN: scalable, heavyweight, threads run in containers, beautiful UI, swarm redundancy  Mesos: don’t know why its still here Remember the master URL on its UI after setup, you are going to use it everywhere.
  • 10. Worker The muscle and the real deal Report status to master and shuffle data to each other. Cores are segregated and share nothing in computation, except broadcasted variables. Disposable! Can be added or removed at will, enables fluent scaling 3 options:  $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker $MASTER_URL: both the easiest and most flexible, support auto-scaling by adding this line into startup script.  …/bin/start-all: launch both master and workers, need to setup password-less ssh login first.  …/ec2/spark-ec2: launch many things on EC2 including an in-memory HDFS, too heavyweight and too many options hardcoded.
  • 11. Driver Node/JVM that runs your main function. Merged with a random worker in cluster deploy mode (see next page) Distribute data Control staging Collect action results and accumulator changes. Technically not part of cluster but still better to be close to all other nodes. (Important for iterative jobs) Must have a public DNS to master! otherwise will cause: WARNING: Initial job has not accepted any resources… $SPARK_HOME on it has to be identical to that on workers (This is really sloppy but people no longer care)
  • 12. Packaging Generate the all inclusive ‘fat/über’ JAR being distributed to nodes. Self-contained, should include everything in your program’s dependency tree This JAR won’t be generated by default, you have to generate it by: Enable maven-shade plugin and run mvn package Enable sbt-assembly plugin and run sbt> assembly … EXCEPT those who overlap with Spark’s dependencies (and all modules’ dependencies, including but not limited to: SparkSQL, Streaming, Mllib and GrpahX).  Excluding them by setting the scope of Spark artifact(s) in your dependency list to ‘provided’  You don’t have to do it but this decrease your JAR size by 90M+.  They already exist in $SPARK_HOME/lib/*.jar and will always be loaded BEFORE your JAR.  if your program and Spark have overlapping dependencies but in different versions, yours will be ignored in runtime (Java’s first-found-first-serve principal), and you go straight into...
  • 13. JAR hell Manifest itself as either one of these errors that only appears after packaging:  NoClassDefFoundError  ClassNotFoundException  NoSuchFieldError  NoSuchMethodError  Unfortunately many dependencies of Spark are severely not up-to-date.  Even more unfortunately the list of these outdated dependencies is still growing, a curse bestowed by Apache Foundation.  Switching to YARN won’t resolve it! It just box threads with containers but won’t change class loading sequence. Only (ugly but working) solution so far: package relocation!  Supported by maven-shade by setting relocation rule, don’t know how to do this in sbt :-<  Probably have third-party plugins that can detect it from dependency, need more testing.  Not very compatible with some IDE, if reporting a classpath error please re-import the project.
  • 14. Maven vs sbt Maven • The most extendable and widely-supported build tool. • Native to Java, but all Scala dependencies are Java bytecode. • Need maven-scala and maven-shade plugins • I don’t know why but Spark official repo just switched from sbt to maven after 0.9.0. • Apparently slightly faster than ivy • A personal tool of choice. Simple Build Tool (used to be simple) • No abominable xml • Native to Scala • Self-contained executable • Beautiful build report by ivy backend • Need sbt-assembly plugin (does NOT support relocation :-<)
  • 15. Deployment $SPARK_HOME/bin/spark-submit --master $SPARK_MASTER_URL --jar $YOUR_JARS full.path.to.your.main.object. This command do everything including distributing JARS and run main function locally as the driver. (a.k.a. client deploy mode) Alternatively you can move the driver to a random node by overriding ‘--deploy- mode’ to ‘cluster’, but it’s not recommended for beginners, reasons:  Don’t know which node until seeing Spark UI  Driver takes extra CPU and bandwidth load.  Cannot use local JAR –you have to upload it to HDFS or S3 first.  spark-submit is dumb, don’t know where to find it in JAR distribution dir.  Useless to any Spark-shell! And a few other things.  If its part of an SOA, have fun pointing all other clients to it.
  • 16. Guideline  Basic: RDD, Transformations and Actions.  Basic: Testing, Packaging and Deployment.  Advanced: Partitioning, Distribution and Staging.  Expert: Composite Mapping and Accumulator.  Example: A query engine for distributed web scraping.  Q&A.
  • 17. Partitioning (a.k.a. shuffling) RDD[T].Partition: a smallest inseparable chunk of T, each cannot spread over 2 cores or threads. -> generating each partition only requires a self-contained single-thread subroutine (called task) that won’t screw up and induces no overhead on scheduling/synchronization. Default number of partitions is the total number of cores in a cluster, works great if workload on each partition is fairly balanced. Otherwise some cores will finish first and fence in your cluster ------> you’d better override this:  Many transformations and parallelizations takes an optional Int parameter to result in RDD with desired number of partitions.  RDD[T].repartition(n: Int) returns an RDD[T] with identical content but different number of partitions, also rebalance sizes of partitions.  RDD[T].coalesce(n: Int) merge closest partitions (ideally on one node) together. This is an incomplete partitioning (no shuffling) which makes it faster.
  • 18. Resilience Partition is also the smallest unit to be discarded and regenerated from scratch whenever:  The task generating it throws an exception and quit. (regardless you can customize your mappings to retry locally inside the task thread to avoid discarding the already succeeded part, Matei confirm this last time)  It is lost in a power outage or being disconnected from the cluster  (When speculative task is enabled) the task generating it takes too long to finish comparing to the time to finish most other partitions. (In this case the old one won’t be discarded but race with the new one)  It is being slowly (I mean discouragingly slow) redistributed from another node or loaded from a remote cache (e.g. HDFS&S3) yet all its dependencies (prior Partitions and mappings needed to generate it) are available locally. When this really happens, you and your network admin will have a problem.
  • 19. Rule No. 1: No. partitions >= No. coresMore partitions/smaller partition = Higher scheduling and distribution overhead Higher workload/bandwidth consumption for driver/master node Lower cost for retry/regeneration Better scheduling for unbalanced partitioning and speculative task Easier monitoring of progress Less partitions/bigger partition = Lower scheduling and distribution overhead Lower workload/bandwidth consumption for driver/master node Higher cost for retry/regeneration (again you can retry inside thread first) Longer waiting time for unbalanced partitioning and speculative task Progress bar will stay at 0% until you lose hope
  • 20. Distribution Both RDD[T] and Mappings are (hopefully evenly) spread across cores. Supports 3 modes, from fast to slow: 1. (fastest) JAR!  Contains all precompiled java bytecode and artifacts, which include but not limited to: IMMUTABLE static objects or constants (Mutable static objects are strictly forbidden: if you modify any of these locally in runtime no other node will know it, and you go straight into mutable hell), class methods, precompiled anonymous functions EXCLUDING closures (they are just methods of classes that inherit Function interfaces. BTW nice lambda operator, Java 8), manifest, anything you package into jar for whatever reason  Not include: fields of dynamic objects (initialized at runtime), closure of anonymous functions (same old thing), variables.  Only (and always) happens once before execution and reused in the lifespan of a job.  You can find the distributed jar in $SPARK_HOME/work/$JOB_UUID/… of each node
  • 21. Distribution 2. (faster) Broadcast (bonus feature, prioritized due to importance)  Basically a cluster-wide singleton initialized in run-time.  Happens immediately during singleton’s initialization using an eponymous static function:  val wrapper = spark.broadcast(thing: T <: Serializable)  After which you can define any mapping to read it repeatedly across the entire cluster, by using this wrapper in the mapping’s parameter:  wrapper.value  You can’t write it, its IMMUTABLE, will cause racing condition if you can anyway. 3. (fast) Shipping (Easiest, also the only way to distribute heterogeneous objects created at runtime including non-singletons and closures)  also used to distribute broadcast wrappers (very small/fast)  Still much faster than reading from ANY non-local file system. (Giving it a performance edge over Hadoop MR)  Happens automatically in partitioning, triggered on-demand.  You can see its time cost in ‘Shuffle Read/Write’ column in job UI.
  • 22. Serialization Hell Broadcast and shipping demands that all objects being distributed are SERIALIZABLE:  ...broadcast[T <: Serializable](…  RDD[T <: Serializable]  …map(f <: Function[T,K] with Serializable) Otherwise deep copy is no-can-do and program throws NotSerializableError. Easiest (and most effective) solution: Don’t be a too functional! only put simple types and collections in RDDs and closures. Will also makes shipping faster (very important for iterative algorithms, beware R programmers.) If not possible you still have 2 options:  Wrap complex objects with a serializable wrapper (recommended, used by many Spark parallelizations/actions to distribute HDFS/S3 credentials)  Switch to Kryo Serializer (shipping is faster in most cases and favored by Mahout due to extremely iterative ML algorithms, I haven’t tried yet) Happens even at shipping between cores, only becomes useless when broadcasting locally (singleton is not bind to cores).  One of the rare cases where you cluster-wise deployment fails yet local test succeeds.
  • 23. Staging A stage contains several mappings that are concatenated into embarrassingly parallelizable longer tasks.  E.g. map->map = 1 stage, map->reduce = 2 stages  Technically reduce can start after a partition of its preceding map is generated, but Spark is not that smart (or unnecessarily complex). Staging can only be triggered by the following mappings:  All actions.  Wide transformations. CanNOT be triggerd by caching or checkpointing: They are also embarrasingly parallelizable.
  • 24. Wide Transformations? Narrow (no partitioning): • Map • FlatMap • MapPartitions • Filter • Sample • Union Wide (partitioning): • Intersection • Distinct • ReduceByKey • GroupByKey • Join • Cartesian • Repartition • Coalesce (I know its incomplete but WTH)
  • 25. Guideline  Basic: RDD, Transformations and Actions.  Basic: Testing, Packaging and Deployment.  Advanced: Partitioning, Distribution and Staging.  Expert: Composite Mapping and Accumulator.  Example: A query engine for distributed web scraping.  Q&A.
  • 26. Composite Mapping Used to create additional mappings and DSL that do complex things in one line, also reduces jar and closure size. However…  You can’t inherit or extend RDD of which exact type is abstracted away from you.  You can’t break basic mappings as atomic black-boxes. Only solution: use Scala implicit view!  Define a wrapper of RDD[T], implement all your methods/operators in it by wiring the plethora of programming bricks.  Create a static object (usually referred as context) with an implicit method that converts RDD[T] to your wrapper (referred as implicit converter)  In your main function, import the implicit converter by:  Import context._  Voila, you can use those methods/operators like mappings on any RDD[T].  Another reason why Scala is the language of choice.
  • 27. Accumulator Used to create counters, progress trackers, and performance metrics of things that are not displayed on UI.  Created from eponymous function:  Val acc = spark.accumulator(i)  Only readable in main function, but can be updated anywhere in parallel by using:  acc += j  Type of i and j must be identical and inherit AccumulatorParam  No constraint on implementation, but order of j should have no impact on final result.  Prefer simple and fast implementation.  Updated in real time, but requires an extra thread or non-blocking function to read locally when the main thread is blocked by stage execution.
  • 28. Guideline  Basic: RDD, Transformations and Actions.  Basic: Testing, Packaging and Deployment.  Advanced: Partitioning, Distribution and Staging.  Expert: Composite Mapping and Accumulator.  Example: A query engine for distributed web scraping.  Q&A.
  • 30. Implementation Only 2 entities:  RDD[ActionPlan]: A sequence of “human” actions executable on a browser.  RDD[Page]: A web page or one of its sections. Query syntax is entirely defined by composite mappings.  Mappings on RDD[ActionPlan] are infix operators  Mappings on RDD[Page] are methods with SQL/LINQ-ish names.  Enabled by importing context._, and work with other mapping extensions. Web UI is ported from iScala-notebook, which is ported from iPython-notebook.
  • 31. RDD[T]. Map(T=>K) Pipe(./bin) FlatMap (T=>K…) Distinct() Union (RDD[T]) Filter (T=>y/n?) Sample() Intersection (RDD[T]) GroupBy (T~T’?) Cartesian (RDD[V]) 𝑇1 𝑇2 ⇒ 𝐾1 𝐾2 𝑇 ⇒ 𝐾1 𝐾2 𝑇 𝑇 ⇒ 𝑇 𝑇1 𝑇2 ∪ 𝑇1 𝑇3 ⇒ 𝑇1 𝑇2 𝑇3 𝑇1 ⇒ 𝑦 𝑇2 ⇒ 𝑛 ⇒ 𝑇1 𝑇1 𝑇2 ∩ 𝑇1 𝑇3 ⇒ 𝑇1 𝑇1 𝑇1′ 𝑇2 ⇒ [𝑇1, 𝑇1′ ] 𝑇2 𝑇1 𝑇2 ⨂ 𝑉 ⇒ (𝑇1, 𝑉) (𝑇2, 𝑉) RDD [(U,V)]. GroupByKey() (Left/right)Join(RDD[U,K]) Lookup(RDD[U]) ReduceByKey(V…=>V) (𝑈1, 𝑉1) (𝑈1, 𝑉2) (𝑈2, 𝑉3) ⇒ (𝑈1, [𝑉1, 𝑉2]) (𝑈2, [𝑉3]) (𝑈1, 𝑉1) (𝑈2, 𝑉2) ⋈ (𝑈2, 𝐾1) (𝑈3, 𝐾2) ⇒ (𝑈1, 𝑉1, ∅ ) (𝑈2, 𝑉2, 𝐾1 ) (𝑈3, ∅, 𝐾2 ) (𝑈1, 𝑉1) (𝑈1, 𝑉2) (𝑈2, 𝑉3) (𝑈2, 𝑉4) ⇒ (𝑈1, 𝑉12) (𝑈2, 𝑉34) RDD[T]. Reduce (T…=>T) Collect() Count() First() saveAsTextFi le(filePath) fromTextFile (filePath) Parallelize (T…) 𝑇1 𝑇2 ⇒ 𝑇12 𝑇3 𝑇4 ⇒ 𝑇34 ⇒ 𝑇 𝑇1 𝑇2 ⇒ 𝑇 … 𝑇1 ⋮ 𝑇𝑛 ⇒ 𝑛 𝑇1 ⋮ 𝑇𝑛 ⇒ 𝑇1 𝑇1 𝑇2 ⇒ ⇒ 𝑇1 𝑇2 𝑇 … ⇒ 𝑇1 𝑇2
  • 32. Performance Benchmark 0 10000 20000 30000 40000 50000 60000 70000 10 cores 20 cores 30 cores 40 cores import.io Pages/hour Amazon.com Google Image Iherb 0 200 400 600 800 1000 1200 1400 1600 1800 10 cores 20 cores 30 cores 40 cores import.io (10 cores?) Pages/(hour*core) Amazon.com Google Image Iherb
  • 33. Thanks for tough questions! ==In Chaos Monkey we trust== Peng pc175@uow.edu.au github.com/tribbloid
  • 34. Addendum More Mutable Hell? Lazy Evaluation Anecdotes
  • 35. More mutable hell? Many sources have claimed that RDD[T] enforces immutable pattern for T, namely:  T contains only immutable (val) fields, which themselves are also immutable types and collections.  Content of T cannot be modified in-place inside any mapping, always create a slightly different deep copy. However this pattern is perhaps less rigorous by now, reasons:  Serialization and shipping happens ubiquitously between stages which always enforce a deep copy.  2 threads/tasks cannot access one memory block by design to avoid racing.  Any partition modified by a mapping then used by another is either regenerated or loaded from a previous cache/checkpoint created BEFORE the modifying mapping, in both cases those modifications are simply discarded without collateral damage.  Immutable pattern requires great discipline in Java and Python. So, my experience: just make sure your var and shallow copies won’t screw up things INSIDE each single-threaded task, anticipate discarded changes, and theoretically you’ll be safe. Of course there is no problem in using immutable pattern if you feel insecure. Again mutable static object is FORBIDDEN.
  • 36. Lazy Evaluation  RDD is empty at creation, only its type and partition ids are determined.  Can only be redeemed by an action called upon itself or its downstream RDD.  After which a recursive resolve request will be passed along its partitions’ dependency tree(a directed acyclic graph or DAG), resulting in their tasks being executed in sequence after their respect dependencies are redeemed.  A task with all its dependencies are largely omitted if the partition it generates is already cached, so always cache if you derive 2+ LARs from 1 RDD Knowing this is trivial to programming until you start using first() action  Caching an RDD, call first() on it (triggers several mini-stages that has 1 partition/task), then do anything that triggers full-scale staging will take twice as long as doing it otherwise. Spark’s caching and staging mechanism is not smart enough to know what will be used in later stages.
  • 37. Anecdotes •Partitioning of an RDD[T] depends on the data structure of T and who creates it, and is rarely random (yeah it’s also called ‘shuffling’ to fool you guys). • E.g. an RDD[(U,V)] (a.k.a. PairRDD) will use U as partition keys and distribute (U,V) in a C* token ring-ish fashion for obvious reasons. • More complex T and expected usage case will results in increasingly complex RDD implementations, notably the SchemaRDD in SparkSQL. • Again don’t try to reverse-engineer them unless you are hardcore.