Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
1. GoDataDriven
PROUDLY PART OF THE XEBIA GROUP
@fzk
frisovanvollenhoven@godatadriven.com
Apache Spark
Friso van Vollenhoven
for applied machine learning
14. Resilient Distributed Dataset
•Immutable set of records (e.g. tuples)
•Distributed across a cluster of workers
•Stored in RAM or on disk (partially)
•Built through transformations
•Automatically rebuilt on failure
•Possibly replicated
15. Operations
•Operate on RDD’s
•Create a new RDD
•Or materialise RDD and return data
•Transformations: map, filter, groupBy, etc.
•Actions: count, collect, reduce, save, etc.
16.
17.
18. The good parts
•Language bindings for Java, Scala and Python
•Works interactively from a shell:
•Scala + IPython (notebook)
•Plays nice with Hadoop
•Deploy on top of YARN cluster manager
•Read data from HDFS
•Hadoop-like fault tolerance