2. ABOUT ME
Didier Marin
PhD in Computer Science (UPMC)
Machine Learning, Reinforcement Learning & Robotics
Co-founder of Heuritech
Likes functional programming and distributed computing
3. We develop tools to make sense from raw text data
Customer insight using the text of visited web pages
6. WHY SPARK ?
Performance, in particular when
batch size < total RAM in cluster
More general than MR, high-level API
Extensions (ML, streaming) and
connectors (Cassandra)
Growing community
10. CLUSTER CONFIGURATION
LXC + salt
N containers : 1 master/executor + (N-1) executors
Cassandra node for each Spark executor
Using an "uber"-JAR to submit jobs
Sharing data through NFS
11.
12. MANAGING SPARK'S MEMORY
Default: 40 % working memory, 60 % cache
20 % of cache used to unroll blocks
Explicit caching for huge RDDs we reuse:
validLogs.persist(StorageLevel.MEMORY_AND_DISK)
Partition tuning may be necessary (spark.default.parallelism)