Gravy Analytics ingests ~17 billion records daily of data and improve and refine that data into many data products at various levels of aggregation. To meet the challenges of our product requirements and scale we constantly evaluate new technologies. Spark has become central to our ability to process ever increasing amounts of data through our data factory. In late 2017 and throughout 2018, we have improved our ability to work with Spark by migrating all Spark jobs to Scala. In this discussion, we’ll cover areas which were more difficult from a Spark perspective to develop in Java than Scala as well as some of the challenges we met along the way.
2. 2| gravyanalytics.com
Where we go is who we are.
REAL-WORLD CONSUMER BEHAVIOR
LIFE STAGES
LIFESTYLESAFFINITIES
INTERESTS
The events consumers attend,
the places they visit,
where they spend their time,
translates into intelligence
3. 3| gravyanalytics.com
We translate the locations that consumers visit, the places they go, and the
events they attend into real-world consumer intelligence
INDUSTRY-LEADING CAPABILITIES
4. 4| gravyanalytics.com
GRAVY SOLUTIONS
AdmitOneTM verified
Visitation, Attendance,
Event data and more for use
in unique business
applications
Gravy Insights provides
brands with in-depth
customer and competitive
intelligence
Gravy Audiences let
marketers reach engaged
consumers based on what
they do in real-life
GRAVY AUDIENCES GRAVY INSIGHTS GRAVY DAAS
• Lifestyle • Enthusiast
• In-Market • Branded • Custom
• Foot Traffic • Competitive
• Attribution
• Visitations • Attendances
• IP Address • User Agent
5. 5| gravyanalytics.com
Gravy’s patented AdmitOne verification engine delivers the
highest-quality location and attendance data in the industry
THE GRAVY DIFFERENCE
Billions of daily location
signals from 250M+ mobile
devices
The largest events
database gives context to
millions of places and POIs
Confirmed, deterministic
consumer attendances at
places and events.
REACH EVENTS VERIFIED
6. 6| gravyanalytics.com
SOLUTION
GEO-SIGNALS
CLOUD
Distribute
Filter & Verify Merge
Spatial Index
LCO & Attendance
Algorithm
Persona Generator
Attendances
Detail Records
Personas /
Audiences
DevicesDevice Processing
Lots of Spark jobs!
Snowflake
Datasets in S3
Zeppelin/EMR
Snowflake
SQL, R, Excel Dashboards-Sisense
Matillion
7. 7| gravyanalytics.com
Some of the major Spark jobs that we run:
• Ingest
• Also validates, removes and/or flags data based on LDVS output
• Location and Device VerificationService (LDVS)
• Signal Merge / Device Merge
• Persona Generator
• Spatial Indexer
SUMMARY OF SPARK JOBS
9. 9| gravyanalytics.com
• Environment
• We currently run ~30 Spark jobs daily
• On average, per hour: ~1300 cores and ~10 TiB memory
• AWS EMR (and spot instances to control costs)
• Data storage: S3 and Snowflake
• The Code (Platform)
• ~200k lines Java, ~30k lines Scala
• Strong domain-driven-design influence
• Many jobs can be run in Spark or stand-alone
• Central orchestration application
• Custom DAG scheduler
• Responsible for job scheduling, configuring, launching,
monitoring, and failure recovery
THE CORE PLATFORM
10. 10| gravyanalytics.com
• 2015-2016
• Targets: 25M sources, 450M events per day (5500/sec)
• Java - Microservices, DDD, AWS (Kinesis/SQS/EC2/DynamoDB/Redshift/etc)
• 2016-2017
• Targets: 100M sources, 4B events per day (40,000/sec)
• Java - Hybrid: Spark 1.6 / Microservices (experiments with storage)
• 2017-2018
• Targets: 200M sources, 10B events per day (100,000/sec)
• Java - Spark 2.0 / DynamoDB / S3 / Snowflake
• 2018-2019+
• Targets: 400M+ sources, 25B+ events per day (300,000/sec)
• Scala - Spark 2.4 / DynamoDB / S3 / Snowflake
SOFTWARE ARCHITECTURE EVOLUTION
11. 11| gravyanalytics.com
• We started using Spark before datasets were a thing
• The original Spark code was designed around RDDs
• As data scaled, we targeted (easy) ways improve efficiency
• After Spark 2.0+, Datasets became more attractive
• What we did
• Reduced size of domain types to reduce memory overhead
• Refactored monolithic Spark jobs into specialized jobs
• Migrated JSON data to Parquet (with partitions)
• Transitioned from RDD API to Dataset API
FROM RDDs TO DATASETS AND MORE
12. 12| gravyanalytics.com
• Transformations, aggregations, and filters
are easier with Datasets
• Improved Dataset performance from Spark
2.0 onward
• Datasets provide an abstraction layer
enabling optimized execution plans
• Easier, more fluent interface
• Dataset provide columnar optimization to
improve data and shuffling performance
• Enhanced functionality with functions._
• Support for SQL, when necessary
WHY DATASETS?
13. 13| gravyanalytics.com
• The dataset API is available in Java so why
did we switch?
• Understanding Spark internals or modifying its
functionality was difficult without knowing Scala
• Scala is a cleanly-designed language
• We wanted to avoid the (often cumbersome) Java API
• Our initial experiments with Scala proved its ease of use
• Case classes resulted in easier serlialization and better
serialization and shuffling performance
• Immutable types provided better garbage collection
• Use of Spark REPL enabled faster prototyping
• Scala's tools and libraries have matured significantly
• Lots of best practices available
• Understanding Scala gives team deeper understanding of
the underlying Spark code
WHY SCALA?
14. 14| gravyanalytics.com
• The switch was worth it - but it
wasn't without a cost
1. Lack of Experience
• Initially we had only one developer with
Scala experience
2. Large Amounts of Legacy Java Code
• We have taken a staged approach, still a
large effort
3. Shift in Coding Mentality
• Embracing a more functional coding style
requires changing how we think about
problems
CHALLENGES: SCALA
17. 17| gravyanalytics.com
UNIT TESTING
• Transitioning from JUnit to
ScalaTest
• Lack of Experience
• Another scenario where the development team
needed to ramp up on new technology
• DataMapper
• We have a homegrown library called the
DataMapper which allows us to generate test data
at runtime from annotations on our unit tests
• The Java version of this library relied on
reflection and did not play nice with case classes
• Eventually we produced a Scala / ScalaTest
compatible trait-based version
18. 18| gravyanalytics.com
HIRING/GOING FORWARD
• Driving home the fact that we are no longer a Java-only shop, we have modified our
job listings to include Scala as a preferred language prerequisite.
• Challenging at first to evaluate candidates' Scala skills as we were novices ourselves.
• As we continue to ramp up on Scala, we have started to branch out from using it only
for Spark to using it for webservices ( play framework ) as well as to replace some of
our legacy utility libraries.
• We think we are now better positioned to quickly take advantage of newer features
coming down the spark pipeline.
20. 20| gravyanalytics.com
• Greatly streamlined syntax
• Easier use with Spark
• Easy, fast serialization of case classes during shuffles
• Built-in Product type encoders
• Built-in tuple types
• Built-in anonymous functions
• Options instead of nulls
• Pattern matching instead of switch statements
• IntelliJ Scala support
• Simpler Futures
• “Duck-typing”
• Advanced reflection
• Functional exception handling
• Syntactic sugar
• Lots of helpers: Option, Try, Success, Failure, Either, etc.
• Everything is a function => more flexibility
• Easier generics (less type erasure)
Extra: Scala Likes
21. 21| gravyanalytics.com
• Untyped vals
• Lots of special symbols
• Library complexity
• Akka and typesafe libraries
• Json parsing libraries (incompatibility with Gson, complex scala libs)
• Java compatibility
• Companion object wrapping
• Bean serialization
• Default to Seq for ordered collections (instead of ideal data structure for the job)
• Gradle vs. SBT
• Overuse of implicit “magic”
• Difficult learning curve (lots to learn!!)
• Too much flexibility can create inconsistent and confusing code
• Opaque compilation errors
• Missing Named Tuple (e.g. Python)
• Enumerations are broken
Extra: Scala Dislikes
22. 22| gravyanalytics.com
• Immutable types instead of mutable types
• Collection syntax sugar
• Chaining functions causes lots of type headaches
• Syntactic sugar
• Using recursion (with @tailrec) instead of procedural
• Pattern matching
• Using small functions to keep code readable
• Reflection, type tags, and class tags
• Curried functions
• Partial functions
• Unfamiliar type system
• OO Paradigms don’t translate well (have to research correct way of doing things)
• Lots to learn!!
Extra: Scala challenges