SlideShare a Scribd company logo
1 of 23
Download to read offline
1| gravyanalytics.com
Transitioning from Java to Scala for Spark
Guy DeCorte, Founder & CTO
Aaron Perrin, Senior Software Developer
March 13, 2019
2| gravyanalytics.com
Where we go is who we are.
REAL-WORLD CONSUMER BEHAVIOR
LIFE STAGES
LIFESTYLESAFFINITIES
INTERESTS
The events consumers attend,
the places they visit,
where they spend their time,
translates into intelligence
3| gravyanalytics.com
We translate the locations that consumers visit, the places they go, and the
events they attend into real-world consumer intelligence
INDUSTRY-LEADING CAPABILITIES
4| gravyanalytics.com
GRAVY SOLUTIONS
AdmitOneTM verified
Visitation, Attendance,
Event data and more for use
in unique business
applications
Gravy Insights provides
brands with in-depth
customer and competitive
intelligence
Gravy Audiences let
marketers reach engaged
consumers based on what
they do in real-life
GRAVY AUDIENCES GRAVY INSIGHTS GRAVY DAAS
• Lifestyle • Enthusiast
• In-Market • Branded • Custom
• Foot Traffic • Competitive
• Attribution
• Visitations • Attendances
• IP Address • User Agent
5| gravyanalytics.com
Gravy’s patented AdmitOne verification engine delivers the
highest-quality location and attendance data in the industry
THE GRAVY DIFFERENCE
Billions of daily location
signals from 250M+ mobile
devices
The largest events
database gives context to
millions of places and POIs
Confirmed, deterministic
consumer attendances at
places and events.
REACH EVENTS VERIFIED
6| gravyanalytics.com
SOLUTION
GEO-SIGNALS
CLOUD
Distribute
Filter & Verify Merge
Spatial Index
LCO & Attendance
Algorithm
Persona Generator
Attendances
Detail Records
Personas /
Audiences
DevicesDevice Processing
Lots of Spark jobs!
Snowflake
Datasets in S3
Zeppelin/EMR
Snowflake
SQL, R, Excel Dashboards-Sisense
Matillion
7| gravyanalytics.com
Some of the major Spark jobs that we run:
• Ingest
• Also validates, removes and/or flags data based on LDVS output
• Location and Device VerificationService (LDVS)
• Signal Merge / Device Merge
• Persona Generator
• Spatial Indexer
SUMMARY OF SPARK JOBS
8| gravyanalytics.com
What's Our Platform Look Like?
9| gravyanalytics.com
• Environment
• We currently run ~30 Spark jobs daily
• On average, per hour: ~1300 cores and ~10 TiB memory
• AWS EMR (and spot instances to control costs)
• Data storage: S3 and Snowflake
• The Code (Platform)
• ~200k lines Java, ~30k lines Scala
• Strong domain-driven-design influence
• Many jobs can be run in Spark or stand-alone
• Central orchestration application
• Custom DAG scheduler
• Responsible for job scheduling, configuring, launching,
monitoring, and failure recovery
THE CORE PLATFORM
10| gravyanalytics.com
• 2015-2016
• Targets: 25M sources, 450M events per day (5500/sec)
• Java - Microservices, DDD, AWS (Kinesis/SQS/EC2/DynamoDB/Redshift/etc)
• 2016-2017
• Targets: 100M sources, 4B events per day (40,000/sec)
• Java - Hybrid: Spark 1.6 / Microservices (experiments with storage)
• 2017-2018
• Targets: 200M sources, 10B events per day (100,000/sec)
• Java - Spark 2.0 / DynamoDB / S3 / Snowflake
• 2018-2019+
• Targets: 400M+ sources, 25B+ events per day (300,000/sec)
• Scala - Spark 2.4 / DynamoDB / S3 / Snowflake
SOFTWARE ARCHITECTURE EVOLUTION
11| gravyanalytics.com
• We started using Spark before datasets were a thing
• The original Spark code was designed around RDDs
• As data scaled, we targeted (easy) ways improve efficiency
• After Spark 2.0+, Datasets became more attractive
• What we did
• Reduced size of domain types to reduce memory overhead
• Refactored monolithic Spark jobs into specialized jobs
• Migrated JSON data to Parquet (with partitions)
• Transitioned from RDD API to Dataset API
FROM RDDs TO DATASETS AND MORE
12| gravyanalytics.com
• Transformations, aggregations, and filters
are easier with Datasets
• Improved Dataset performance from Spark
2.0 onward
• Datasets provide an abstraction layer
enabling optimized execution plans
• Easier, more fluent interface
• Dataset provide columnar optimization to
improve data and shuffling performance
• Enhanced functionality with functions._
• Support for SQL, when necessary
WHY DATASETS?
13| gravyanalytics.com
• The dataset API is available in Java so why
did we switch?
• Understanding Spark internals or modifying its
functionality was difficult without knowing Scala
• Scala is a cleanly-designed language
• We wanted to avoid the (often cumbersome) Java API
• Our initial experiments with Scala proved its ease of use
• Case classes resulted in easier serlialization and better
serialization and shuffling performance
• Immutable types provided better garbage collection
• Use of Spark REPL enabled faster prototyping
• Scala's tools and libraries have matured significantly
• Lots of best practices available
• Understanding Scala gives team deeper understanding of
the underlying Spark code
WHY SCALA?
14| gravyanalytics.com
• The switch was worth it - but it
wasn't without a cost
1. Lack of Experience
• Initially we had only one developer with
Scala experience
2. Large Amounts of Legacy Java Code
• We have taken a staged approach, still a
large effort
3. Shift in Coding Mentality
• Embracing a more functional coding style
requires changing how we think about
problems
CHALLENGES: SCALA
15| gravyanalytics.com
AN EXAMPLE: JAVA RDD
16| gravyanalytics.com
AN EXAMPLE: SCALA DATASET
17| gravyanalytics.com
UNIT TESTING
• Transitioning from JUnit to
ScalaTest
• Lack of Experience
• Another scenario where the development team
needed to ramp up on new technology
• DataMapper
• We have a homegrown library called the
DataMapper which allows us to generate test data
at runtime from annotations on our unit tests
• The Java version of this library relied on
reflection and did not play nice with case classes
• Eventually we produced a Scala / ScalaTest
compatible trait-based version
18| gravyanalytics.com
HIRING/GOING FORWARD
• Driving home the fact that we are no longer a Java-only shop, we have modified our
job listings to include Scala as a preferred language prerequisite.
• Challenging at first to evaluate candidates' Scala skills as we were novices ourselves.
• As we continue to ramp up on Scala, we have started to branch out from using it only
for Spark to using it for webservices ( play framework ) as well as to replace some of
our legacy utility libraries.
• We think we are now better positioned to quickly take advantage of newer features
coming down the spark pipeline.
19| gravyanalytics.com
DISCUSSION
QUESTIONS?
20| gravyanalytics.com
• Greatly streamlined syntax
• Easier use with Spark
• Easy, fast serialization of case classes during shuffles
• Built-in Product type encoders
• Built-in tuple types
• Built-in anonymous functions
• Options instead of nulls
• Pattern matching instead of switch statements
• IntelliJ Scala support
• Simpler Futures
• “Duck-typing”
• Advanced reflection
• Functional exception handling
• Syntactic sugar
• Lots of helpers: Option, Try, Success, Failure, Either, etc.
• Everything is a function => more flexibility
• Easier generics (less type erasure)
Extra: Scala Likes
21| gravyanalytics.com
• Untyped vals
• Lots of special symbols
• Library complexity
• Akka and typesafe libraries
• Json parsing libraries (incompatibility with Gson, complex scala libs)
• Java compatibility
• Companion object wrapping
• Bean serialization
• Default to Seq for ordered collections (instead of ideal data structure for the job)
• Gradle vs. SBT
• Overuse of implicit “magic”
• Difficult learning curve (lots to learn!!)
• Too much flexibility can create inconsistent and confusing code
• Opaque compilation errors
• Missing Named Tuple (e.g. Python)
• Enumerations are broken
Extra: Scala Dislikes
22| gravyanalytics.com
• Immutable types instead of mutable types
• Collection syntax sugar
• Chaining functions causes lots of type headaches
• Syntactic sugar
• Using recursion (with @tailrec) instead of procedural
• Pattern matching
• Using small functions to keep code readable
• Reflection, type tags, and class tags
• Curried functions
• Partial functions
• Unfamiliar type system
• OO Paradigms don’t translate well (have to research correct way of doing things)
• Lots to learn!!
Extra: Scala challenges
23| gravyanalytics.com
Aaron Perrin, Senior Software Developer
703-840-8850
aperrin@gravyanalytics.com

More Related Content

What's hot

50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...Lucas Jellema
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
 
Microservices in the Enterprise
Microservices in the Enterprise Microservices in the Enterprise
Microservices in the Enterprise Jesus Rodriguez
 
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For ScalaScala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For ScalaLightbend
 
The (not so) Dark Art of Atlassian Performance Tuning
The (not so) Dark Art of Atlassian Performance TuningThe (not so) Dark Art of Atlassian Performance Tuning
The (not so) Dark Art of Atlassian Performance Tuningcolleenfry
 
Cisco's MultiCloud Strategy
Cisco's MultiCloud StrategyCisco's MultiCloud Strategy
Cisco's MultiCloud StrategyMaulik Shyani
 
Agile infrastructure
Agile infrastructureAgile infrastructure
Agile infrastructureTarun Rajput
 
Business and IT agility through DevOps and microservice architecture powered ...
Business and IT agility through DevOps and microservice architecture powered ...Business and IT agility through DevOps and microservice architecture powered ...
Business and IT agility through DevOps and microservice architecture powered ...Lucas Jellema
 
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...Chocolatey Software
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataJohn Nestor
 
Big ideas in small packages - How microservices helped us to scale our vision
Big ideas in small packages  - How microservices helped us to scale our visionBig ideas in small packages  - How microservices helped us to scale our vision
Big ideas in small packages - How microservices helped us to scale our visionSebastian Schleicher
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
Cloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful ServerlessCloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful ServerlessLightbend
 
Microservices, DevOps & SRE
Microservices, DevOps & SREMicroservices, DevOps & SRE
Microservices, DevOps & SREAraf Karsh Hamid
 
Automated Configuration & Deployment of Atlassian Applications
Automated Configuration & Deployment of Atlassian ApplicationsAutomated Configuration & Deployment of Atlassian Applications
Automated Configuration & Deployment of Atlassian Applicationscolleenfry
 
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...Lucas Jellema
 
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Lucas Jellema
 
Agile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven DesignAgile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven DesignAraf Karsh Hamid
 
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate Value
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate ValueIt’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate Value
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate ValueScout RFP
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesAdrian Cockcroft
 

What's hot (20)

50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
50 Shades of Data - how, when and why Big,Relational,NoSQL,Elastic,Event,CQRS...
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Microservices in the Enterprise
Microservices in the Enterprise Microservices in the Enterprise
Microservices in the Enterprise
 
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For ScalaScala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
Scala Security: Eliminate 200+ Code-Level Threats With Fortify SCA For Scala
 
The (not so) Dark Art of Atlassian Performance Tuning
The (not so) Dark Art of Atlassian Performance TuningThe (not so) Dark Art of Atlassian Performance Tuning
The (not so) Dark Art of Atlassian Performance Tuning
 
Cisco's MultiCloud Strategy
Cisco's MultiCloud StrategyCisco's MultiCloud Strategy
Cisco's MultiCloud Strategy
 
Agile infrastructure
Agile infrastructureAgile infrastructure
Agile infrastructure
 
Business and IT agility through DevOps and microservice architecture powered ...
Business and IT agility through DevOps and microservice architecture powered ...Business and IT agility through DevOps and microservice architecture powered ...
Business and IT agility through DevOps and microservice architecture powered ...
 
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
Big ideas in small packages - How microservices helped us to scale our vision
Big ideas in small packages  - How microservices helped us to scale our visionBig ideas in small packages  - How microservices helped us to scale our vision
Big ideas in small packages - How microservices helped us to scale our vision
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Cloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful ServerlessCloudstate - Towards Stateful Serverless
Cloudstate - Towards Stateful Serverless
 
Microservices, DevOps & SRE
Microservices, DevOps & SREMicroservices, DevOps & SRE
Microservices, DevOps & SRE
 
Automated Configuration & Deployment of Atlassian Applications
Automated Configuration & Deployment of Atlassian ApplicationsAutomated Configuration & Deployment of Atlassian Applications
Automated Configuration & Deployment of Atlassian Applications
 
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
 
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
Event Bus as Backbone for Decoupled Microservice Choreography (JFall 2017)
 
Agile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven DesignAgile, User Stories, Domain Driven Design
Agile, User Stories, Domain Driven Design
 
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate Value
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate ValueIt’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate Value
It’s All About Adoption: How Gilead Sciences Forged a Path to Accelerate Value
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
 

Similar to Transitioning from Java to Scala for Spark - March 13, 2019

IncQuery Server for Teamwork Cloud - Talk at IW2019
IncQuery Server for Teamwork Cloud - Talk at IW2019IncQuery Server for Teamwork Cloud - Talk at IW2019
IncQuery Server for Teamwork Cloud - Talk at IW2019Istvan Rath
 
Whitepages Practical Experience Converting from Ruby to Reactive
Whitepages Practical Experience Converting from Ruby to ReactiveWhitepages Practical Experience Converting from Ruby to Reactive
Whitepages Practical Experience Converting from Ruby to ReactiveDragos Manolescu
 
Experience Converting from Ruby to Scala
Experience Converting from Ruby to ScalaExperience Converting from Ruby to Scala
Experience Converting from Ruby to ScalaJohn Nestor
 
Stardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF DatabaseStardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF Databasekendallclark
 
Stardog 1.1: Easier, Smarter, Faster RDF Database
Stardog 1.1: Easier, Smarter, Faster RDF DatabaseStardog 1.1: Easier, Smarter, Faster RDF Database
Stardog 1.1: Easier, Smarter, Faster RDF DatabaseClark & Parsia LLC
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRA
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRAWikidsmart PM: Requirements Management within Confluence, Integrated with JIRA
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRAzAgile
 
Evolving IGN’s New APIs with Scala
 Evolving IGN’s New APIs with Scala Evolving IGN’s New APIs with Scala
Evolving IGN’s New APIs with ScalaManish Pandit
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Anthony Baker
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode
 
Play Architecture, Implementation, Shiny Objects, and a Proposal
Play Architecture, Implementation, Shiny Objects, and a ProposalPlay Architecture, Implementation, Shiny Objects, and a Proposal
Play Architecture, Implementation, Shiny Objects, and a ProposalMike Slinn
 
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Thoughtworks
 
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Thoughtworks
 
Sledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QASledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QAShelley Lambert
 
Stay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithStay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithMarkus Eisele
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 

Similar to Transitioning from Java to Scala for Spark - March 13, 2019 (20)

IncQuery Server for Teamwork Cloud - Talk at IW2019
IncQuery Server for Teamwork Cloud - Talk at IW2019IncQuery Server for Teamwork Cloud - Talk at IW2019
IncQuery Server for Teamwork Cloud - Talk at IW2019
 
Whitepages Practical Experience Converting from Ruby to Reactive
Whitepages Practical Experience Converting from Ruby to ReactiveWhitepages Practical Experience Converting from Ruby to Reactive
Whitepages Practical Experience Converting from Ruby to Reactive
 
Experience Converting from Ruby to Scala
Experience Converting from Ruby to ScalaExperience Converting from Ruby to Scala
Experience Converting from Ruby to Scala
 
Stardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF DatabaseStardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF Database
 
Stardog 1.1: Easier, Smarter, Faster RDF Database
Stardog 1.1: Easier, Smarter, Faster RDF DatabaseStardog 1.1: Easier, Smarter, Faster RDF Database
Stardog 1.1: Easier, Smarter, Faster RDF Database
 
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
Scala Jday 2014
Scala Jday 2014 Scala Jday 2014
Scala Jday 2014
 
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRA
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRAWikidsmart PM: Requirements Management within Confluence, Integrated with JIRA
Wikidsmart PM: Requirements Management within Confluence, Integrated with JIRA
 
Evolving IGN’s New APIs with Scala
 Evolving IGN’s New APIs with Scala Evolving IGN’s New APIs with Scala
Evolving IGN’s New APIs with Scala
 
Pig on Spark
Pig on SparkPig on Spark
Pig on Spark
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
Play Architecture, Implementation, Shiny Objects, and a Proposal
Play Architecture, Implementation, Shiny Objects, and a ProposalPlay Architecture, Implementation, Shiny Objects, and a Proposal
Play Architecture, Implementation, Shiny Objects, and a Proposal
 
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
 
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
 
Sledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QASledgehammer to Fine Brush for QA
Sledgehammer to Fine Brush for QA
 
Stay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithStay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolith
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 

Recently uploaded

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 

Recently uploaded (20)

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 

Transitioning from Java to Scala for Spark - March 13, 2019

  • 1. 1| gravyanalytics.com Transitioning from Java to Scala for Spark Guy DeCorte, Founder & CTO Aaron Perrin, Senior Software Developer March 13, 2019
  • 2. 2| gravyanalytics.com Where we go is who we are. REAL-WORLD CONSUMER BEHAVIOR LIFE STAGES LIFESTYLESAFFINITIES INTERESTS The events consumers attend, the places they visit, where they spend their time, translates into intelligence
  • 3. 3| gravyanalytics.com We translate the locations that consumers visit, the places they go, and the events they attend into real-world consumer intelligence INDUSTRY-LEADING CAPABILITIES
  • 4. 4| gravyanalytics.com GRAVY SOLUTIONS AdmitOneTM verified Visitation, Attendance, Event data and more for use in unique business applications Gravy Insights provides brands with in-depth customer and competitive intelligence Gravy Audiences let marketers reach engaged consumers based on what they do in real-life GRAVY AUDIENCES GRAVY INSIGHTS GRAVY DAAS • Lifestyle • Enthusiast • In-Market • Branded • Custom • Foot Traffic • Competitive • Attribution • Visitations • Attendances • IP Address • User Agent
  • 5. 5| gravyanalytics.com Gravy’s patented AdmitOne verification engine delivers the highest-quality location and attendance data in the industry THE GRAVY DIFFERENCE Billions of daily location signals from 250M+ mobile devices The largest events database gives context to millions of places and POIs Confirmed, deterministic consumer attendances at places and events. REACH EVENTS VERIFIED
  • 6. 6| gravyanalytics.com SOLUTION GEO-SIGNALS CLOUD Distribute Filter & Verify Merge Spatial Index LCO & Attendance Algorithm Persona Generator Attendances Detail Records Personas / Audiences DevicesDevice Processing Lots of Spark jobs! Snowflake Datasets in S3 Zeppelin/EMR Snowflake SQL, R, Excel Dashboards-Sisense Matillion
  • 7. 7| gravyanalytics.com Some of the major Spark jobs that we run: • Ingest • Also validates, removes and/or flags data based on LDVS output • Location and Device VerificationService (LDVS) • Signal Merge / Device Merge • Persona Generator • Spatial Indexer SUMMARY OF SPARK JOBS
  • 8. 8| gravyanalytics.com What's Our Platform Look Like?
  • 9. 9| gravyanalytics.com • Environment • We currently run ~30 Spark jobs daily • On average, per hour: ~1300 cores and ~10 TiB memory • AWS EMR (and spot instances to control costs) • Data storage: S3 and Snowflake • The Code (Platform) • ~200k lines Java, ~30k lines Scala • Strong domain-driven-design influence • Many jobs can be run in Spark or stand-alone • Central orchestration application • Custom DAG scheduler • Responsible for job scheduling, configuring, launching, monitoring, and failure recovery THE CORE PLATFORM
  • 10. 10| gravyanalytics.com • 2015-2016 • Targets: 25M sources, 450M events per day (5500/sec) • Java - Microservices, DDD, AWS (Kinesis/SQS/EC2/DynamoDB/Redshift/etc) • 2016-2017 • Targets: 100M sources, 4B events per day (40,000/sec) • Java - Hybrid: Spark 1.6 / Microservices (experiments with storage) • 2017-2018 • Targets: 200M sources, 10B events per day (100,000/sec) • Java - Spark 2.0 / DynamoDB / S3 / Snowflake • 2018-2019+ • Targets: 400M+ sources, 25B+ events per day (300,000/sec) • Scala - Spark 2.4 / DynamoDB / S3 / Snowflake SOFTWARE ARCHITECTURE EVOLUTION
  • 11. 11| gravyanalytics.com • We started using Spark before datasets were a thing • The original Spark code was designed around RDDs • As data scaled, we targeted (easy) ways improve efficiency • After Spark 2.0+, Datasets became more attractive • What we did • Reduced size of domain types to reduce memory overhead • Refactored monolithic Spark jobs into specialized jobs • Migrated JSON data to Parquet (with partitions) • Transitioned from RDD API to Dataset API FROM RDDs TO DATASETS AND MORE
  • 12. 12| gravyanalytics.com • Transformations, aggregations, and filters are easier with Datasets • Improved Dataset performance from Spark 2.0 onward • Datasets provide an abstraction layer enabling optimized execution plans • Easier, more fluent interface • Dataset provide columnar optimization to improve data and shuffling performance • Enhanced functionality with functions._ • Support for SQL, when necessary WHY DATASETS?
  • 13. 13| gravyanalytics.com • The dataset API is available in Java so why did we switch? • Understanding Spark internals or modifying its functionality was difficult without knowing Scala • Scala is a cleanly-designed language • We wanted to avoid the (often cumbersome) Java API • Our initial experiments with Scala proved its ease of use • Case classes resulted in easier serlialization and better serialization and shuffling performance • Immutable types provided better garbage collection • Use of Spark REPL enabled faster prototyping • Scala's tools and libraries have matured significantly • Lots of best practices available • Understanding Scala gives team deeper understanding of the underlying Spark code WHY SCALA?
  • 14. 14| gravyanalytics.com • The switch was worth it - but it wasn't without a cost 1. Lack of Experience • Initially we had only one developer with Scala experience 2. Large Amounts of Legacy Java Code • We have taken a staged approach, still a large effort 3. Shift in Coding Mentality • Embracing a more functional coding style requires changing how we think about problems CHALLENGES: SCALA
  • 17. 17| gravyanalytics.com UNIT TESTING • Transitioning from JUnit to ScalaTest • Lack of Experience • Another scenario where the development team needed to ramp up on new technology • DataMapper • We have a homegrown library called the DataMapper which allows us to generate test data at runtime from annotations on our unit tests • The Java version of this library relied on reflection and did not play nice with case classes • Eventually we produced a Scala / ScalaTest compatible trait-based version
  • 18. 18| gravyanalytics.com HIRING/GOING FORWARD • Driving home the fact that we are no longer a Java-only shop, we have modified our job listings to include Scala as a preferred language prerequisite. • Challenging at first to evaluate candidates' Scala skills as we were novices ourselves. • As we continue to ramp up on Scala, we have started to branch out from using it only for Spark to using it for webservices ( play framework ) as well as to replace some of our legacy utility libraries. • We think we are now better positioned to quickly take advantage of newer features coming down the spark pipeline.
  • 20. 20| gravyanalytics.com • Greatly streamlined syntax • Easier use with Spark • Easy, fast serialization of case classes during shuffles • Built-in Product type encoders • Built-in tuple types • Built-in anonymous functions • Options instead of nulls • Pattern matching instead of switch statements • IntelliJ Scala support • Simpler Futures • “Duck-typing” • Advanced reflection • Functional exception handling • Syntactic sugar • Lots of helpers: Option, Try, Success, Failure, Either, etc. • Everything is a function => more flexibility • Easier generics (less type erasure) Extra: Scala Likes
  • 21. 21| gravyanalytics.com • Untyped vals • Lots of special symbols • Library complexity • Akka and typesafe libraries • Json parsing libraries (incompatibility with Gson, complex scala libs) • Java compatibility • Companion object wrapping • Bean serialization • Default to Seq for ordered collections (instead of ideal data structure for the job) • Gradle vs. SBT • Overuse of implicit “magic” • Difficult learning curve (lots to learn!!) • Too much flexibility can create inconsistent and confusing code • Opaque compilation errors • Missing Named Tuple (e.g. Python) • Enumerations are broken Extra: Scala Dislikes
  • 22. 22| gravyanalytics.com • Immutable types instead of mutable types • Collection syntax sugar • Chaining functions causes lots of type headaches • Syntactic sugar • Using recursion (with @tailrec) instead of procedural • Pattern matching • Using small functions to keep code readable • Reflection, type tags, and class tags • Curried functions • Partial functions • Unfamiliar type system • OO Paradigms don’t translate well (have to research correct way of doing things) • Lots to learn!! Extra: Scala challenges
  • 23. 23| gravyanalytics.com Aaron Perrin, Senior Software Developer 703-840-8850 aperrin@gravyanalytics.com