SlideShare una empresa de Scribd logo
1 de 40
Spark Meetup #2
Tal Sliwowicz
Director, R&D
tal@taboola.com
Who are we?
Ruthy Goldberg
Sr. Software Engineer
ruthy@taboola.com
Collaborative Filtering
Bucketed Consumption Groups
Geo
Region-based
Recommendations
Context
Metadata
Social
Facebook/Twitter API
User Behavior
Cookie Data
Engine Focused on Maximizing CTR & Post Click Engagement
Largest Content Discovery and
Monetization Network
550MMonthly Unique
Users
240BMonthly
Recommendations
10B+Daily User Events
5TB+Incoming Daily Data
• Using Spark in production since v0.8
• 6 Data Centers across the globe
• Dedicated Spark & Cassandra (for spark) cluster consists of
– 5000+ cores with 35TB of RAM memory and ~1PB of SSD local
storage, across 2 Data Centers.
• Data must be processed and analyzed in real time, for example:
– Real-time, per user content recommendations
– Real-time expenditure reports
– Automated campaign management
– Automated recommendation algorithms calibration
– Real-time analytics
What Does it Mean?
SPARK SUMMIT SF 2015
Highlights
• Spark DataFrames: Simple and Fast Analysis of
Structured Data
https://spark-summit.org/2015/events/spark-dataframes-simple-and-fast-analysis-
of-structured-data/
DataFrames
• From DataFrames to Tungsten: A Peek into Spark's
Future
https://spark-summit.org/2015/events/keynote-9/
• Deep Dive into Project Tungsten: Bringing Spark
Closer to Bare Metal
https://spark-summit.org/2015/events/deep-dive-into-project-tungsten-bringing-
spark-closer-to-bare-metal/
Tungsten
• Spark and Spark Streaming at Netflix
https://spark-summit.org/2015/events/spark-and-spark-streaming-at-netflix/
Interesting Users’ Experience - Netflix
• How Spark Fits into Baidu's Scale
https://spark-summit.org/2015/events/keynote-10/
Interesting Users’ Experience - Baidu
• Recipes for Running Spark Streaming Applications in
Production
https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-
applications-in-production/
Databricks Practical Talks – Spark Streaming
• Building, Debugging, and Tuning Spark Machine
Learning Pipelines
https://spark-summit.org/2015/events/practical-machine-learning-pipelines-with-
mllib-2/
Databricks Practical Talks – Machine Learning
• Making Sense of Spark Performance
https://spark-summit.org/2015/events/making-sense-of-spark-performance/
• Taming GC Pauses for Humongous Java Heaps in
Spark Graph Computing
https://spark-summit.org/2015/events/taming-gc-pauses-for-humongous-java-
heaps-in-spark-graph-computing/
• IndexedRDD: Efficient Fine-Grained Updates for
RDDs
https://spark-summit.org/2015/events/indexedrdd-efficient-fine-grained-updates-
for-rdds/
Performance
• All Spark summit videos and presentations can be
found here https://spark-summit.org/2015/
Summary
USING SPARK AND C* TOGETHER
FOR DATA ANALYSIS USING DATA
FRAMES AND ZEPPELIN
Newsroom Dashboard
Cassandra
Main Taboola’s framework classes:
– CassandraTableSchemaProvider
– CassandraDataLoader
Cassandra Table  Spark DataFrame
CassandraTableSchemaProvider
Getting Cassandra Metadata
DF From Cassandra
CassandraDataLoader
CassandraDataLoader
Main Taboola’s framework classes:
– CassandraTableSchemaProvider
– CassandraDataLoader
Cassandra Table  Spark DataFrame
Mysql Table  Spark DataFrame
Zeppelin
Loading Our Code Into Zeppelin
SparkContext, SQLContext, ZeppelinContext are
automatically created and exposed as variable names
'sc', 'sqlContext' and 'z', respectively, both in scala and
python environments.
General Variables In Zeppelin
Executing Staging Code
DEMO
• Connect Zeppelin to the cluster (not
standalone)
• Load raw sessions data
• Run code (python/scala) for algorithmic
analysis
Zeppelin @Taboola - What’s next?
tal@taboola.com
ruthy@taboola.com
Thank You!

Más contenido relacionado

La actualidad más candente

Monitoring kubernetes wwith prometheus and grafana azure singapore - 19 aug...
Monitoring kubernetes wwith prometheus and grafana   azure singapore - 19 aug...Monitoring kubernetes wwith prometheus and grafana   azure singapore - 19 aug...
Monitoring kubernetes wwith prometheus and grafana azure singapore - 19 aug...Nilesh Gule
 
Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...
Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...
Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...HostedbyConfluent
 
WSO2Con ASIA 2016: API Driven Innovation Within the Enterprise
WSO2Con ASIA 2016: API Driven Innovation Within the EnterpriseWSO2Con ASIA 2016: API Driven Innovation Within the Enterprise
WSO2Con ASIA 2016: API Driven Innovation Within the EnterpriseWSO2
 
Icinga Camp Bangalore - Enterprise exceptions
Icinga Camp Bangalore - Enterprise exceptions Icinga Camp Bangalore - Enterprise exceptions
Icinga Camp Bangalore - Enterprise exceptions Icinga
 
Build Your Own Recommendation Engine
Build Your Own Recommendation EngineBuild Your Own Recommendation Engine
Build Your Own Recommendation EngineSri Ambati
 
JEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java WorldJEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java WorldSerg Masyutin
 
Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...
Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...
Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...HostedbyConfluent
 
Extending KEDA with External Scalers
Extending KEDA with External ScalersExtending KEDA with External Scalers
Extending KEDA with External ScalersBaltazar Chua
 
Google Charts for native Android apps
Google Charts for native Android appsGoogle Charts for native Android apps
Google Charts for native Android appsChuck Greb
 
Cis 528presentation final
Cis 528presentation finalCis 528presentation final
Cis 528presentation finalpriyalmistry4
 
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr... Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...Databricks
 
01 supermapiportaloverview
01 supermapiportaloverview01 supermapiportaloverview
01 supermapiportaloverviewGeoMedeelel
 
Kafka Summit NYC 2017 - The Rise of the Streaming Platform
Kafka Summit NYC 2017 - The Rise of the Streaming PlatformKafka Summit NYC 2017 - The Rise of the Streaming Platform
Kafka Summit NYC 2017 - The Rise of the Streaming Platformconfluent
 
Up and Running with firebase
Up and Running with firebaseUp and Running with firebase
Up and Running with firebaseMd. Sadhan Sarker
 
Real Time API delivering data @ Scale
Real Time API delivering data @ ScaleReal Time API delivering data @ Scale
Real Time API delivering data @ ScaleAkash Mishra
 
Icinga Camp Bangalore - Icinga2 and Salt Stack at SnapDeal
Icinga Camp Bangalore - Icinga2 and Salt Stack at SnapDealIcinga Camp Bangalore - Icinga2 and Salt Stack at SnapDeal
Icinga Camp Bangalore - Icinga2 and Salt Stack at SnapDealIcinga
 
02 supermapiclientforjavascriptintroduction
02 supermapiclientforjavascriptintroduction02 supermapiclientforjavascriptintroduction
02 supermapiclientforjavascriptintroductionGeoMedeelel
 

La actualidad más candente (20)

Monitoring kubernetes wwith prometheus and grafana azure singapore - 19 aug...
Monitoring kubernetes wwith prometheus and grafana   azure singapore - 19 aug...Monitoring kubernetes wwith prometheus and grafana   azure singapore - 19 aug...
Monitoring kubernetes wwith prometheus and grafana azure singapore - 19 aug...
 
Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...
Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...
Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...
 
WSO2Con ASIA 2016: API Driven Innovation Within the Enterprise
WSO2Con ASIA 2016: API Driven Innovation Within the EnterpriseWSO2Con ASIA 2016: API Driven Innovation Within the Enterprise
WSO2Con ASIA 2016: API Driven Innovation Within the Enterprise
 
Icinga Camp Bangalore - Enterprise exceptions
Icinga Camp Bangalore - Enterprise exceptions Icinga Camp Bangalore - Enterprise exceptions
Icinga Camp Bangalore - Enterprise exceptions
 
Monitoring cloud applications and containers
Monitoring cloud applications and containersMonitoring cloud applications and containers
Monitoring cloud applications and containers
 
Build Your Own Recommendation Engine
Build Your Own Recommendation EngineBuild Your Own Recommendation Engine
Build Your Own Recommendation Engine
 
Big data and non relational database
Big data and non relational databaseBig data and non relational database
Big data and non relational database
 
JEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java WorldJEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java World
 
Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...
Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...
Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...
 
Extending KEDA with External Scalers
Extending KEDA with External ScalersExtending KEDA with External Scalers
Extending KEDA with External Scalers
 
Google Charts for native Android apps
Google Charts for native Android appsGoogle Charts for native Android apps
Google Charts for native Android apps
 
Cis 528presentation final
Cis 528presentation finalCis 528presentation final
Cis 528presentation final
 
Cis 528 big data
Cis 528 big dataCis 528 big data
Cis 528 big data
 
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr... Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...
 
01 supermapiportaloverview
01 supermapiportaloverview01 supermapiportaloverview
01 supermapiportaloverview
 
Kafka Summit NYC 2017 - The Rise of the Streaming Platform
Kafka Summit NYC 2017 - The Rise of the Streaming PlatformKafka Summit NYC 2017 - The Rise of the Streaming Platform
Kafka Summit NYC 2017 - The Rise of the Streaming Platform
 
Up and Running with firebase
Up and Running with firebaseUp and Running with firebase
Up and Running with firebase
 
Real Time API delivering data @ Scale
Real Time API delivering data @ ScaleReal Time API delivering data @ Scale
Real Time API delivering data @ Scale
 
Icinga Camp Bangalore - Icinga2 and Salt Stack at SnapDeal
Icinga Camp Bangalore - Icinga2 and Salt Stack at SnapDealIcinga Camp Bangalore - Icinga2 and Salt Stack at SnapDeal
Icinga Camp Bangalore - Icinga2 and Salt Stack at SnapDeal
 
02 supermapiclientforjavascriptintroduction
02 supermapiclientforjavascriptintroduction02 supermapiclientforjavascriptintroduction
02 supermapiclientforjavascriptintroduction
 

Similar a Spark meetup2 final (Taboola)

Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Monthstsliwowicz
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkBurak Yavuz
 
Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...
Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...
Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...HostedbyConfluent
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
Spark Hsinchu meetup
Spark Hsinchu meetupSpark Hsinchu meetup
Spark Hsinchu meetupYung-An He
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operationsStepan Pushkarev
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Pavel Hardak
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Eren Avşaroğulları
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architectureStepan Pushkarev
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Christopher Gutknecht
 
Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day
Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in dayCherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day
Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in dayVishal Pawar
 
Desenvolvimento .NET no Linux. Veja porque a Microsoft ama Linux e Open Source
Desenvolvimento .NET no Linux. Veja porque a Microsoft ama Linux e Open SourceDesenvolvimento .NET no Linux. Veja porque a Microsoft ama Linux e Open Source
Desenvolvimento .NET no Linux. Veja porque a Microsoft ama Linux e Open SourceRodrigo Kono
 
SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?Nicolas Georgeault
 
Knowage roadmap-2022 (1)
Knowage roadmap-2022 (1)Knowage roadmap-2022 (1)
Knowage roadmap-2022 (1)KNOWAGE
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 

Similar a Spark meetup2 final (Taboola) (20)

Spark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 MonthsSpark Magic Building and Deploying a High Scale Product in 4 Months
Spark Magic Building and Deploying a High Scale Product in 4 Months
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...
Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...
Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Spark Hsinchu meetup
Spark Hsinchu meetupSpark Hsinchu meetup
Spark Hsinchu meetup
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operations
 
Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020Spark Development Lifecycle at Workday - ApacheCon 2020
Spark Development Lifecycle at Workday - ApacheCon 2020
 
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
 
Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day
Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in dayCherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day
Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day
 
Desenvolvimento .NET no Linux. Veja porque a Microsoft ama Linux e Open Source
Desenvolvimento .NET no Linux. Veja porque a Microsoft ama Linux e Open SourceDesenvolvimento .NET no Linux. Veja porque a Microsoft ama Linux e Open Source
Desenvolvimento .NET no Linux. Veja porque a Microsoft ama Linux e Open Source
 
SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?SPSNYC2019 - What is Common Data Model and how to use it?
SPSNYC2019 - What is Common Data Model and how to use it?
 
Knowage roadmap-2022 (1)
Knowage roadmap-2022 (1)Knowage roadmap-2022 (1)
Knowage roadmap-2022 (1)
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 

Más de tsliwowicz

Spark war stories taboola
Spark war stories taboolaSpark war stories taboola
Spark war stories taboolatsliwowicz
 
Spark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboolaSpark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboolatsliwowicz
 
Using apache spark to fight world hunger - Israel spark meetup at taboola
Using apache spark to fight world hunger - Israel spark meetup at taboolaUsing apache spark to fight world hunger - Israel spark meetup at taboola
Using apache spark to fight world hunger - Israel spark meetup at taboolatsliwowicz
 
Inneractive - Spark meetup2
Inneractive - Spark meetup2Inneractive - Spark meetup2
Inneractive - Spark meetup2tsliwowicz
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 

Más de tsliwowicz (6)

Spark war stories taboola
Spark war stories taboolaSpark war stories taboola
Spark war stories taboola
 
Spark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboolaSpark on Dataproc - Israel Spark Meetup at taboola
Spark on Dataproc - Israel Spark Meetup at taboola
 
Using apache spark to fight world hunger - Israel spark meetup at taboola
Using apache spark to fight world hunger - Israel spark meetup at taboolaUsing apache spark to fight world hunger - Israel spark meetup at taboola
Using apache spark to fight world hunger - Israel spark meetup at taboola
 
Inneractive - Spark meetup2
Inneractive - Spark meetup2Inneractive - Spark meetup2
Inneractive - Spark meetup2
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 

Último

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 

Último (20)

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 

Spark meetup2 final (Taboola)

  • 2. Tal Sliwowicz Director, R&D tal@taboola.com Who are we? Ruthy Goldberg Sr. Software Engineer ruthy@taboola.com
  • 3. Collaborative Filtering Bucketed Consumption Groups Geo Region-based Recommendations Context Metadata Social Facebook/Twitter API User Behavior Cookie Data Engine Focused on Maximizing CTR & Post Click Engagement
  • 4. Largest Content Discovery and Monetization Network 550MMonthly Unique Users 240BMonthly Recommendations 10B+Daily User Events 5TB+Incoming Daily Data
  • 5. • Using Spark in production since v0.8 • 6 Data Centers across the globe • Dedicated Spark & Cassandra (for spark) cluster consists of – 5000+ cores with 35TB of RAM memory and ~1PB of SSD local storage, across 2 Data Centers. • Data must be processed and analyzed in real time, for example: – Real-time, per user content recommendations – Real-time expenditure reports – Automated campaign management – Automated recommendation algorithms calibration – Real-time analytics What Does it Mean?
  • 6. SPARK SUMMIT SF 2015 Highlights
  • 7. • Spark DataFrames: Simple and Fast Analysis of Structured Data https://spark-summit.org/2015/events/spark-dataframes-simple-and-fast-analysis- of-structured-data/ DataFrames
  • 8.
  • 9.
  • 10. • From DataFrames to Tungsten: A Peek into Spark's Future https://spark-summit.org/2015/events/keynote-9/ • Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal https://spark-summit.org/2015/events/deep-dive-into-project-tungsten-bringing- spark-closer-to-bare-metal/ Tungsten
  • 11.
  • 12.
  • 13.
  • 14. • Spark and Spark Streaming at Netflix https://spark-summit.org/2015/events/spark-and-spark-streaming-at-netflix/ Interesting Users’ Experience - Netflix
  • 15.
  • 16. • How Spark Fits into Baidu's Scale https://spark-summit.org/2015/events/keynote-10/ Interesting Users’ Experience - Baidu
  • 17.
  • 18. • Recipes for Running Spark Streaming Applications in Production https://spark-summit.org/2015/events/recipes-for-running-spark-streaming- applications-in-production/ Databricks Practical Talks – Spark Streaming
  • 19.
  • 20. • Building, Debugging, and Tuning Spark Machine Learning Pipelines https://spark-summit.org/2015/events/practical-machine-learning-pipelines-with- mllib-2/ Databricks Practical Talks – Machine Learning
  • 21.
  • 22. • Making Sense of Spark Performance https://spark-summit.org/2015/events/making-sense-of-spark-performance/ • Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing https://spark-summit.org/2015/events/taming-gc-pauses-for-humongous-java- heaps-in-spark-graph-computing/ • IndexedRDD: Efficient Fine-Grained Updates for RDDs https://spark-summit.org/2015/events/indexedrdd-efficient-fine-grained-updates- for-rdds/ Performance
  • 23. • All Spark summit videos and presentations can be found here https://spark-summit.org/2015/ Summary
  • 24. USING SPARK AND C* TOGETHER FOR DATA ANALYSIS USING DATA FRAMES AND ZEPPELIN
  • 27. Main Taboola’s framework classes: – CassandraTableSchemaProvider – CassandraDataLoader Cassandra Table  Spark DataFrame
  • 32. Main Taboola’s framework classes: – CassandraTableSchemaProvider – CassandraDataLoader Cassandra Table  Spark DataFrame
  • 33. Mysql Table  Spark DataFrame
  • 35. Loading Our Code Into Zeppelin
  • 36. SparkContext, SQLContext, ZeppelinContext are automatically created and exposed as variable names 'sc', 'sqlContext' and 'z', respectively, both in scala and python environments. General Variables In Zeppelin
  • 38. DEMO
  • 39. • Connect Zeppelin to the cluster (not standalone) • Load raw sessions data • Run code (python/scala) for algorithmic analysis Zeppelin @Taboola - What’s next?

Notas del editor

  1. Tungsten motivation – CPU stayed the same for the last 10 years, so need to optimize code (1) Runtime code generation (2)  Exploiting cache locality (3)  Off-heap memory management
  2. ----- Meeting Notes (2/8/15 22:08) ----- Dont forget Spark 1.4 + Kafka
  3. ----- Meeting Notes (3/8/15 14:00) ----- Lior Chaga