Spark meetup2 final (Taboola)

•Descargar como PPTX, PDF•

3 recomendaciones•595 vistas

tsliwowicz

Spark Summit Highlights Spark DataFrames and Zeppelin on top of Cassandra

Datos y análisis

Tal Sliwowicz
Director, R&D
tal@taboola.com
Who are we?
Ruthy Goldberg
Sr. Software Engineer
ruthy@taboola.com

Collaborative Filtering
Bucketed Consumption Groups
Geo
Region-based
Recommendations
Context
Metadata
Social
Facebook/Twitter API
User Behavior
Cookie Data
Engine Focused on Maximizing CTR & Post Click Engagement

Largest Content Discovery and
Monetization Network
550MMonthly Unique
Users
240BMonthly
Recommendations
10B+Daily User Events
5TB+Incoming Daily Data

• Using Spark in production since v0.8
• 6 Data Centers across the globe
• Dedicated Spark & Cassandra (for spark) cluster consists of
– 5000+ cores with 35TB of RAM memory and ~1PB of SSD local
storage, across 2 Data Centers.
• Data must be processed and analyzed in real time, for example:
– Real-time, per user content recommendations
– Real-time expenditure reports
– Automated campaign management
– Automated recommendation algorithms calibration
– Real-time analytics
What Does it Mean?

• Spark DataFrames: Simple and Fast Analysis of
Structured Data
https://spark-summit.org/2015/events/spark-dataframes-simple-and-fast-analysis-
of-structured-data/
DataFrames

• From DataFrames to Tungsten: A Peek into Spark's
Future
https://spark-summit.org/2015/events/keynote-9/
• Deep Dive into Project Tungsten: Bringing Spark
Closer to Bare Metal
https://spark-summit.org/2015/events/deep-dive-into-project-tungsten-bringing-
spark-closer-to-bare-metal/
Tungsten

• Spark and Spark Streaming at Netflix
https://spark-summit.org/2015/events/spark-and-spark-streaming-at-netflix/
Interesting Users’ Experience - Netflix

• How Spark Fits into Baidu's Scale
https://spark-summit.org/2015/events/keynote-10/
Interesting Users’ Experience - Baidu

• Recipes for Running Spark Streaming Applications in
Production
https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-
applications-in-production/
Databricks Practical Talks – Spark Streaming

• Building, Debugging, and Tuning Spark Machine
Learning Pipelines
https://spark-summit.org/2015/events/practical-machine-learning-pipelines-with-
mllib-2/
Databricks Practical Talks – Machine Learning

• Making Sense of Spark Performance
https://spark-summit.org/2015/events/making-sense-of-spark-performance/
• Taming GC Pauses for Humongous Java Heaps in
Spark Graph Computing
https://spark-summit.org/2015/events/taming-gc-pauses-for-humongous-java-
heaps-in-spark-graph-computing/
• IndexedRDD: Efficient Fine-Grained Updates for
RDDs
https://spark-summit.org/2015/events/indexedrdd-efficient-fine-grained-updates-
for-rdds/
Performance

• All Spark summit videos and presentations can be
found here https://spark-summit.org/2015/
Summary

USING SPARK AND C* TOGETHER
FOR DATA ANALYSIS USING DATA
FRAMES AND ZEPPELIN

Main Taboola’s framework classes:
– CassandraTableSchemaProvider
– CassandraDataLoader
Cassandra Table  Spark DataFrame

CassandraTableSchemaProvider
Getting Cassandra Metadata

SparkContext, SQLContext, ZeppelinContext are
automatically created and exposed as variable names
'sc', 'sqlContext' and 'z', respectively, both in scala and
python environments.
General Variables In Zeppelin

• Connect Zeppelin to the cluster (not
standalone)
• Load raw sessions data
• Run code (python/scala) for algorithmic
analysis
Zeppelin @Taboola - What’s next?

tal@taboola.com
ruthy@taboola.com
Thank You!

Más contenido relacionado

La actualidad más candente

Monitoring kubernetes wwith prometheus and grafana azure singapore - 19 aug...Nilesh Gule

Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...HostedbyConfluent

WSO2Con ASIA 2016: API Driven Innovation Within the EnterpriseWSO2

Icinga Camp Bangalore - Enterprise exceptions Icinga

Monitoring cloud applications and containersManageEngine, Zoho Corporation

Build Your Own Recommendation EngineSri Ambati

Big data and non relational databaseManageEngine, Zoho Corporation

JEEConf 2015 Big Data Analysis in Java WorldSerg Masyutin

Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...HostedbyConfluent

Extending KEDA with External ScalersBaltazar Chua

Google Charts for native Android appsChuck Greb

Cis 528presentation finalpriyalmistry4

Cis 528 big dataakashgandhi10

Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...Databricks

01 supermapiportaloverviewGeoMedeelel

Kafka Summit NYC 2017 - The Rise of the Streaming Platformconfluent

Up and Running with firebaseMd. Sadhan Sarker

Real Time API delivering data @ ScaleAkash Mishra

Icinga Camp Bangalore - Icinga2 and Salt Stack at SnapDealIcinga

02 supermapiclientforjavascriptintroductionGeoMedeelel

La actualidad más candente (20)

Monitoring kubernetes wwith prometheus and grafana azure singapore - 19 aug...

Server Sent Events using Reactive Kafka and Spring Web flux | Gagan Solur Ven...

WSO2Con ASIA 2016: API Driven Innovation Within the Enterprise

Icinga Camp Bangalore - Enterprise exceptions

Monitoring cloud applications and containers

Build Your Own Recommendation Engine

Big data and non relational database

JEEConf 2015 Big Data Analysis in Java World

Evolving the Engineering Culture to Manage Kafka as a Service | Kate Agnew, O...

Extending KEDA with External Scalers

Google Charts for native Android apps

Cis 528presentation final

Cis 528 big data

Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...

01 supermapiportaloverview

Kafka Summit NYC 2017 - The Rise of the Streaming Platform

Up and Running with firebase

Real Time API delivering data @ Scale

Icinga Camp Bangalore - Icinga2 and Salt Stack at SnapDeal

02 supermapiclientforjavascriptintroduction

Similar a Spark meetup2 final (Taboola)

Spark Magic Building and Deploying a High Scale Product in 4 Monthstsliwowicz

End-to-End Data Pipelines with Apache SparkBurak Yavuz

Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...HostedbyConfluent

Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan

Spark Hsinchu meetupYung-An He

Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson

Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao

Serverless machine learning operationsStepan Pushkarev

Spark Development Lifecycle at Workday - ApacheCon 2020Pavel Hardak

Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Eren Avşaroğulları

An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN

Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj

Spark and machine learning in microservices architectureStepan Pushkarev

Building Data Products with BigQuery for PPC and SEO (SMX 2022)Christopher Gutknecht

Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in dayVishal Pawar

Desenvolvimento .NET no Linux. Veja porque a Microsoft ama Linux e Open SourceRodrigo Kono

SPSNYC2019 - What is Common Data Model and how to use it?Nicolas Georgeault

Knowage roadmap-2022 (1)KNOWAGE

Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit

Introduction to real time big data with Apache SparkTaras Matyashovsky

Similar a Spark meetup2 final (Taboola) (20)

Spark Magic Building and Deploying a High Scale Product in 4 Months

End-to-End Data Pipelines with Apache Spark

Stateful Microservices with Apache Kafka and Spring Cloud Stream with Jan Svo...

Databricks Meetup @ Los Angeles Apache Spark User Group

Spark Hsinchu meetup

Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...

Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform

Serverless machine learning operations

Spark Development Lifecycle at Workday - ApacheCon 2020

Apache Spark Development Lifecycle @ Workday - ApacheCon 2020

An Insider’s Guide to Maximizing Spark SQL Performance

Hybrid Transactional/Analytics Processing with Spark and IMDGs

Spark and machine learning in microservices architecture

Building Data Products with BigQuery for PPC and SEO (SMX 2022)

Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day

Desenvolvimento .NET no Linux. Veja porque a Microsoft ama Linux e Open Source

SPSNYC2019 - What is Common Data Model and how to use it?

Knowage roadmap-2022 (1)

Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma

Introduction to real time big data with Apache Spark

Más de tsliwowicz

Spark war stories taboolatsliwowicz

Spark on Dataproc - Israel Spark Meetup at taboolatsliwowicz

Using apache spark to fight world hunger - Israel spark meetup at taboolatsliwowicz

Inneractive - Spark meetup2tsliwowicz

Taboola Road To Scale With Apache Sparktsliwowicz

Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz

Más de tsliwowicz (6)

Spark war stories taboola

Spark on Dataproc - Israel Spark Meetup at taboola

Using apache spark to fight world hunger - Israel spark meetup at taboola

Inneractive - Spark meetup2

Taboola Road To Scale With Apache Spark

Taboola's experience with Apache Spark (presentation @ Reversim 2014)

Último

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Multiple time frame trading analysis -brianshannon.pdfchwongval

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

RadioAdProWritingCinderellabyButleri.pdfgstagge

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

While-For-loop in python used in collegessuser7a7cd61

Spark meetup2 final (Taboola)

1. Spark Meetup #2

2. Tal Sliwowicz Director, R&D tal@taboola.com Who are we? Ruthy Goldberg Sr. Software Engineer ruthy@taboola.com

3. Collaborative Filtering Bucketed Consumption Groups Geo Region-based Recommendations Context Metadata Social Facebook/Twitter API User Behavior Cookie Data Engine Focused on Maximizing CTR & Post Click Engagement

4. Largest Content Discovery and Monetization Network 550MMonthly Unique Users 240BMonthly Recommendations 10B+Daily User Events 5TB+Incoming Daily Data

5. • Using Spark in production since v0.8 • 6 Data Centers across the globe • Dedicated Spark & Cassandra (for spark) cluster consists of – 5000+ cores with 35TB of RAM memory and ~1PB of SSD local storage, across 2 Data Centers. • Data must be processed and analyzed in real time, for example: – Real-time, per user content recommendations – Real-time expenditure reports – Automated campaign management – Automated recommendation algorithms calibration – Real-time analytics What Does it Mean?

6. SPARK SUMMIT SF 2015 Highlights

7. • Spark DataFrames: Simple and Fast Analysis of Structured Data https://spark-summit.org/2015/events/spark-dataframes-simple-and-fast-analysis- of-structured-data/ DataFrames

10. • From DataFrames to Tungsten: A Peek into Spark's Future https://spark-summit.org/2015/events/keynote-9/ • Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal https://spark-summit.org/2015/events/deep-dive-into-project-tungsten-bringing- spark-closer-to-bare-metal/ Tungsten

11.

12.

13.

14. • Spark and Spark Streaming at Netflix https://spark-summit.org/2015/events/spark-and-spark-streaming-at-netflix/ Interesting Users’ Experience - Netflix

15.

16. • How Spark Fits into Baidu's Scale https://spark-summit.org/2015/events/keynote-10/ Interesting Users’ Experience - Baidu

17.

18. • Recipes for Running Spark Streaming Applications in Production https://spark-summit.org/2015/events/recipes-for-running-spark-streaming- applications-in-production/ Databricks Practical Talks – Spark Streaming

19.

20. • Building, Debugging, and Tuning Spark Machine Learning Pipelines https://spark-summit.org/2015/events/practical-machine-learning-pipelines-with- mllib-2/ Databricks Practical Talks – Machine Learning

21.

22. • Making Sense of Spark Performance https://spark-summit.org/2015/events/making-sense-of-spark-performance/ • Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing https://spark-summit.org/2015/events/taming-gc-pauses-for-humongous-java- heaps-in-spark-graph-computing/ • IndexedRDD: Efficient Fine-Grained Updates for RDDs https://spark-summit.org/2015/events/indexedrdd-efficient-fine-grained-updates- for-rdds/ Performance

23. • All Spark summit videos and presentations can be found here https://spark-summit.org/2015/ Summary

24. USING SPARK AND C* TOGETHER FOR DATA ANALYSIS USING DATA FRAMES AND ZEPPELIN

25. Newsroom Dashboard

26. Cassandra

27. Main Taboola’s framework classes: – CassandraTableSchemaProvider – CassandraDataLoader Cassandra Table  Spark DataFrame

28. CassandraTableSchemaProvider Getting Cassandra Metadata

29. DF From Cassandra

30. CassandraDataLoader

31. CassandraDataLoader

32. Main Taboola’s framework classes: – CassandraTableSchemaProvider – CassandraDataLoader Cassandra Table  Spark DataFrame

33. Mysql Table  Spark DataFrame

34. Zeppelin

35. Loading Our Code Into Zeppelin

36. SparkContext, SQLContext, ZeppelinContext are automatically created and exposed as variable names 'sc', 'sqlContext' and 'z', respectively, both in scala and python environments. General Variables In Zeppelin

37. Executing Staging Code

38. DEMO

39. • Connect Zeppelin to the cluster (not standalone) • Load raw sessions data • Run code (python/scala) for algorithmic analysis Zeppelin @Taboola - What’s next?

40. tal@taboola.com ruthy@taboola.com Thank You!

Notas del editor

Tungsten motivation – CPU stayed the same for the last 10 years, so need to optimize code (1) Runtime code generation (2) Exploiting cache locality (3) Off-heap memory management
----- Meeting Notes (2/8/15 22:08) ----- Dont forget Spark 1.4 + Kafka
----- Meeting Notes (3/8/15 14:00) ----- Lior Chaga

Spark meetup2 final (Taboola)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Spark meetup2 final (Taboola)

Similar a Spark meetup2 final (Taboola) (20)

Más de tsliwowicz

Más de tsliwowicz (6)

Último

Último (20)

Spark meetup2 final (Taboola)

Notas del editor