December 2013 HUG: Spark at Yahoo!

•

10 recomendaciones•5,208 vistas

Yahoo Developer Network

Tecnología Educación

HADOOP AND SPARK
JOIN FORCES AT YAHOO
Andy Feng (afeng@yahoo-inc.com)
Distinguished Architect, Platforms, Yahoo

1
Monday, December 2, 13

YAHOO 2012: TODAY MODULE

http://visualize.yahoo.com/core/#
Monday, December 2, 13

2

YAHOO 2013: PERSONALIZED HOMEPAGE
http://www.yahoo.com

Mobile

3
Monday, December 2, 13

YAHOO 2013: PERSONALIZED PROPERTIES
http://ﬁnance.yahoo.com

Mobile

4
Monday, December 2, 13

YAHOO 2013: IMPROVED WEB SEARCH
W/ VERTICAL CONTENT
http://search.yahoo.com
Mobile

5
Monday, December 2, 13

DATA SCIENCE AT SCALE

Data Scientist

6
Monday, December 2, 13

1. CHALLENGE: SCIENCE
✦ Single model for all items in homepage stream
✴ Millions of items
✴ 1000’s of item/user features
• Yahoo content categories
• Wikipedia entity names

✴ Over 800 million users

✦ Objective function
✴ Relevance & user engagement
✴ Freshness & popularity
✴ Diversity

★ Algorithm exploration
✴ Logistic regression?
✴ Collaborative ﬁltering?
✴ Decision trees?
✴ Hybrid?

7
Monday, December 2, 13

II. CHALLENGE: SPEED
✦Ex. Item CTR in Yahoo homepage Today Module
* Short Lifetimes
* Temporal effect

✦Models should be constructed hourly or faster
8
Monday, December 2, 13

* Breaking news

III. CHALLENGE: SCALE
✦150 PB of data on Yahoo Hadoop clusters
✴Yahoo data scientists need the data for
‣ Model building
‣ BI analytics
✴Such datasets should be accessed efﬁciently
‣ avoid latency caused by data movement
✦35,000 servers in Hadoop cluster
✴Science projects need to leverage
9
Monday, December 2, 13

all these servers for computation

SOLUTION: HADOOP + SPARK

I. science ... Spark API & MLlib ease development of ML algorithms
II.speed ... Spark reduces latency of model training via in-memory RDD etc
III.scale ... YARN brings Hadoop datasets & servers at scientists’ ﬁngertips
10
Monday, December 2, 13

PILOT PROJECT: E-COMMERCE
Yahoo Taiwan Shopping & Auction
✦ Collaborative

ﬁltering algorithms for
✴ Viewed-also-viewed
✴ Bought-also-bought
✴ Bought-after-viewed

✦ 30

LOC in Spark/Scala
✴ 14 min. on 10 servers
- Hadoop-based algorithm: 106 min.

11
Monday, December 2, 13

PILOT PROJECT: STREAM ADS
✦A logistic regression algorithm
✴120 LOC in Spark/Scala
‣ Alternative: Vowpal Wabbit ... Difﬁcult to extend
✴30 min. on model creation for 100M samples
and 13K features with 30 iterations

✦“I used Spark-on-Yarn package
today, works great.” Amit (July 26, 2013 2:51 PM)

✴Initial algorithm was launched within 2 hours
after Spark-YARN package announcement
✴Compare: Several weeks on system setup and
data movement

12
Monday, December 2, 13

SUMMARY
✦Spark plays an important role in machine learning at Yahoo
✴Hadoop continues to be the core of our big-data platform
✦Yahoo is excited about your continued contribution to Apache
✴4 committers
✴ex. Spark-on-Yarn, Shark, security, scalability, operability etc.

13
Monday, December 2, 13

Spark

KEY TECHNOLOGY:
SPARK ON YARN

Thomas Graves (tgraves@yahoo-inc.com)
Spark Committer & Hadoop PMC
Yahoo

14
Monday, December 2, 13

SPARK-ON-YARN: ROADMAP
✦ Spark-0.6 & 0.7 - Experimental support of YARN
✴ Hadoop 0.23.X and early Hadoop 2.X releases
✦ Spark-0.8.0

- Spark-on-Yarn merged into master Spark branch

✦ Spark-0.8.X

- Future integration with Hadoop

✦ Spark-0.9.X

- Client-mode introduced

✴ Secure HDFS access
✴ Use YARN approved directories
✴ Link Spark UI to YARN UI

✴ Add authentication to Spark
✴ Support running spark on YARN from HDFS
✴ Support ﬁles/archives in Hadoop distributed cache
✴ Support Spark Shell on YARN
✴ Spark on YARN running on Hadoop

2.2.X
15

Monday, December 2, 13

ARCHITECTURE: STANDALONE MODE

16
Monday, December 2, 13

ARCHITECTURE: CLIENT MODE

17
Monday, December 2, 13

FUTURE DIRECTIONS
✦Support long running
✴Shark
✴Spark Streaming
✦Dynamic

jobs

resource allocation

✦Integrate with Hadoop enhancements
✴Generic History Server
✴Preemption, etc.
18
Monday, December 2, 13

MORE INFO:
✦http://spark.incubator.apache.org/docs/latest/running-on-yarn.html

19
Monday, December 2, 13

Más contenido relacionado

La actualidad más candente

Apache Spark OverviewVadim Y. Bichutskiy

The Evolution of the Hadoop EcosystemCloudera, Inc.

Apache spark installation [autosaved]Shweta Patnaik

Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks

Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz

Apache Hadoop YARN 2015: Present and FutureDataWorks Summit

Apache Spark at ViadeoCepoi Eugen

Spark architectureGauravBiswas9

Introduction to the Hadoop EcoSystemShivaji Dutta

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt

Introduction to Apache SparkSamy Dindane

Intro to Apache Spark by Marco VasquezMapR Technologies

Apache Spark & HadoopMapR Technologies

Apache sparkPrashant Pranay

Map reduce vs sparkTudor Lapusan

The hadoop 2.0 ecosystem and yarnMichael Joseph

Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies

Hadoop YARNVigen Sahakyan

YARN - Hadoop Next Generation Compute PlatformBikas Saha

Introduction to apache spark Aakashdata

La actualidad más candente (20)

Apache Spark Overview

The Evolution of the Hadoop Ecosystem

Apache spark installation [autosaved]

Apache Hadoop YARN - Enabling Next Generation Data Applications

Introduction to the Hadoop Ecosystem (FrOSCon Edition)

Apache Hadoop YARN 2015: Present and Future

Apache Spark at Viadeo

Spark architecture

Introduction to the Hadoop EcoSystem

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...

Introduction to Apache Spark

Intro to Apache Spark by Marco Vasquez

Apache Spark & Hadoop

Apache spark

Map reduce vs spark

The hadoop 2.0 ecosystem and yarn

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop

Hadoop YARN

YARN - Hadoop Next Generation Compute Platform

Introduction to apache spark

Destacado

Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)Spark Summit

Food Recommendation System Using Clustering Analysis for Diabetic patientsMaiyaporn Phanich

Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)Spark Summit

Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Spark Summit

Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)Spark Summit

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Spark Summit

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit

Airstream: Spark Streaming At AirbnbJen Aman

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit

A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...Spark Summit

iRIS: A Large-Scale Food and Recipe Recommendation System Using Spark-(Joohyu...Spark Summit

Collaborative Filtering with SparkChris Johnson

Cassandra and Spark: Optimizing for Data LocalityRussell Spitzer

Destacado (13)

Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)

Food Recommendation System Using Clustering Analysis for Diabetic patients

Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)

Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...

Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)

Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)

Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...

Airstream: Spark Streaming At Airbnb

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)

A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...

iRIS: A Large-Scale Food and Recipe Recommendation System Using Spark-(Joohyu...

Collaborative Filtering with Spark

Cassandra and Spark: Optimizing for Data Locality

Similar a December 2013 HUG: Spark at Yahoo!

Future of Data Intensive ApplicaitonsMilind Bhandarkar

Drupal: Internet Lego - What is Drupal?Eric Aitala

Proud to be polyglot!NLJUG

Treasure Data Cloud StrategyTreasure Data, Inc.

Productionalizing Spark StreamingRyan Weald

For a Social Local and Mobile DrupalAdyax

Lightweight javaEE with GuicePeerapat Asoktummarungsri

Drupal for Project Managers, Part 3: LaunchingAcquia

Why and How to integrate Hadoop and NoSQL?Tugdual Grall

Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at OoyalaDataStax Academy

Graph Gurus Episode 1: Enterprise GraphTigerGraph

Softshake 2013: Introduction to NoSQL with CouchbaseTugdual Grall

Power of open source, Business perspectiveDirk Frigne

Big data oracle_introduccionFran Navarro

Drupal and the semantic web - SemTechBiz 2012scorlosquet

Python + NoSQL in AnimationsShuen-Huei Guan

Oracle GoldenGate Studio IntroBobby Curtis

Introduction to Drupal for Absolute Beginnerseverlearner

Case Study: Digital Asset Management in DrupalEric Sembrat

Infrastructure as Code with Chef / PuppetEdmund Siegfried Haselwanter

Similar a December 2013 HUG: Spark at Yahoo! (20)

Future of Data Intensive Applicaitons

Drupal: Internet Lego - What is Drupal?

Proud to be polyglot!

Treasure Data Cloud Strategy

Productionalizing Spark Streaming

For a Social Local and Mobile Drupal

Lightweight javaEE with Guice

Drupal for Project Managers, Part 3: Launching

Why and How to integrate Hadoop and NoSQL?

Cassandra Meetup: Real-time Analytics using Cassandra, Spark and Shark at Ooyala

Graph Gurus Episode 1: Enterprise Graph

Softshake 2013: Introduction to NoSQL with Couchbase

Power of open source, Business perspective

Big data oracle_introduccion

Drupal and the semantic web - SemTechBiz 2012

Python + NoSQL in Animations

Oracle GoldenGate Studio Intro

Introduction to Drupal for Absolute Beginners

Case Study: Digital Asset Management in Drupal

Infrastructure as Code with Chef / Puppet

Más de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network

Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network

CICD at Oath using ScrewdriverYahoo Developer Network

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network

Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network

Architecting Petabyte Scale AI ApplicationsYahoo Developer Network

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network

Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network

February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network

February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network

Más de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...

Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...

CICD at Oath using Screwdriver

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...

Moving the Oath Grid to Docker, Eric Badger, Oath

Architecting Petabyte Scale AI Applications

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...

Jun 2017 HUG: YARN Scheduling – A Step Beyond

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...

February 2017 HUG: Exactly-once end-to-end processing with Apache Apex

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Último

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

MINDCTI Revenue Release Quarter One 2024MIND CTI

DBX First Quarter 2024 Investor PresentationDropbox

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede

ICT role in 21st century education and its challengesrafiqahmad00786416

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

December 2013 HUG: Spark at Yahoo!

1. HADOOP AND SPARK JOIN FORCES AT YAHOO Andy Feng (afeng@yahoo-inc.com) Distinguished Architect, Platforms, Yahoo 1 Monday, December 2, 13

2. YAHOO 2012: TODAY MODULE http://visualize.yahoo.com/core/# Monday, December 2, 13 2

3. YAHOO 2013: PERSONALIZED HOMEPAGE http://www.yahoo.com Mobile 3 Monday, December 2, 13

4. YAHOO 2013: PERSONALIZED PROPERTIES http://ﬁnance.yahoo.com Mobile 4 Monday, December 2, 13

5. YAHOO 2013: IMPROVED WEB SEARCH W/ VERTICAL CONTENT http://search.yahoo.com Mobile 5 Monday, December 2, 13

6. DATA SCIENCE AT SCALE Data Scientist 6 Monday, December 2, 13

7. 1. CHALLENGE: SCIENCE ✦ Single model for all items in homepage stream ✴ Millions of items ✴ 1000’s of item/user features • Yahoo content categories • Wikipedia entity names ✴ Over 800 million users ✦ Objective function ✴ Relevance & user engagement ✴ Freshness & popularity ✴ Diversity ★ Algorithm exploration ✴ Logistic regression? ✴ Collaborative ﬁltering? ✴ Decision trees? ✴ Hybrid? 7 Monday, December 2, 13

8. II. CHALLENGE: SPEED ✦Ex. Item CTR in Yahoo homepage Today Module * Short Lifetimes * Temporal effect ✦Models should be constructed hourly or faster 8 Monday, December 2, 13 * Breaking news

9. III. CHALLENGE: SCALE ✦150 PB of data on Yahoo Hadoop clusters ✴Yahoo data scientists need the data for ‣ Model building ‣ BI analytics ✴Such datasets should be accessed efﬁciently ‣ avoid latency caused by data movement ✦35,000 servers in Hadoop cluster ✴Science projects need to leverage 9 Monday, December 2, 13 all these servers for computation

10. SOLUTION: HADOOP + SPARK I. science ... Spark API & MLlib ease development of ML algorithms II.speed ... Spark reduces latency of model training via in-memory RDD etc III.scale ... YARN brings Hadoop datasets & servers at scientists’ ﬁngertips 10 Monday, December 2, 13

11. PILOT PROJECT: E-COMMERCE Yahoo Taiwan Shopping & Auction ✦ Collaborative ﬁltering algorithms for ✴ Viewed-also-viewed ✴ Bought-also-bought ✴ Bought-after-viewed ✦ 30 LOC in Spark/Scala ✴ 14 min. on 10 servers - Hadoop-based algorithm: 106 min. 11 Monday, December 2, 13

12. PILOT PROJECT: STREAM ADS ✦A logistic regression algorithm ✴120 LOC in Spark/Scala ‣ Alternative: Vowpal Wabbit ... Difﬁcult to extend ✴30 min. on model creation for 100M samples and 13K features with 30 iterations ✦“I used Spark-on-Yarn package today, works great.” Amit (July 26, 2013 2:51 PM) ✴Initial algorithm was launched within 2 hours after Spark-YARN package announcement ✴Compare: Several weeks on system setup and data movement 12 Monday, December 2, 13

13. SUMMARY ✦Spark plays an important role in machine learning at Yahoo ✴Hadoop continues to be the core of our big-data platform ✦Yahoo is excited about your continued contribution to Apache ✴4 committers ✴ex. Spark-on-Yarn, Shark, security, scalability, operability etc. 13 Monday, December 2, 13 Spark

14. KEY TECHNOLOGY: SPARK ON YARN Thomas Graves (tgraves@yahoo-inc.com) Spark Committer & Hadoop PMC Yahoo 14 Monday, December 2, 13

15. SPARK-ON-YARN: ROADMAP ✦ Spark-0.6 & 0.7 - Experimental support of YARN ✴ Hadoop 0.23.X and early Hadoop 2.X releases ✦ Spark-0.8.0 - Spark-on-Yarn merged into master Spark branch ✦ Spark-0.8.X - Future integration with Hadoop ✦ Spark-0.9.X - Client-mode introduced ✴ Secure HDFS access ✴ Use YARN approved directories ✴ Link Spark UI to YARN UI ✴ Add authentication to Spark ✴ Support running spark on YARN from HDFS ✴ Support ﬁles/archives in Hadoop distributed cache ✴ Support Spark Shell on YARN ✴ Spark on YARN running on Hadoop 2.2.X 15 Monday, December 2, 13

16. ARCHITECTURE: STANDALONE MODE 16 Monday, December 2, 13

17. ARCHITECTURE: CLIENT MODE 17 Monday, December 2, 13

18. FUTURE DIRECTIONS ✦Support long running ✴Shark ✴Spark Streaming ✦Dynamic jobs resource allocation ✦Integrate with Hadoop enhancements ✴Generic History Server ✴Preemption, etc. 18 Monday, December 2, 13

19. MORE INFO: ✦http://spark.incubator.apache.org/docs/latest/running-on-yarn.html 19 Monday, December 2, 13

December 2013 HUG: Spark at Yahoo!

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (13)

Similar a December 2013 HUG: Spark at Yahoo!

Similar a December 2013 HUG: Spark at Yahoo! (20)

Más de Yahoo Developer Network

Más de Yahoo Developer Network (20)

Último

Último (20)

December 2013 HUG: Spark at Yahoo!