SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
Apache S4: A Distributed Stream
Computing Platform

Presented at Stanford Infolab – Nov 4, 2011

http://incubator.apache.org/projects/s4 (migrating from http://s4.io)


  S4 Committers: {fpj, kishoreg, leoneu, mmorel,
  robbins}@apache.org
  Presented by Leo Neumeyer (@leoneu)


                                                                        1
About Me

 Born in Buenos Aires, Argentina, studied EE.
 School/Work in Canada (Signal Processing, Speech Coding).
 SRI Int'l (Menlo Park) Speech Lab, DARPA benchmarks, lab
 founded speech recognition spin-off Nuance Comm Inc.
 Mindstech: Startup to teach spoken English in Asia using web
 audio/video (before 2-way media was widely available).
 Yahoo! Labs: Search advertising (optimization, auctions).
 Quantbench: mission is to create a marketplace for data
 scientists, data providers, and investment funds.




                                                                2
S4 Project History

 Started as a research project at Yahoo! Labs in August 2008
 out of the need to personalize search ads in real-time.
 Open sourced in September 2009.
 Moved to Apache Incubator in October 2011.




                                                               3
Motivation


                                                       Online Parameter
 Personalized Search            Twitter Trends
                                                         Optimization



                        given multiple event streams
Predict Market Prices        extract information
                                                          Spam Filtering
 Automatic Trading
                          using data driven models
                                 in real time
                              with low latency
  Network Intrusion                at scale
     Detection                                           Sensor Networks


                               It's Fun!
                                                                           4
S4 Architecture

     Node
      App
      App           Server             App
                                       App
                                        App        PE Prototype
                                                       App
                                                        App         PE Instance
                                                                        App
                                                                         App



                                                      Stream
                                                        App
                                                         App


 Unlimited       There is one     Apps             An app is a      PE instances
 number of       server process   encapsulate      graph            are clones of
 nodes. Each     per node. The    units of work.   composed of      the prototype.
 node has one    server           They can         PE prototypes    They are
 process.        loads/unloads    consume and      and streams      associated with
                 apps.            produce event    that produce,    a unique key
                                  streams.         consume, and     and contain the
                                                   transmit msgs.   state.



S4 is a general-purpose, real-time, distributed, decentralized, robust, scalable,
event driven, pluggable platform that allows programmers to easily implement
applications for processing continuous unbounded streams of data.
                                                                                      5
Latency vs. Accuracy


            Zero Errors                Real-Time
Latency     ➔   Unconstrained          ➔   Constrained

Why?        ➔   Reproducible results   ➔   Limited control over
                                           inbound data rate and
                                           computing complexity
Use         ➔ Debug                    ➔ Process unstructured data
            ➔ Train Models             ➔ Tolerance to small errors

                                       ➔ Graceful recovery from

                                         inbound data streams




                                                                     6
Design

 Actors programming model.
 Probabilistic thinking in both algorithms and systems.
 Run on commodity hardware.
 All in-memory, no disk bottlenecks.
 Pluggable (Protocols, applications, serialization, etc.)
 Object oriented design → POJOs
 Static typing, no string literals, minimize type casting.
 Science friendly → constant change, ease of use.




                                                             7
Programming Model


                    Example: estimate click-
                    through rate in a web
                    application after applying a
                    filter to remove bot traffic.




                                                    8
Coding an App




                9
Research Areas: Systems

 Checkpointing strategies
 Replication strategies
 Dynamic load balancing
 Adaptive load management
 Query languages




                            10
Fault Tolerance

Problem                                  Approaches                 S4
High Availability                        ➔ Warm/hot failover        ➔ Warm failover
                                         ➔ Cold failover            ➔ Standby nodes +

                                                                      Apache Zookeeper
State Loss                               ➔ Lossy checkpointing      ➔   Lossy checkpointing
                                         ➔ Lossless checkpoint.
(Crashes, system
updates)
Low Latency                              ➔   Decouple stream        ➔ Asynchronous writes
                                             processing from        ➔ Uncoordinated

                                             checkpointing            checkpointing

Approach: checkpoints are count or time based, pluggable backend to
support any data store, lazy PE restore, tuning is application dependent.
Research by M. Morel, F. Junqueira, Yahoo! Research Europe, 2011.

                                                                                              11
Resilience in a Distributed Word Count Task




                                              12
Research Areas: Algorithms

 Self-adaptive models: adaptive language models using small
 amounts of data.
 Personalization: learn from user feedback (clicks, location,
 behavior) to deliver relevant information in RT.
 Trend detection: find personal Twitter trends relevant to you.
 Intrusion detection: summarize high level state of the network
 and detect unusual patterns.
 Sensor networks: large amounts of audio/video and other
 sources require processing, recognition, detection, and
 tracking. Detect events across sensors.




                                                              13
Personalized Search Ads

                                                                 Goal is to maximize:
                                                                  Revenue
                                                                  Click yield
                                                                  User experience

                                                                 By controlling:
                                                                  Ranking
                                                                  Pricing
                                                                  Filtering
                                                                  Placement

S. Schroedl, A. Kesari, and L. Neumeyer, “Personalized ad placement in web search,” in ADKDD ’10: Proceedings of the 4th Annual
International Workshop on Data Mining and Audience Intelligence for Online Advertising, 2010.

                                                                                                                                  14
Personalized Search Ads

 Model ad click intent using recent user activity.
 More likely to click → show more North ads.

 Example 1
  First query is digital slr camera
  Next query is canon slr
  More likely than average to click another ad

 Example 2
  Repeated query without previous clicks
  Less likely to click another ad

                                                     15
Personalized Search Ads

 Modeling user session

 Typical features:
   Number of searches/clicks by user past 24 hrs
   User COPC: Ratio of observed clicks to predicted clicks
   Identical query searched before / clicked before
   Time (seconds) since last search/click
   Similarity measures: current vs. previous queries

 Modeling technique: stochastic gradient-descent boosted
 trees (GDBT)

                                                             16
Personalized Search Ads


   Target
      P[CLICK|ad,query,user]

   Approximation
     P[CLICK|ad,query]* ucp[user,session]


       Non-personalized   User Click Propensity (UCP)
       long-term model          for user session
    computed using Hadoop     computed using S4


                                                        17
Personalized Search Ads

 Results:

  We can reduce the average number of ads (ad footprint) by
  7% without decreasing click yield and revenue.

                - OR -

  For a given ad footprint we can increase click yield by
  ~2%.




                                                            18
Thank you!
 Join the Apache S4 project:

  s4-user-subscribe@incubator.apache.org

  s4-dev-subscribe@incubator.apache.org



                                           19

Más contenido relacionado

Destacado

Ukraine job market overview (Tallinn, June 2014)
Ukraine job market overview (Tallinn, June 2014)Ukraine job market overview (Tallinn, June 2014)
Ukraine job market overview (Tallinn, June 2014)Max Ischenko
 
Edisi22o Ktaceh
Edisi22o KtacehEdisi22o Ktaceh
Edisi22o Ktacehepaper
 
Edisi5novaceh
Edisi5novacehEdisi5novaceh
Edisi5novacehepaper
 
Universiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
Universiteit Antwerpen Ken Lawrence Paper CultuurkritiekUniversiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
Universiteit Antwerpen Ken Lawrence Paper CultuurkritiekThisco
 
Epaper Edisi 20 Aceh
Epaper Edisi 20 AcehEpaper Edisi 20 Aceh
Epaper Edisi 20 Acehepaper
 
Storytelling In Power Point
Storytelling In Power PointStorytelling In Power Point
Storytelling In Power Pointguest31da44c
 
Bioassets Management Services
Bioassets Management ServicesBioassets Management Services
Bioassets Management Servicesguest5df60b0
 
Waspada Nasional 15 8 2009
Waspada Nasional 15 8 2009Waspada Nasional 15 8 2009
Waspada Nasional 15 8 2009epaper
 
25desaceh
25desaceh25desaceh
25desacehepaper
 
Edisi 4 Des Aceh
Edisi 4 Des AcehEdisi 4 Des Aceh
Edisi 4 Des Acehepaper
 
Waspada Aceh 110909
Waspada  Aceh 110909Waspada  Aceh 110909
Waspada Aceh 110909epaper
 
Shop Camp3 Viren Bhandari
Shop Camp3 Viren BhandariShop Camp3 Viren Bhandari
Shop Camp3 Viren BhandariViren Bhandari
 
Dubai. Religion
Dubai. ReligionDubai. Religion
Dubai. ReligionMeliiza
 
18 J An N As
18 J An N As18 J An N As
18 J An N Asepaper
 
Edisi 13 Aceh
Edisi 13 AcehEdisi 13 Aceh
Edisi 13 Acehepaper
 
Edisi 22 Feb Aceh
Edisi 22 Feb AcehEdisi 22 Feb Aceh
Edisi 22 Feb Acehepaper
 
OS Mapping and Industrial Location
OS Mapping and Industrial LocationOS Mapping and Industrial Location
OS Mapping and Industrial Locationdouglasgreig
 

Destacado (20)

Ukraine job market overview (Tallinn, June 2014)
Ukraine job market overview (Tallinn, June 2014)Ukraine job market overview (Tallinn, June 2014)
Ukraine job market overview (Tallinn, June 2014)
 
Edisi22o Ktaceh
Edisi22o KtacehEdisi22o Ktaceh
Edisi22o Ktaceh
 
Edisi5novaceh
Edisi5novacehEdisi5novaceh
Edisi5novaceh
 
Universiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
Universiteit Antwerpen Ken Lawrence Paper CultuurkritiekUniversiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
Universiteit Antwerpen Ken Lawrence Paper Cultuurkritiek
 
Epaper Edisi 20 Aceh
Epaper Edisi 20 AcehEpaper Edisi 20 Aceh
Epaper Edisi 20 Aceh
 
Storytelling In Power Point
Storytelling In Power PointStorytelling In Power Point
Storytelling In Power Point
 
Bioassets Management Services
Bioassets Management ServicesBioassets Management Services
Bioassets Management Services
 
Waspada Nasional 15 8 2009
Waspada Nasional 15 8 2009Waspada Nasional 15 8 2009
Waspada Nasional 15 8 2009
 
Presentation1
Presentation1Presentation1
Presentation1
 
25desaceh
25desaceh25desaceh
25desaceh
 
Edisi 4 Des Aceh
Edisi 4 Des AcehEdisi 4 Des Aceh
Edisi 4 Des Aceh
 
11 03 15 Think
11 03 15 Think11 03 15 Think
11 03 15 Think
 
Uganda
UgandaUganda
Uganda
 
Waspada Aceh 110909
Waspada  Aceh 110909Waspada  Aceh 110909
Waspada Aceh 110909
 
Shop Camp3 Viren Bhandari
Shop Camp3 Viren BhandariShop Camp3 Viren Bhandari
Shop Camp3 Viren Bhandari
 
Dubai. Religion
Dubai. ReligionDubai. Religion
Dubai. Religion
 
18 J An N As
18 J An N As18 J An N As
18 J An N As
 
Edisi 13 Aceh
Edisi 13 AcehEdisi 13 Aceh
Edisi 13 Aceh
 
Edisi 22 Feb Aceh
Edisi 22 Feb AcehEdisi 22 Feb Aceh
Edisi 22 Feb Aceh
 
OS Mapping and Industrial Location
OS Mapping and Industrial LocationOS Mapping and Industrial Location
OS Mapping and Industrial Location
 

Similar a Apache S4: A Distributed Stream Computing Platform for Real-Time Applications

Sybase Complex Event Processing
Sybase Complex Event ProcessingSybase Complex Event Processing
Sybase Complex Event ProcessingSybase Türkiye
 
Monitoreo y análisis de aplicaciones "Multi-Tier"
Monitoreo y análisis de aplicaciones "Multi-Tier"Monitoreo y análisis de aplicaciones "Multi-Tier"
Monitoreo y análisis de aplicaciones "Multi-Tier"GeneXus
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Databricks
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Karthik Murugesan
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTDr. Haxel Consult
 
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu Behera
 
The Magic of Symbiotic Security
The Magic of Symbiotic SecurityThe Magic of Symbiotic Security
The Magic of Symbiotic SecurityDenim Group
 
Stuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedStuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedYury Chemerkin
 
Monitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud InfrastructureMonitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud InfrastructureNewvewm
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBasedarach
 
Profiling PHP - PHPBenelux Unconference track - 2015-01-24
Profiling PHP - PHPBenelux Unconference track - 2015-01-24Profiling PHP - PHPBenelux Unconference track - 2015-01-24
Profiling PHP - PHPBenelux Unconference track - 2015-01-24Dennis de Greef
 
Learning's from mobile testing
Learning's from mobile testingLearning's from mobile testing
Learning's from mobile testingVikrant Chauhan
 
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011darach
 
IBM Pulse 2013 session - DevOps for Mobile Apps
IBM Pulse 2013 session - DevOps for Mobile AppsIBM Pulse 2013 session - DevOps for Mobile Apps
IBM Pulse 2013 session - DevOps for Mobile AppsSanjeev Sharma
 
SAP Sybase Event Streaming Processing
SAP Sybase Event Streaming ProcessingSAP Sybase Event Streaming Processing
SAP Sybase Event Streaming ProcessingSybase Türkiye
 
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11Dennis de Greef
 
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Narayan Bharadwaj
 
Development, Confusion and Exploration of Honeypot Technology
Development, Confusion and Exploration of Honeypot TechnologyDevelopment, Confusion and Exploration of Honeypot Technology
Development, Confusion and Exploration of Honeypot TechnologyAntiy Labs
 

Similar a Apache S4: A Distributed Stream Computing Platform for Real-Time Applications (20)

Sybase Complex Event Processing
Sybase Complex Event ProcessingSybase Complex Event Processing
Sybase Complex Event Processing
 
Monitoreo y análisis de aplicaciones "Multi-Tier"
Monitoreo y análisis de aplicaciones "Multi-Tier"Monitoreo y análisis de aplicaciones "Multi-Tier"
Monitoreo y análisis de aplicaciones "Multi-Tier"
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPT
 
Himansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloperHimansu-Java&BigdataDeveloper
Himansu-Java&BigdataDeveloper
 
The Magic of Symbiotic Security
The Magic of Symbiotic SecurityThe Magic of Symbiotic Security
The Magic of Symbiotic Security
 
Stuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learnedStuxnet redux. malware attribution & lessons learned
Stuxnet redux. malware attribution & lessons learned
 
Monitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud InfrastructureMonitoring Your AWS Cloud Infrastructure
Monitoring Your AWS Cloud Infrastructure
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
 
Profiling PHP - PHPBenelux Unconference track - 2015-01-24
Profiling PHP - PHPBenelux Unconference track - 2015-01-24Profiling PHP - PHPBenelux Unconference track - 2015-01-24
Profiling PHP - PHPBenelux Unconference track - 2015-01-24
 
WoMakersCode 2016 - Shit Happens
WoMakersCode 2016 -  Shit HappensWoMakersCode 2016 -  Shit Happens
WoMakersCode 2016 - Shit Happens
 
Learning's from mobile testing
Learning's from mobile testingLearning's from mobile testing
Learning's from mobile testing
 
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
StreamBase - Embedded Erjang - Erlang User Group London - 20th April 2011
 
IBM Pulse 2013 session - DevOps for Mobile Apps
IBM Pulse 2013 session - DevOps for Mobile AppsIBM Pulse 2013 session - DevOps for Mobile Apps
IBM Pulse 2013 session - DevOps for Mobile Apps
 
SAP Sybase Event Streaming Processing
SAP Sybase Event Streaming ProcessingSAP Sybase Event Streaming Processing
SAP Sybase Event Streaming Processing
 
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
Profiling PHP - WordPress Meetup Nijmegen 2015-03-11
 
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Development, Confusion and Exploration of Honeypot Technology
Development, Confusion and Exploration of Honeypot TechnologyDevelopment, Confusion and Exploration of Honeypot Technology
Development, Confusion and Exploration of Honeypot Technology
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Apache S4: A Distributed Stream Computing Platform for Real-Time Applications

  • 1. Apache S4: A Distributed Stream Computing Platform Presented at Stanford Infolab – Nov 4, 2011 http://incubator.apache.org/projects/s4 (migrating from http://s4.io) S4 Committers: {fpj, kishoreg, leoneu, mmorel, robbins}@apache.org Presented by Leo Neumeyer (@leoneu) 1
  • 2. About Me Born in Buenos Aires, Argentina, studied EE. School/Work in Canada (Signal Processing, Speech Coding). SRI Int'l (Menlo Park) Speech Lab, DARPA benchmarks, lab founded speech recognition spin-off Nuance Comm Inc. Mindstech: Startup to teach spoken English in Asia using web audio/video (before 2-way media was widely available). Yahoo! Labs: Search advertising (optimization, auctions). Quantbench: mission is to create a marketplace for data scientists, data providers, and investment funds. 2
  • 3. S4 Project History Started as a research project at Yahoo! Labs in August 2008 out of the need to personalize search ads in real-time. Open sourced in September 2009. Moved to Apache Incubator in October 2011. 3
  • 4. Motivation Online Parameter Personalized Search Twitter Trends Optimization given multiple event streams Predict Market Prices extract information Spam Filtering Automatic Trading using data driven models in real time with low latency Network Intrusion at scale Detection Sensor Networks It's Fun! 4
  • 5. S4 Architecture Node App App Server App App App PE Prototype App App PE Instance App App Stream App App Unlimited There is one Apps An app is a PE instances number of server process encapsulate graph are clones of nodes. Each per node. The units of work. composed of the prototype. node has one server They can PE prototypes They are process. loads/unloads consume and and streams associated with apps. produce event that produce, a unique key streams. consume, and and contain the transmit msgs. state. S4 is a general-purpose, real-time, distributed, decentralized, robust, scalable, event driven, pluggable platform that allows programmers to easily implement applications for processing continuous unbounded streams of data. 5
  • 6. Latency vs. Accuracy Zero Errors Real-Time Latency ➔ Unconstrained ➔ Constrained Why? ➔ Reproducible results ➔ Limited control over inbound data rate and computing complexity Use ➔ Debug ➔ Process unstructured data ➔ Train Models ➔ Tolerance to small errors ➔ Graceful recovery from inbound data streams 6
  • 7. Design Actors programming model. Probabilistic thinking in both algorithms and systems. Run on commodity hardware. All in-memory, no disk bottlenecks. Pluggable (Protocols, applications, serialization, etc.) Object oriented design → POJOs Static typing, no string literals, minimize type casting. Science friendly → constant change, ease of use. 7
  • 8. Programming Model Example: estimate click- through rate in a web application after applying a filter to remove bot traffic. 8
  • 10. Research Areas: Systems Checkpointing strategies Replication strategies Dynamic load balancing Adaptive load management Query languages 10
  • 11. Fault Tolerance Problem Approaches S4 High Availability ➔ Warm/hot failover ➔ Warm failover ➔ Cold failover ➔ Standby nodes + Apache Zookeeper State Loss ➔ Lossy checkpointing ➔ Lossy checkpointing ➔ Lossless checkpoint. (Crashes, system updates) Low Latency ➔ Decouple stream ➔ Asynchronous writes processing from ➔ Uncoordinated checkpointing checkpointing Approach: checkpoints are count or time based, pluggable backend to support any data store, lazy PE restore, tuning is application dependent. Research by M. Morel, F. Junqueira, Yahoo! Research Europe, 2011. 11
  • 12. Resilience in a Distributed Word Count Task 12
  • 13. Research Areas: Algorithms Self-adaptive models: adaptive language models using small amounts of data. Personalization: learn from user feedback (clicks, location, behavior) to deliver relevant information in RT. Trend detection: find personal Twitter trends relevant to you. Intrusion detection: summarize high level state of the network and detect unusual patterns. Sensor networks: large amounts of audio/video and other sources require processing, recognition, detection, and tracking. Detect events across sensors. 13
  • 14. Personalized Search Ads Goal is to maximize: Revenue Click yield User experience By controlling: Ranking Pricing Filtering Placement S. Schroedl, A. Kesari, and L. Neumeyer, “Personalized ad placement in web search,” in ADKDD ’10: Proceedings of the 4th Annual International Workshop on Data Mining and Audience Intelligence for Online Advertising, 2010. 14
  • 15. Personalized Search Ads Model ad click intent using recent user activity. More likely to click → show more North ads. Example 1 First query is digital slr camera Next query is canon slr More likely than average to click another ad Example 2 Repeated query without previous clicks Less likely to click another ad 15
  • 16. Personalized Search Ads Modeling user session Typical features: Number of searches/clicks by user past 24 hrs User COPC: Ratio of observed clicks to predicted clicks Identical query searched before / clicked before Time (seconds) since last search/click Similarity measures: current vs. previous queries Modeling technique: stochastic gradient-descent boosted trees (GDBT) 16
  • 17. Personalized Search Ads Target P[CLICK|ad,query,user] Approximation P[CLICK|ad,query]* ucp[user,session] Non-personalized User Click Propensity (UCP) long-term model for user session computed using Hadoop computed using S4 17
  • 18. Personalized Search Ads Results: We can reduce the average number of ads (ad footprint) by 7% without decreasing click yield and revenue. - OR - For a given ad footprint we can increase click yield by ~2%. 18
  • 19. Thank you! Join the Apache S4 project: s4-user-subscribe@incubator.apache.org s4-dev-subscribe@incubator.apache.org 19