SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
MapReduce:
Beyond Word Count
Jeff Patti
https://github.com/jepatti/mrjob_recipes
What is MapReduce?
“MapReduce is a programming model for processing large
data sets with a parallel, distributed algorithm on a cluster.”
- Wikipedia
Map - given a line of a file, yield key: value pairs
Reduce - given a key and all values with that key from the
prior map phase, yield key: value pairs
Word Count
Problem: count frequencies of words in
documents
Word Count Using mrjob
def mapper(self, key, line):
for word in line.split():
yield word, 1

def reducer(self, word, occurrences):
yield word, sum(occurrences)
Sample Output
"ligula" 4
"ligula." 2
"lorem" 5
"lorem." 4
"luctus" 3
"magna" 5
"magna," 3
"magnis" 1
Monetate Background
● Core products are merchandising,
personalization, testing, etc.
● A/B & Multivariate testing to determine
impact of experiments
● Involved with >20% of ecommerce spend
each holiday season for the past 2 years
running
Monetate Stack
● Distributed across multiple availability zones
and regions for redundancy, scaling, and
lower round trip times
● Real time decision engine using MySQL
● Nightly processing of each days data via
Hadoop using mrjob, a python library for
writing mapreduce jobs
Beyond Word Count
● Activity stream sessionization
● Product recommendations
● User behavior statistics
Activity Stream Sessionization
Goal: collate user activity, splitting into different
sessions if user inactive for more than 5
minutes
Input format: timestamp, user_id
Collate user activity
def mapper(self, key, line):
timestamp, user_id = line.split()
yield user_id, timestamp

def reducer(self, uid, timestamps):
yield uid, sorted(timestamps)
Sample Output
"998" ["1384389407", "1384389417", "1384389422",
"1384389425", "1384390407", "1384390417",
"1384391416", "1384392410", "1384392416",
"1384395420", "1384396405"]
"999" ["1384388414", "1384388425", "1384389419",
"1384389420", "1384390420", "1384391415",
"1384391418", "1384393413", "1384393425",
"1384394426", "1384395416", "1384396415",
"1384396422"]
Segment into Sessions
MAX_SESSION_INACTIVITY = 60 * 5
...
def reducer(self, uid, timestamps):
timestamps = sorted(timestamps)
start_index = 0
for index, timestamp in enumerate(timestamps):
if index > 0:
if timestamp - timestamps[index-1] >
MAX_SESSION_INACTIVITY:
yield uid, timestamps[start_index:index]
start_index = index
yield uid, timestamps[start_index:]
Sample Output
"999"[1384388414, 1384388425]
"999"[1384389419, 1384389420]
"999"[1384390420]
"999"[1384391415, 1384391418]
"999"[1384393413, 1384393425]
"999"[1384394426]
"999"[1384395416]
"999"[1384396415, 1384396422]
Product Recommendations
Goal: For each product a client sells, generate
a ‘people who bought this also bought this’
recommendation
Input: product_id_1, product_id_2, ...
Coincident Purchase Frequency
def mapper(self, key, line):
purchases = set(line.split(','))
for p1, p2 in permutations(purchases, 2):
yield (p1, p2), 1

def reducer(self, pair, occurrences):
p1, p2 = pair
yield p1, (p2, sum(occurrences))
Sample output
"8" ["5", 11]
"8" ["6", 19]
"8" ["7", 14]
"8" ["9", 11]
"9" ["1", 20]
"9" ["10", 22]
"9" ["11", 21]
"9" ["12", 13]
Top Recommendations
def reducer(self, purchase_pair, occurrences):
p1, p2 = purchase_pair
yield p1, (sum(occurrences), p2)

def reducer_find_best_recos(self, p1, p2_occurrences):
top_products = sorted(p2_occurrences, reverse=True)[:5]
top_products = [p2 for occurrences, p2 in top_products]
yield p1, top_products

def steps(self):
return [self.mr(mapper=self.mapper, reducer=self.reducer),
self.mr(reducer=self.reducer_find_best_recos)]
Sample Output
"7"
"8"
"9"

["15", "18", "17", "16", "3"]
["14", "15", "20", "6", "3"]
["15", "17", "19", "6", "3"]
Top Recommendations
Multi Account
def mapper(self, key, line):
account_id, purchases = line.split()
purchases = set(purchases.split(','))
for p1, p2 in permutations(purchases, 2):
yield (account_id, p1, p2), 1

def reducer(self, purchase_pair, occurrences):
account_id, p1, p2 = purchase_pair
yield (account_id, p1), (sum(occurrences), p2)

2nd step reducer unchanged
Sample Output
["9", "20"]
["9", "3"]
["9", "4"]
["9", "5"]
["9", "6"]
["9", "7"]
["9", "8"]
["9", "9"]

["8", "14", "13", "10", "1"]
["2", "4", "16", "11", "17"]
["3", "18", "11", "16", "15"]
["2", "1", "7", "18", "17"]
["12", "3", "2", "17", "16"]
["18", "5", "17", "1", "9"]
["20", "14", "13", "10", "4"]
["18", "7", "6", "5", "4"]
User Behavior Statistics
Goal: compute statistics about user behavior
(conversion rate & time on site) by account and
experiment in an efficient manner
Input:
account_id, campaigns_viewed, user_id, purchased?,
session_start_time, session_end_time
Statistics Primer
With sample count, mean, and variance for
each side of an experiment we can compute all
the statistics our analytics package displays
Statistics Primer (cont.)
y = a sessions metric value, ex: time on site
● Sample count: count the number of sessions
that viewed the experiment
○ sum(y^0)

● Mean: sum the metric / sample count
○ sum(y^1)/sum(y^0)
Statistics Primer (cont.)
● Variance:

○ Variance = mean of square minus square of mean
○ Variance = sum(y^2)/sum(y^0) - (sum(y^1)/sum(y^0)) ^ 2

For each side of an experiment we only need to
generate: sum(y^0), sum(y^1), sum(y^2)
Statistics by account
statistic_rollup/statistic_summarize.py
Sample Output
["8", "average session length"] [99, 24463, 7968891]
["8", "conversion rate"] [99, 45, 45]
["9", "average session length"] [115, 29515, 10071591]
["9", "conversion rate"] [115, 55, 55]
Statistics by experiment
statistic_rollup_by_experiment/statistic_summa
rize.py
Sample Output
["9", 0, "average session length"] [32, 8405, 3031009]
["9", 0, "conversion rate"] [32, 20, 20]
["9", 1, "average session length"] [23, 5405, 1770785]
["9", 1, "conversion rate"] [23, 14, 14]
["9", 2, "average session length"] [39, 9481, 2965651]
["9", 2, "conversion rate"] [39, 20, 20]
["9", 3, "average session length"] [25, 6276, 2151014]
["9", 3, "conversion rate"] [25, 13, 13]
["9", 4, "average session length"] [27, 5721, 1797715]
["9", 4, "conversion rate"] [27, 16, 16]
Questions?

?

Más contenido relacionado

La actualidad más candente

Unit-IV Windowing and Clipping.pdf
Unit-IV Windowing and Clipping.pdfUnit-IV Windowing and Clipping.pdf
Unit-IV Windowing and Clipping.pdfAmol Gaikwad
 
The Production and Visual FX of Killzone Shadow Fall
The Production and Visual FX of Killzone Shadow FallThe Production and Visual FX of Killzone Shadow Fall
The Production and Visual FX of Killzone Shadow FallGuerrilla
 
송창규, unity build로 빌드타임 반토막내기, NDC2010
송창규, unity build로 빌드타임 반토막내기, NDC2010송창규, unity build로 빌드타임 반토막내기, NDC2010
송창규, unity build로 빌드타임 반토막내기, NDC2010devCAT Studio, NEXON
 
jemalloc 세미나
jemalloc 세미나jemalloc 세미나
jemalloc 세미나Jang Hoon
 
2D Transformations(Computer Graphics)
2D Transformations(Computer Graphics)2D Transformations(Computer Graphics)
2D Transformations(Computer Graphics)AditiPatni3
 
Modern C++ 프로그래머를 위한 CPP11/14 핵심
Modern C++ 프로그래머를 위한 CPP11/14 핵심Modern C++ 프로그래머를 위한 CPP11/14 핵심
Modern C++ 프로그래머를 위한 CPP11/14 핵심흥배 최
 
[Glitch版] 0から始めようWebAR/VR入門ハンズオン with 織りなすラボ
[Glitch版] 0から始めようWebAR/VR入門ハンズオン with 織りなすラボ[Glitch版] 0から始めようWebAR/VR入門ハンズオン with 織りなすラボ
[Glitch版] 0から始めようWebAR/VR入門ハンズオン with 織りなすラボTakashi Yoshinaga
 

La actualidad más candente (7)

Unit-IV Windowing and Clipping.pdf
Unit-IV Windowing and Clipping.pdfUnit-IV Windowing and Clipping.pdf
Unit-IV Windowing and Clipping.pdf
 
The Production and Visual FX of Killzone Shadow Fall
The Production and Visual FX of Killzone Shadow FallThe Production and Visual FX of Killzone Shadow Fall
The Production and Visual FX of Killzone Shadow Fall
 
송창규, unity build로 빌드타임 반토막내기, NDC2010
송창규, unity build로 빌드타임 반토막내기, NDC2010송창규, unity build로 빌드타임 반토막내기, NDC2010
송창규, unity build로 빌드타임 반토막내기, NDC2010
 
jemalloc 세미나
jemalloc 세미나jemalloc 세미나
jemalloc 세미나
 
2D Transformations(Computer Graphics)
2D Transformations(Computer Graphics)2D Transformations(Computer Graphics)
2D Transformations(Computer Graphics)
 
Modern C++ 프로그래머를 위한 CPP11/14 핵심
Modern C++ 프로그래머를 위한 CPP11/14 핵심Modern C++ 프로그래머를 위한 CPP11/14 핵심
Modern C++ 프로그래머를 위한 CPP11/14 핵심
 
[Glitch版] 0から始めようWebAR/VR入門ハンズオン with 織りなすラボ
[Glitch版] 0から始めようWebAR/VR入門ハンズオン with 織りなすラボ[Glitch版] 0から始めようWebAR/VR入門ハンズオン with 織りなすラボ
[Glitch版] 0から始めようWebAR/VR入門ハンズオン with 織りなすラボ
 

Similar a Map reduce: beyond word count

Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Johann de Boer
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeAtScale
 
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard WorldMonitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard WorldBrian Troutwine
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
Feature surfacing - meetup
Feature surfacing  - meetupFeature surfacing  - meetup
Feature surfacing - meetupPredicSis
 
Streaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talkStreaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talkAmrit Sarkar
 
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Lucidworks
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkAnirudh Todi
 
Timeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaTimeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaOCoderFest
 
Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014Dieter Plaetinck
 
Usability testing
Usability testingUsability testing
Usability testinggamelanYK
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ FyberDaniel Hen
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowRomain Dorgueil
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri
 
E.D.D.I - Open Source Chatbot Platform
E.D.D.I - Open Source Chatbot PlatformE.D.D.I - Open Source Chatbot Platform
E.D.D.I - Open Source Chatbot PlatformGregor Jarisch
 
Benchmarking and PHPBench
Benchmarking and PHPBenchBenchmarking and PHPBench
Benchmarking and PHPBenchdantleech
 

Similar a Map reduce: beyond word count (20)

Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
 
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard WorldMonitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
 
R console
R consoleR console
R console
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Feature surfacing - meetup
Feature surfacing  - meetupFeature surfacing  - meetup
Feature surfacing - meetup
 
Streaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talkStreaming Solr - Activate 2018 talk
Streaming Solr - Activate 2018 talk
 
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...Building Analytics Applications with Streaming Expressions in Apache Solr - A...
Building Analytics Applications with Streaming Expressions in Apache Solr - A...
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech Talk
 
Tsar tech talk
Tsar tech talkTsar tech talk
Tsar tech talk
 
Timeseries - data visualization in Grafana
Timeseries - data visualization in GrafanaTimeseries - data visualization in Grafana
Timeseries - data visualization in Grafana
 
Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014Rethinking metrics: metrics 2.0 @ Lisa 2014
Rethinking metrics: metrics 2.0 @ Lisa 2014
 
Usability testing
Usability testingUsability testing
Usability testing
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B Tests
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
E.D.D.I - Open Source Chatbot Platform
E.D.D.I - Open Source Chatbot PlatformE.D.D.I - Open Source Chatbot Platform
E.D.D.I - Open Source Chatbot Platform
 
Benchmarking and PHPBench
Benchmarking and PHPBenchBenchmarking and PHPBench
Benchmarking and PHPBench
 

Último

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Último (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Map reduce: beyond word count

  • 1. MapReduce: Beyond Word Count Jeff Patti https://github.com/jepatti/mrjob_recipes
  • 2. What is MapReduce? “MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.” - Wikipedia Map - given a line of a file, yield key: value pairs Reduce - given a key and all values with that key from the prior map phase, yield key: value pairs
  • 3. Word Count Problem: count frequencies of words in documents
  • 4. Word Count Using mrjob def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences)
  • 5. Sample Output "ligula" 4 "ligula." 2 "lorem" 5 "lorem." 4 "luctus" 3 "magna" 5 "magna," 3 "magnis" 1
  • 6. Monetate Background ● Core products are merchandising, personalization, testing, etc. ● A/B & Multivariate testing to determine impact of experiments ● Involved with >20% of ecommerce spend each holiday season for the past 2 years running
  • 7. Monetate Stack ● Distributed across multiple availability zones and regions for redundancy, scaling, and lower round trip times ● Real time decision engine using MySQL ● Nightly processing of each days data via Hadoop using mrjob, a python library for writing mapreduce jobs
  • 8. Beyond Word Count ● Activity stream sessionization ● Product recommendations ● User behavior statistics
  • 9. Activity Stream Sessionization Goal: collate user activity, splitting into different sessions if user inactive for more than 5 minutes Input format: timestamp, user_id
  • 10. Collate user activity def mapper(self, key, line): timestamp, user_id = line.split() yield user_id, timestamp def reducer(self, uid, timestamps): yield uid, sorted(timestamps)
  • 11. Sample Output "998" ["1384389407", "1384389417", "1384389422", "1384389425", "1384390407", "1384390417", "1384391416", "1384392410", "1384392416", "1384395420", "1384396405"] "999" ["1384388414", "1384388425", "1384389419", "1384389420", "1384390420", "1384391415", "1384391418", "1384393413", "1384393425", "1384394426", "1384395416", "1384396415", "1384396422"]
  • 12. Segment into Sessions MAX_SESSION_INACTIVITY = 60 * 5 ... def reducer(self, uid, timestamps): timestamps = sorted(timestamps) start_index = 0 for index, timestamp in enumerate(timestamps): if index > 0: if timestamp - timestamps[index-1] > MAX_SESSION_INACTIVITY: yield uid, timestamps[start_index:index] start_index = index yield uid, timestamps[start_index:]
  • 13. Sample Output "999"[1384388414, 1384388425] "999"[1384389419, 1384389420] "999"[1384390420] "999"[1384391415, 1384391418] "999"[1384393413, 1384393425] "999"[1384394426] "999"[1384395416] "999"[1384396415, 1384396422]
  • 14. Product Recommendations Goal: For each product a client sells, generate a ‘people who bought this also bought this’ recommendation Input: product_id_1, product_id_2, ...
  • 15. Coincident Purchase Frequency def mapper(self, key, line): purchases = set(line.split(',')) for p1, p2 in permutations(purchases, 2): yield (p1, p2), 1 def reducer(self, pair, occurrences): p1, p2 = pair yield p1, (p2, sum(occurrences))
  • 16. Sample output "8" ["5", 11] "8" ["6", 19] "8" ["7", 14] "8" ["9", 11] "9" ["1", 20] "9" ["10", 22] "9" ["11", 21] "9" ["12", 13]
  • 17. Top Recommendations def reducer(self, purchase_pair, occurrences): p1, p2 = purchase_pair yield p1, (sum(occurrences), p2) def reducer_find_best_recos(self, p1, p2_occurrences): top_products = sorted(p2_occurrences, reverse=True)[:5] top_products = [p2 for occurrences, p2 in top_products] yield p1, top_products def steps(self): return [self.mr(mapper=self.mapper, reducer=self.reducer), self.mr(reducer=self.reducer_find_best_recos)]
  • 18. Sample Output "7" "8" "9" ["15", "18", "17", "16", "3"] ["14", "15", "20", "6", "3"] ["15", "17", "19", "6", "3"]
  • 19. Top Recommendations Multi Account def mapper(self, key, line): account_id, purchases = line.split() purchases = set(purchases.split(',')) for p1, p2 in permutations(purchases, 2): yield (account_id, p1, p2), 1 def reducer(self, purchase_pair, occurrences): account_id, p1, p2 = purchase_pair yield (account_id, p1), (sum(occurrences), p2) 2nd step reducer unchanged
  • 20. Sample Output ["9", "20"] ["9", "3"] ["9", "4"] ["9", "5"] ["9", "6"] ["9", "7"] ["9", "8"] ["9", "9"] ["8", "14", "13", "10", "1"] ["2", "4", "16", "11", "17"] ["3", "18", "11", "16", "15"] ["2", "1", "7", "18", "17"] ["12", "3", "2", "17", "16"] ["18", "5", "17", "1", "9"] ["20", "14", "13", "10", "4"] ["18", "7", "6", "5", "4"]
  • 21. User Behavior Statistics Goal: compute statistics about user behavior (conversion rate & time on site) by account and experiment in an efficient manner Input: account_id, campaigns_viewed, user_id, purchased?, session_start_time, session_end_time
  • 22. Statistics Primer With sample count, mean, and variance for each side of an experiment we can compute all the statistics our analytics package displays
  • 23. Statistics Primer (cont.) y = a sessions metric value, ex: time on site ● Sample count: count the number of sessions that viewed the experiment ○ sum(y^0) ● Mean: sum the metric / sample count ○ sum(y^1)/sum(y^0)
  • 24. Statistics Primer (cont.) ● Variance: ○ Variance = mean of square minus square of mean ○ Variance = sum(y^2)/sum(y^0) - (sum(y^1)/sum(y^0)) ^ 2 For each side of an experiment we only need to generate: sum(y^0), sum(y^1), sum(y^2)
  • 26. Sample Output ["8", "average session length"] [99, 24463, 7968891] ["8", "conversion rate"] [99, 45, 45] ["9", "average session length"] [115, 29515, 10071591] ["9", "conversion rate"] [115, 55, 55]
  • 28. Sample Output ["9", 0, "average session length"] [32, 8405, 3031009] ["9", 0, "conversion rate"] [32, 20, 20] ["9", 1, "average session length"] [23, 5405, 1770785] ["9", 1, "conversion rate"] [23, 14, 14] ["9", 2, "average session length"] [39, 9481, 2965651] ["9", 2, "conversion rate"] [39, 20, 20] ["9", 3, "average session length"] [25, 6276, 2151014] ["9", 3, "conversion rate"] [25, 13, 13] ["9", 4, "average session length"] [27, 5721, 1797715] ["9", 4, "conversion rate"] [27, 16, 16]