Building and Improving Products with Hadoop

•Descargar como PPTX, PDF•

2 recomendaciones•1,064 vistas

In many instances the terms `big data` and `Hadoop` are reserved for conversations on business analytics. Instead, I posit that these technologies are most powerful when they are deployed as a way to both build new products, and improve existing ones. Measurement is a fundamental part of the process, but more importantly I will walk through an effective tool-chain that can be used to: a) build unique new products, based on data. b) test improvements to a product At Foursquare, we`ve used a Hadoop-based tool chain to build new products (like social-recommendations), and to improve existing features through initiatives such as experimentation, and offline data generation. These products and improvements are fundamental to our core business, yet their existence would not be possible without Hadoop. I will pull examples from Foursquare and other companies to demonstrate these points, and outline the infrastructure components needed to accomplish them.

Tecnología Deportes

2013
Building and
Improving Products
with Hadoop
Matthew Rathbone

2013
What is Foursquare
Foursquare helps you explore
the world around you.
Meet up with friends, discover
new places, and save money
using your phone.
 4bn check-ins
 35mm users
 50mm POI
 150 employees
 1tb+ a day of data

2013
FIRST, A STORY
http://www.flickr.com/photos/shannonpatrick17

2013
The Right Tool for the Job
• Nginx – Serving static files
• Perl – Regular expressions
• XML – Frustrating people
• Hadoop (Map Reduce) – Counting

2013
COUNTING – WHAT IS IT GOOD FOR
http://www.flickr.com/photos/blaahhi/

2013
Statistically Improbable Phrases
Statistically Improbable Phrases

2013
SIPS use cases
• menu extraction
• sentiment analysis
• venue ratings
• specific recommendations
• search indexing
• pricing data
• facility information

2013
How is SIPS built?
Basically lots of counting.

2013
SIPS
• Tokenize data with a language model (into N-
Grams)
• built using tips, shouts, menu items, likes, etc
• Apply a TF-IDF algorithm (Term frequency,
inverse document frequency)
• Global phrase count
• Local phrase count ( in a venue )
• Some Filtering and ranking
• Re-compute & deploy nightly

2013
WHY USE HADOOP?
http://www.flickr.com/photos/dbrekke/

2013
SIPS – Without Hadoop
Potential Problems
• Database Query Throttling
• Venues are out of sync
• Altering the algorithm could take forever to
populate for all venues
• Where would you store the results?
• What about debug data?
• Does it scale to 10x, 100x?
• What about other, similar workflows?

2013
SIPS – Hadoop Benefits
• Quick Deployment
• Modular & Reusable
• Arbitrarily complex combination of many
datasets
• Every step of the workflow creates value

2013
Apple Store - Downtown San Francisco
1 tip mentions "haircuts"
Search for "haircuts" in "san francisco"  Apple store???
Fixed by looking at % of tips and overall frequency
“Hey Apple, how bout less shiny pizzazz and fancy haircuts and more fix-
my-f!@#$-imac”

2013
ACTUALLY, IT’S A BIT MORE
COMPLICATED
http://www.flickr.com/photos/bfishadow

2013
These benefits require infrastructure

2013
Dependency Management
Many options
• Oozie (Apache)
• Azkaban (LinkedIn)
• Luigi ( Spotify, we <3 this )
• Hamake ( Codeminders )
• Chronos ( AirBNB)

2013
Database / Log Ingestion
• Sqoop
• Mongo-Hadoop
• Kafka
• Flume
• Scribe
• etc

2013
MapReduce Friendly Datastore
A few obvious ones:
• Hbase
• Cassandra
• Voldemort
we built our own, it’s very similar to
Voldemort and uses the Hfile API

2013
Getting started without all that stuff

2013
The best way to start
Don’t use Hadoop.
*but pretend you do

2013
Other reasons to not use Hadoop
• Your idea might not be very good
• Hadoop will slow you down to start with
• You don’t have enough infrastructure yet
• build it when you need it
• V1 might not be that complex
• V1 could be a spreadsheet

2013
SIPS
Version 1
• Off the shelf language model
• A subset of Venues & Tips
• Did not use Map Reduce
• Did not push to production at all

2013
SIPS
Version 2
• Started building our own language model
• Rewritten as a Map Reduce
• Manually loaded data to production
• Filters for English data only.
Tweak, improve, etc

2013
SIPS
Version 3
• Incorporated more data sources into our language
model
• Deployment to KV store (auto)
• Incorporated lots of debug output
• Language pipeline also feeds sentiment analysis
Now we’re in the perfect place to iterate & improve

2013
In Summary
• Hadoop is good for counting, so use it for
counting
• Move quickly whenever possible and don’t
worry about automation
• Bring in new production services as you
need them
• Freedom!

20132013
Thanks!
matthew@foursquare.com
@rathboma
Bonus:
http://hadoopweekly.com
from my colleague, Joe Crobak (presenting later!)

Más contenido relacionado

Similar a Building and Improving Products with Hadoop

Getting started with CouchbaseJosue Bustos

Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014gmalouf678

This Ain't Your Parents' Search EngineLucidworks

Ncku csie talk about SparkGiivee The

Data Science with Hadoop - A primerOfer Mendelevitch

2015 Data Science Summit @ dato ReviewHang Li

Hadoop: The Unintended BenefitsDataWorks Summit

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain

Data Science with Hadoop: A PrimerDataWorks Summit

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin

Crawlable Spatial Data - #Geo4Web research topic #3Dimitri van Hees

Hadoop for Data ScienceDonald Miner

SolrPeter Svehla

Real time monitoring of hadoop and spark workflowsShankar Manian

The original vision of Nutch, 14 years later: Building an open source search ...Sylvain Zimmer

Lightning Fast Dataframes with PolarsAlberto Danese

Facebook Retrospective - Big data-world-europe-2012Joydeep Sen Sarma

Data-Driven Development Era and Its TechnologiesSATOSHI TAGOMORI

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)

Similar a Building and Improving Products with Hadoop (20)

Getting started with Couchbase

Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014

This Ain't Your Parents' Search Engine

Ncku csie talk about Spark

Data Science with Hadoop - A primer

2015 Data Science Summit @ dato Review

Hadoop: The Unintended Benefits

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...

Data Science with Hadoop: A Primer

Data Science at Scale: Using Apache Spark for Data Science at Bitly

How LinkedIn Uses Scalding for Data Driven Product Development

Crawlable Spatial Data - #Geo4Web research topic #3

Hadoop for Data Science

Solr

Real time monitoring of hadoop and spark workflows

The original vision of Nutch, 14 years later: Building an open source search ...

Lightning Fast Dataframes with Polars

Facebook Retrospective - Big data-world-europe-2012

Data-Driven Development Era and Its Technologies

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine

Más de DataWorks Summit

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Más de DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Último

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Artificial intelligence in cctv survelliance.pptxhariprasad279825

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Install Stable Diffusion in windows machinePadma Pradeep

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

CloudStudio User manual (basic edition):comworks

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

AI as an Interface for Commercial BuildingsMemoori

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Training state-of-the-art general text embeddingZilliz

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Building and Improving Products with Hadoop

1. 2013 Building and Improving Products with Hadoop Matthew Rathbone

2. 2013 What is Foursquare Foursquare helps you explore the world around you. Meet up with friends, discover new places, and save money using your phone.  4bn check-ins  35mm users  50mm POI  150 employees  1tb+ a day of data

3. 2013 FIRST, A STORY http://www.flickr.com/photos/shannonpatrick17

4. 2013 The Right Tool for the Job • Nginx – Serving static files • Perl – Regular expressions • XML – Frustrating people • Hadoop (Map Reduce) – Counting

5. 2013 COUNTING – WHAT IS IT GOOD FOR http://www.flickr.com/photos/blaahhi/

6. 2013

7. 2013

8. 2013

9. 2013

10. 2013

11. 2013 Statistically Improbable Phrases Statistically Improbable Phrases

12. 2013 SIPS use cases • menu extraction • sentiment analysis • venue ratings • specific recommendations • search indexing • pricing data • facility information

13. 2013 How is SIPS built? Basically lots of counting.

14. 2013 SIPS • Tokenize data with a language model (into N- Grams) • built using tips, shouts, menu items, likes, etc • Apply a TF-IDF algorithm (Term frequency, inverse document frequency) • Global phrase count • Local phrase count ( in a venue ) • Some Filtering and ranking • Re-compute & deploy nightly

15. 2013 WHY USE HADOOP? http://www.flickr.com/photos/dbrekke/

16. 2013 SIPS – Without Hadoop Potential Problems • Database Query Throttling • Venues are out of sync • Altering the algorithm could take forever to populate for all venues • Where would you store the results? • What about debug data? • Does it scale to 10x, 100x? • What about other, similar workflows?

17. 2013 SIPS – Hadoop Benefits • Quick Deployment • Modular & Reusable • Arbitrarily complex combination of many datasets • Every step of the workflow creates value

18. 2013 Apple Store - Downtown San Francisco 1 tip mentions "haircuts" Search for "haircuts" in "san francisco"  Apple store??? Fixed by looking at % of tips and overall frequency “Hey Apple, how bout less shiny pizzazz and fancy haircuts and more fix- my-f!@#$-imac”

19. 2013 Data & Modularity

20. 2013

21. 2013

22. 2013

23. 2013 ACTUALLY, IT’S A BIT MORE COMPLICATED http://www.flickr.com/photos/bfishadow

24. 2013 These benefits require infrastructure

25. 2013 Dependency Management Many options • Oozie (Apache) • Azkaban (LinkedIn) • Luigi ( Spotify, we <3 this ) • Hamake ( Codeminders ) • Chronos ( AirBNB)

26. 2013

27. 2013 Database / Log Ingestion • Sqoop • Mongo-Hadoop • Kafka • Flume • Scribe • etc

28. 2013

29. 2013 MapReduce Friendly Datastore A few obvious ones: • Hbase • Cassandra • Voldemort we built our own, it’s very similar to Voldemort and uses the Hfile API

30. 2013

31. 2013 Getting started without all that stuff

32. 2013 Components you likely don’t have

33. 2013 The best way to start Don’t use Hadoop. *but pretend you do

34. 2013 Other reasons to not use Hadoop • Your idea might not be very good • Hadoop will slow you down to start with • You don’t have enough infrastructure yet • build it when you need it • V1 might not be that complex • V1 could be a spreadsheet

35. 2013

36. 2013

37. 2013 SIPS Version 1 • Off the shelf language model • A subset of Venues & Tips • Did not use Map Reduce • Did not push to production at all

38. 2013 SIPS Version 2 • Started building our own language model • Rewritten as a Map Reduce • Manually loaded data to production • Filters for English data only. Tweak, improve, etc

39. 2013 SIPS Version 3 • Incorporated more data sources into our language model • Deployment to KV store (auto) • Incorporated lots of debug output • Language pipeline also feeds sentiment analysis Now we’re in the perfect place to iterate & improve

40. 2013 …to explore data

41. 2013 In Summary • Hadoop is good for counting, so use it for counting • Move quickly whenever possible and don’t worry about automation • Bring in new production services as you need them • Freedom!

42. 20132013 Thanks! matthew@foursquare.com @rathboma Bonus: http://hadoopweekly.com from my colleague, Joe Crobak (presenting later!)

Notas del editor

Friend – financeSpent 2 years building management platformScrapped the projectFund manager hired kid to build excel macrosRight tool for the job
Great for analyticsGreat for your products too
- tf-idf : counting globally, counting locally
Use lots of data sources without fearEach MR step outputs data to hdfs that can be used in other workflows.Makes the workflow naturally modulareasy to test isolated parts of the workflow
Once you’ve solved the MR -> Datastore problem once, you’ve solved it for good.
Every task has requirementsOther tasksDirectories with _SUCCESS flagsRun on cron
- Hadoop-friendly Datastore-- we built our own (HFile Service)-- -- immutable-- -- downloads data from s3-- -- reads everything into memory (but doesn't need to)-- -- create X shards using map-reduce, swap these into X servers. They memory-map the files
when in production hadoop lets you iterate quicklyright now, it slows you downstill work offlinedo it without any of the important components I just told you about
build a MVP in a spreadsheet, webview, whatevereven if you deploy it, you can manually load data into a DB to start withIf you’re testing a v1 for a limited subset (employees), you probably don’t have much data anyway
This didn’t need any of the key infrastructure components
This needed database dumps.Ran on a cronLoaded manually
Needed database dumpsRun with our dependency management engineLoads to our production datastore

Building and Improving Products with Hadoop

Recomendados

Recomendados

Más contenido relacionado

Similar a Building and Improving Products with Hadoop

Similar a Building and Improving Products with Hadoop (20)

Más de DataWorks Summit

Más de DataWorks Summit (20)

Último

Último (20)

Building and Improving Products with Hadoop

Notas del editor