Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes

Helping Travellers Make
Better Hotel Choices
500 Million Times a Month
Miguel Cabrera
@mfcabrera
https://www.flickr.com/photos/18694857@N00/5614701858/

•  Neuberliner
•  Ing. Sistemas e Inf. Universidad Nacional - Med
•  M.Sc. In Informatics TUM, Hons. Technology
Management.
•  Work for TrustYou as Data (Scientist|Engineer|
Juggler)™
•  Founder and former organizer of Munich DataGeeks
ABOUT ME

•  What we do
•  Architecture
•  Technology
•  Crawling
•  Textual Processing
•  Workﬂow Management and Scale
•  Sample Application
AGENDA

For every hotel on the planet, provide
a summary of traveler reviews.

•  Crawling
•  Natural Language Processing / Semantic
Analysis
•  Record Linkage / Deduplication
•  Ranking
•  Recommendation
•  Classiﬁcation
•  Clustering
Tasks

Data Flow
Crawling

Seman-c

Analysis

Database
API

Clients
• Google
• Kayak+
• TY
Analytics

Batch
Layer
• Hadoop
• Python
• Pig*
• Java*
Service
Layer
• PostgreSQL
• MongoDB
• Redis
• Cassandra
DATA DATA
Hadoop Cluster
Application
Machines
Stack

30,000,000+ daily crawled
reviews

Deduplicated against 250,000,000+
reviews

Lots of text

•  Numpy
•  NLTK
•  Scikit-Learn
•  Pandas
•  IPython / Jupyter
•  Scrapy
Python

•  Hadoop Streaming
•  MRJob
•  Oozie
•  Luigi
•  …
Python + Hadoop

•  Build your own web crawlers
•  Extract data via CSS selectors, XPath,
regexes, etc.
•  Handles queuing, request parallelism,
cookies, throttling …
•  Comprehensive and well-designed
•  Commercial support by
http://scrapinghub.com/

•  2 - 3 million new reviews/week
•  Customers want alerts 8 - 24h after review
publication!
•  Smart crawl frequency & depth, but still high
overhead
•  Pools of constantly refreshed EC2 proxy IPs
•  Direct API connections with many sites
Crawling at TrustYou

•  Custom framework very similar to scrapy
•  Runs on Hadoop cluster (100 nodes)
•  Not 100% suitable for MapReduce
•  Nodes mostly waiting
•  Coordination/messaging between nodes
required:
–  Distributed queue
–  Rate Limiting
Crawling at TrustYou

Text Processing
Raw
text

Setence

spli:ng

Tokenizing
Stopwords

Stemming
Topic Models
Word Vectors
Classiﬁcation

•  “great rooms”
•  “great hotel”
•  “rooms are terrible”
•  “hotel is terrible”
Text Processing
JJ NN
JJ NN
NN VB JJ
NN VB JJ

>> nltk.pos_tag(nltk.word_tokenize("hotel is
terrible"))

[('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]

•  25+ languages
•  Linguistic system (morphology, taggers,
grammars, parsers …)
•  Hadoop: Scale out CPU
•  ~1B opinions in the database
•  Python for ML & NLP libraries
Semantic Analysis

An instance of shallow learning

Generates real-valued vectors
represenation of words

“king” – “man” + “woman” = “queen”

Word2Vec
Source:
h*p://technology.s4tchﬁx.com/blog/2015/03/11/word-‐is-‐worth-‐a-‐thousand-‐vectors/

Similar words/documents are nearby
vectors

Wor2vec oﬀer a similarity metric of
words

Can be extended to paragraphs and
documents

A fast Python based implementation
available via Gensim

Workﬂow Management and Scale

Crawl

Extract

Clean

Stats

ML

ML

NLP

Luigi
“ A python framework for data
ﬂow deﬁnition and execution ”

Luigi
•  Build complex pipelines of
batch jobs
•  Dependency resolution
•  Parallelism
•  Resume failed jobs
•  Some support for Hadoop

Luigi
•  Dependency deﬁnition
•  Hadoop / HDFS Integration
•  Object oriented abstraction
•  Parallelism
•  Resume failed jobs
•  Visualization of pipelines
•  Command line integration

Minimal Bolerplate Code
class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()

def requires(self):
def output(self):
def run(self):
count = {}
f.close()
Task Parameters

def requires(self):
def output(self):
def run(self):
count = {}
f.close()
Programmatically Defined Dependencies

def requires(self):
def output(self):
def run(self):
count = {}
f.close()
Each Task produces an ouput

def requires(self):
def output(self):
def run(self):
count = {}
f.close()
Write Logic in Python

Hadoop = Java?

Hadoop
Streaming
cat input.txt | ./map.py | sort | ./reduce.py > output.txt

Hadoop
Streaming
hadoop jar contrib/streaming/hadoop-*streaming*.jar
-file /home/hduser/mapper.py -mapper /home/hduser/mapper.py
-file /home/hduser/reducer.py -reducer /home/hduser/reducer.py
-input /user/hduser/text.txt -output /user/hduser/gutenberg-output

class WordCount(luigi.hadoop.JobTask):
def requires(self):
def output(self):
return luigi.hdfs.HdfsTarget(’%s' % self.date_interval)
def mapper(self, line):
yield word, 1
def reducer(self, key, values):
yield key, sum(values)
Luigi + Hadoop/HDFS

Before
•  Bash scripts + Cron
•  Manual cleanup
•  Manual failure recovery
•  Hard(er) to debug

Now
•  Complex nested Luigi jobs graphs
•  Automatic retries
•  Still Hard to debug

We use it for…
•  Standalone executables
•  Dump data from databases
•  General Hadoop Streaming
•  Bash Scripts / MRJob
•  Pig* Scripts

Source:
hGp://www.telegraph.co.uk/travel/hotels/11240430/TripAdvisor-‐the-‐funniest-‐
reviews-‐biggest-‐controversies-‐and-‐best-‐spoofs.html

Reviews highlight the individuality
and personality of users

Snippets from Reviews
“Hips don’t lie”
“Maid was banging”
“Beautiful bowl flowers”
“Irish dance, I love that”
“No ghost sighting”
“One ghost touching”
“Too much cardio, not enough squats in the gym”
“it is like hugging a bony super model”

Hotel Reviews + Gensim + Python +
Luigi = ?

ExtractSentences
LearnBigrams
LearnModel
ExtractClusterIds
UploadEmbeddings
Pig

from gensim.models.doc2vec import Doc2Vec
class LearnModelTask(luigi.Task):
# Parameters.... blah blah blah
def output(self):
return luigi.LocalTarget(os.path.join(self.output_directory,
self.model_out))
def requires(self):
return LearnBigramsTask()
def run(self):
sentences = LabeledClusterIDSentence(self.input().path)
model = Doc2Vec(sentences=sentences,
size=int(self.size),
dm=int(self.distmem),
negative=int(self.negative),
workers=int(self.workers),
window=int(self.window),
min_count=int(self.min_count),
train_words=True)
model.save(self.output().path)

Wor2vec/Doc2vec offer a similarity
metric of words

Similarities are useful for non-
personalized recommender systems

Non-personalized recommenders
recommend items based on what
other consumers have said about the
items.

Takeaways
•  It is possible to use Python as the primary
language for doing large data processing on
Hadoop.
•  It is not a perfect setup but works well most of
the time.
•  Keep your ecosystem open to other
technologies.

We are hiring
miguel.cabrera@trustyou.net

Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (17)

Similar a Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes

Similar a Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes (20)

Más de Big Data Colombia

Más de Big Data Colombia (18)

Último

Último (20)

Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes