This document discusses how TrustYou processes large amounts of hotel review data to provide summaries to travelers. It crawls over 30 million reviews daily across 25 languages. Natural language processing and machine learning techniques are used to analyze the text and provide recommendations. Workflows are managed through Luigi and tasks include crawling, text processing, modeling word embeddings, and powering a sample application. Hadoop and Python are used extensively to handle the large scale processing.
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
1. Helping Travellers Make
Better Hotel Choices
500 Million Times a Month
Miguel Cabrera
@mfcabrera
https://www.flickr.com/photos/18694857@N00/5614701858/
3. • Neuberliner
• Ing. Sistemas e Inf. Universidad Nacional - Med
• M.Sc. In Informatics TUM, Hons. Technology
Management.
• Work for TrustYou as Data (Scientist|Engineer|
Juggler)™
• Founder and former organizer of Munich DataGeeks
ABOUT ME
35. • Build your own web crawlers
• Extract data via CSS selectors, XPath,
regexes, etc.
• Handles queuing, request parallelism,
cookies, throttling …
• Comprehensive and well-designed
• Commercial support by
http://scrapinghub.com/
36.
37.
38.
39.
40. • 2 - 3 million new reviews/week
• Customers want alerts 8 - 24h after review
publication!
• Smart crawl frequency & depth, but still high
overhead
• Pools of constantly refreshed EC2 proxy IPs
• Direct API connections with many sites
Crawling at TrustYou
41. • Custom framework very similar to scrapy
• Runs on Hadoop cluster (100 nodes)
• Not 100% suitable for MapReduce
• Nodes mostly waiting
• Coordination/messaging between nodes
required:
– Distributed queue
– Rate Limiting
Crawling at TrustYou
45. • “great rooms”
• “great hotel”
• “rooms are terrible”
• “hotel is terrible”
Text Processing
JJ NN
JJ NN
NN VB JJ
NN VB JJ
>> nltk.pos_tag(nltk.word_tokenize("hotel is
terrible"))
[('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
46. • 25+ languages
• Linguistic system (morphology, taggers,
grammars, parsers …)
• Hadoop: Scale out CPU
• ~1B opinions in the database
• Python for ML & NLP libraries
Semantic Analysis
69. Luigi
• Dependency definition
• Hadoop / HDFS Integration
• Object oriented abstraction
• Parallelism
• Resume failed jobs
• Visualization of pipelines
• Command line integration
70. Minimal Bolerplate Code
class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
71. class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
Task Parameters
72. class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
Programmatically Defined Dependencies
73. class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
Each Task produces an ouput
74. class WordCount(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return InputText(date)
def output(self):
return luigi.LocalTarget(’/tmp/%s' % self.date_interval)
def run(self):
count = {}
for f in self.input():
for line in f.open('r'):
for word in line.strip().split():
count[word] = count.get(word, 0) + 1
f = self.output().open('w')
for word, count in six.iteritems(count):
f.write("%st%dn" % (word, count))
f.close()
Write Logic in Python
94. Snippets from Reviews
“Hips don’t lie”
“Maid was banging”
“Beautiful bowl flowers”
“Irish dance, I love that”
“No ghost sighting”
“One ghost touching”
“Too much cardio, not enough squats in the gym”
“it is like hugging a bony super model”
104. Takeaways
• It is possible to use Python as the primary
language for doing large data processing on
Hadoop.
• It is not a perfect setup but works well most of
the time.
• Keep your ecosystem open to other
technologies.