Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 107 Anuncio

Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes

Descargar para leer sin conexión

TrustYou analiza reseñas de hoteles en linea para crear un resumen para cada hotel en el mundo: http://www.trust-score.com/ ;

Los datos procesados por TrustYou son integrados por servicios como Kayak, Trivago, HipMunk entre otros.

Cada semana analizamos la web para descargar 3 millones de reviews. Estos son analizados usando técnicas de lingüística computacional, procesamiento de lenguaje natural y aprendizaje de maquinas. Para ello usamos casi exclusivamente Python. En esta charla Miguel les contará qué estrategias y herramientas usa TrustYou para lograr este objetivo.

Presentador: Miguel Fernando Cabrera. (@mfcabrera) Ing. de Sistemas e Informática de la Universidad Nacional de Colombia (Medellin), M.Sc. en Computer Science de la TU Munich con un Honours en Administración Tecnológica del CDTM. Fundó y lideró hasta principio del 2015 Munich DataGeeks, un grupo de interés en ML y Data Science más Baviera. Actualmente trabaja como como Data Engineer / Scientist para TrustYou.

TrustYou analiza reseñas de hoteles en linea para crear un resumen para cada hotel en el mundo: http://www.trust-score.com/ ;

Los datos procesados por TrustYou son integrados por servicios como Kayak, Trivago, HipMunk entre otros.

Cada semana analizamos la web para descargar 3 millones de reviews. Estos son analizados usando técnicas de lingüística computacional, procesamiento de lenguaje natural y aprendizaje de maquinas. Para ello usamos casi exclusivamente Python. En esta charla Miguel les contará qué estrategias y herramientas usa TrustYou para lograr este objetivo.

Presentador: Miguel Fernando Cabrera. (@mfcabrera) Ing. de Sistemas e Informática de la Universidad Nacional de Colombia (Medellin), M.Sc. en Computer Science de la TU Munich con un Honours en Administración Tecnológica del CDTM. Fundó y lideró hasta principio del 2015 Munich DataGeeks, un grupo de interés en ML y Data Science más Baviera. Actualmente trabaja como como Data Engineer / Scientist para TrustYou.

Anuncio
Anuncio

Más Contenido Relacionado

Similares a Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes (20)

Anuncio

Más de Big Data Colombia (18)

Más reciente (20)

Anuncio

Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes

  1. 1. Helping Travellers Make Better Hotel Choices 500 Million Times a Month Miguel Cabrera @mfcabrera https://www.flickr.com/photos/18694857@N00/5614701858/
  2. 2. ABOUT ME
  3. 3. •  Neuberliner •  Ing. Sistemas e Inf. Universidad Nacional - Med •  M.Sc. In Informatics TUM, Hons. Technology Management. •  Work for TrustYou as Data (Scientist|Engineer| Juggler)™ •  Founder and former organizer of Munich DataGeeks ABOUT ME
  4. 4. TODAY
  5. 5. •  What we do •  Architecture •  Technology •  Crawling •  Textual Processing •  Workflow Management and Scale •  Sample Application AGENDA
  6. 6. WHAT WE DO
  7. 7. For every hotel on the planet, provide a summary of traveler reviews.
  8. 8. •  Crawling •  Natural Language Processing / Semantic Analysis •  Record Linkage / Deduplication •  Ranking •  Recommendation •  Classification •  Clustering Tasks
  9. 9. ARCHITECTURE
  10. 10. Data Flow Crawling   Seman-c   Analysis    Database   API   Clients • Google • Kayak+ • TY Analytics
  11. 11. Batch Layer • Hadoop • Python • Pig* • Java* Service Layer • PostgreSQL • MongoDB • Redis • Cassandra DATA DATA Hadoop Cluster Application Machines Stack
  12. 12. SOME NUMBERS
  13. 13. 25 supported languages
  14. 14. 500,000+ Properties
  15. 15. 30,000,000+ daily crawled reviews
  16. 16. Deduplicated against 250,000,000+ reviews
  17. 17. 300,000+ daily new reviews
  18. 18. https://www.flickr.com/photos/22646823@N08/2694765397/ Lots of text
  19. 19. TECHNOLOGY
  20. 20. •  Numpy •  NLTK •  Scikit-Learn •  Pandas •  IPython / Jupyter •  Scrapy Python
  21. 21. •  Hadoop Streaming •  MRJob •  Oozie •  Luigi •  … Python + Hadoop
  22. 22. Crawling
  23. 23. Crawling
  24. 24. •  Build your own web crawlers •  Extract data via CSS selectors, XPath, regexes, etc. •  Handles queuing, request parallelism, cookies, throttling … •  Comprehensive and well-designed •  Commercial support by http://scrapinghub.com/
  25. 25. •  2 - 3 million new reviews/week •  Customers want alerts 8 - 24h after review publication! •  Smart crawl frequency & depth, but still high overhead •  Pools of constantly refreshed EC2 proxy IPs •  Direct API connections with many sites Crawling at TrustYou
  26. 26. •  Custom framework very similar to scrapy •  Runs on Hadoop cluster (100 nodes) •  Not 100% suitable for MapReduce •  Nodes mostly waiting •  Coordination/messaging between nodes required: –  Distributed queue –  Rate Limiting Crawling at TrustYou
  27. 27. Text Processing
  28. 28. Text Processing Raw  text   Setence   spli:ng   Tokenizing   Stopwords   Stemming Topic Models Word Vectors Classification
  29. 29. Text Processing
  30. 30. •  “great rooms” •  “great hotel” •  “rooms are terrible” •  “hotel is terrible” Text Processing JJ NN JJ NN NN VB JJ NN VB JJ >> nltk.pos_tag(nltk.word_tokenize("hotel is terrible")) [('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
  31. 31. •  25+ languages •  Linguistic system (morphology, taggers, grammars, parsers …) •  Hadoop: Scale out CPU •  ~1B opinions in the database •  Python for ML & NLP libraries Semantic Analysis
  32. 32. Word2Vec/Doc2Vec
  33. 33. Group of algorithms
  34. 34. An instance of shallow learning
  35. 35. Feature learning model
  36. 36. Generates real-valued vectors represenation of words
  37. 37. “king” – “man” + “woman” = “queen”
  38. 38. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  
  39. 39. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  
  40. 40. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  
  41. 41. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  
  42. 42. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  
  43. 43. Word2Vec Source:  h*p://technology.s4tchfix.com/blog/2015/03/11/word-­‐is-­‐worth-­‐a-­‐thousand-­‐vectors/  
  44. 44. Similar words/documents are nearby vectors
  45. 45. Wor2vec offer a similarity metric of words
  46. 46. Can be extended to paragraphs and documents
  47. 47. A fast Python based implementation available via Gensim
  48. 48. Workflow Management and Scale
  49. 49. Crawl   Extract   Clean   Stats   ML   ML   NLP  
  50. 50. Luigi “ A python framework for data flow definition and execution ”
  51. 51. Luigi •  Build complex pipelines of batch jobs •  Dependency resolution •  Parallelism •  Resume failed jobs •  Some support for Hadoop
  52. 52. Luigi
  53. 53. Luigi •  Dependency definition •  Hadoop / HDFS Integration •  Object oriented abstraction •  Parallelism •  Resume failed jobs •  Visualization of pipelines •  Command line integration
  54. 54. Minimal Bolerplate Code class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%st%dn" % (word, count)) f.close()
  55. 55. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%st%dn" % (word, count)) f.close() Task Parameters
  56. 56. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%st%dn" % (word, count)) f.close() Programmatically Defined Dependencies
  57. 57. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%st%dn" % (word, count)) f.close() Each Task produces an ouput
  58. 58. class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {} for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1 f = self.output().open('w') for word, count in six.iteritems(count): f.write("%st%dn" % (word, count)) f.close() Write Logic in Python
  59. 59. Hadoop
  60. 60. https://www.flickr.com/photos/12914838@N00/15015146343/ Hadoop = Java?
  61. 61. Hadoop Streaming cat input.txt | ./map.py | sort | ./reduce.py > output.txt
  62. 62. Hadoop Streaming hadoop jar contrib/streaming/hadoop-*streaming*.jar -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py -input /user/hduser/text.txt -output /user/hduser/gutenberg-output
  63. 63. class WordCount(luigi.hadoop.JobTask): date = luigi.DateParameter() def requires(self): return InputText(date) def output(self): return luigi.hdfs.HdfsTarget(’%s' % self.date_interval) def mapper(self, line): for word in line.strip().split(): yield word, 1 def reducer(self, key, values): yield key, sum(values) Luigi + Hadoop/HDFS
  64. 64. Go and learn:
  65. 65. Data Flow Visualization
  66. 66. Data Flow Visualization
  67. 67. Before •  Bash scripts + Cron •  Manual cleanup •  Manual failure recovery •  Hard(er) to debug
  68. 68. Now •  Complex nested Luigi jobs graphs •  Automatic retries •  Still Hard to debug
  69. 69. We use it for… •  Standalone executables •  Dump data from databases •  General Hadoop Streaming •  Bash Scripts / MRJob •  Pig* Scripts
  70. 70. You can wrap anything
  71. 71. Sample Application
  72. 72. Reviews are boring…
  73. 73. Source:  hGp://www.telegraph.co.uk/travel/hotels/11240430/TripAdvisor-­‐the-­‐funniest-­‐ reviews-­‐biggest-­‐controversies-­‐and-­‐best-­‐spoofs.html  
  74. 74. Reviews highlight the individuality and personality of users
  75. 75. Snippets from Reviews “Hips don’t lie” “Maid was banging” “Beautiful bowl flowers” “Irish dance, I love that” “No ghost sighting” “One ghost touching” “Too much cardio, not enough squats in the gym” “it is like hugging a bony super model”
  76. 76. Hotel Reviews + Gensim + Python + Luigi = ?
  77. 77. ExtractSentences LearnBigrams LearnModel ExtractClusterIds UploadEmbeddings Pig
  78. 78. from gensim.models.doc2vec import Doc2Vec class LearnModelTask(luigi.Task): # Parameters.... blah blah blah def output(self): return luigi.LocalTarget(os.path.join(self.output_directory, self.model_out)) def requires(self): return LearnBigramsTask() def run(self): sentences = LabeledClusterIDSentence(self.input().path) model = Doc2Vec(sentences=sentences, size=int(self.size), dm=int(self.distmem), negative=int(self.negative), workers=int(self.workers), window=int(self.window), min_count=int(self.min_count), train_words=True) model.save(self.output().path)
  79. 79. Wor2vec/Doc2vec offer a similarity metric of words
  80. 80. Similarities are useful for non- personalized recommender systems
  81. 81. Non-personalized recommenders recommend items based on what other consumers have said about the items.
  82. 82. http://demo.trustyou.com
  83. 83. Takeaways
  84. 84. Takeaways •  It is possible to use Python as the primary language for doing large data processing on Hadoop. •  It is not a perfect setup but works well most of the time. •  Keep your ecosystem open to other technologies.
  85. 85. We are hiring miguel.cabrera@trustyou.net
  86. 86. We are hiring miguel.cabrera@trustyou.net
  87. 87. Questions?

×