Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

如何建置關鍵字精靈 How to Build an Keyword Wizard

726 visualizaciones

Publicado el

關鍵字的應用身處於每個人的日常生活及企業應用之中,舉凡欲尋找有關沖繩相關旅遊資訊該如何下出正確的關鍵字,
或者是你是個廣告業務想要投放嬰兒用品的廣告不知道該下哪些關鍵字。這時你會需要打造一個關鍵字精靈來協助你
演講內容會介紹 Text Mining 的一些基本觀念、打造關鍵字聯想精靈的技巧及所需要套件資源,最後會介紹一些實際的應用。

Publicado en: Internet

如何建置關鍵字精靈 How to Build an Keyword Wizard

  1. 1. 如何建置關鍵字精靈 How to Build an Keyword Wizard
  2. 2. Agenda ● What is Keyword ? ● Why We Need ? ● Word Relation & Word Representation ● How to Build this Wizard ● Live Demo
  3. 3. What is Keyword ? ● Wikipedia : Keyword (computer programming), word or identifier that has a particular meaning to the programming language
  4. 4. Why We Need ? Advertisement Tags Look Me ! Relation Article Summary
  5. 5. Word Relation Model 琉球潛水 沖繩潛水 沖繩機場 那霸機場 琉球機場 琉球浮潛 沖繩浮潛 沖繩水族館 琉球水族館 OKinawa 水族館 沖繩
  6. 6. Word Relation Model 沖繩
  7. 7. Word Representation - Vector Space Model
  8. 8. One Hot v.s Continue Value It is better for analysis Very High Dimension
  9. 9. Word Representation - One Hot Representation Word One Hot Index Apple 00000001 how 00000010 Are 00000100 You 00001000 I 00010000 Am 00100000 Fine 01000000 Book 10000000 How Are You ? I am Fine . Thank You TF - Term Frequency 01111110 00001000 00010000 AND You I 00000000
  10. 10. Word Representation - Context Vector P(Wi|Context) Word 餐廳 浮潛 美食 旅遊 出國 沖繩 0.1 0.7 0.5 0.9 0.5 好吃 0.6 0.01 0.7 0.01 0.02 Okinawa 0.2 0.5 0.2 0.8 0.7 喔伊西 0.3 0.002 0.8 0.02 0.03 Similar Similar
  11. 11. Word Context Vector
  12. 12. Co-occurrence Matrix Sparse & Large n ~= 500K Space ~= n*n Time ~= n*n GG!!
  13. 13. Word2Vec 使用類神網路來產生以下模型: 給予短句中的前文即可預測出下一 個可能會出現的詞 附帶產生的結果 投影層即為詞向量(Word Vector) https://www.tensorflow.org/versions/r0.8/tutorials/word2vec/index.html 我想要去沖繩潛水 潛水 打 球 潛 水 睡 覺 洗 臉 ...
  14. 14. Word2Vec ● Google 2013 Release ● Open Source Project ● Two Layer Neural Network ● Another Toolkit : Gensim ● pip install --upgrade gensim https://www.tensorflow.org/versions/r0.8/tutorials/word2vec/index.html
  15. 15. How to Hands On ???
  16. 16. Major Process Flow Word Collection Content ExtractionArticle Selection Build ModelWord Cutting 花笠麵很好吃 花笠麵△很△好吃 Slack IntegrationSearch Log
  17. 17. Article Selection High Quality 500K Articles at 2015Q3Q4 4.4 Billion Spam Classifier Ranking ● pip install --upgrade https://storage.googleapis. com/tensorflow/linux/cpu/tensorflow-0.8.0-cp27-none- linux_x86_64.whl ● pip install -U scikit-learn ● http://www.wildml.com/2015/11/understanding- convolutional-neural-networks-for-nlp/
  18. 18. Content Extraction Top Content Body Bottom Side Side <div> <p>沖繩哪裡好玩</p> <p>美ら海水族館</p> <div> 沖繩哪裡好玩 美ら海水族館 ● pip install beautifulsoup4
  19. 19. Content Extraction
  20. 20. Content Extraction
  21. 21. Article Raw Data Preparation A A1A2A3A4A5A6A7A8A9 B1B2B3B4B4B6B7B8B9 Z1Z2Z3Z4Z5Z6Z7Z8Z9 ….. A1 A2 A4 A5 A6 A7 A8 A9 B1 B2 B3 B4 B6 B7 B8 B9 Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 …..
  22. 22. Build Model - Word2Vec
  23. 23. Build Model - Word2Vec
  24. 24. Build Model - CoOccurrence
  25. 25. Term Database ● Search Log ● 各大電商網站(e.q 阿里巴巴) ○ Link1 ○ Link2 ● http://baseterm.com/ ● 輸入法詞庫 ○ 詞庫 破解
  26. 26. Term Database - Search Log Term CollectionSearch History Filter & Counting
  27. 27. Search Log Keyword URL Date Click 好吃 http://xxx.xxx 20160520 33 好吃 http://zzz.zzz 20160520 22 日本旅遊 http://xxx.xxx 20160521 15 http://xxxx.xx.xxx http://xxxx.xx.xxx 20160522 12121
  28. 28. Term Database - Search Log by Count
  29. 29. Term Database - Search Log by Count/Len
  30. 30. Term Database ● 各大電商網站(e.q 阿里巴巴) ○ Link1 ○ Link2
  31. 31. Word Cutting ● Word Cut Tool ○ Jieba : https://github.com/fxsjy/jieba ○ https://github.com/yanyiwu/cppjieba-serve ● C++ Jieba Server ↑ x 30 以上 ● pip install jieba
  32. 32. Slack Integration ● Library ○ pip install slackbot ○ pip install slacker ● Get Bot Token ○ https://my.slack.com/services/new/bot
  33. 33. NAS Technology Software Stack Redshift BigQuery Article DB Spark WorkerWorker Worker Jieba Server Gensim Word2Vec Flask Jupyter Scikit Learn TensorFlow Slack Bot
  34. 34. LIVE Demo
  35. 35. Q&A
  36. 36. 2016 PIXNET HACKATHON 8/13

×