Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

The Anatomy of a Small Scale Question Classification Engine by David Curran

92 visualizaciones

Publicado el

The Anatomy of a Small Scale Question Classification Engine by David Curran, Machine Learning Engineer, Openjaw Technologies

Another great presentation on chatbots with a focus on question classification and practical issues of deploying chatbots in China

Great review of the approach to classifying questions for a chatbot to determine the intents of customers. Think of it like a spam filter, that examines incoming emails and determines if it is either spam or not spam. Rather across a number of possible intents / ground truths.

This is an example of supervised learning, where a data set is gathered of possible questions from customer agents, which are classified by humans to define Ground Truths (intents). Such as "I need to change my flight", or "My luggage is lost", or "I need to book a flight". Check out the "How to improve Natural Language Datasets" to understand more on the Kfold test and improving the quality of the training dataset.

David highlights some important points of running chatbots in China in the difficulty of using IBM or Google's machine learning platforms; and also the relatively high cost of AI engines in China given the restricted competition. Which results in many businesses building their own AI Engine. He also covers the unique aspects of the written Chinese language compared to Roman Scripts, for example the lack of spaces between words.

David runs through the steps in creating the classifier:

Read in data. Utterance, label;
Separate out words;
Turn into machine comparable format, e.g. word vector etc;
Carry out manipulations. Tf-idf, stopwords, bigrams, stemming etc
Test classifier. Blind set, k-fold (validation set) - which we covered in this presentation ("How to improve Natural Language Datasets")
Tf-idf is frequency–inverse document frequency, a numerical statistic that is intended to reflect how important a word is to a document. Like the word iPhone being used 5 times in a passage means it's likely about iPhone.

David shows how using support-vector machines, supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. And a RASA pipeline can create a Small Scale Question Classification Engine. Without giving all your data away to Google. Though in the West the cost is so low with IBM and Google, and their engines so well-trained, its hard to justify this approach outside China.

Publicado en: Tecnología
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

The Anatomy of a Small Scale Question Classification Engine by David Curran

  1. 1. The Anatomy of a Small- Scale Question Classification Engine DAVID CURRAN
  2. 2. What is Question Classification Question comes in from user. We have to say which of a class of known questions it belongs to with a confidence we have in our opinion. Supervised Machine learning problem Think Spam/Real Email decider but with loads of classes not 2.
  3. 3. Ground Truth
  4. 4. Data We had Data we have: 3 datasets each of 5000-10000 labelled utterances. Most chatbots I have worked on have 2000 utterances at the end. Utterance, Intent makes Ground truth training data
  5. 5. Why we are building and AI engine The big companies will not deploy in China as they have to share their source code. Medium sized companies want to control more of the chatbot process We want some backup if connection breaks Big companies want their data centre to be used Cost of calling an AI engine is not a big reason. Except in China
  6. 6. Steps in Pipeline Read in data. Utterance, label Separate out words Turn into machine comparable format. word vector etc Carry out manipulations. Tf-idf, stopwords, bigrams, stemming etc Test classifier. Blind set, k-fold (validation set) This is Python but Java allows the same code steps daily.
  7. 7. Chinese NLP Chinese Word Segmentation Language Technology Platform Segmentor Sting to Word vector (bag of words) tf-idf, stopwords, Ngrams, stemming, lemmatization s̶p̶i̶l̶l̶i̶n̶g̶ spelling mistakes in Chinese
  8. 8. Chinese Word Segmentation 为什么我的证件一直显示无行程呢 change to 为什么 我 的 证件 一直 显示 无 行程 呢 Language Technology Platform Segmentor. Are other choices. And Metrics to judge quality Eyeballing is really useful here
  9. 9. Pre-processing text Need to convert the text to something the computer can read. Does this particular question have this word in it? Airport 0 Booking 1 Cargo 0 …. Sting to Word vector (bag of words) tf-idf. Questions are too short for these Stopwords. A, the, an.. not a big deal in Chinese
  10. 10. N-Grams Windows of words to get more information Trade off with more information and having seen examples before
  11. 11. Other Chinese Specific Issues s̶p̶i̶l̶l̶i̶n̶g̶ spelling mistakes in Chinese Pinyin. Zhou Youguang who died last year at age 111 Stemming and lemmatization. Traditional Chinese transliteration Lots of Unknown unknowns here
  12. 12. Simple Pipeline Using the Language Technology Platform Segmentor String to word Vec with minimum 3 occurrences, 1000 words kept, Bigram tokenizer LibLinear SVM classifier with L2-regularized logistic regression (primal) SVMtype.
  13. 13. Classification Algorithm
  14. 14. Hacks Pattern Matching Entities PNR, Ticket numbers etc. Worth 1+% and Watson does not do these Brings us up to standard 1% off Watson in tests I want to cancel ticket 123456789 I want to cancel ticket 123456789 @ticketnum Part of Speech tagging- Can sometimes give you .5% Hypernym. Of the root of the Parsed tree. Pretty or Now tradeoff
  15. 15. No Deep Learning? This is to get a quick up and running system. Other Engines. Baidu ERNIE Spacy Word embeddings
  16. 16. Word Embeddings
  17. 17. Basic PoC Python flask application Returns json web interface also Takes .2 seconds to respond Stack, resources etc to be decided
  18. 18. RASA Handles entities Dialog Flows Can deploy as Docker image This engine described is just a pipeline in RASA
  19. 19. The Anatomy of a Small- Scale Question Classification Engine DAVID CURRAN