More Related Content
Similar to Lexalytics Text Analytics Workshop: Perfect Text Analytics (20)
Lexalytics Text Analytics Workshop: Perfect Text Analytics
- 2. Perfect per·fect [adj., n. pur-fikt; v. per-fekt] 1. conforming absolutely to the description or definition of an ideal type: a perfect sphere; a perfect gentleman. 2. excellent or complete beyond practical or theoretical improvement: There is no perfect legal code. The proportions of this temple are almost perfect. 2 All right reserved © 2010 Lexalytics Inc.
- 3. Text Analytics The term text analytics describes a set of linguistic statistical, and machine learning techniques that model and structure the information content of textual sources. (Wikipedia) In other words, enhancing the value of text content by extracting entities, features, context, relationships and emotion. 3 All right reserved © 2010 Lexalytics Inc.
- 4. Perfect is Fast Average Human Reading Speed: 250wpm Conservative computer reading speed: 6000 wpm/core (our speed on a moderate single core) Each core is equivalent to the reading bandwidth of 12 people. Modern machines have 8 cores. That’s just about 100 people in a box. Nice. 4 All right reserved © 2010 Lexalytics Inc.
- 5. Perfect is Useable “I don’t like the results” is not the same as “the results are incorrect” Understanding the behavior key to usefulness Can you make better decisions? Can you make more money or save money? What is the most controversial area of text analytics? Thompson Reuters trading w/Sentiment Analysis increased Alpha (profit over market) by 80 basis points 5 All right reserved © 2010 Lexalytics Inc.
- 6. Useable: How much can you differ? “In my shop, that up until now has relied exclusively on human coding, we consider anything below 90% to be unacceptably inaccurate…. There is no doubt that automated sentiment is getting much much better, but to suggest that people should be okay with 20% of their data being wrong is just absurd.” Katie Delahaye Payne Why is 10% “wrong” so much less absurd than 20% “wrong”? 20% Error 10% Error 6 All right reserved © 2010 Lexalytics Inc.
- 7. Perfect is Consistent Same results for same content, every time University of Pittsburgh “Multi-Perspective Question Answering” Corpus: 535 documents, 11k+ sentences. 40 hours of training for each rater ~80% inter-rater agreement 7 All right reserved © 2010 Lexalytics Inc.
- 8. Perfect is (new) Knowledge Discover the stuff you don’t know Text Analytics is really, really great at telling you the who, the what, and the where. Sometimes the “how” You have to supply the “why” – but that question is way easier to answer when you know the other “w’s and the h” 8 All right reserved © 2010 Lexalytics Inc.
- 9. Perfect Includes Everything Running our top of the line software flat out across one year will cost you about $.002/document analyzed (news article sized content) (assuming 3 docs/core-second, 8 core machine) The more data the better and the greater worth your ta has 9 All right reserved © 2010 Lexalytics Inc.
- 10. Perfect is Trainable Can you solve YOUR business problem with it? Can you optimize to suit different kinds of content and roll those results up into a single reporting system? 10 All right reserved © 2010 Lexalytics Inc.
- 11. Perfect Text Analytics 11 All right reserved © 2010 Lexalytics Inc. Fast Useable Consistent Knowledge (that is) Inclusive Trainable
- 15. Market Intelligence Client Employee User Authentication Single Sign-on External Content Providers SinglePoint Client Company User Authentication Web 2.0 Collaboration Search Results Secondary Research Suppliers User Authentication MI Analyst Text Analytics Integrated Index News & Journals NL Search Engine FIREWALL Internal Document Repository Optional Document Repository Financial analyst reports Internal research Content Processing Custom Web Crawls & Gov. Databases Trash can crawl, FTP or CD 15 All right reserved © 2010 Lexalytics Inc.
- 24. Sarcasm, Twitter Model trained to detect sarcasm Once detected, you can decide what to do with it – because actually determining the sentiment is going to be unreliable New model trained on Twitter content Moving towards a concept of text analytics driven by business logic All right reserved © 2010 Lexalytics Inc. 22
- 25. Thesaurus-based Theme Rollup Machine generated conceptual taxonomy Gas/Electric Hybrid and EV might roll up to EV Fewer themes, but very useful to detect patterns across content All right reserved © 2010 Lexalytics Inc. 23
- 26. Foreign Language Support French is first, followed by other Romance languages New stemmer New summarization algorithm New part-of-speech tagger Automatic language detection New sentiment/entity extraction algorithms Also applicable to vertical specific content Confidence scoring by algorithm Use business logic to meld the results All right reserved © 2010 Lexalytics Inc. 24
- 27. Trainable Entity Sentiment New technique for entity sentiment Initial results from testing in English extremely promising Average human scoring overlap of >> 90% for scored sentences Initially used only for French 25 All right reserved © 2010 Lexalytics Inc.
- 28. Tool Enhancements Eventually use on English content: Twitter Customer Satisfaction Others… Entity Management Toolkit Part of Speech Tagset training Using to train Salience on French Sentiment Toolkit Build your own entity sentiment models: French (first) New Sentiment Toolkit + Maximum Entropy model builder allows new Entity and Sentiment modules New EMT helps us build a new French PoS tagger Entity Extraction & Sentiment Models Fully Tagged Document Doc POS Tagger 26 All right reserved © 2010 Lexalytics Inc. Themes & Summaries
- 29. Business Logic + TA Algorithms Content Source Search Business Logic Other TA System Sarcasm Route On Sports Finance Unknown $ ? A B C D Entity: Cisco 27 All right reserved © 2010 Lexalytics Inc. ProbabilityScores Cisco : Positive
- 30. Summary Lots of people making money with text analytics In lots of different verticals Next 12 months brings online a whole host of features to make our software even more flexible Check out tas.lexalytics.com Check out www.lexalytics.com/lexascope All right reserved © 2010 Lexalytics Inc. 28