Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Popular Text Analytics Algorithms

2.099 visualizaciones

Publicado el

This presentation introduces text analytics, its applications and various tools/algorithms used for this process. Given below are some of the important tools:

- Decision trees
- Naive-Bayes
- K-nearest neighbours
- Artificial Neural Networks
- Fuzzy C-Means
- Latent Dirichlet Allocation

Publicado en: Datos y análisis
  • Want to earn $4000/m? Of course you do. Learn how when you join today! ★★★
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Popular Text Analytics Algorithms

  1. 1. Popular Text Analytics Algorithms
  2. 2. What is text analytics? It is all about deriving high-quality structured data for analysis from unstructured text.
  3. 3. Why is text analytics used? It is used to measure customer opinions, product reviews, feedback, to provide search facility, sentimental analysis and entity modeling to support data-backed decision making.
  4. 4. What are the primary steps in text analytics? Text acquisition and preparation Processing and analysis Reporting (visualization/presentation)
  5. 5. For instance, social media chatter around brand can create a supremely spiraling impact (remember the post which showed a Kentucky man was violently removed from his United Airlines seat on an overbooked flight? And how it lead to a social media disaster for the airline?).
  6. 6. In addition to social media data, other examples include e-mail messages, call center notes, and customer records.
  7. 7. In addition to social media data, other examples include e-mail messages, call center notes, and customer records.
  8. 8. What type of information can be extracted? Terms Named entity Concept Sentiment
  9. 9. Terms These are extraction based on keywords (on own site or competitor site)
  10. 10. Named entities These are extracted to answer the ‘who’, ‘what’, or ‘where’. Some instances include name, location, timestamp, or product.
  11. 11. Concept These are extracted to answer the ‘about’ of a piece of content. It describes the idea behind the content.
  12. 12. Sentiment These are extracted to gauge the overall feeling around a brand at the moment. The above United Airlines example will be (evidently) negative sentiment, denoting unhappy customers, and potential business losses.
  13. 13. What type of tools/algorithms are used for text analytics? Decision tree Naive-Bayes Support Vector Machine K-nearest neighbours Artificial Neural Networks Fuzzy C-Means LDA
  14. 14. Decision Trees This is a classifier that seeks to repeatedly group data into groups or classes. It comes in handy for tasks like classification or regression.
  15. 15. Popular algorithms in Decision trees ID3: Iternative Dichotomizer builds a decision tree that splits data based on highest information gain (and lowest entropy) till every group has homogenous data. C4.5: This algorithm too uses information gain and entropy to classify data (just like ID3). Unlike ID3, it accepts continuous and discrete features and handles incomplete data too. CART: Classification and Regression Tree works just like C4.5. One notable difference is that CART uses Gini impurity (to assess ‘purity’ or homogeneity of the node) instead of information gain/entropy used by C4.5
  16. 16. Naive-Bayes This is a popular technique to classify text and documents based on a category (whether to classify a document as Sport or as Political based on the occurrence of certain words). It is a simple way to assign class or category labels to instances or cases.
  17. 17. Naive-Bayes Rather than being a single distinct algorithm, it is a set of algorithms that work on one underlying principle -- “the value of a given feature is independent of the value of any other feature”.
  18. 18. Support Vector Machines This is a supervised machine learning algorithm. It can be applied on classification and regression problems. Its essential component is kernel trick which transforms linear data into non-linear data by replacing its features by a kernel function. It is used in hypertext categorization, classification of images, and facial recognition applications.
  19. 19. Applications of SVM It is used in hypertext categorization, classification of images, and facial recognition applications.
  20. 20. K Nearest Neighbors k-NN is used is search items where you are looking for something similar. You determine similarity by creating a vector representation of the items and then compare how similar or dissimilar they are using a distance metric like Euclidean distance.
  21. 21. Applications of k-NN The best example of k-NN’s prowess is an e-commerce site’s product recommendation feature. You can also utilize k-NN to do Concept Search (finding semantically similar documents).
  22. 22. Artificial Neural Networks ANNs are primarily utilized for non- linear boundaries- based classification. Much like the working of the human brain, ANN operates on hidden states (which correspond to the neurons in the brain).
  23. 23. Algorithms to train ANN Gradient Descent Evolutionary Algorithms Genetic Algorithms
  24. 24. Applications of ANN Image compression, handwriting analysis, and stock exchange movement prediction are some sectors where ANN comes in useful.
  25. 25. Fuzzy C-Means This is a useful form of clustering that can add value when there are items that can be a part of more than one cluster. It works on the principle that after the clustering is over, all items in a cluster are as similar as possible to each other.
  26. 26. Steps in Fuzzy C-Means Pick Pick a number of clusters where the items can be categorized Assign Assign coefficient to each data point for being present inside the cluster Repeat Repeat till the coefficients’ value updates between two iterations is not more than the pre-defined sensitivity threshold value
  27. 27. Applications of Fuzzy C-Means Disciplines like Bioinformatics, healthcare, and economics make use of fuzzy c-means with great success.
  28. 28. Latent Dirichlet Allocation (LDA) It helps in finding a linear combination of features that distinguishes or characterizes multiple classes of events or objects.
  29. 29. Primary steps in LDA 01 Provide an estimate of the potential number of topics 02 Algorithm assigns a word to a topic Algorithm will check the accuracy of topic assignment in a loop This helps in ensuring coherent topic clustering.
  30. 30. An example of LDA Suppose there are three separate sentences. 1. I eat chicken and vegetables 2. Chicken are pets 3. My dog loves to eat chicken With LDA, topic clustering for these 3 lines are done as follows – • Sentence 1 = 100% Topic B • Sentence 2 = 100% Topic A • Sentence 3= 33% Topic A and 67% Topic B Now we infer that there are two clusters for sentence classification – Pets (Topic A) and Food (Topic B).
  31. 31. A pioneer is custom and large-scale web data extraction. |