Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Best Practices for Large Scale Text Mining Processing

NOW facilitates semantic search by having annotations attached to search strings. How compolex does that get, e.g. with wildcards between annotated strings?
NOW’s searchbox is quite basic at the moment, but still supports a few scenarios.
1. Pure concept/faceted search - search for all documents containing a concept or where a set of concepts are co-occurring. Ranking is based on frequence of occurrence.
2. Concept/faceted + Full Text search - search for both concepts and particular textual term of phrase.
3. Full text search
With search, pretty much anything can be done to customise it. For the NOW showcase we’ve kept it fairly simple, as usually every client has a slightly different case and wants to tune search in a slightly different direction.

The search in NOW is faceted which means that you search with concepts (facets) and you retrieve all documents which contain mentions of the searched concept. If you search by more than one facet the engine retrieves documents which contain mentions of both concepts but there is no restriction that they occur next to each other.
Is the tagging service expandable (say with custom ontologies)? also is it a something you offer as a service? it is unclear to me from the website.
The TAG service is used for demonstration purposes only. The models behind it are trained for annotating news articles. The pipeline is customizable for every concrete scenario, different domains and entities of interest. You can access several of our pipelines as a service through the S4 platform or you can have them hosted as an on premise solution. In some cases our clients want domain adaptation or improvements in particular area, or to tag with their internal dataset - in this case we offer again an on premise deployment and also a managed service hosted on our hardware.
Hdoes your system accomodate cluster analysis using unsupervised keyword/phrase annotation for knowledge discovery?
As much as the patterns of user behaviour are also considered knowledge discovery we employ these for suggesting related reads. Apart from these we have experience tailoring custom clustering pipelines which also rely on features like keyword and named entities.

For topic extraction how many topics can we extract? from twitter corpus wgat csn we infer?
For topic extraction we have determined that we obtain best results when suggesting 3 categories. These are taken from IPTC but only the uppermost levels which are less than 20.

The twitter corpus example is from a project Ontotext participates in called Pheme. The goal of the project is to detect rumours and to check their veracity, thus help journalists in their hunt for attractive news.

Do you provide Processing Resources and JAPE rules for GATE framework and that can be used with GATE embedded?
We are contributing to the GATE framework and everything which has been wrapped up as PRs has been included the corresponding GATE distributions.

  • Inicia sesión para ver los comentarios

Best Practices for Large Scale Text Mining Processing

  1. 1. Oct 13, 2016 Ivelina Nikolova Senior NLP Engineer Best Practices for Large Scale Text Mining Process
  2. 2. 2Oct 13, 2016 In this webinar you will learn … • Industry applications that maximize Return on Investment (ROI) of your text mining process • To describe your text mining problem • To define the output of the text mining • To select the appropriate text analysis techniques • To plan the prerequisites for a successful text mining solution • DOs and DON’Ts in setting up a text mining process.
  3. 3. 3Oct 13, 2016 Outline • Business need for text mining solutions • Introduction to NLP and information extraction • How to tailor your text analysis process • Applications and demonstrations
  4. 4. 4Oct 13, 2016 • Links mentions in the text to knowledge base concepts • Automatic, manual and semi-automatic Semantic annotation/enrichment
  5. 5. 5Oct 13, 2016 • Semantic annotation facilitates: – data search – data management – data understanding – and more abstract modeling of the textual content like… Business needs for text mining solutions
  6. 6. 6Oct 13, 2016 • Text summarization • Content recommendation • Document classification • Topic extraction • Document search and retrieval • Question answering • Sentiment analysis Business needs for text mining solutions
  7. 7. 7Oct 13, 2016 Some of our customers
  8. 8. 8Oct 13, 2016 • Computational Linguistics (CS) • Natural Language Processing (NLP) • Text Mining (TM) • Information Extraction (IE) • Named Entity Recognition (NER) NLP and IE
  9. 9. 9Oct 13, 2016 • Named Entity Recognition – 60% F1 [OKE-challenge@ESWC2015] – 82.9% F1 [Leaman and Lu, 2016] in the biomedical domain – above 90% for more specific tasks State-of-the art
  10. 10. 10Oct 13, 2016 • Language and domain dependent • The input is free text “President Barack Obama labels Donald Trump comments as 'disturbing'” “Barack Obama labels Donald Trump comments as 'disturbing'” “President Obama labels Donald Trump comments as 'disturbing'” • Natural language ambiguity I cleaned the dishes in my pajamas. I cleaned the dishes in the sink. Georgia was happy with the meal, her boyfriend cooked. Maria is excited about her trip to Georgia next month. Why is NLP so hard?
  11. 11. 11Oct 13, 2016 Designing the text mining process • Know your business problem • Know your data • Find appropriate samples • Use common formats or formats which can be easily transformed to such • Get together domain experts, technical staff, NLP engineers and potential users • Narrow the business problem to information extraction task • Clear the annotation types • Clear the annotation guidelines • Apply the appropriate algorithm for IE • Do iterations of evaluation and improvement • Insure continuous adaptation by curation and re-training
  12. 12. 12Oct 13, 2016
  13. 13. 13Oct 13, 2016 Clear problem definition • Define clearly your business problem • specific smart search • content recommendation • content enrichment • content aggregation etc. E.g. the system must do <A, B, C> • Define clearly the text analysis problem • Reduce the business problem to information extraction problem Business problem: faceted search by Persons, Organizations, Locations Information extraction problem: extract mentions of Persons, Organizations, Locations and link them to the corresponding concepts in the knowledge base
  14. 14. 14Oct 13, 2016 • Annotations – abstract descriptions of the mentions of concepts of interest Named entities: Person, Location, Organization Disease, Symptom, Chemical SpaceObject, SpaceCraf Relations: PersonHasRoleInOrganisation, Causation Define the annotation types I
  15. 15. 15Oct 13, 2016 • Annotation types • Person, Organization, Location • Person, Organization, City • Person, Organization, City, Country • Annotation features Location: string, geonames instance, latitude, longitude Define the annotation types II
  16. 16. 16Oct 13, 2016 Locations mentioned Holocaust documents
  17. 17. 17Oct 13, 2016 • Annotation types • Person, Organization, Location • Person, Organization, City • Person, Organization, City, Country • Annotation features Location: string, geonames instance, latitude, longitude Chemical: string, inChi, SMILES, CAS PersonHasRoleInOrganization: person instance, role instance, organization instance, timestamp Define the annotation types II string: the Gulf of Mexico startOffset: 71 endOffset: 89 type: Location inst: links: [] latitude:25.368611 longitude:-90.390556
  18. 18. 18Oct 13, 2016 • Realistic • Demonstrating the desired output • Positive and negative • “It therefore increases insulin secretion and reduces POS[glucose] levels, especially postprandially.” • “It acts by increasing POS[NEG[glucose]-induced insulin] release and by reducing glucagon secretion postprandially.” • Representative and balanced set of the types of problems • In appropriate/commonly used format – XML, HTML, TXT, CSV, DOC, PDF. Provide examples
  19. 19. 19Oct 13, 2016 Domain model and knowledge • Domain model/ontology - describes the types of objects in the problem area and the relations between them
  20. 20. 20Oct 13, 2016 • Data sources - proprietary data, public data, professional data • Data cleanup • Data formats • Data stores • For metadata - GraphDB ( • For content – MongoDB, MarkLogic etc. • Data modeling is inevitable part of the process of semantic data enrichment • Start it as early as possible • Keep to the common data formats • Mistakes and underestimations are expensive because they influence the whole process of developing a text mining solution Data
  21. 21. 21Oct 13, 2016 • Gold standard – annotated data with superior quality • Annotation guidelines - used as guidance for manually annotating the documents. POS[London] universities = universities located in London NEG[London] City Council NEG[London] Mayor • Manual annotation tools – intuitive UI, visualization features, export formats • MANT – Ontotext's in-house tool • GATE – and • Brad - • Annotation approach • Manual vs. semi-automatic • Domain experts vs. crowd annotation • E.g. Mechanical Turk - • Inter-annotator agreement • Train:Test ratio – 60:40, 70:30 Gold standard
  22. 22. 22Oct 13, 2016 • Rule-based approach • lower number of clear patterns which do not change over time or slightly change • high precision • appropriate for domains where it is important to know how the decision for extracting given annotation is taken – e.g. bio-medical domain • Machine learning approach • higher number of patterns which do change over time • requires annotated data • allows for retraining over time • Neural Network approach • Deep Neural Networks - getting closer to AI • Recent advances promise true natural language understanding via complex neural networks • Great results in Speech recognition, Image recognition and Machine translation; breakthrough expected in NLP • Still unclear why and how it works thus difficult to optimize Text analysis approach
  23. 23. 23Oct 13, 2016 • Preprocessing • Keyphrase extraction • Gazetteer based enrichment • Named entity recognition and disambiguation • Generic entity extraction • Result consolidation • Relation extraction NER Pipeline
  24. 24. 24Oct 13, 2016 NER pipeline
  25. 25. 25Oct 13, 2016 NER pipeline
  26. 26. 26Oct 13, 2016 NER pipeline
  27. 27. 27Oct 13, 2016 • Curation of results - domain experts assess manually the work of the text analysis components • Testing interfaces • Feedback • Select representative set of documents to evaluate manually • Provide as full description of the results and the used component as possible:  <pipeline version>  <input as send for processing>  <description of the wrong behavior>  <description of the correct behavior> • The earlier this happens it triggers revision of the models and improvement of the annotation Results curation / Error analysis
  28. 28. 28Oct 13, 2016 • Gold standard split train:test • 70:30 • 80:20 • Which task you want to evaluate • E.g. extraction at document level or inline annotation • Evaluation metrics • Information extraction tasks – precision, recall, F-measure • Recommendations – A/B-testing Evaluation of the results
  29. 29. 29Oct 13, 2016 Continuous adaptation
  30. 30. 30Oct 13, 2016 • Document categorization • post, political news, sport news, etc.; • Topic extraction • important words and phrases in the text; • Named entity recognition • People, Organization, Location, Time, Amounts of money, etc.; • Keyterm assignment from predefined hierarchies • Concept extraction • entities from a knowledge base; • Relation extraction • relations between types of entities. Types of extracted information
  31. 31. 31Oct 13, 2016 • TAG ( • NOW ( • Patient Insights ( - contact for credentials. Applications
  32. 32. 32Oct 13, 2016 • Clearly defined business problem needs to be broken down to a clearly defined information extraction problem • Requires combined efforts from business decision makers, domain experts, natural language processing experts and technical staff • Data modeling is inevitable part of the process, consider it as early as possible • Create clear annotation guidelines based on real-world examples • Start with an initial small set of balanced and representative documents • Plan the evaluation of the results in advance • Choose appropriate manual annotation tool • While annotating content check how the quantity influences the performance • Select the appropriate text analysis approach • Plan iterations of curation by domain experts followed by revision of the text analysis approach • Plan the aspects of continuous adaptation – document quantity, timing, temporality of the information fed in the model Take away messages
  33. 33. 33Oct 13, 2016 Thank you very much for the attention! You are welcome to try our demos at