IPTC EXTRA Spring 2018

An update on the EXTRA project - an open source rules based classifier for news content. Including the application for additional funding from Google DNI for FRANCIS

  1. 1. “Extra” by Jeremy Brooks
  EXTRA and FRANCIS Stuart Myles * Associated Press * 24th April 2018
  3. 3. Rules-Based Classification • Rules better for breaking news than statistical methods – You don’t need 50 examples before you can start tagging – A rule for a new topic doesn’t require other rules to change • More consistent and scalable than hand tagging • Easier to explain why rules classify content – Machine learning methods can be “black boxes” – Easier to precisely explain - and correct - mistakes © 2018 IPTC ( All rights reserved 3
  4. 4. EXTRA EXTraction Rules Apparatus Rules-based classification of text Open source software EXTRA was developed by the IPTC €50,000 Grant from the Digital News Initiative You can use your own taxonomy, rules and formats - Example rules help us drive development of the EXTRA system - You can use the example rules to see how to develop your own - Rules could apply IPTC Media Topics or any other taxonomy © 2018 IPTC ( All rights reserved 4
  5. 5. Development Process The EXTRA software was developed by Infalia - All software is open source Two linguists creating rules in English and German - Samples rules to apply IPTC Media Topics Example news corpora licensed for EXTRA - English from Thomson Reuters - German from APA © 2018 IPTC ( All rights reserved 5
  6. 6. EXTRA Components Elasticsearch Percolator + Custom Code Classification Rule authoring Corpus Testing Schema Management © 2018 IPTC ( All rights reserved 6
  7. 7. Classification using Percolator • Elasticsearch – A sophisticated, open source full-text search engine – Lets you query documents stored in an index • Elasticsearch Percolator – Store queries in an index and match documents to queries – Classification uses the percolator to match documents to rules • EXTRA Rule Language – Rule-writer-friendly language (easier than ES DSL) – Access to all ES features, plus custom operators © 2018 IPTC ( All rights reserved 7
  8. 8. Schema and Rules Example • Two fields - headline and body- with body allowed to be queried by paragraph headline body body_paragraph • A rule to require that “angela merkel” and “us elections” appear in the same paragraph (prox/unit=paragraph/distance=1 (body adj "angela merkel") (body adj "us elections") ) © 2018 IPTC ( All rights reserved 8
  9. 9. FRANCIS* Using machine learning to empower rule-based classification of news with semantics. • “aboutness” evaluation – Given that a story is about a topic, how much is it about it? • Rule suggestion – Suggest rules based on a pre-tagged corpus • Enriched rule operators – For example, nested “count” operators – Using EXTRA as the foundation * St Francis de Sales is the patron saint of writers and journalists © 2018 IPTC ( All rights reserved 9