SlideShare una empresa de Scribd logo
1 de 62
Descargar para leer sin conexión
SystemT: an Algebraic Approach to
Declarative Information Extraction
Laura Chiticariu
IBM Research | Almaden
Talk at MIT Computer Science and Artificial Intelligence Laboratory
03/08/2016
Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 2
Case Study 1: Sentiment Analysis
Text
Analytics
Product catalog, Customer Master Data, …
Social Media
• Products
Interests360o Profile
• Relationships
• Personal
Attributes
• Life
Events
Statistical
Analysis,
Report Gen.
Bank
6
23%
Bank
5
20%
Bank
4
21%
Bank
3
5%
Bank
2
15%
Bank
1
15%
ScaleScaleScaleScale
450M+ tweets a day,
100M+ consumers, …
BreadthBreadthBreadthBreadth
Buzz, intent, sentiment, life
events, personal atts, …
ComplexityComplexityComplexityComplexity
Sarcasm, wishful thinking, …
Customer 360º
3
Case Study 2: Machine Analysis
Web
Servers
Application
Servers
Transaction
Processing
Monitors
Database #1 Database #2
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log
File
Log FileLog File
DB #2
Log File
DB #2
Log File
• Web site with multi-tier
architecture
• Every component produces its
own system logs
• An error shows up in the log for
Database #2
• What sequence of events ledWhat sequence of events ledWhat sequence of events ledWhat sequence of events led
to this error?to this error?to this error?to this error?
12:34:56 SQL ERROR 43251:
Table CUST.ORDERWZ is not
Operations Analysis
4
Case Study 2: Machine Analysis
Web Server
Logs
CustomCustom
Application
Logs
Database
Logs
Raw Logs
Parsed
Log
Records
Parse Extract
Linkage
Information
End-to-End
Application
Sessions
Integrate
Operations Analysis
Graphical Models
(Anomaly detection)
Analyze
5 min 0.85
CauseCause EffectEffect DelayDelay CorrelationCorrelation
10 min 0.62
Correlation Analysis
• Every customer has unique
components with unique log record
formats
• Need to quickly customize all
stages of analysis for these custom
log data sources
5
Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 6
Text Analytics vs Information Extraction
7
… hotel close to Seaworld in Orlando?
NNS JJ IN NNP IN NNP
POS Tagging
Text
Normalization
Oh wow. Holy s***. Welp. Ima go see
Godzilla in 3 months. F*** yes
I am going to go see Godzilla in 3
months.
We are neutral on Malaysian banks.
Dependency
Parsing
Information
Extraction
Clustering
Classification
Regression
Applications
7
Enterprise Requirements
• ExpressivityExpressivityExpressivityExpressivity
– Need to express complex NLP algorithms for a variety of tasks and data
sources
• ScalabilityScalabilityScalabilityScalability
– Large data volumes, often orders of magnitude larger than classical NLP
corpora
• Social Media: Twitter alone has 500M+ messages / day; 1TB+ per day
• Financial Data: SEC alone has 20M+ filings, several TBs of data, with documents range from few
KBs to few MBs
• Machine Data: One application server under moderate load at medium logging level 1GB of
logs per day
• TransparencyTransparencyTransparencyTransparency
– Every customer’s data and problems are unique in some way
– Need to easily comprehendcomprehendcomprehendcomprehend, debugdebugdebugdebug and enhanceenhanceenhanceenhance extractors 8
Expressivity Example: Different Kinds of Parses
We are raising our tablet forecast.
S
are
NP
We
S
raising
NP
forecastNP
tablet
DET
our
subj
obj
subj pred
Natural Language Machine Log
Dependency
Tree
Oct 1 04:12:24 9.1.1.3 41865:
%PLATFORM_ENV-1-DUAL_PWR: Faulty
internal power supply B detected
Time Oct 1 04:12:24
Host 9.1.1.3
Process 41865
Category
%PLATFORM_ENV-1-
DUAL_PWR
Message
Faulty internal power
supply B detected
9
1010
Expressivity Example: Fact Extraction (Tables)
Singapore 2012 Annual Report
(136 pages PDF)
Identify note breaking down
Operating expenses line item,
and extract opex components
Identify line item for Operating
expenses from Income statement
(financial table in pdf document)
Expressivity Example: Sentiment Analysis
Mcdonalds mcnuggets are fake as shit but they so delicious.
We should do something cool like go to ZZZZ (kiddingkiddingkiddingkidding).
Makin chicken fries at home bc everyone sucks!
You are never too old for Disney movies.
Bank X got me ****ed up today!
Not a pleasant client experience. Please fix ASAP.
I'm still hearing from clients that Company A's website is better.
X... fixing something that wasn't broken
Intel's 2013 capex is elevated at 23% of sales, above average of 16%
IBM announced 4Q2012 earnings of $5.13 per share, compared with 4Q2011 earnings of $4.62
per share, an increase of 11 percent
We continue to rate shares of MSFT neutral.
Sell EUR/CHF at market for a decline to 1.31000…
FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.
Customer SurveysCustomer SurveysCustomer SurveysCustomer Surveys
Social MediaSocial MediaSocial MediaSocial Media
Analyst ResearchAnalyst ResearchAnalyst ResearchAnalyst Research
ReportsReportsReportsReports
11
Transparency Example: Need to incorporate in-situ data
Intel's 2013 capex is elevated at 23% of sales, above average of 16%Intel's 2013 capex is elevated at 23% of sales, above average of 16%
FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.
I'm still hearing from clients that Merrill's website is better.I'm still hearing from clients that Merrill's website is better.
Customer or
competitor?
Good or bad?
Entity of interest
I need to go back to Walmart, Toys R Us has the sameI need to go back to Walmart, Toys R Us has the same
toy $10 cheaper!
12
A Brief History of IE
• 1978-1997: MUC (Message
Understanding Conference) –
DARPA competition 1987 to 1997
– FRUMP [DeJong82]
– FASTUS [Appelt93],
– TextPro, PROTEUS
• 1998: Common Pattern
Specification Language (CPSL)
standard [Appelt98]
– Standard for subsequent rule-
based systems
• 1999-present: Commercial
products, GATE
• At first: Simple techniques like
Naive Bayes
• 1990’s: Learning Rules
– AUTOSLOG [Riloff93]
– CRYSTAL [Soderland98]
– SRV [Freitag98]
• 2000’s: More specialized models
– Maximum Entropy Models
[Berger96]
– Hidden Markov Models [Leek97]
– Maximum Entropy Markov
Models [McCallum00]
– Conditional Random Fields
[Lafferty01]
– Automatic feature expansion
RuleRuleRuleRule----BasedBasedBasedBased Machine LearningMachine LearningMachine LearningMachine Learning
13
Real Systems: A Practical Perspective
• Entity extraction
• EMNLP, ACL,
NAACL, 2003-
2012
• 54 industrial
vendors (Who’s
Who in Text
Analytics, 2012)
[Chiticariu, Li, Reiss, EMNLP’13]
Fast development, fast adaptation,
better results in limited time,
sophistication
Easy to comprehend, maintain, debug &
optimize performance; lower reliance on
labeled data
14
Background: The SystemT Project
• Early 2000’s: NLP group starts at IBM Research –
Almaden
• Initial focus: Collection-level machine learning problems
• Observation: Most time spent on feature extraction
– Technology used: Java, then Cascading Finite State Automata
15
Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 16
Finite-state Automata
• Common formalism underlying most rule-based IE
systems
– Input text viewed as a sequence of tokens
– Rules expressed as regular expression patterns over the lexical
features of these tokens
• Several levels of processing Cascading Automata
– Typically, at higher levels of the grammar, larger segments of text are
analyzed and annotated
• Common Pattern Specification Language (CPSL)
– Standard to specify and represent cascading finite state automata
– Each automata accepts a sequence of annotations and outputs a
sequence of annotations
17
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet,
sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante.
Suspendisse
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in
sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
Cascading Automata Example: Person Extractor
Tokenization
(preprocessing step)
Level 1
〈Gazetteer〉[type = LastGaz] 〈Last〉
〈Gazetteer〉[type = FirstGaz] 〈First〉
〈Token〉[~ “[A-Z]w+”] 〈Caps〉
Rule priority used to prefer
First over Caps
• Rule priority used to prefer First over Caps.
• Lossy Sequencing: annotations dropped
because input to next stage must be a sequence
– First preferred over Last since it was declared earlier
18
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet,
sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante.
Suspendisse
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in
sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e
sagittis Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci.
Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra
lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque
id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent
Cascading Automata Example: Person Extractor
Tokenization
(preprocessing step)
Level 1
〈Gazetteer〉[type = LastGaz] 〈Last〉
〈Gazetteer〉[type = FirstGaz] 〈First〉
〈Token〉[~ “[A-Z]w+”] 〈Caps〉
Level 2 〈First〉 〈Last〉 〈Person〉
〈First〉 〈Caps〉 〈Person〉
〈First〉 〈Person〉
Rigid Rule Priority and Lossy
Sequencing in Level 1 caused
partial results
19
Problems with Cascading Automata
• Scalability: Redundant passes over document
• Expressivity: Frequent use of custom code
• Transparency
• Ease of comprehension
• Ease of debugging
• Ease of enhancement
Operational
semantics
+ custom code
= no provenance
20
Bringing Transparency to Feature Extraction
• Our approachOur approachOur approachOur approach: Use a declarative language
– Decouple meaning of extraction rules from execution plan
• Our language:Our language:Our language:Our language: AQL (Annotator Query Language)
– Semantics based on relational calculus
– Syntax based on SQL
21
Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 22
Document
text: String
Person
last: Spanfirst: Span fullname: Span
AQL Data Model (Simplified)
• Relational data modelRelational data modelRelational data modelRelational data model: data is organized in: data is organized in: data is organized in: data is organized in tuples; tuples have a; tuples have a; tuples have a; tuples have a schema
• SpecialSpecialSpecialSpecial data types necessarydata types necessarydata types necessarydata types necessary for text processing:for text processing:for text processing:for text processing:
– Document consists of a single texttexttexttext attribute
– Annotations are represented by a type called SpanSpanSpanSpan, which consists of beginbeginbeginbegin,
endendendend and documentdocumentdocumentdocument attribute
23
create view FirstCaps as
select CombineSpans(F.name, C.name) as name
from First F, Caps C
where FollowsTok(F.name, C.name, 0, 0);
<First> <Caps>
0 tokens
• Declarative: Specify logical conditions that input tuples should satisfy in order to
generate an output tuple
• Choice of SQL-like syntax for AQL motivated by wider adoption of SQL
• Compiles into SystemT algebra
AQL By Example
24
create view Person as
select S.name as name
from (
( select CombineSpans(F.name, C.name) as name
from First F, Caps C
where FollowsTok(F.name, C.name, 0, 0))
union all
( select CombineSpans(F.name, L.name) as name
from First F, Last L
where FollowsTok(F.name, L.name, 0, 0))
union all
( select *
from First F )
) S
consolidate on name;
<First><Caps>
<First><Last>
<First>
Revisiting the Person Example
25
create view Person as
select S.name as name
from (
( select CombineSpans(F.name, C.name) as name
from First F, Caps C
where FollowsTok(F.name, C.name, 0, 0))
union all
( select CombineSpans(F.name, L.name) as name
from First F, Last L
where FollowsTok(F.name, L.name, 0, 0))
union all
( select *
from First F )
) S
consolidate on name;
Explicit clause for
resolving ambiguity
(No Rigid Priority
problem)
Input may contain
overlapping annotations
(No Lossy Sequencing
problem)
Revisiting the Person Example
26
Compiling and Executing AQL
AQL Language
Optimizer
Operator
Graph
Specify extractor semantics
declaratively (express logic of
computation, not control flow)
Choose efficient execution
plan that implements
semantics
Optimized execution plan
executed at runtime
27
Regular Expression Extraction Operator
28
[A-Z][a-z]+
DocumentInput Tuple
we will meet Mark
Scott and
Output Tuple 2 Span 2Document
Span 1Output Tuple 1 Document
Regex
Expressivity: Rich Set of Operators
Deep Syntactic Parsing ML Training & Scoring
Core Operators
Tokenization Parts of Speech Dictionaries
Regular
Expressions
Span
Operations
Relational
Operations
Semantic Role Labels
Language to express NLP Algorithms AQL
.
Aggregation
Operations
29
package com.ibm.avatar.algebra.util.sentence;
import java.io.BufferedWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.regex.Matcher;
public class SentenceChunker
{
private Matcher sentenceEndingMatcher = null;
public static BufferedWriter sentenceBufferedWriter = null;
private HashSet<String> abbreviations = new HashSet<String> ();
public SentenceChunker ()
{
}
/** Constructor that takes in the abbreviations directly. */
public SentenceChunker (String[] abbreviations)
{
// Generate the abbreviations directly.
for (String abbr : abbreviations) {
this.abbreviations.add (abbr);
}
}
/**
* @param doc the document text to be analyzed
* @return true if the document contains at least one sentence boundary
*/
public boolean containsSentenceBoundary (String doc)
{
String origDoc = doc;
/*
* Based on getSentenceOffsetArrayList()
*/
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
String candidate = /*
* Looks at the last character of the String. If this last
* character is part of an abbreviation (as detected by
* REGEX) then the sentenceString is not a fullSentence and
* "false” is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder)
&& isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
// sentences.addElement(candidate.trim().replaceAll("n", "
// "));
// sentenceArrayList.add(new Integer(currentOffset + boundary
// + 1));
// currentOffset += boundary + 1;
// Found a sentence boundary. If the boundary is the last
// character in the string, we don't consider it to be
// contained within the string.
int baseOffset = currentOffset + boundary + 1;
if (baseOffset < origDoc.length ()) {
// System.err.printf("Sentence ends at %d of %dn",
// baseOffset, origDoc.length());
return true;
}
else {
return false;
}
}
// origDoc.substring(0,currentOffset));
// doc = doc.substring(boundary + 1);
doc = remainder;
}
}
while (boundary != -1);
// If we get here, didn't find any boundaries.
return false;
}
public ArrayList<Integer> getSentenceOffsetArrayList (String doc)
{
ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
sentenceArrayList.add (new Integer (0));
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {
String candidate = doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
/*
* Looks at the last character of the String. If this last character
* is part of an abbreviation (as detected by REGEX) then the
* sentenceString is not a fullSentence and "false" is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder) &&
isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + boundary + 1));
currentOffset += boundary + 1;
}
// origDoc.substring(0,currentOffset));
doc = remainder;
}
}
while (boundary != -1);
if (doc.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + doc.length ()));
}
sentenceArrayList.trimToSize ();
return sentenceArrayList;
}
private void setDocumentForObtainingBoundaries (String doc)
{
sentenceEndingMatcher = SentenceConstants.
sentenceEndingPattern.matcher (doc);
}
private int getNextCandidateBoundary ()
{
if (sentenceEndingMatcher.find ()) {
return sentenceEndingMatcher.start ();
}
else
return -1;
}
private boolean doesNotBeginWithPunctuation (String remainder)
{
Matcher m = SentenceConstants.punctuationPattern.matcher (remainder);
return (!m.find ());
}
private String getLastWord (String cand)
{
Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand);
if (lastWordMatcher.find ()) {
return lastWordMatcher.group ();
}
else {
return "";
}
}
/*
* Looks at the last character of the String. If this last character is
* par of an abbreviation (as detected by REGEX)
* then the sentenceString is not a fullSentence and "false" is returned
*/
private boolean isFullSentence (String cand)
{
// cand = cand.replaceAll("n", " "); cand = " " + cand;
Matcher validSentenceBoundaryMatcher =
SentenceConstants.validSentenceBoundaryPattern.matcher (cand);
if (validSentenceBoundaryMatcher.find ()) return true;
Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);
if (abbrevMatcher.find ()) {
return false; // Means it ends with an abbreviation
}
else {
// Check if the last word of the sentenceString has an entry in the
// abbreviations dictionary (like Mr etc.)
String lastword = getLastWord (cand);
if (abbreviations.contains (lastword)) { return false; }
}
return true;
}
}
Java Implementation of Sentence Boundary Detection
Expressivity: AQL vs. Custom Java Code
30
package com.ibm.avatar.algebra.util.sentence;
import java.io.BufferedWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.regex.Matcher;
public class SentenceChunker
{
private Matcher sentenceEndingMatcher = null;
public static BufferedWriter sentenceBufferedWriter = null;
private HashSet<String> abbreviations = new HashSet<String> ();
public SentenceChunker ()
{
}
/** Constructor that takes in the abbreviations directly. */
public SentenceChunker (String[] abbreviations)
{
// Generate the abbreviations directly.
for (String abbr : abbreviations) {
this.abbreviations.add (abbr);
}
}
/**
* @param doc the document text to be analyzed
* @return true if the document contains at least one sentence boundary
*/
public boolean containsSentenceBoundary (String doc)
{
String origDoc = doc;
/*
* Based on getSentenceOffsetArrayList()
*/
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
String candidate = /*
* Looks at the last character of the String. If this last
* character is part of an abbreviation (as detected by
* REGEX) then the sentenceString is not a fullSentence and
* "false” is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder)
&& isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
// sentences.addElement(candidate.trim().replaceAll("n", "
// "));
// sentenceArrayList.add(new Integer(currentOffset + boundary
// + 1));
// currentOffset += boundary + 1;
// Found a sentence boundary. If the boundary is the last
// character in the string, we don't consider it to be
// contained within the string.
int baseOffset = currentOffset + boundary + 1;
if (baseOffset < origDoc.length ()) {
// System.err.printf("Sentence ends at %d of %dn",
// baseOffset, origDoc.length());
return true;
}
else {
return false;
}
}
// origDoc.substring(0,currentOffset));
// doc = doc.substring(boundary + 1);
doc = remainder;
}
}
while (boundary != -1);
// If we get here, didn't find any boundaries.
return false;
}
public ArrayList<Integer> getSentenceOffsetArrayList (String doc)
{
ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
sentenceArrayList.add (new Integer (0));
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {
String candidate = doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
/*
* Looks at the last character of the String. If this last character
* is part of an abbreviation (as detected by REGEX) then the
* sentenceString is not a fullSentence and "false" is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder) &&
isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + boundary + 1));
currentOffset += boundary + 1;
}
// origDoc.substring(0,currentOffset));
doc = remainder;
}
}
while (boundary != -1);
if (doc.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + doc.length ()));
}
sentenceArrayList.trimToSize ();
return sentenceArrayList;
}
private void setDocumentForObtainingBoundaries (String doc)
{
sentenceEndingMatcher = SentenceConstants.
sentenceEndingPattern.matcher (doc);
}
private int getNextCandidateBoundary ()
{
if (sentenceEndingMatcher.find ()) {
return sentenceEndingMatcher.start ();
}
else
return -1;
}
private boolean doesNotBeginWithPunctuation (String remainder)
{
Matcher m = SentenceConstants.punctuationPattern.matcher (remainder);
return (!m.find ());
}
private String getLastWord (String cand)
{
Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand);
if (lastWordMatcher.find ()) {
return lastWordMatcher.group ();
}
else {
return "";
}
}
/*
* Looks at the last character of the String. If this last character is
* par of an abbreviation (as detected by REGEX)
* then the sentenceString is not a fullSentence and "false" is returned
*/
private boolean isFullSentence (String cand)
{
// cand = cand.replaceAll("n", " "); cand = " " + cand;
Matcher validSentenceBoundaryMatcher =
SentenceConstants.validSentenceBoundaryPattern.matcher (cand);
if (validSentenceBoundaryMatcher.find ()) return true;
Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);
if (abbrevMatcher.find ()) {
return false; // Means it ends with an abbreviation
}
else {
// Check if the last word of the sentenceString has an entry in the
// abbreviations dictionary (like Mr etc.)
String lastword = getLastWord (cand);
if (abbreviations.contains (lastword)) { return false; }
}
return true;
}
}
Java Implementation of Sentence Boundary Detection
Expressivity: AQL vs. Custom Java Code
create dictionary AbbrevDict from file
'abbreviation.dict’;
create view SentenceBoundary as
select R.match as boundary
from ( extract regex /(([.?!]+s)|(ns*n))/
on D.text as match from Document D ) R
where
Not(ContainsDict('AbbrevDict',
CombineSpans(LeftContextTok(R.match, 1),R.match)));
Equivalent AQL Implementation
31
Expressivity: AQL vs CPSL
• [Fagin et al., PODS ’13]
– Formal modeling of core subset of code-free CPSL (regular
spanners) and AQL (core spanners)
– Core spanners strictly more expressive than regular
spanners
• [Fagin et al., PODS ‘14]
– Unified framework for declarative cleaning (i.e., resolving
overlap)
– Cleaning increases expressiveness in general
• But not for SystemT consolidate policies, nor for controls in JAPE
(CPSL implementation in GATE)
32
Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 33
Scalability
• Performance issues with cascading automata
– Complete pass through tokens for each cascade
– Many of these passes are wasted work
• Dominant approach: Make each pass go faster
– Doesn’t solve root problem!
• Algebraic approach: Build a query optimizer!
34
Scalability: Class of Optimizations in SystemT
• RewriteRewriteRewriteRewrite----basedbasedbasedbased: rewrite
algebraic operator graph
– Shared Dictionary Matching
– Shared Regular Expression
Evaluation
– On-demand tokenization
• CostCostCostCost----basedbasedbasedbased: relies on novel
selectivity estimation for text-
specific operators
– Standard transformations
• E.g., push down selections
– Restricted Span Evaluation
• Evaluate expensive operators
on restricted regions of the
document
35
Tokenization overhead is paid only
once
First
(followed within 0 tokens)
Plan C
Plan A
Join
Caps
Restricted Span
Evaluation
Plan B
First
Identify Caps starting
within 0 tokensExtract text to the right
Caps
Identify First ending
within 0 tokens
Extract text to the left
Performance Comparison (with ANNIE)
0
100
200
300
400
500
600
700
0 20 40 60 80 100
Average document size (KB)
Throughput(KB/sec)
Open Source Entity Tagger
SystemT
ANNIE
TaskTaskTaskTask: Named Entity
DatasetDatasetDatasetDataset : Different document collections from the Enron corpus obtained
by randomly sampling 1000 documents for each size
10~50x faster
[Chiticariu et al., ACL’10] 36
[Chiticariu et al., ACL’10]
Performance Comparison on Larger Documents
Dataset Document Size
Throughput
(KB/sec)
Average Memory
(MB)
Range Average ANNIE SystemT ANNIE SystemT
Web Crawl 68 B – 388 KB 8.8 KB 42.8 498.8 201.8 77.2
Medium
SEC Filings
240 KB – 0.9 MB 401 KB 26.3 703.5 601.8 143.7
Large
SEC Flings
1 MB – 3.4 MB 1.54 MB 21.1 954.5 2683.5 189.6
Datasets : Web crawl and filings from the Securities and Exchanges Commission (SEC)
Throughput benefits carryover for
wide-variety of document sizes
Much lower
memory footprint
37
[Chiticariu et al., ACL’10]
Performance Comparison on Larger Documents
Dataset Document Size
Throughput
(KB/sec)
Average Memory
(MB)
Range Average ANNIE SystemT ANNIE SystemT
Web Crawl 68 B – 388 KB 8.8 KB 42.8 498.8 201.8 77.2
Medium
SEC Filings
240 KB – 0.9 MB 401 KB 26.3 703.5 601.8 143.7
Large
SEC Flings
1 MB – 3.4 MB 1.54 MB 21.1 954.5 2683.5 189.6
Datasets : Web crawl and filings from the Securities and Exchanges Commission (SEC)
Throughput benefits carryover for
wide-variety of document sizes
Much lower
memory footprint
Theorem:Theorem:Theorem:Theorem: For any acyclic tokenFor any acyclic tokenFor any acyclic tokenFor any acyclic token----based FSTbased FSTbased FSTbased FST TTTT,,,,
there exists an operator graphthere exists an operator graphthere exists an operator graphthere exists an operator graph GGGG such that evaluatingsuch that evaluatingsuch that evaluatingsuch that evaluating
TTTT andandandand GGGG has the same computational complexityhas the same computational complexityhas the same computational complexityhas the same computational complexity
38
SystemT Acceleration on FPGA
[Atasu et al. FPL’13, Polig et al., FPL’14]
39
Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 40
How AQL Solved our Problems
• Expressivity: Complex tasks, no custom
code
• Scalability: Cost-based query optimization
• Transparency
• Ease of comprehension
• Ease of debugging
• Ease of enhancement
Clear and Simple
Provenance
41
42
Computing Provenance
PersonPhone rule:
insert into PersonPhone
select Merge(F.match, P.match) as match
from Person F, Phone P
where Follows(F.match, P.match, 0, 60);
match
Anna at James St. office (555-5555
James St. office (555-5555
PersonPhone
Provenance: Explains output data in terms of the input data, the intermediate data,
and the transformation (e.g., SQL query, ETL, workflow)
– Surveys: [Davidson & Freire, SIGMOD 2008] [Cheney et al., Found. Databases 2009]
For predicate-based rule languages (e.g., SQL), can be computed automatically!
Person PhonePerson
Anna at James St. office (555-5555) .
[Liu et al., VLDB’10]
43
Computing Provenance
Rewritten PersonPhone rule:
insert into PersonPhone
select Merge(F.match, P.match) as match,
GenerateID() as ID,
P.id as nameProv, Ph.id as numberProv
‘AND’ as how
from Person F, Phone P
where Follows(F.match, P.match, 0, 60);
Person PhonePerson
Anna at James St. office (555-5555) .
ID: 1 ID: 2 ID: 3
1 3AND
2 3AND
match
Anna at James St. office (555-5555
James St. office (555-5555
PersonPhone
Provenance
Provenance: Explains output data in terms of the input data, the intermediate data,
and the transformation (e.g., SQL query, ETL, workflow)
– Surveys: [Davidson & Freire, SIGMOD 2008] [Cheney et al., Found. Databases 2009]
For predicate-based rule languages (e.g., SQL), can be computed automatically!
[Liu et al., VLDB’10]
AQL: Going beyond Feature Extraction
Dataset Entity Type System Precision Recall F-measure
CoNLL 2003
Location
SystemT 93.11 91.61 92.35
Florian 90.59 91.73 91.15
Organization
SystemT 92.25 85.31 88.65
Florian 85.93 83.44 84.67
Person
SystemT 96.32 92.39 94.32
Florian 92.49 95.24 93.85
Enron Person
SystemT 87.27 81.82 84.46
Minkov 81.1 74.9 77.9
ExtractionExtractionExtractionExtraction Task:Task:Task:Task: Named-entity extraction
SystemsSystemsSystemsSystems compared:compared:compared:compared: SystemT (customized) vs. [Florian et al.’03] [Minkov et al.’05]
[Chiticariu et al., EMNLP’10]
Transparency without machine learning
outperforms machine learning without
transparency.
44
Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 45
Multilingual Support
Tokenization • 26+ languages
• All other languages supported via
whitespace/punctuation tokenizer
Part of Speech and Lemmatization • 18+ languages, including English and Western
languages, Arabic, Russian, CJK
Semantic Role Labeling • English
• Ongoing work on Multilingual SRL [Akbik et al,
ACL’15]
Annotator Libraries • Out-of-the-box customizable libraries for
multiple applications and data sources in
multiple languages
46
Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Cascading Finite State Automata
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 47
Machine Learning in SystemT
• AQL provides a foundation of transparency
• Next step: Add machine learning without losing transparency
• Major machine learning efforts:
– Embeddable Models in AQL
– Learning using AQL as target language
48
Machine Learning in SystemT
• AQL provides a foundation of transparency
• Next step: Add machine learning without losing transparency
• Major machine learning efforts:
– Embeddable Models in AQL
• Model Training and Scoring integration with SystemML
• Deep Parsing & Semantic Role Labeling [Akbik et al., ACL’15]
• Text Normalization [Zhang et al. ACL’13, Li & Baldwin, NAACL ‘15]
– Learning using AQL as target language
49
© 2015 IBM Corporation
Input
Documents
Extracted features
SystemT Runtime
SystemT
Operators
Statistical
Learning
Algorithm
Label
Feature Extraction
Training the Parser: Efficient and Powerful Feature Extraction
AQL
create view FirstWord as
select...
from ...
where ...;
...
create view Digits as
extract regex...
from ...
where ... ;
Input
docs
English FirstWord: I, We,
This,
Digits: 360, 2014
Capitalized: We,
IBM,
German FirstWord: Ich,
Wir,Dies,
Digits: 360, 2014
Capitalized: Wir,
IBM,
Statistical
Parser
Annotations
AQL
create function Parse(span Span)
return String ...
external name ...
...
create view Parse as
select Parse(S.text) as parse,...
from sentence S;
create view Actions as
select...
from Parse P;
...
-- Example: We are happy
create view Sentiment_SpeakerIsPositive as
select ‘positive’ as polarity, ...
‘isPositive’ as patternType,...
from Actions A, Roles R
where MatchesDict(‘PositiveVerb.dict’, A.verb)
and MatchesDict(‘Speakers.dict’, R.value)
and ...;
...
Input
doc
Applying the Parser: Easy Incorporation of Parsing Results for Complex Extractors
UDFs
SentimentMention
polarity mention target clue patternType
positive We like IBM from a long timer IBM like SpeakerDoesPosi
tive
negative Microsoft earnings expected
to drop
Microsoft expected to
drop
TargetDoesNegat
ive… …
Parsing + extraction
Cost-based
optimization
SystemT
Operators
SystemT Runtime
Embed
statistical
parser as UDF
Cost-based
optimization
Run
statistical
parser
Build rules
using parser
results
Identify
sentiment
based on
parse tree
patterns
Identify the
first word of
a sentence
Identify all
digits
Simplify Training and Applying StatisticalSimplify Training and Applying StatisticalSimplify Training and Applying StatisticalSimplify Training and Applying Statistical PPPParsersarsersarsersarsers
50
UDFs
SystemML in a Nutshell
• Provides a language for data scientists to implement machine learning algorithms
– Declarative, high-level language with R-like syntax (also Python)
– Also comes with approx. 20 algorithms pre-implemented
• Compiles execution plans ranging from single node (scale up multi threaded) to
scale out (MapReduce, Spark)
– Cost-based optimizer to generate execution plans, parallelize
• Based on data and system characteristics
– Operators for in-memory single node and cluster execution
• Runs in embeddable, standalone, and cluster mode
• Supports various APIs
• Apache SystemML Incubator project: http://systemml.apache.org
• Ongoing research effort at IBM Research - Almaden
51
Machine Learning in SystemT
• AQL provides a foundation of transparency
• Next step: Add machine learning without losing transparency
• Major machine learning efforts:
– Embeddable Models in AQL
– Learning using AQL as target language
• Low-level features: regular expressions [Li et al., EMNLP’08],
dictionaries [Li et al., CIKM’11, Roy et al., SIGMOD’13]
• Rule refinement [Liu et al., VLDB’10]
• Rule induction [Nagesh et al., EMNLP’12]
• Ongoing research
52
Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Grammar-based IE systems
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 53
54
Tooling Research for Productivity
Develop
TestAnalyze
Development
Deploy
Refine
Test
Maintenance
Task Analysis
[ACL’11,12,13,CHI’13]
• Concordance Viewer and
Labeling Tool [ACL ‘13]
•Extraction plan [CHI’13]
• Track provenance [VLDB’10]
• Contextual clue discovery[CIKM’11]
• Regex learning [EMNLP’08]
• Rule induction & refinement [EMNLP’12,VLDB’10]
• Dictionary refinement [SIGMOD’13]
•Visual Programming [VLDB’15]
• Visual Programming: [VLDB’15]
•NE Interface [EMNLP’10]
Eclipse Tools Overview
55
Ease of
Programming
Performance
Tuning
Automatic
Discovery
AQL Editor
Explain
Pattern Discovery
Result Viewer
Regex Learner
AQL Editor: syntax highlighting, auto-complete,
hyperlink navigation
Result Viewer: visualize/compare/evaluate
Explain: show how each result was generated
Workflow UI: end-to-end development wizard
Regex Generator: generate regular expressions
from examples
Pattern Discovery: identify patterns in the
data
Profiler: identify performance bottlenecks to be
hand tuned
[Chiticariu et al., SIGMOD ‘11, Li et al. ACL’12]
Web Tools Overview
56
Ease of
Programming
Ease of
Sharing
Canvas: Visual construction of extractors,
Customization of existing extractors
Result Viewer: visualize/compare/evaluate
Concept catalog: share concepts
Project: share extractor development
[Li et al., VLDB’15]
Tooling for Productivity
Named
Entities
Parts-of-speech
….
Core
Operations
.....
Higher-level
Tasks
Tokenization Span Operators
Dictionary Regular Expressions
Financial
Primitives
Drug
Extraction
Sentiment
Analysis
Corporate Intelligence Life Sciences
Application
Tasks
.....
NLPEngineers
(EclipseTools)
SME
(WebTools)
Relational &Aggregate
Operators
ML Training
& Scoring
AQL Language
Customizable Extractor Libraries developed in AQL
Deep Parsing & SRL
57
Outline
• Emerging Applications: Case Studies
• Enterprise Requirements
• Challenges in Grammar-based IE systems
• SystemT : an Algebraic approach
– The AQL language
– Performance
– Transparency
– Multilingual Support
– Machine Learning
– Tooling
• Summary 58
59
Summary
Declarative AQL language
Development EnvironmentDevelopment EnvironmentDevelopment EnvironmentDevelopment Environment
Cost-based
optimization
............
Discovery tools for AQL development
SystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT Runtime
Input
Documents
Extracted
Objects
AQL: a declarative language that can
be used to build extractors
outperforming the state-of-the-arts
[ACL’10, EMNLP’10]
A suite of novel development tooling
leveraging machine learning and HCI
[EMNLP’08, VLDB’10, ACL’11, CIKM’11, ACL’12,
EMNLP’12, SIGMOD’12, CHI’13, ACL’13, VLDB’15]
Cost-based optimization
for text-centric operations
[ICDE’08, ICDE’11]
Highly embeddable runtime with
high-throughput and small
memory footprint. [SIGMOD
Record’09, SIGMOD’09,’ FPL’13,’14]
A declarativedeclarativedeclarativedeclarative information extraction system with costcostcostcost----based optimizationbased optimizationbased optimizationbased optimization, highhighhighhigh----performanceperformanceperformanceperformance
runtimeruntimeruntimeruntime and novel development toolingnovel development toolingnovel development toolingnovel development tooling based on solid theoretical foundationsolid theoretical foundationsolid theoretical foundationsolid theoretical foundation [PODS’13, 14]. Shipping
with 10+ IBM products, and installed on 400+ client’s sites.
InfoSphere
StreamsStreamsStreamsStreams
InfoSphere
BigInsightsBigInsightsBigInsightsBigInsights
SystemTSystemTSystemTSystemTSystemTSystemTSystemTSystemT
IBM EnginesIBM EnginesIBM EnginesIBM Engines
UIMAUIMAUIMAUIMA
SystemTSystemTSystemTSystemT
…
Find Out More about SystemT!
60
https://ibm.biz/BdF4GQ
Find Out More about SystemT!
61
https://ibm.biz/BdF4GQ
Try out SystemT
Watch a demo
Learn about using
SystemT in
unversity courses
Thank you!
• For more information…
– Visit our website:Visit our website:Visit our website:Visit our website:
http://ibm.co/1Cdm1Mj
– Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:
http://http://http://http://ibm.co/1DIouEvibm.co/1DIouEvibm.co/1DIouEvibm.co/1DIouEv
– Learning at your own pace: With the lab materialsLearning at your own pace: With the lab materialsLearning at your own pace: With the lab materialsLearning at your own pace: With the lab materials
– Contact meContact meContact meContact me
• chiti@us.ibm.com
SystemT Team:
Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao Li
Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu
62

Más contenido relacionado

Similar a SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)

Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Roger Barga
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentTasktop
 
The Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine LearninThe Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine LearninInside Analysis
 
Time Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sTime Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sInside Analysis
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Diego Oppenheimer
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science Mahesh Kumar CV
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private BankingJérôme Kehrli
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabsChetan Khatri
 
[Data Meetup] Data Science in Finance - Building a Quant ML pipeline
[Data Meetup] Data Science in Finance -  Building a Quant ML pipeline[Data Meetup] Data Science in Finance -  Building a Quant ML pipeline
[Data Meetup] Data Science in Finance - Building a Quant ML pipelineData Science Society
 
Fst auckland presentation
Fst auckland presentationFst auckland presentation
Fst auckland presentationBohdan Szymanik
 
Gala Webminar September 2013
Gala Webminar September 2013Gala Webminar September 2013
Gala Webminar September 2013pangeanic
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017Prashant Bhatmule
 
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...Dell World
 
final oracle presentation
final oracle presentationfinal oracle presentation
final oracle presentationPriyesh Patel
 
Data mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageData mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageq-Maxim
 
Transformacion del Negocio Financiero por medio de Tecnologias Cloud
Transformacion del Negocio Financiero por medio de Tecnologias CloudTransformacion del Negocio Financiero por medio de Tecnologias Cloud
Transformacion del Negocio Financiero por medio de Tecnologias CloudRaul Goycoolea Seoane
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software EngineeringMiroslaw Staron
 
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Amazon Web Services Korea
 

Similar a SystemT: Declarative Information Extraction (invited talk at MIT CSAIL) (20)

Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics Environment
 
The Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine LearninThe Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine Learnin
 
Time Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sTime Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today's
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private Banking
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
 
[Data Meetup] Data Science in Finance - Building a Quant ML pipeline
[Data Meetup] Data Science in Finance -  Building a Quant ML pipeline[Data Meetup] Data Science in Finance -  Building a Quant ML pipeline
[Data Meetup] Data Science in Finance - Building a Quant ML pipeline
 
Fst auckland presentation
Fst auckland presentationFst auckland presentation
Fst auckland presentation
 
Gala Webminar September 2013
Gala Webminar September 2013Gala Webminar September 2013
Gala Webminar September 2013
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017
 
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
If You Are Not Embedding Analytics Into Your Day To Day Processes, You Are Do...
 
final oracle presentation
final oracle presentationfinal oracle presentation
final oracle presentation
 
A6 big data_in_the_cloud
A6 big data_in_the_cloudA6 big data_in_the_cloud
A6 big data_in_the_cloud
 
Data mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid languageData mining and Machine learning expained in jargon free & lucid language
Data mining and Machine learning expained in jargon free & lucid language
 
Transformacion del Negocio Financiero por medio de Tecnologias Cloud
Transformacion del Negocio Financiero por medio de Tecnologias CloudTransformacion del Negocio Financiero por medio de Tecnologias Cloud
Transformacion del Negocio Financiero por medio de Tecnologias Cloud
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
 
Machine Learning For Stock Broking
Machine Learning For Stock BrokingMachine Learning For Stock Broking
Machine Learning For Stock Broking
 

Último

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)

  • 1. SystemT: an Algebraic Approach to Declarative Information Extraction Laura Chiticariu IBM Research | Almaden Talk at MIT Computer Science and Artificial Intelligence Laboratory 03/08/2016
  • 2. Outline • Emerging Applications: Case Studies • Enterprise Requirements • Challenges in Cascading Finite State Automata • SystemT : an Algebraic approach – The AQL language – Performance – Transparency – Multilingual Support – Machine Learning – Tooling • Summary 2
  • 3. Case Study 1: Sentiment Analysis Text Analytics Product catalog, Customer Master Data, … Social Media • Products Interests360o Profile • Relationships • Personal Attributes • Life Events Statistical Analysis, Report Gen. Bank 6 23% Bank 5 20% Bank 4 21% Bank 3 5% Bank 2 15% Bank 1 15% ScaleScaleScaleScale 450M+ tweets a day, 100M+ consumers, … BreadthBreadthBreadthBreadth Buzz, intent, sentiment, life events, personal atts, … ComplexityComplexityComplexityComplexity Sarcasm, wishful thinking, … Customer 360º 3
  • 4. Case Study 2: Machine Analysis Web Servers Application Servers Transaction Processing Monitors Database #1 Database #2 Log File Log File Log File Log File Log File Log File Log File Log File Log File Log File Log File Log File Log File Log File Log File Log File Log File Log File Log FileLog File DB #2 Log File DB #2 Log File • Web site with multi-tier architecture • Every component produces its own system logs • An error shows up in the log for Database #2 • What sequence of events ledWhat sequence of events ledWhat sequence of events ledWhat sequence of events led to this error?to this error?to this error?to this error? 12:34:56 SQL ERROR 43251: Table CUST.ORDERWZ is not Operations Analysis 4
  • 5. Case Study 2: Machine Analysis Web Server Logs CustomCustom Application Logs Database Logs Raw Logs Parsed Log Records Parse Extract Linkage Information End-to-End Application Sessions Integrate Operations Analysis Graphical Models (Anomaly detection) Analyze 5 min 0.85 CauseCause EffectEffect DelayDelay CorrelationCorrelation 10 min 0.62 Correlation Analysis • Every customer has unique components with unique log record formats • Need to quickly customize all stages of analysis for these custom log data sources 5
  • 6. Outline • Emerging Applications: Case Studies • Enterprise Requirements • Challenges in Cascading Finite State Automata • SystemT : an Algebraic approach – The AQL language – Performance – Transparency – Multilingual Support – Machine Learning – Tooling • Summary 6
  • 7. Text Analytics vs Information Extraction 7 … hotel close to Seaworld in Orlando? NNS JJ IN NNP IN NNP POS Tagging Text Normalization Oh wow. Holy s***. Welp. Ima go see Godzilla in 3 months. F*** yes I am going to go see Godzilla in 3 months. We are neutral on Malaysian banks. Dependency Parsing Information Extraction Clustering Classification Regression Applications 7
  • 8. Enterprise Requirements • ExpressivityExpressivityExpressivityExpressivity – Need to express complex NLP algorithms for a variety of tasks and data sources • ScalabilityScalabilityScalabilityScalability – Large data volumes, often orders of magnitude larger than classical NLP corpora • Social Media: Twitter alone has 500M+ messages / day; 1TB+ per day • Financial Data: SEC alone has 20M+ filings, several TBs of data, with documents range from few KBs to few MBs • Machine Data: One application server under moderate load at medium logging level 1GB of logs per day • TransparencyTransparencyTransparencyTransparency – Every customer’s data and problems are unique in some way – Need to easily comprehendcomprehendcomprehendcomprehend, debugdebugdebugdebug and enhanceenhanceenhanceenhance extractors 8
  • 9. Expressivity Example: Different Kinds of Parses We are raising our tablet forecast. S are NP We S raising NP forecastNP tablet DET our subj obj subj pred Natural Language Machine Log Dependency Tree Oct 1 04:12:24 9.1.1.3 41865: %PLATFORM_ENV-1-DUAL_PWR: Faulty internal power supply B detected Time Oct 1 04:12:24 Host 9.1.1.3 Process 41865 Category %PLATFORM_ENV-1- DUAL_PWR Message Faulty internal power supply B detected 9
  • 10. 1010 Expressivity Example: Fact Extraction (Tables) Singapore 2012 Annual Report (136 pages PDF) Identify note breaking down Operating expenses line item, and extract opex components Identify line item for Operating expenses from Income statement (financial table in pdf document)
  • 11. Expressivity Example: Sentiment Analysis Mcdonalds mcnuggets are fake as shit but they so delicious. We should do something cool like go to ZZZZ (kiddingkiddingkiddingkidding). Makin chicken fries at home bc everyone sucks! You are never too old for Disney movies. Bank X got me ****ed up today! Not a pleasant client experience. Please fix ASAP. I'm still hearing from clients that Company A's website is better. X... fixing something that wasn't broken Intel's 2013 capex is elevated at 23% of sales, above average of 16% IBM announced 4Q2012 earnings of $5.13 per share, compared with 4Q2011 earnings of $4.62 per share, an increase of 11 percent We continue to rate shares of MSFT neutral. Sell EUR/CHF at market for a decline to 1.31000… FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury. Customer SurveysCustomer SurveysCustomer SurveysCustomer Surveys Social MediaSocial MediaSocial MediaSocial Media Analyst ResearchAnalyst ResearchAnalyst ResearchAnalyst Research ReportsReportsReportsReports 11
  • 12. Transparency Example: Need to incorporate in-situ data Intel's 2013 capex is elevated at 23% of sales, above average of 16%Intel's 2013 capex is elevated at 23% of sales, above average of 16% FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury. I'm still hearing from clients that Merrill's website is better.I'm still hearing from clients that Merrill's website is better. Customer or competitor? Good or bad? Entity of interest I need to go back to Walmart, Toys R Us has the sameI need to go back to Walmart, Toys R Us has the same toy $10 cheaper! 12
  • 13. A Brief History of IE • 1978-1997: MUC (Message Understanding Conference) – DARPA competition 1987 to 1997 – FRUMP [DeJong82] – FASTUS [Appelt93], – TextPro, PROTEUS • 1998: Common Pattern Specification Language (CPSL) standard [Appelt98] – Standard for subsequent rule- based systems • 1999-present: Commercial products, GATE • At first: Simple techniques like Naive Bayes • 1990’s: Learning Rules – AUTOSLOG [Riloff93] – CRYSTAL [Soderland98] – SRV [Freitag98] • 2000’s: More specialized models – Maximum Entropy Models [Berger96] – Hidden Markov Models [Leek97] – Maximum Entropy Markov Models [McCallum00] – Conditional Random Fields [Lafferty01] – Automatic feature expansion RuleRuleRuleRule----BasedBasedBasedBased Machine LearningMachine LearningMachine LearningMachine Learning 13
  • 14. Real Systems: A Practical Perspective • Entity extraction • EMNLP, ACL, NAACL, 2003- 2012 • 54 industrial vendors (Who’s Who in Text Analytics, 2012) [Chiticariu, Li, Reiss, EMNLP’13] Fast development, fast adaptation, better results in limited time, sophistication Easy to comprehend, maintain, debug & optimize performance; lower reliance on labeled data 14
  • 15. Background: The SystemT Project • Early 2000’s: NLP group starts at IBM Research – Almaden • Initial focus: Collection-level machine learning problems • Observation: Most time spent on feature extraction – Technology used: Java, then Cascading Finite State Automata 15
  • 16. Outline • Emerging Applications: Case Studies • Enterprise Requirements • Challenges in Cascading Finite State Automata • SystemT : an Algebraic approach – The AQL language – Performance – Transparency – Multilingual Support – Machine Learning – Tooling • Summary 16
  • 17. Finite-state Automata • Common formalism underlying most rule-based IE systems – Input text viewed as a sequence of tokens – Rules expressed as regular expression patterns over the lexical features of these tokens • Several levels of processing Cascading Automata – Typically, at higher levels of the grammar, larger segments of text are analyzed and annotated • Common Pattern Specification Language (CPSL) – Standard to specify and represent cascading finite state automata – Each automata accepts a sequence of annotations and outputs a sequence of annotations 17
  • 18. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet, sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in Cascading Automata Example: Person Extractor Tokenization (preprocessing step) Level 1 〈Gazetteer〉[type = LastGaz] 〈Last〉 〈Gazetteer〉[type = FirstGaz] 〈First〉 〈Token〉[~ “[A-Z]w+”] 〈Caps〉 Rule priority used to prefer First over Caps • Rule priority used to prefer First over Caps. • Lossy Sequencing: annotations dropped because input to next stage must be a sequence – First preferred over Last since it was declared earlier 18
  • 19. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem , volutpat dapibus, ultrices sit amet, sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent Cascading Automata Example: Person Extractor Tokenization (preprocessing step) Level 1 〈Gazetteer〉[type = LastGaz] 〈Last〉 〈Gazetteer〉[type = FirstGaz] 〈First〉 〈Token〉[~ “[A-Z]w+”] 〈Caps〉 Level 2 〈First〉 〈Last〉 〈Person〉 〈First〉 〈Caps〉 〈Person〉 〈First〉 〈Person〉 Rigid Rule Priority and Lossy Sequencing in Level 1 caused partial results 19
  • 20. Problems with Cascading Automata • Scalability: Redundant passes over document • Expressivity: Frequent use of custom code • Transparency • Ease of comprehension • Ease of debugging • Ease of enhancement Operational semantics + custom code = no provenance 20
  • 21. Bringing Transparency to Feature Extraction • Our approachOur approachOur approachOur approach: Use a declarative language – Decouple meaning of extraction rules from execution plan • Our language:Our language:Our language:Our language: AQL (Annotator Query Language) – Semantics based on relational calculus – Syntax based on SQL 21
  • 22. Outline • Emerging Applications: Case Studies • Enterprise Requirements • Challenges in Cascading Finite State Automata • SystemT : an Algebraic approach – The AQL language – Performance – Transparency – Multilingual Support – Machine Learning – Tooling • Summary 22
  • 23. Document text: String Person last: Spanfirst: Span fullname: Span AQL Data Model (Simplified) • Relational data modelRelational data modelRelational data modelRelational data model: data is organized in: data is organized in: data is organized in: data is organized in tuples; tuples have a; tuples have a; tuples have a; tuples have a schema • SpecialSpecialSpecialSpecial data types necessarydata types necessarydata types necessarydata types necessary for text processing:for text processing:for text processing:for text processing: – Document consists of a single texttexttexttext attribute – Annotations are represented by a type called SpanSpanSpanSpan, which consists of beginbeginbeginbegin, endendendend and documentdocumentdocumentdocument attribute 23
  • 24. create view FirstCaps as select CombineSpans(F.name, C.name) as name from First F, Caps C where FollowsTok(F.name, C.name, 0, 0); <First> <Caps> 0 tokens • Declarative: Specify logical conditions that input tuples should satisfy in order to generate an output tuple • Choice of SQL-like syntax for AQL motivated by wider adoption of SQL • Compiles into SystemT algebra AQL By Example 24
  • 25. create view Person as select S.name as name from ( ( select CombineSpans(F.name, C.name) as name from First F, Caps C where FollowsTok(F.name, C.name, 0, 0)) union all ( select CombineSpans(F.name, L.name) as name from First F, Last L where FollowsTok(F.name, L.name, 0, 0)) union all ( select * from First F ) ) S consolidate on name; <First><Caps> <First><Last> <First> Revisiting the Person Example 25
  • 26. create view Person as select S.name as name from ( ( select CombineSpans(F.name, C.name) as name from First F, Caps C where FollowsTok(F.name, C.name, 0, 0)) union all ( select CombineSpans(F.name, L.name) as name from First F, Last L where FollowsTok(F.name, L.name, 0, 0)) union all ( select * from First F ) ) S consolidate on name; Explicit clause for resolving ambiguity (No Rigid Priority problem) Input may contain overlapping annotations (No Lossy Sequencing problem) Revisiting the Person Example 26
  • 27. Compiling and Executing AQL AQL Language Optimizer Operator Graph Specify extractor semantics declaratively (express logic of computation, not control flow) Choose efficient execution plan that implements semantics Optimized execution plan executed at runtime 27
  • 28. Regular Expression Extraction Operator 28 [A-Z][a-z]+ DocumentInput Tuple we will meet Mark Scott and Output Tuple 2 Span 2Document Span 1Output Tuple 1 Document Regex
  • 29. Expressivity: Rich Set of Operators Deep Syntactic Parsing ML Training & Scoring Core Operators Tokenization Parts of Speech Dictionaries Regular Expressions Span Operations Relational Operations Semantic Role Labels Language to express NLP Algorithms AQL . Aggregation Operations 29
  • 30. package com.ibm.avatar.algebra.util.sentence; import java.io.BufferedWriter; import java.util.ArrayList; import java.util.HashSet; import java.util.regex.Matcher; public class SentenceChunker { private Matcher sentenceEndingMatcher = null; public static BufferedWriter sentenceBufferedWriter = null; private HashSet<String> abbreviations = new HashSet<String> (); public SentenceChunker () { } /** Constructor that takes in the abbreviations directly. */ public SentenceChunker (String[] abbreviations) { // Generate the abbreviations directly. for (String abbr : abbreviations) { this.abbreviations.add (abbr); } } /** * @param doc the document text to be analyzed * @return true if the document contains at least one sentence boundary */ public boolean containsSentenceBoundary (String doc) { String origDoc = doc; /* * Based on getSentenceOffsetArrayList() */ // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) {doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); String candidate = /* * Looks at the last character of the String. If this last * character is part of an abbreviation (as detected by * REGEX) then the sentenceString is not a fullSentence and * "false” is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { // sentences.addElement(candidate.trim().replaceAll("n", " // ")); // sentenceArrayList.add(new Integer(currentOffset + boundary // + 1)); // currentOffset += boundary + 1; // Found a sentence boundary. If the boundary is the last // character in the string, we don't consider it to be // contained within the string. int baseOffset = currentOffset + boundary + 1; if (baseOffset < origDoc.length ()) { // System.err.printf("Sentence ends at %d of %dn", // baseOffset, origDoc.length()); return true; } else { return false; } } // origDoc.substring(0,currentOffset)); // doc = doc.substring(boundary + 1); doc = remainder; } } while (boundary != -1); // If we get here, didn't find any boundaries. return false; } public ArrayList<Integer> getSentenceOffsetArrayList (String doc) { ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> (); // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; sentenceArrayList.add (new Integer (0)); do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) { String candidate = doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); /* * Looks at the last character of the String. If this last character * is part of an abbreviation (as detected by REGEX) then the * sentenceString is not a fullSentence and "false" is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + boundary + 1)); currentOffset += boundary + 1; } // origDoc.substring(0,currentOffset)); doc = remainder; } } while (boundary != -1); if (doc.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + doc.length ())); } sentenceArrayList.trimToSize (); return sentenceArrayList; } private void setDocumentForObtainingBoundaries (String doc) { sentenceEndingMatcher = SentenceConstants. sentenceEndingPattern.matcher (doc); } private int getNextCandidateBoundary () { if (sentenceEndingMatcher.find ()) { return sentenceEndingMatcher.start (); } else return -1; } private boolean doesNotBeginWithPunctuation (String remainder) { Matcher m = SentenceConstants.punctuationPattern.matcher (remainder); return (!m.find ()); } private String getLastWord (String cand) { Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand); if (lastWordMatcher.find ()) { return lastWordMatcher.group (); } else { return ""; } } /* * Looks at the last character of the String. If this last character is * par of an abbreviation (as detected by REGEX) * then the sentenceString is not a fullSentence and "false" is returned */ private boolean isFullSentence (String cand) { // cand = cand.replaceAll("n", " "); cand = " " + cand; Matcher validSentenceBoundaryMatcher = SentenceConstants.validSentenceBoundaryPattern.matcher (cand); if (validSentenceBoundaryMatcher.find ()) return true; Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand); if (abbrevMatcher.find ()) { return false; // Means it ends with an abbreviation } else { // Check if the last word of the sentenceString has an entry in the // abbreviations dictionary (like Mr etc.) String lastword = getLastWord (cand); if (abbreviations.contains (lastword)) { return false; } } return true; } } Java Implementation of Sentence Boundary Detection Expressivity: AQL vs. Custom Java Code 30
  • 31. package com.ibm.avatar.algebra.util.sentence; import java.io.BufferedWriter; import java.util.ArrayList; import java.util.HashSet; import java.util.regex.Matcher; public class SentenceChunker { private Matcher sentenceEndingMatcher = null; public static BufferedWriter sentenceBufferedWriter = null; private HashSet<String> abbreviations = new HashSet<String> (); public SentenceChunker () { } /** Constructor that takes in the abbreviations directly. */ public SentenceChunker (String[] abbreviations) { // Generate the abbreviations directly. for (String abbr : abbreviations) { this.abbreviations.add (abbr); } } /** * @param doc the document text to be analyzed * @return true if the document contains at least one sentence boundary */ public boolean containsSentenceBoundary (String doc) { String origDoc = doc; /* * Based on getSentenceOffsetArrayList() */ // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) {doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); String candidate = /* * Looks at the last character of the String. If this last * character is part of an abbreviation (as detected by * REGEX) then the sentenceString is not a fullSentence and * "false” is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { // sentences.addElement(candidate.trim().replaceAll("n", " // ")); // sentenceArrayList.add(new Integer(currentOffset + boundary // + 1)); // currentOffset += boundary + 1; // Found a sentence boundary. If the boundary is the last // character in the string, we don't consider it to be // contained within the string. int baseOffset = currentOffset + boundary + 1; if (baseOffset < origDoc.length ()) { // System.err.printf("Sentence ends at %d of %dn", // baseOffset, origDoc.length()); return true; } else { return false; } } // origDoc.substring(0,currentOffset)); // doc = doc.substring(boundary + 1); doc = remainder; } } while (boundary != -1); // If we get here, didn't find any boundaries. return false; } public ArrayList<Integer> getSentenceOffsetArrayList (String doc) { ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> (); // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; sentenceArrayList.add (new Integer (0)); do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) { String candidate = doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); /* * Looks at the last character of the String. If this last character * is part of an abbreviation (as detected by REGEX) then the * sentenceString is not a fullSentence and "false" is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + boundary + 1)); currentOffset += boundary + 1; } // origDoc.substring(0,currentOffset)); doc = remainder; } } while (boundary != -1); if (doc.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + doc.length ())); } sentenceArrayList.trimToSize (); return sentenceArrayList; } private void setDocumentForObtainingBoundaries (String doc) { sentenceEndingMatcher = SentenceConstants. sentenceEndingPattern.matcher (doc); } private int getNextCandidateBoundary () { if (sentenceEndingMatcher.find ()) { return sentenceEndingMatcher.start (); } else return -1; } private boolean doesNotBeginWithPunctuation (String remainder) { Matcher m = SentenceConstants.punctuationPattern.matcher (remainder); return (!m.find ()); } private String getLastWord (String cand) { Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand); if (lastWordMatcher.find ()) { return lastWordMatcher.group (); } else { return ""; } } /* * Looks at the last character of the String. If this last character is * par of an abbreviation (as detected by REGEX) * then the sentenceString is not a fullSentence and "false" is returned */ private boolean isFullSentence (String cand) { // cand = cand.replaceAll("n", " "); cand = " " + cand; Matcher validSentenceBoundaryMatcher = SentenceConstants.validSentenceBoundaryPattern.matcher (cand); if (validSentenceBoundaryMatcher.find ()) return true; Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand); if (abbrevMatcher.find ()) { return false; // Means it ends with an abbreviation } else { // Check if the last word of the sentenceString has an entry in the // abbreviations dictionary (like Mr etc.) String lastword = getLastWord (cand); if (abbreviations.contains (lastword)) { return false; } } return true; } } Java Implementation of Sentence Boundary Detection Expressivity: AQL vs. Custom Java Code create dictionary AbbrevDict from file 'abbreviation.dict’; create view SentenceBoundary as select R.match as boundary from ( extract regex /(([.?!]+s)|(ns*n))/ on D.text as match from Document D ) R where Not(ContainsDict('AbbrevDict', CombineSpans(LeftContextTok(R.match, 1),R.match))); Equivalent AQL Implementation 31
  • 32. Expressivity: AQL vs CPSL • [Fagin et al., PODS ’13] – Formal modeling of core subset of code-free CPSL (regular spanners) and AQL (core spanners) – Core spanners strictly more expressive than regular spanners • [Fagin et al., PODS ‘14] – Unified framework for declarative cleaning (i.e., resolving overlap) – Cleaning increases expressiveness in general • But not for SystemT consolidate policies, nor for controls in JAPE (CPSL implementation in GATE) 32
  • 33. Outline • Emerging Applications: Case Studies • Enterprise Requirements • Challenges in Cascading Finite State Automata • SystemT : an Algebraic approach – The AQL language – Performance – Transparency – Multilingual Support – Machine Learning – Tooling • Summary 33
  • 34. Scalability • Performance issues with cascading automata – Complete pass through tokens for each cascade – Many of these passes are wasted work • Dominant approach: Make each pass go faster – Doesn’t solve root problem! • Algebraic approach: Build a query optimizer! 34
  • 35. Scalability: Class of Optimizations in SystemT • RewriteRewriteRewriteRewrite----basedbasedbasedbased: rewrite algebraic operator graph – Shared Dictionary Matching – Shared Regular Expression Evaluation – On-demand tokenization • CostCostCostCost----basedbasedbasedbased: relies on novel selectivity estimation for text- specific operators – Standard transformations • E.g., push down selections – Restricted Span Evaluation • Evaluate expensive operators on restricted regions of the document 35 Tokenization overhead is paid only once First (followed within 0 tokens) Plan C Plan A Join Caps Restricted Span Evaluation Plan B First Identify Caps starting within 0 tokensExtract text to the right Caps Identify First ending within 0 tokens Extract text to the left
  • 36. Performance Comparison (with ANNIE) 0 100 200 300 400 500 600 700 0 20 40 60 80 100 Average document size (KB) Throughput(KB/sec) Open Source Entity Tagger SystemT ANNIE TaskTaskTaskTask: Named Entity DatasetDatasetDatasetDataset : Different document collections from the Enron corpus obtained by randomly sampling 1000 documents for each size 10~50x faster [Chiticariu et al., ACL’10] 36
  • 37. [Chiticariu et al., ACL’10] Performance Comparison on Larger Documents Dataset Document Size Throughput (KB/sec) Average Memory (MB) Range Average ANNIE SystemT ANNIE SystemT Web Crawl 68 B – 388 KB 8.8 KB 42.8 498.8 201.8 77.2 Medium SEC Filings 240 KB – 0.9 MB 401 KB 26.3 703.5 601.8 143.7 Large SEC Flings 1 MB – 3.4 MB 1.54 MB 21.1 954.5 2683.5 189.6 Datasets : Web crawl and filings from the Securities and Exchanges Commission (SEC) Throughput benefits carryover for wide-variety of document sizes Much lower memory footprint 37
  • 38. [Chiticariu et al., ACL’10] Performance Comparison on Larger Documents Dataset Document Size Throughput (KB/sec) Average Memory (MB) Range Average ANNIE SystemT ANNIE SystemT Web Crawl 68 B – 388 KB 8.8 KB 42.8 498.8 201.8 77.2 Medium SEC Filings 240 KB – 0.9 MB 401 KB 26.3 703.5 601.8 143.7 Large SEC Flings 1 MB – 3.4 MB 1.54 MB 21.1 954.5 2683.5 189.6 Datasets : Web crawl and filings from the Securities and Exchanges Commission (SEC) Throughput benefits carryover for wide-variety of document sizes Much lower memory footprint Theorem:Theorem:Theorem:Theorem: For any acyclic tokenFor any acyclic tokenFor any acyclic tokenFor any acyclic token----based FSTbased FSTbased FSTbased FST TTTT,,,, there exists an operator graphthere exists an operator graphthere exists an operator graphthere exists an operator graph GGGG such that evaluatingsuch that evaluatingsuch that evaluatingsuch that evaluating TTTT andandandand GGGG has the same computational complexityhas the same computational complexityhas the same computational complexityhas the same computational complexity 38
  • 39. SystemT Acceleration on FPGA [Atasu et al. FPL’13, Polig et al., FPL’14] 39
  • 40. Outline • Emerging Applications: Case Studies • Enterprise Requirements • Challenges in Cascading Finite State Automata • SystemT : an Algebraic approach – The AQL language – Performance – Transparency – Multilingual Support – Machine Learning – Tooling • Summary 40
  • 41. How AQL Solved our Problems • Expressivity: Complex tasks, no custom code • Scalability: Cost-based query optimization • Transparency • Ease of comprehension • Ease of debugging • Ease of enhancement Clear and Simple Provenance 41
  • 42. 42 Computing Provenance PersonPhone rule: insert into PersonPhone select Merge(F.match, P.match) as match from Person F, Phone P where Follows(F.match, P.match, 0, 60); match Anna at James St. office (555-5555 James St. office (555-5555 PersonPhone Provenance: Explains output data in terms of the input data, the intermediate data, and the transformation (e.g., SQL query, ETL, workflow) – Surveys: [Davidson & Freire, SIGMOD 2008] [Cheney et al., Found. Databases 2009] For predicate-based rule languages (e.g., SQL), can be computed automatically! Person PhonePerson Anna at James St. office (555-5555) . [Liu et al., VLDB’10]
  • 43. 43 Computing Provenance Rewritten PersonPhone rule: insert into PersonPhone select Merge(F.match, P.match) as match, GenerateID() as ID, P.id as nameProv, Ph.id as numberProv ‘AND’ as how from Person F, Phone P where Follows(F.match, P.match, 0, 60); Person PhonePerson Anna at James St. office (555-5555) . ID: 1 ID: 2 ID: 3 1 3AND 2 3AND match Anna at James St. office (555-5555 James St. office (555-5555 PersonPhone Provenance Provenance: Explains output data in terms of the input data, the intermediate data, and the transformation (e.g., SQL query, ETL, workflow) – Surveys: [Davidson & Freire, SIGMOD 2008] [Cheney et al., Found. Databases 2009] For predicate-based rule languages (e.g., SQL), can be computed automatically! [Liu et al., VLDB’10]
  • 44. AQL: Going beyond Feature Extraction Dataset Entity Type System Precision Recall F-measure CoNLL 2003 Location SystemT 93.11 91.61 92.35 Florian 90.59 91.73 91.15 Organization SystemT 92.25 85.31 88.65 Florian 85.93 83.44 84.67 Person SystemT 96.32 92.39 94.32 Florian 92.49 95.24 93.85 Enron Person SystemT 87.27 81.82 84.46 Minkov 81.1 74.9 77.9 ExtractionExtractionExtractionExtraction Task:Task:Task:Task: Named-entity extraction SystemsSystemsSystemsSystems compared:compared:compared:compared: SystemT (customized) vs. [Florian et al.’03] [Minkov et al.’05] [Chiticariu et al., EMNLP’10] Transparency without machine learning outperforms machine learning without transparency. 44
  • 45. Outline • Emerging Applications: Case Studies • Enterprise Requirements • Challenges in Cascading Finite State Automata • SystemT : an Algebraic approach – The AQL language – Performance – Transparency – Multilingual Support – Machine Learning – Tooling • Summary 45
  • 46. Multilingual Support Tokenization • 26+ languages • All other languages supported via whitespace/punctuation tokenizer Part of Speech and Lemmatization • 18+ languages, including English and Western languages, Arabic, Russian, CJK Semantic Role Labeling • English • Ongoing work on Multilingual SRL [Akbik et al, ACL’15] Annotator Libraries • Out-of-the-box customizable libraries for multiple applications and data sources in multiple languages 46
  • 47. Outline • Emerging Applications: Case Studies • Enterprise Requirements • Challenges in Cascading Finite State Automata • SystemT : an Algebraic approach – The AQL language – Performance – Transparency – Multilingual Support – Machine Learning – Tooling • Summary 47
  • 48. Machine Learning in SystemT • AQL provides a foundation of transparency • Next step: Add machine learning without losing transparency • Major machine learning efforts: – Embeddable Models in AQL – Learning using AQL as target language 48
  • 49. Machine Learning in SystemT • AQL provides a foundation of transparency • Next step: Add machine learning without losing transparency • Major machine learning efforts: – Embeddable Models in AQL • Model Training and Scoring integration with SystemML • Deep Parsing & Semantic Role Labeling [Akbik et al., ACL’15] • Text Normalization [Zhang et al. ACL’13, Li & Baldwin, NAACL ‘15] – Learning using AQL as target language 49
  • 50. © 2015 IBM Corporation Input Documents Extracted features SystemT Runtime SystemT Operators Statistical Learning Algorithm Label Feature Extraction Training the Parser: Efficient and Powerful Feature Extraction AQL create view FirstWord as select... from ... where ...; ... create view Digits as extract regex... from ... where ... ; Input docs English FirstWord: I, We, This, Digits: 360, 2014 Capitalized: We, IBM, German FirstWord: Ich, Wir,Dies, Digits: 360, 2014 Capitalized: Wir, IBM, Statistical Parser Annotations AQL create function Parse(span Span) return String ... external name ... ... create view Parse as select Parse(S.text) as parse,... from sentence S; create view Actions as select... from Parse P; ... -- Example: We are happy create view Sentiment_SpeakerIsPositive as select ‘positive’ as polarity, ... ‘isPositive’ as patternType,... from Actions A, Roles R where MatchesDict(‘PositiveVerb.dict’, A.verb) and MatchesDict(‘Speakers.dict’, R.value) and ...; ... Input doc Applying the Parser: Easy Incorporation of Parsing Results for Complex Extractors UDFs SentimentMention polarity mention target clue patternType positive We like IBM from a long timer IBM like SpeakerDoesPosi tive negative Microsoft earnings expected to drop Microsoft expected to drop TargetDoesNegat ive… … Parsing + extraction Cost-based optimization SystemT Operators SystemT Runtime Embed statistical parser as UDF Cost-based optimization Run statistical parser Build rules using parser results Identify sentiment based on parse tree patterns Identify the first word of a sentence Identify all digits Simplify Training and Applying StatisticalSimplify Training and Applying StatisticalSimplify Training and Applying StatisticalSimplify Training and Applying Statistical PPPParsersarsersarsersarsers 50 UDFs
  • 51. SystemML in a Nutshell • Provides a language for data scientists to implement machine learning algorithms – Declarative, high-level language with R-like syntax (also Python) – Also comes with approx. 20 algorithms pre-implemented • Compiles execution plans ranging from single node (scale up multi threaded) to scale out (MapReduce, Spark) – Cost-based optimizer to generate execution plans, parallelize • Based on data and system characteristics – Operators for in-memory single node and cluster execution • Runs in embeddable, standalone, and cluster mode • Supports various APIs • Apache SystemML Incubator project: http://systemml.apache.org • Ongoing research effort at IBM Research - Almaden 51
  • 52. Machine Learning in SystemT • AQL provides a foundation of transparency • Next step: Add machine learning without losing transparency • Major machine learning efforts: – Embeddable Models in AQL – Learning using AQL as target language • Low-level features: regular expressions [Li et al., EMNLP’08], dictionaries [Li et al., CIKM’11, Roy et al., SIGMOD’13] • Rule refinement [Liu et al., VLDB’10] • Rule induction [Nagesh et al., EMNLP’12] • Ongoing research 52
  • 53. Outline • Emerging Applications: Case Studies • Enterprise Requirements • Challenges in Grammar-based IE systems • SystemT : an Algebraic approach – The AQL language – Performance – Transparency – Multilingual Support – Machine Learning – Tooling • Summary 53
  • 54. 54 Tooling Research for Productivity Develop TestAnalyze Development Deploy Refine Test Maintenance Task Analysis [ACL’11,12,13,CHI’13] • Concordance Viewer and Labeling Tool [ACL ‘13] •Extraction plan [CHI’13] • Track provenance [VLDB’10] • Contextual clue discovery[CIKM’11] • Regex learning [EMNLP’08] • Rule induction & refinement [EMNLP’12,VLDB’10] • Dictionary refinement [SIGMOD’13] •Visual Programming [VLDB’15] • Visual Programming: [VLDB’15] •NE Interface [EMNLP’10]
  • 55. Eclipse Tools Overview 55 Ease of Programming Performance Tuning Automatic Discovery AQL Editor Explain Pattern Discovery Result Viewer Regex Learner AQL Editor: syntax highlighting, auto-complete, hyperlink navigation Result Viewer: visualize/compare/evaluate Explain: show how each result was generated Workflow UI: end-to-end development wizard Regex Generator: generate regular expressions from examples Pattern Discovery: identify patterns in the data Profiler: identify performance bottlenecks to be hand tuned [Chiticariu et al., SIGMOD ‘11, Li et al. ACL’12]
  • 56. Web Tools Overview 56 Ease of Programming Ease of Sharing Canvas: Visual construction of extractors, Customization of existing extractors Result Viewer: visualize/compare/evaluate Concept catalog: share concepts Project: share extractor development [Li et al., VLDB’15]
  • 57. Tooling for Productivity Named Entities Parts-of-speech …. Core Operations ..... Higher-level Tasks Tokenization Span Operators Dictionary Regular Expressions Financial Primitives Drug Extraction Sentiment Analysis Corporate Intelligence Life Sciences Application Tasks ..... NLPEngineers (EclipseTools) SME (WebTools) Relational &Aggregate Operators ML Training & Scoring AQL Language Customizable Extractor Libraries developed in AQL Deep Parsing & SRL 57
  • 58. Outline • Emerging Applications: Case Studies • Enterprise Requirements • Challenges in Grammar-based IE systems • SystemT : an Algebraic approach – The AQL language – Performance – Transparency – Multilingual Support – Machine Learning – Tooling • Summary 58
  • 59. 59 Summary Declarative AQL language Development EnvironmentDevelopment EnvironmentDevelopment EnvironmentDevelopment Environment Cost-based optimization ............ Discovery tools for AQL development SystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT RuntimeSystemT Runtime Input Documents Extracted Objects AQL: a declarative language that can be used to build extractors outperforming the state-of-the-arts [ACL’10, EMNLP’10] A suite of novel development tooling leveraging machine learning and HCI [EMNLP’08, VLDB’10, ACL’11, CIKM’11, ACL’12, EMNLP’12, SIGMOD’12, CHI’13, ACL’13, VLDB’15] Cost-based optimization for text-centric operations [ICDE’08, ICDE’11] Highly embeddable runtime with high-throughput and small memory footprint. [SIGMOD Record’09, SIGMOD’09,’ FPL’13,’14] A declarativedeclarativedeclarativedeclarative information extraction system with costcostcostcost----based optimizationbased optimizationbased optimizationbased optimization, highhighhighhigh----performanceperformanceperformanceperformance runtimeruntimeruntimeruntime and novel development toolingnovel development toolingnovel development toolingnovel development tooling based on solid theoretical foundationsolid theoretical foundationsolid theoretical foundationsolid theoretical foundation [PODS’13, 14]. Shipping with 10+ IBM products, and installed on 400+ client’s sites. InfoSphere StreamsStreamsStreamsStreams InfoSphere BigInsightsBigInsightsBigInsightsBigInsights SystemTSystemTSystemTSystemTSystemTSystemTSystemTSystemT IBM EnginesIBM EnginesIBM EnginesIBM Engines UIMAUIMAUIMAUIMA SystemTSystemTSystemTSystemT …
  • 60. Find Out More about SystemT! 60 https://ibm.biz/BdF4GQ
  • 61. Find Out More about SystemT! 61 https://ibm.biz/BdF4GQ Try out SystemT Watch a demo Learn about using SystemT in unversity courses
  • 62. Thank you! • For more information… – Visit our website:Visit our website:Visit our website:Visit our website: http://ibm.co/1Cdm1Mj – Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center: http://http://http://http://ibm.co/1DIouEvibm.co/1DIouEvibm.co/1DIouEvibm.co/1DIouEv – Learning at your own pace: With the lab materialsLearning at your own pace: With the lab materialsLearning at your own pace: With the lab materialsLearning at your own pace: With the lab materials – Contact meContact meContact meContact me • chiti@us.ibm.com SystemT Team: Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao Li Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu 62