2. The Stanford DeepDive, developed by Professor Chris Ré
and a team of PhDs, is a powerful data management and
preparation platform that allows users to build highly
sophisticated end-to-end data pipelines
This presentation covers the technicalities of the inference and learning
engine behind DeepDive; including how DeepDive is different from
traditional data management systems, how to build an application on
DeepDive, as well as how exactly does DeepDive work.
“We are just an advanced breed of monkeys
on a minor planet of a very average star. But
we can understand the Universe. That makes
us special””
- Stephen Hawking
3. THE DEEPDIVE OVERVIEW
How Is DeepDive Different?
Source: www.deepdive.stanford.edu
DeepDive is an end-to-end framework for building KBC systems.
B.Obama
and his
wife M.
Obama
Candidate
Generation
& Feature
Extraction
Super-
vision
Learning
&
Inference
Has
Spouse
Input Output
Newdocs
FeatureExt.
rules
Supervision
rules
Inference
rules
Erroranalysis
Input: Unstructured Docs
Developers will add new rules to improve quality
How Does DeepDive Work?
• Candidate Generation and Feature Extraction
• Save input data in relational database
• Feature Extractors: a set of user-defined
functions
• Supervision
• DeepDive language is based on Markov Logic
• Can use training data to mirror the same
function it serves under supervised learning
• Learning and Inference
• Factor graph
• Error Analysis
• Determine if the user needs to inspect the errors
DeepDive Design
Features that makes it convenient for non-computer scientists to
use:
i) No reference to underlying machine learning algorithm.
Probabilistic semantics provide a way to debug the system
independently of algorithm
ii) Allows users to write extra features in Python, SQL and
Scala
iii) Fits into the familiar SQL stack, therefore allows standard
tools to inspect and visualize data
Source: Incremental Knowledge Base Construction Using DeepDive
Output: structured knowledge base
Feature
Engineering
High Quality
Allows developers to think about features
rather than algorithms
Applications have achieved higher quality
than human volunteers
Calibration
Variety of
Sources
Computes calibrated probability for every
assertion it makes
Can extract data from documents, PDFs,
web pages, tables and figures
Domain
Knowledge
Distant
Supervision
Integrates with writing sample rules to
improve quality
Does not require tedious training for every
prediction
4. DEVELOPMENT PROCESS OF
DEEPDIVE APPLICATIONS
Writing The Application
Running The Application
Evaluate / Debug
• Define the data flow in DDlog schema that
describes the input data and data to be produced
• Write User-Defined Functions (data
transformation rules)
• Specify a statistical model in DDlog
• The user can compile and run the application
incrementally
• Actual data loaded to data base and queried ->
User-Defined Functions executed incrementally
• Model’s parameters can be learned or reused to
make predictions
• Formal error analysis supported by interactive
tools
• DeepDive contains a suite of tools and guides:
Label data products, browse data, monitor
descriptive statistics, calibration etc.
# DDlog is a higher-level language for
writing DeepDive applications in
succinct, Datalog-like syntax
# Variable declarations + Scoping and
supervision rules + Inference rules
# A core set of commands that
supports precise control of execution
# Several commands on the statistical
model such as its creation, parameter
estimation, computation of
probabilities and keeping and reusing
the parameters
# User-Defined Functions can be
written on any standard programming
languages
# Produces calibration plots to
evaluate the iterative workflow
# Comments
Start with a
basic first
version and
improve
iteratively
Source: DeepDive: A Data Management System for Automatic Knowledge Base Construction
“It’s okay to have your eggs in one basket as
long as you control what happens to that
basket”
- Elon Musk
5. THE DEEPDIVE FRAMEWORK
Input
Candidate
Generation &
Feature Extraction
Supervision
Learning &
Inference
Output
New docs
Feature Ext.
rules
Supervision
rules
Inference
rules
Error analysis
End-To-End Framework For Building KBCs
Source: Incremental Knowledge Base Construction Using DeepDive
Knowledge-Based Construction Systems
The input to a KBC system is a heterogeneous
collection of unstructured, semi-structured, and
structured data.
The output is a relational database containing
facts extracted from the input and put into the
appropriate schema
The KBC Model
The standard KBC model seeks to extract four
types of objects from input documents:
Entity
Relation
Mention
Relation
Mention
A real person, place, or thing
A relation associates two (or more) entities
A span of text in input document that refers
to the entity or relation
A phrase that connects two mentions that
participate in a relations
6. THE DEEPDIVE FRAMEWORK:
STEP-BY-STEP
Input
Candidate
Generation &
Feature Extraction
Supervision
Learning &
Inference
Output
New docs
Feature Ext.
rules
Supervision
rules
Inference
rules
Error analysis
Source: Incremental Knowledge Base Construction Using DeepDive
Candidate Generation & Feature Extraction
All data is stored in a relational database. This
phase populates the database using a set of SQL
queries and User-Defined Functions (Feature
Extractors)
By default, DeepDive stores all documents in the
database in one sentence per row with markup
produced by standard NLP pre-processing tools,
including HTML stripping, part-of-speech tagging,
and linguistic parsing
Then, DeepDive executes two types of queries:
Candidate mappings – SQL queries that produce
possible mentions, entities, and relations
Feature Extractors – associate features to
candidates
“A breakthrough in machine learning would be
worth ten Microsofts”
- Bill Gates
7. THE DEEPDIVE FRAMEWORK:
STEP-BY-STEP
Input
Candidate
Generation &
Feature Extraction
Supervision
Learning &
Inference
Output
New docs
Feature Ext.
rules
Supervision
rules
Inference
rules
Error analysis
Source: Incremental Knowledge Base Construction Using DeepDive
Just as in Markov Logic, DeepDive can use training
data or evidence about any relation.
Each user relation is associated with an evidence
that indicates whether the entry is true or false
Two standard techniques generate training data:
Hand-labeling and Distant Supervision
Distant Supervision
Traditional machine learning techniques require a
set of training data. In distant supervision, DeepDive
takes existing databases (e.g. domain-specific
database) to collect relations DeepDive wants to
extract. Then use these examples to automatically
generate the training data
Supervision
8. THE DEEPDIVE FRAMEWORK:
STEP-BY-STEP
Input
Candidate
Generation &
Feature Extraction
Supervision
Learning &
Inference
Output
New docs
Feature Ext.
rules
Supervision
rules
Inference
rules
Error analysis
Source: Incremental Knowledge Base Construction Using DeepDive
Learning & Inference
In this phase, DeepDive generates a factor graph
An example factor graph. There is one user relation
containing all tokens, and there are two correlation
relations for adjacent-token correlation (F1) and same-
word correlation (F2) respectively.
A probabilistic graphical model that is the abstraction
used for learning. DeepDive relies heavily on factor
graph
Raw Data In-database Representation
He said that he would come.
Factor Graph
He
Said
That
He
i
ii
iii
iv
Adjacent-
token
Same-
word
User Rela)ons
Token Word
A He
B Said
C That
D He
Assignment Example
Correla)on Rela)ons
Rx Vars Rx Vars
i (A,B) iv (A,D)
ii (B,C)
iii (C,D)
F1 F2
Assignment
Token Assignment
A 1
B 0
C 0
D 1
Partition Function
Z =
f1(1,0) x
f1(0,0) x
f1(0,1) x
f1(1,1) x
Factors in F1
Factors in F2
Source: DeepDive: A Data Management System for Automatic Knowledge Base Construction
A B C D
A
B
C
D
“Problems worthy of attack prove their worth
by fighting back”
- Paul Erdös
9. REFERENCES
Shin, Jaeho, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher
Ré. "Incremental Knowledge Base Construction Using DeepDive." Proc. VLDB Endow.
Proceedings of the VLDB Endowment 8.11 (2015): 1310-321. Web.
Ce Zhang. “DeepDive: A Data Management System for Automatic Knowledge Base Construction."
Proc. VLDB Endow. Proceedings of the VLDB Endowment 8.13 (2015): 1310-321. Web.