Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Ongoing and Future Work: Part II
DeepDive & Caffe con Troll:
Knowledge Base Construction from Text and Beyond
Ce Zhang
Stanford University

(a) Natural Language Text (b) Table (c) Document Layout
... The Namurian Tsingyuan Fm.
from Ningxia, China, is divided into
three members ...
formation time
Tsingyuan Fm. Namurian
Formation-Time
formation location
Tsingyuan Fm. Ningxia
Formation-Location
taxon formation
Euphemites Tsingyuan Fm.
Taxon-Formation taxon formation
Turbonitella
Semisulcatus
Turbo
Semisulcatus
Taxon-Taxon
taxon
Shasiell
Taxon-
(a) Natural Language Text (b) Table (c
three members ...
formation time
Formation-Time
formation location
Formation-Location
taxon formation
Taxon-Formation taxon
Turbo
Semis
Taxo
(c) Document Layout (d) Image
Fm.
taxon formation
Turbonitella
Semisulcatus
Turbo
Semisulcatus
Taxon-Taxon
taxon real size
Shasiella tongxinensis 5cm x
5cm
Taxon-Real Size
Text (b) Table (c) Document Layout (d) Image
into
ian
n
a
taxon formation
Turbonitella
Semisulcatus
Turbo
Semisulcatus
Taxon-Taxon
taxon real size
5cm
Taxon-Real Size
http://deepdive.stanford.edu
DeepDive

(a) Natural Language Text (b) Table (c) Document Layout (d) Image
three members ...
formation time
Formation-Time
formation location
Formation-Location
taxon formation
Turbonitella
Semisulcatus
Turbo
Semisulcatus
Taxon-Taxon
taxon real size
5cm
Taxon-Real Size
(a) Natural Language Text (b) Table (c) Document Layout (d) Image
three members ...
formation time
Formation-Time
formation location
Formation-Location
taxon formation
Turbonitella
Semisulcatus
Turbo
Semisulcatus
Taxon-Taxon
taxon real size
5cm
Taxon-Real Size
Table (c) Document Layout (d) Image
formation
Tsingyuan Fm.
ation taxon formation
Turbonitella
Semisulcatus
Turbo
Semisulcatus
Taxon-Taxon
taxon real size
5cm
Taxon-Real Size
ural Language Text (b) Table (c) Document Layout (d) Image
urian Tsingyuan Fm.
ia, China, is divided into
ers ...
n time
n Fm. Namurian
on-Time
n location
n Fm. Ningxia
on-Location
taxon formation
Turbonitella
Semisulcatus
Turbo
Semisulcatus
Taxon-Taxon
taxon real size
5cm
Taxon-Real Size
DeepDive
Unstructured Inputs
Structured Outputs
Goal: High Quality
DeepDive: Applications to Knowledge Base Construction
Caffe con Troll: A Deep Learning Engine
DeepDive with Caffe con Troll: Ongoing Work

Many pressing scientific questions
are macroscopic.

KBC Applications
Science is built up with facts, as a house is with
stones.
- Jules Henri Poincaré
Example: Paleontology
Taxon Rock
Age Location
Scientific Facts
Biodiversity
Macroscopic View Insights & Knowledge
Impact of climate
change to bio-
diversity?

KBC Applications
Taxon Rock
Age Location
Scientific Facts
Biodiversity
Impact of climate
change to bio-
diversity?

KBC Applications
Taxon Rock
Age Location
Scientific Facts
Biodiversity
Impact of climate
change to bio-
diversity?
1570 1670 1770 1870 1970 2015
Input Sources
KBConstruction
Knowledge
Base (KB)

KBC Applications
Paleontology Genomics
Taxon Rock
Age Location
Knowledge Base
Gene Drug
Disease
Knowledge Base
Dark Web
Server Service
Price Location
Knowledge Base
Climate & Biodiversity Social GoodHealth & Medicine

Challenge:
Can we just do KBC manually?

Challenge of Manual KBC
Paleontology
Taxon Rock
Age Location
Knowledge Base
Effort on Manual KBC
Sepkoski (1982) manually
compiled a compendium of 3300
animal families with 396
references in his monograph.
300 professional volunteers
(1998-present) spent 8 continuo-
us human years to compile
PaleoDB with 55,479 references.
80
90
100
110
120
2010 2011 2012 2013
#NewPaleo
References…
100K new references
per year! 16 continuous human
years every year just to
keep up-to-date!

Can we build a machine
to read for us?

Automatic KBC
Input Sources
Machine
Knowledge Base

Case Study - PaleoDeepDive
The Goal
Extract paleobiological facts to build higher coverage fossil
record.
T. Rex are found dating to
the upper Cretaceous.
Appears(“T. Rex”, “Cretaceous”)
DeepDive

Case Study - PaleoDeepDive
55K documents
329 geoscientists
8 years
126K fossil mentions
2000 machine cores
46 machine years
1M relations
300K documents
3M fossil mentions
2.1M relations
PaleoDB PaleoDeepDive
Human-created
Paleobiology
database!
Machine-created
Paleobiology
database!
(>90% Precision)
Biodiversity Curve
On the same relation, PaleoDeepDive achieves equal (or
sometimes better) precision as professional human
volunteers.

Validation on Real Applications
Paleontology
Geology
Pharmacogenomics
Genomics
Wikipedia-like Relations
Dark Web
“It's a little scary,
the machines
are
getting that
good.”Recall: 2-10x more extractions than human
Precision: 92%-97% (Human ~84%-92%)
Highest score out of 18 teams and 65
submissions (2nd highest is also DeepDive).
Applied Physics
Goal: Enables easy engineering to
build high-quality KBC Systems by
thinking about features not
algorithms.

Can we support more sophisticated
image processing in DeepDive?

Go Beyond Text-Processing
What kind
of dinosaur
is this?
Does this
patient have
short finger?
Is this sea
star found in
2014 sick?
What’s the
Clinical out-
come of this
patient?
Images are important to many scientific questions.
[User] Can I run Deep Learning on my
datasets with DeepDive?

Just before we start the run…
On which machine should we run? CPU or GPU?
I have a GPU
Cluster
I have 5000 CPU cores
I have $100K to spend
on the cloud
EC2: c4.4xlarge
8 cores@2.90GHz
EC2: g2.2xlarge
1.5K cores@800MHz
0.7TFlops 1.2TFlops
Not a 10x gap? Can we close this gap?

Caffe con Troll
http://github.com/HazyResearch/CaffeConTroll
A prototype system to study the
CPU/GPU tradeoff.
Same-input-same-output as Caffe.

What we found…
c4.4x_large
($0.68/h)
c4.4x_large
($0.68/h)
g2.2x_large
($0.47/h)
c4.8x_large
($1.37/h)
c4.8x_large
($1.37/h)
RelativeSpeed
0
0.2
0.4
0.6
0.8
1
1.2
Caffe CPU CcT CPU Caffe
GPU
Caffe CPU CcT CPU

What we found…
c4.4x_large
($0.68/h)
c4.4x_large
($0.68/h)
g2.2x_large
($0.47/h)
c4.8x_large
($1.37/h)
c4.8x_large
($1.37/h)
RelativeSpeed
0
0.2
0.4
0.6
0.8
1
1.2
Caffe CPU CcT CPU Caffe
GPU
Caffe CPU CcT CPU
Proportional
to FLOPs!

Four Shallow Ideas Described in
Four Pages…
arXiv:1504.04343

One of the four shallow ideas…
3 CPU Cores 3 Images Strategy 1 Strategy 2
If the amount of data is too small for each core, the
process might not be CPU bound.
For AlexNet over Haswell CPUs, Strategy 2 is 3-4x faster.

Caffe con Troll + DeepDive
(Ongoing Work)

Application 1: Paleontology
Images without high-quality human labels also
contain valuable information.
What can we learn from these
images without human labels?
Name of Fossil
Fossil Image

We apply Distant Supervision!
Porifera Brachiopoda
ClassifierDocument
Can we build a system that automatically “reads” a
Paleontology textbook and learn the difference
between sponges and shells?

29
Fig. 387,1a-c. *B. rara, Serpukhovian, Kazakhstan,
Dzhezgazgan district; a,b, holotype, viewed
ventrally, laterally, MGU 31/342, XI (Litvinovich, 1967);
Figure Name Mention Taxon Mention
DeepDive Extractions
Fig. 387
Figures
Provide Labels
Train CNN
Test with Human Labels
3K Brachiopoda Images
2K Porifera Images
Accuracy = 94%

Thank You
deepdive.stanford.edu
github.com/HazyResearch/CaffeConTroll
Ce Zhang: czhang@cs.stanford.edu
DeepDive Group: contact.hazy@gmail.com

Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Recomendados

Recomendados

Más contenido relacionado

Similar a Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Similar a Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15 (20)

Más de MLconf

Más de MLconf (20)

Último

Último (20)

Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Notas del editor