SlideShare una empresa de Scribd logo
1 de 30
Ongoing and Future Work: Part II
DeepDive & Caffe con Troll:
Knowledge Base Construction from Text and Beyond
Ce Zhang
Stanford University
(a) Natural Language Text (b) Table (c) Document Layout
... The Namurian Tsingyuan Fm.
from Ningxia, China, is divided into
three members ...
formation time
Tsingyuan Fm. Namurian
Formation-Time
formation location
Tsingyuan Fm. Ningxia
Formation-Location
taxon formation
Euphemites Tsingyuan Fm.
Taxon-Formation taxon formation
Turbonitella
Semisulcatus
Turbo
Semisulcatus
Taxon-Taxon
taxon
Shasiell
Taxon-
(a) Natural Language Text (b) Table (c
... The Namurian Tsingyuan Fm.
from Ningxia, China, is divided into
three members ...
formation time
Tsingyuan Fm. Namurian
Formation-Time
formation location
Tsingyuan Fm. Ningxia
Formation-Location
taxon formation
Euphemites Tsingyuan Fm.
Taxon-Formation taxon
Turbo
Semis
Taxo
(c) Document Layout (d) Image
Fm.
taxon formation
Turbonitella
Semisulcatus
Turbo
Semisulcatus
Taxon-Taxon
taxon real size
Shasiella tongxinensis 5cm x
5cm
Taxon-Real Size
Text (b) Table (c) Document Layout (d) Image
into
ian
n
a
taxon formation
Euphemites Tsingyuan Fm.
Taxon-Formation taxon formation
Turbonitella
Semisulcatus
Turbo
Semisulcatus
Taxon-Taxon
taxon real size
Shasiella tongxinensis 5cm x
5cm
Taxon-Real Size
http://deepdive.stanford.edu
DeepDive
(a) Natural Language Text (b) Table (c) Document Layout (d) Image
... The Namurian Tsingyuan Fm.
from Ningxia, China, is divided into
three members ...
formation time
Tsingyuan Fm. Namurian
Formation-Time
formation location
Tsingyuan Fm. Ningxia
Formation-Location
taxon formation
Euphemites Tsingyuan Fm.
Taxon-Formation taxon formation
Turbonitella
Semisulcatus
Turbo
Semisulcatus
Taxon-Taxon
taxon real size
Shasiella tongxinensis 5cm x
5cm
Taxon-Real Size
(a) Natural Language Text (b) Table (c) Document Layout (d) Image
... The Namurian Tsingyuan Fm.
from Ningxia, China, is divided into
three members ...
formation time
Tsingyuan Fm. Namurian
Formation-Time
formation location
Tsingyuan Fm. Ningxia
Formation-Location
taxon formation
Euphemites Tsingyuan Fm.
Taxon-Formation taxon formation
Turbonitella
Semisulcatus
Turbo
Semisulcatus
Taxon-Taxon
taxon real size
Shasiella tongxinensis 5cm x
5cm
Taxon-Real Size
Table (c) Document Layout (d) Image
formation
Tsingyuan Fm.
ation taxon formation
Turbonitella
Semisulcatus
Turbo
Semisulcatus
Taxon-Taxon
taxon real size
Shasiella tongxinensis 5cm x
5cm
Taxon-Real Size
ural Language Text (b) Table (c) Document Layout (d) Image
urian Tsingyuan Fm.
ia, China, is divided into
ers ...
n time
n Fm. Namurian
on-Time
n location
n Fm. Ningxia
on-Location
taxon formation
Euphemites Tsingyuan Fm.
Taxon-Formation taxon formation
Turbonitella
Semisulcatus
Turbo
Semisulcatus
Taxon-Taxon
taxon real size
Shasiella tongxinensis 5cm x
5cm
Taxon-Real Size
DeepDive
Unstructured Inputs
Structured Outputs
Goal: High Quality
DeepDive: Applications to Knowledge Base Construction
Caffe con Troll: A Deep Learning Engine
DeepDive with Caffe con Troll: Ongoing Work
Many pressing scientific questions
are macroscopic.
KBC Applications
Science is built up with facts, as a house is with
stones.
- Jules Henri Poincaré
Example: Paleontology
Taxon Rock
Age Location
Scientific Facts
Biodiversity
Macroscopic View Insights & Knowledge
Impact of climate
change to bio-
diversity?
KBC Applications
Example: Paleontology
Taxon Rock
Age Location
Scientific Facts
Biodiversity
Macroscopic View Insights & Knowledge
Impact of climate
change to bio-
diversity?
KBC Applications
Example: Paleontology
Taxon Rock
Age Location
Scientific Facts
Biodiversity
Macroscopic View Insights & Knowledge
Impact of climate
change to bio-
diversity?
1570 1670 1770 1870 1970 2015
Input Sources
KBConstruction
Knowledge
Base (KB)
KBC Applications
Paleontology Genomics
Taxon Rock
Age Location
Knowledge Base
Gene Drug
Disease
Knowledge Base
Dark Web
Server Service
Price Location
Knowledge Base
Climate & Biodiversity Social GoodHealth & Medicine
Challenge:
Can we just do KBC manually?
Challenge of Manual KBC
Paleontology
Taxon Rock
Age Location
Knowledge Base
Effort on Manual KBC
Sepkoski (1982) manually
compiled a compendium of 3300
animal families with 396
references in his monograph.
300 professional volunteers
(1998-present) spent 8 continuo-
us human years to compile
PaleoDB with 55,479 references.
80
90
100
110
120
2010 2011 2012 2013
#NewPaleo
References…
100K new references
per year! 16 continuous human
years every year just to
keep up-to-date!
Can we build a machine
to read for us?
Automatic KBC
Input Sources
Machine
Knowledge Base
Case Study - PaleoDeepDive
The Goal
Extract paleobiological facts to build higher coverage fossil
record.
T. Rex are found dating to
the upper Cretaceous.
Appears(“T. Rex”, “Cretaceous”)
DeepDive
Case Study - PaleoDeepDive
55K documents
329 geoscientists
8 years
126K fossil mentions
2000 machine cores
46 machine years
1M relations
300K documents
3M fossil mentions
2.1M relations
PaleoDB PaleoDeepDive
Human-created
Paleobiology
database!
Machine-created
Paleobiology
database!
(>90% Precision)
Biodiversity Curve
On the same relation, PaleoDeepDive achieves equal (or
sometimes better) precision as professional human
volunteers.
Validation on Real Applications
Paleontology
Geology
Pharmacogenomics
Genomics
Wikipedia-like Relations
Dark Web
“It's a little scary,
the machines
are
getting that
good.”Recall: 2-10x more extractions than human
Precision: 92%-97% (Human ~84%-92%)
Highest score out of 18 teams and 65
submissions (2nd highest is also DeepDive).
Applied Physics
Goal: Enables easy engineering to
build high-quality KBC Systems by
thinking about features not
algorithms.
Can we support more sophisticated
image processing in DeepDive?
Go Beyond Text-Processing
What kind
of dinosaur
is this?
Does this
patient have
short finger?
Is this sea
star found in
2014 sick?
What’s the
Clinical out-
come of this
patient?
Images are important to many scientific questions.
[User] Can I run Deep Learning on my
datasets with DeepDive?
Just before we start the run…
On which machine should we run? CPU or GPU?
I have a GPU
Cluster
I have 5000 CPU cores
I have $100K to spend
on the cloud
EC2: c4.4xlarge
8 cores@2.90GHz
EC2: g2.2xlarge
1.5K cores@800MHz
0.7TFlops 1.2TFlops
Not a 10x gap? Can we close this gap?
Caffe con Troll
http://github.com/HazyResearch/CaffeConTroll
A prototype system to study the
CPU/GPU tradeoff.
Same-input-same-output as Caffe.
What we found…
c4.4x_large
($0.68/h)
c4.4x_large
($0.68/h)
g2.2x_large
($0.47/h)
c4.8x_large
($1.37/h)
c4.8x_large
($1.37/h)
RelativeSpeed
0
0.2
0.4
0.6
0.8
1
1.2
Caffe CPU CcT CPU Caffe
GPU
Caffe CPU CcT CPU
What we found…
c4.4x_large
($0.68/h)
c4.4x_large
($0.68/h)
g2.2x_large
($0.47/h)
c4.8x_large
($1.37/h)
c4.8x_large
($1.37/h)
RelativeSpeed
0
0.2
0.4
0.6
0.8
1
1.2
Caffe CPU CcT CPU Caffe
GPU
Caffe CPU CcT CPU
What we found…
c4.4x_large
($0.68/h)
c4.4x_large
($0.68/h)
g2.2x_large
($0.47/h)
c4.8x_large
($1.37/h)
c4.8x_large
($1.37/h)
RelativeSpeed
0
0.2
0.4
0.6
0.8
1
1.2
Caffe CPU CcT CPU Caffe
GPU
Caffe CPU CcT CPU
What we found…
c4.4x_large
($0.68/h)
c4.4x_large
($0.68/h)
g2.2x_large
($0.47/h)
c4.8x_large
($1.37/h)
c4.8x_large
($1.37/h)
RelativeSpeed
0
0.2
0.4
0.6
0.8
1
1.2
Caffe CPU CcT CPU Caffe
GPU
Caffe CPU CcT CPU
Proportional
to FLOPs!
Four Shallow Ideas Described in
Four Pages…
arXiv:1504.04343
One of the four shallow ideas…
3 CPU Cores 3 Images Strategy 1 Strategy 2
If the amount of data is too small for each core, the
process might not be CPU bound.
For AlexNet over Haswell CPUs, Strategy 2 is 3-4x faster.
Caffe con Troll + DeepDive
(Ongoing Work)
Application 1: Paleontology
Images without high-quality human labels also
contain valuable information.
What can we learn from these
images without human labels?
Name of Fossil
Fossil Image
Application 1: Paleontology
We apply Distant Supervision!
Porifera Brachiopoda
ClassifierDocument
Can we build a system that automatically “reads” a
Paleontology textbook and learn the difference
between sponges and shells?
Application 1: Paleontology
29
Fig. 387,1a-c. *B. rara, Serpukhovian, Kazakhstan,
Dzhezgazgan district; a,b, holotype, viewed
ventrally, laterally, MGU 31/342, XI (Litvinovich, 1967);
Figure Name Mention Taxon Mention
DeepDive Extractions
Fig. 387
Figures
Provide Labels
Train CNN
Test with Human Labels
3K Brachiopoda Images
2K Porifera Images
Accuracy = 94%
Thank You
deepdive.stanford.edu
github.com/HazyResearch/CaffeConTroll
Ce Zhang: czhang@cs.stanford.edu
DeepDive Group: contact.hazy@gmail.com

Más contenido relacionado

Similar a Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

Class Of Forensic And Investigative Sciences
Class Of Forensic And Investigative SciencesClass Of Forensic And Investigative Sciences
Class Of Forensic And Investigative Sciences
Lisa Kennedy
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
Bertram Ludäscher
 

Similar a Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15 (20)

Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...
Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...
Toward Real-Time Analysis of Large Data Volumes for Diffraction Studies by Ma...
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
Class Of Forensic And Investigative Sciences
Class Of Forensic And Investigative SciencesClass Of Forensic And Investigative Sciences
Class Of Forensic And Investigative Sciences
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Nanoscale Properties of Biocompatible materials
Nanoscale Properties of Biocompatible materialsNanoscale Properties of Biocompatible materials
Nanoscale Properties of Biocompatible materials
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research
 
Round Table Introduction: Analytics on 100 TB+ catalogs
Round Table Introduction: Analytics on 100 TB+ catalogsRound Table Introduction: Analytics on 100 TB+ catalogs
Round Table Introduction: Analytics on 100 TB+ catalogs
 
Tim Brown ACEAS Phenocams
Tim Brown ACEAS PhenocamsTim Brown ACEAS Phenocams
Tim Brown ACEAS Phenocams
 
IB Biology Option D.3: Human evolution
IB Biology Option D.3: Human evolutionIB Biology Option D.3: Human evolution
IB Biology Option D.3: Human evolution
 
The iPlant Tree of Life Project and Toolkit
The iPlant Tree of Life Project and ToolkitThe iPlant Tree of Life Project and Toolkit
The iPlant Tree of Life Project and Toolkit
 
Miguel Foronda T3chfest
Miguel Foronda T3chfestMiguel Foronda T3chfest
Miguel Foronda T3chfest
 
A Sightseeing Tour of Provenance in Databases & Workflows
A Sightseeing Tour of Provenance in Databases & WorkflowsA Sightseeing Tour of Provenance in Databases & Workflows
A Sightseeing Tour of Provenance in Databases & Workflows
 
From physics to data science
From physics to data scienceFrom physics to data science
From physics to data science
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
JBrowse & Apollo Overview - for AGR
JBrowse & Apollo Overview - for AGRJBrowse & Apollo Overview - for AGR
JBrowse & Apollo Overview - for AGR
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
E research overview gahegan bioinformatics workshop 2010
E research overview gahegan bioinformatics workshop 2010E research overview gahegan bioinformatics workshop 2010
E research overview gahegan bioinformatics workshop 2010
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 

Más de MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 

Más de MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Último

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15

  • 1. Ongoing and Future Work: Part II DeepDive & Caffe con Troll: Knowledge Base Construction from Text and Beyond Ce Zhang Stanford University
  • 2. (a) Natural Language Text (b) Table (c) Document Layout ... The Namurian Tsingyuan Fm. from Ningxia, China, is divided into three members ... formation time Tsingyuan Fm. Namurian Formation-Time formation location Tsingyuan Fm. Ningxia Formation-Location taxon formation Euphemites Tsingyuan Fm. Taxon-Formation taxon formation Turbonitella Semisulcatus Turbo Semisulcatus Taxon-Taxon taxon Shasiell Taxon- (a) Natural Language Text (b) Table (c ... The Namurian Tsingyuan Fm. from Ningxia, China, is divided into three members ... formation time Tsingyuan Fm. Namurian Formation-Time formation location Tsingyuan Fm. Ningxia Formation-Location taxon formation Euphemites Tsingyuan Fm. Taxon-Formation taxon Turbo Semis Taxo (c) Document Layout (d) Image Fm. taxon formation Turbonitella Semisulcatus Turbo Semisulcatus Taxon-Taxon taxon real size Shasiella tongxinensis 5cm x 5cm Taxon-Real Size Text (b) Table (c) Document Layout (d) Image into ian n a taxon formation Euphemites Tsingyuan Fm. Taxon-Formation taxon formation Turbonitella Semisulcatus Turbo Semisulcatus Taxon-Taxon taxon real size Shasiella tongxinensis 5cm x 5cm Taxon-Real Size http://deepdive.stanford.edu DeepDive
  • 3. (a) Natural Language Text (b) Table (c) Document Layout (d) Image ... The Namurian Tsingyuan Fm. from Ningxia, China, is divided into three members ... formation time Tsingyuan Fm. Namurian Formation-Time formation location Tsingyuan Fm. Ningxia Formation-Location taxon formation Euphemites Tsingyuan Fm. Taxon-Formation taxon formation Turbonitella Semisulcatus Turbo Semisulcatus Taxon-Taxon taxon real size Shasiella tongxinensis 5cm x 5cm Taxon-Real Size (a) Natural Language Text (b) Table (c) Document Layout (d) Image ... The Namurian Tsingyuan Fm. from Ningxia, China, is divided into three members ... formation time Tsingyuan Fm. Namurian Formation-Time formation location Tsingyuan Fm. Ningxia Formation-Location taxon formation Euphemites Tsingyuan Fm. Taxon-Formation taxon formation Turbonitella Semisulcatus Turbo Semisulcatus Taxon-Taxon taxon real size Shasiella tongxinensis 5cm x 5cm Taxon-Real Size Table (c) Document Layout (d) Image formation Tsingyuan Fm. ation taxon formation Turbonitella Semisulcatus Turbo Semisulcatus Taxon-Taxon taxon real size Shasiella tongxinensis 5cm x 5cm Taxon-Real Size ural Language Text (b) Table (c) Document Layout (d) Image urian Tsingyuan Fm. ia, China, is divided into ers ... n time n Fm. Namurian on-Time n location n Fm. Ningxia on-Location taxon formation Euphemites Tsingyuan Fm. Taxon-Formation taxon formation Turbonitella Semisulcatus Turbo Semisulcatus Taxon-Taxon taxon real size Shasiella tongxinensis 5cm x 5cm Taxon-Real Size DeepDive Unstructured Inputs Structured Outputs Goal: High Quality DeepDive: Applications to Knowledge Base Construction Caffe con Troll: A Deep Learning Engine DeepDive with Caffe con Troll: Ongoing Work
  • 4. Many pressing scientific questions are macroscopic.
  • 5. KBC Applications Science is built up with facts, as a house is with stones. - Jules Henri Poincaré Example: Paleontology Taxon Rock Age Location Scientific Facts Biodiversity Macroscopic View Insights & Knowledge Impact of climate change to bio- diversity?
  • 6. KBC Applications Example: Paleontology Taxon Rock Age Location Scientific Facts Biodiversity Macroscopic View Insights & Knowledge Impact of climate change to bio- diversity?
  • 7. KBC Applications Example: Paleontology Taxon Rock Age Location Scientific Facts Biodiversity Macroscopic View Insights & Knowledge Impact of climate change to bio- diversity? 1570 1670 1770 1870 1970 2015 Input Sources KBConstruction Knowledge Base (KB)
  • 8. KBC Applications Paleontology Genomics Taxon Rock Age Location Knowledge Base Gene Drug Disease Knowledge Base Dark Web Server Service Price Location Knowledge Base Climate & Biodiversity Social GoodHealth & Medicine
  • 9. Challenge: Can we just do KBC manually?
  • 10. Challenge of Manual KBC Paleontology Taxon Rock Age Location Knowledge Base Effort on Manual KBC Sepkoski (1982) manually compiled a compendium of 3300 animal families with 396 references in his monograph. 300 professional volunteers (1998-present) spent 8 continuo- us human years to compile PaleoDB with 55,479 references. 80 90 100 110 120 2010 2011 2012 2013 #NewPaleo References… 100K new references per year! 16 continuous human years every year just to keep up-to-date!
  • 11. Can we build a machine to read for us?
  • 13. Case Study - PaleoDeepDive The Goal Extract paleobiological facts to build higher coverage fossil record. T. Rex are found dating to the upper Cretaceous. Appears(“T. Rex”, “Cretaceous”) DeepDive
  • 14. Case Study - PaleoDeepDive 55K documents 329 geoscientists 8 years 126K fossil mentions 2000 machine cores 46 machine years 1M relations 300K documents 3M fossil mentions 2.1M relations PaleoDB PaleoDeepDive Human-created Paleobiology database! Machine-created Paleobiology database! (>90% Precision) Biodiversity Curve On the same relation, PaleoDeepDive achieves equal (or sometimes better) precision as professional human volunteers.
  • 15. Validation on Real Applications Paleontology Geology Pharmacogenomics Genomics Wikipedia-like Relations Dark Web “It's a little scary, the machines are getting that good.”Recall: 2-10x more extractions than human Precision: 92%-97% (Human ~84%-92%) Highest score out of 18 teams and 65 submissions (2nd highest is also DeepDive). Applied Physics Goal: Enables easy engineering to build high-quality KBC Systems by thinking about features not algorithms.
  • 16. Can we support more sophisticated image processing in DeepDive?
  • 17. Go Beyond Text-Processing What kind of dinosaur is this? Does this patient have short finger? Is this sea star found in 2014 sick? What’s the Clinical out- come of this patient? Images are important to many scientific questions. [User] Can I run Deep Learning on my datasets with DeepDive?
  • 18. Just before we start the run… On which machine should we run? CPU or GPU? I have a GPU Cluster I have 5000 CPU cores I have $100K to spend on the cloud EC2: c4.4xlarge 8 cores@2.90GHz EC2: g2.2xlarge 1.5K cores@800MHz 0.7TFlops 1.2TFlops Not a 10x gap? Can we close this gap?
  • 19. Caffe con Troll http://github.com/HazyResearch/CaffeConTroll A prototype system to study the CPU/GPU tradeoff. Same-input-same-output as Caffe.
  • 24. Four Shallow Ideas Described in Four Pages… arXiv:1504.04343
  • 25. One of the four shallow ideas… 3 CPU Cores 3 Images Strategy 1 Strategy 2 If the amount of data is too small for each core, the process might not be CPU bound. For AlexNet over Haswell CPUs, Strategy 2 is 3-4x faster.
  • 26. Caffe con Troll + DeepDive (Ongoing Work)
  • 27. Application 1: Paleontology Images without high-quality human labels also contain valuable information. What can we learn from these images without human labels? Name of Fossil Fossil Image
  • 28. Application 1: Paleontology We apply Distant Supervision! Porifera Brachiopoda ClassifierDocument Can we build a system that automatically “reads” a Paleontology textbook and learn the difference between sponges and shells?
  • 29. Application 1: Paleontology 29 Fig. 387,1a-c. *B. rara, Serpukhovian, Kazakhstan, Dzhezgazgan district; a,b, holotype, viewed ventrally, laterally, MGU 31/342, XI (Litvinovich, 1967); Figure Name Mention Taxon Mention DeepDive Extractions Fig. 387 Figures Provide Labels Train CNN Test with Human Labels 3K Brachiopoda Images 2K Porifera Images Accuracy = 94%
  • 30. Thank You deepdive.stanford.edu github.com/HazyResearch/CaffeConTroll Ce Zhang: czhang@cs.stanford.edu DeepDive Group: contact.hazy@gmail.com

Notas del editor

  1. Hi everyone, thanks for coming. Today, I am going to talk about a system that we have being building over the last couple years called DeepDive, and the Deep Learning engine for it called Caffe con Troll.
  2. DeepDive is a system we build to support a workload called knowledge base construction. I will talk more about DeepDive, but for now you can think about it as a system that takes as input unstructured documents, for example, a text document containing natural language text, and output relations extracted from the input document. For example, given a text like this, DeepDive can extract relations between rock formations and their age and location, as database relations. For those of you who are familiar with databases, this might sound like an information extraction task; and for those of you who are familiar with machine learning and natural language processing, this might sound like a relation extraction task. DeepDive learned a lot from these two communities. On the other hand, DeepDive tries to extend this ability to other sources of input, like tables, document layout, and even figures.
  3. This focus on the diversity of input sources require DeepDive to consume a diverse set of features, no matter they are produced by user defined function over parse trees of natural language sentences, or image features extracted by state-of-the-art image processing algorithms like Deep Learning. In this talk, I will first tell you more about DeepDive, and a deep learning Engine we built called Caffe con Troll. These two pieces of softwares are currently ready for you to download and play with. Other than these two pieces, I will also tell you about one of our on-going direction that tries to fuse these two pieces more tightly together.
  4. Over the last couple years, we have learned from many of Our scientisits collaborators that many pressing scientific Questions are macroscopic. That is, to get insights and hints To these questions, one often need to aggregate a huge amount Of facts about a certain domain.
  5. One of such example can be found in Paleontology. When the scentists want to understand questions like what is the impact of climate change to biodiversity, one way to get some insights is to start from a collection of facts about what fossils appear in what rock formations, and the age and location of a given rock formation. Given this collection of fact, they can aggregate them by time to get a biodiversity curve, which essentially tells us at a given time how many distinct species are there. From this macroscopic view, geoscientists can get some insights to help them understanding scientific questions like the one about climate changes. We can see that if we want to support this workflow from scientific facts to scientific insights, we first need to have this collection of facts ready for analysis, ideally in a structured form that we can query then with different analytic tools.
  6. However, in practice, many of these facts are not currently organized in a structured way like a relational tables in database, instead, many are published in journal articles or books that could be more than four hundreds years old. In this talk, I will call this collection of scientific facts organized in structured form knowledge base, and the process of extracting these facts from input resources knowledge base construction. As we can see here, this KBC step could provide one possible starting point for scientists to understand key scientific questions.
  7. This example of workload not only appears for Paleontology. In our past experience, we find that similar KBC process could be useful for a whole range of other domains. For example, for genomics, if we could build a knowledge base between gene, drug, and diseases from published journal articles, it could have genomistis to understand better about drug-repurposing or personalized medicine. If we could build a knowledge base by extracting from bad guys’ communication on the Web, we could provide opportunities to make this world a better place. Now that we have seen that KBC could be an useful process, then a natural question to ask is
  8. Can we just do KBC manually?
  9. Let’s still use our Paleontology example where our target knowledge is about fossils, rock formations, their age, and location. Actually, building such a knowledge is so important that people are actually trying to build it manually. More than thirty years ago, Sepkoski manually compiled a knowledge base that contains more than three thousands of animal families by manually extracting these facts from four hundreds references. This monograph along has been cited hundreds and times and lead to many discoveries. The importance compiling such a knowledge base is actually get noticed by the community, and starting from twenty years ago, more than 300 professional paleonotlogists have spend more than eight continuous human years to compile a knowledge base called PaleoDB that contains more than 55 thousands references. This project is also highly successfully and lead to more than 200 papers, many of which published in Nature or Science. However, this effort of manually constructing knowledge base has its own limitations. Take one of the largest database of geoscience-related publications, the number of references are growing in a rate of 100 thousands references per year. If we compare this rate with the total size of PaleoDB, it means that it might take 16 continuous human years just to keep-up-to-date every year. Although this estimation is pretty rough, it does show us how expensive and time-consuming that manual KBC could be given the amount of information being produced in recent years. This huge number of publications are not unique for geoscience, and every years there are millions of new papers published in all fields of science. Motivated by the sheer amount of data that we need to extract information from, one question that we are really interested in is
  10. Can we build a machine to read for us? Here the word “read” could mean a lot of things for human beings, but lets make it precise for a machine. By “read”, we mean that
  11. Can we build a machine that takes as input all these input sources like journal articles, and automatically fill in a knowledge base stored as database tables? DeepDive is the system we built make this process easier. Over the last Couple years, we have been building this type of systems for a range Of domains, and let me tell you more.
  12. One application we built is called PaleoDeepDive. The goal is to extract paleobiological facts to build higher coverage fossil records. The input of the system is a collection of journal articles, and the output is a knowledge base containing information about fossils. For example, if we see this sentence from the journal article, we expect the system output A tuple to encode the fact that the dinasaur T. Rex appears in the age Cretaceous.
  13. One of the most interesting aspect of PaleoDeepDive is that it extracts relations in exactly the same schema as PaleoDB, the manually curated knowledge base that I just described. This enables us to compare the quality of PaleoDeepDive with professional volunteers for the KBC task. As we just mentioned, PaleoDB is an effort from three hundreds professional paleontologists who spend 8 continuous human years together. PaleoDB contains more than 55 thousands documents, 100 thousands fossil mentions, and more than 1 million relations. On the other hand, PaleoDeepDive is a machine curated knowledge base that uses more than 2000 machine cores and 46 machine years. It processes more than 300 thousands journal articles, and extracted three million fossil mentions and 2 million relations. We can aggregate both systems to get a biodiversity curve with high correlations. We actually conduct double-blind experiments to ask scientists to label the facts in PaleoDB or PaleoDeepDive on their correctness, and find that on the same relation PaleoDeepDive achieves equal and sometimes better precision as professional human volunteers.
  14. DeepDive has been used to build similar applications cross different domains. At the early stage of developing DeepDive, we ourselves are the developer of KBCs and get lot of helps from domain scientists. Some of these systems actually get pretty high quality. The PaleoDeepDive work that I Just mentioned was featured in a news article in the July issue of Nature and we are pretty excited about it. And according to that article, some geoscientists are impressed by how high the quality of our system is. We also developed a KBC system to extraction information from Web, and it produces the highest score in a popular KBC competition among 18 teams. Now, KBC system are usually developed by domain experts beside us including applications In the domain of pharmacogenomics or applied physicas. To make these domain experts to actually use our system by themselves, DeepDive models the whole Process as a large inference problem, and the only thing the user needs to do is to keep providing features To the system without worrying about what algorithm is actually running inside DeepDive. One of the common thing about these appicatoins is that at their early stage many of them can Only extract information from text and tables. After we understand how to do these tasks for Text and table, we find that we need to understand images better.
  15. Although currently DeepDive can support some textural extractions from images, this functionality could be significantly extended and improved. We are interested in figures and imags because sources like images are important to many scientific questions. For example, if we could build high-quality image recognition tools with DeepDive, we could automatically classify fossils into different classes or orders; it could also help genomists to automatically identify phenotypes of a given patient and use this information with the patients’ genotype of decide a plan for treatment. We talk about lot of our scientisits collaborators on what type of information they want out of images, And one of them most frequently first questions they ask is can you guys just run deep learning On my set of images and classify them? At first, we think, well, this requriement should be easy to support. There are lot of awesome tools out There like Caffe or Theano that makes it really easy to specify a deep learning task and run it. So we start to set up the machine to run this for them, but just before we start the run, One question that comes up is on which machine should we run? Should we run them on a lot Of GPUs or just a lot of CPUs?
  16. So we look at existing papers and systems to try to understand this question, but What we found is somehow a mixed set of information. For some systems it is Not uncommon for it to run 10 times faster on GPU than CPU; but there are also Many successful systems built in indrustry that takes advatnage of a cluster of CPU pretty well. More over, users often have access to a largely diverse set of Resources, some have a GPU cluster, some have a lot of CPU cores, and some of Some credits on the clouds that they can spend. When we put all these information Together, we were kind of getting more confused on this question. So we decide to look into more deeply into this question, and we start to investigate The difference between CPU and GPU in terms of running Deep Learning workload. If we compare a CPU and a GPU that are available on cloud provided by Amazon, we can see that the difference between the amount of floating point operations they can do per second is not that different. Therefore, one natural question to ask is because there is not a terrible gap on the peak flops, can we achieve this peak? Or is there anything special in CPU that prevents us from achieving this peak?
  17. Therefore, we built a very simple prototype system called Caffe con Troll to study this question. We design Caffe con Troll to be a prototype that takes The same input as a popular deep learning framework Called Caffe and produces the same output.
  18. We find that the performance of CPU could be optimized such that we get near 80% of the peak flops on CPU. On a single CPU our CPU implementation could be more than 5 times faster than Caffe.
  19. We find that the performance of CPU could be optimized such that we get near 80% of the peak flops on CPU. On a single CPU our CPU implementation could be more than 5 times faster than Caffe.
  20. We find that the performance of CPU could be optimized such that we get near 80% of the peak flops on CPU. On a single CPU our CPU implementation could be more than 5 times faster than Caffe.
  21. Actually, when we add one more CPU to the machine, we get almost 2x Speed up on CPU and these two 8-core Haswell CPU can match the speed of a single GPU aviable on Amazon’s cloud. But a more interesting result is that under our implementation, the difference of running deep learning Over CPU and GPU is proportional to the number of floating points that a device could provide. This provides a very simple rule-of-the-thumb for us to guide our users when they ask what types of Device they need. Also, because CPU is not that slow compared with GPU, it makes more sense to ask the question Of how to run Deep Learning over a machine with both CPUs and GPUs together. Surprisingly, how to achieve this speed-up is actually pretty simple,
  22. Recall that one of our motivation of buildign Caffe con Troll is to Help our users of DeepDive to process their images. Now we have A prototype impementation to run Deep learning, how are we going To use it for knowledge base construction? The answer to this question is still under exploration, and I will just Tell you about a very preliminary application that we building.
  23. One interesting direction that we are exploring is how to combine Deep learning with DeepDive. If we are able to run Deep learning efficiently, how can it help DeepDive and our scientists users? One observation is that state-of-the-art image recongnition methods often require a corpus with human labels. However, image without high-quality human labels might also contain value information. Take this page in a paleontology journal articles for exam, although there are no human labels telling us the name of the fossil, such name actually appears in the same document. Therefore, one question that we are interested in is what can we learn from these images without human labels but has rich information surrounding it?
  24. More concretely, can we build a system that automatically reads a Paleontology textbook and learn the difference between different classes of fossils, like sponges and shells? If such a system is possible, the input would be a set of journal articles, and the output would be a classifier that can distinguish between different classes of fossils. Our hypothesis is that we could extend the idea of distant supervision for text processing to automatically generated labels for image applications.
  25. To study this hypothesis, we have built a very simple prototype that use DeepDive to extract information between Figure number and fossil names from the text, and use it to label Figures as training examples. We then train convolutional neural network on this distantly-generated labels. Some early result shows that we can achieve pretty high accuracy on this simple task. This result is pretty preliminary, but we hope to explore further in similar directions to understand what framework should we provide to the user such that they could build similar applications easily.
  26. Thank you for your attention, both systems that I talk about today Can be downloaded from their Web sites, and we are actively Working on understanding how to fuse these two systems together To support applications that requires inference cross text and images.