SlideShare una empresa de Scribd logo
1 de 39
CONTEXT
SEMANTICS!
Danny : isBrotherOf : Nezih
food cart : uses : bicycles
Frank : isFriendsWith : Mohit
Frank : isFriendsWith : Ted
Frank : likes : bicycles
Frank : likes : food carts
Ivy : isFriendsWith : Kushal
Ivy : isFriendsWith : Ted
Ivy : likes : bicycles
Ivy : likes : food carts
Kushal : isFriendsWith : Mohit
Kushal : isFriendsWith : Nezih
Nezih : is FriendsWith : Ted
Ted : likes : bicycles
This model... ... infers this interest.
Ted Kushal
Mohit
Danny
Ivy
Frank
Nezih
friends
friends
friends
brothers
friends
friends
friends
friends
Food
Cart
likes
likes likesBicycles
likes likes
likes
uses
Likes?
Virtuous cycle of data
CLOUD
Richer data to
analyze
CLIENTS
Richer data
from devices
Richer
user experiences
INTELLIGENT
SYSTEMS
SEMANTIC INFORMATION
IS FUEL FOR THE CYCLE
1985 1995 2005 2015
enterprise
NoSQL
Docs
+
Semantics
RDF
WIDESPREAD
MACHINE LEARNING
ON THIS
IMAGINE THE POSSIBILITIES
Graph centrality
High
Program
Importance
(Centrality)
Low
Graph of
channel
viewing
behavior
Current popular
surfing patterns
SH002463130000 EP005544723744
Changes in surfing
behavior may predict
customer churn.
Preference and Similarity Recommendations
User
Movie
1.7MM Nodes
23.9MM Edges
similar cast
prefers
similar
topic
userId: A0A22A5
title: The Godfather
genre: Crime drama
cast: [M. Brando, Al Pacino]
title: Scarface
genre: Crime drama
cast: [Al Pacino, M. Pfeiffer]
title: The Departed
genre: Crime drama
cast: [L. DiCaprio, M. Damon]
weight=11.8
weight=0.67
weight=0.03
weight=14.98
Min-cost path search
10
URL Ground-Truth Data
IP/Domain Reputations
420MM Records
74.5MM Nodes
185MM Edges
URL
Domain
IP Address
Calculation of priors
LBP Messaging
Loopy Belief Propagation on the (semantic) web
84.231.82.93
86.39.155.137
forum.vsichko.com
hermansonskok.se
euskzzbz.nonetheups.com
keesenbep.spaces.live.com
Loopy Belief Propagation on the (semantic) web
A yoga
ball
graph.
Really!?!
You may actually need this
• When the problem is an information
network
• When a graph is a natural way of
expressing the algorithm
• When you want to study specific
relationships
• When you want faster machine learning
or solvers on sparse data
shortest path
central
influence
sub networks
triangle count
But there are challenges.
Handling all that
data.
Finding people good at both handling all
that data and data analysis.
Putting exploratory work into production
fast enough to keep up with the
competition.
14
Congratulations
! You
are a
data scientist!
It’s a demanding job
Ingest &
Clean
Engineer
Features
Structure
Model
Train
Model
Query &
Analyze
Learn
Visualize
Skills shortage at
intersection of
systems
engineering and
data analysis
Painful data
ingestion and
preparation
Workflows that are not designed
with loopbacks in mind
Few tools for analyzing
semantics at scale
Composing
pipeline is DIY
Decomposing
the “data
scientist”
Source: 2013 Report from Accenture Institute for High Performance
IMAGINE A PLATFORM FOR DATA SCIENTISTS
DOCS + SEMANTICS + MACHINE LEARNING
Ease-of-use: Making big data familiar
Python
R
Dataflow
GUI
...
Datacenter / CloudNetworkClient
BIG
DATA
API
Connec
tManag
e
Secure
Analyzedistributed and parallel
Manag
eSecure
Connec
t
Analyzelocal
Query
Big Data Java/Scala/C++
Computational Frameworks
Big Data Algorithms
Cluster Workload Mgmt
Cluster Storage
Machine Learning & Statistics
Data WranglingAnalyst
Skills
The
Other
Skills
Delivering it
FILESYSTEMS AND NOSQL STORAGE
HW PLATFORM
APACHE HADOOP APACHE SPARK
DATA WRANGLING
MACHINE LEARNING AND
STATISTICS
Graphical
Algorithms
Classical
Algorithms
Graph
Construction Tools
Useful String
Manipulation
Useful Math
Operators
BIG DATA API
DATA SCIENCE SERVER (Query and Scripting)
Intel Analytics Toolkit
A UNIFIED DOCUMENT + SEMANTIC STORE
The Ask
Approach Algorithm Category Applications/Use Cases
Loopy Belief Propagation (LBP) Structured Prediction Personalized recs, image de-noising
Label Propagation Structured Prediction Personalized recommendations
Alternating Least Squares (ALS) Collaborative Filtering Recommenders
Conjugate Gradient Descent (CGD) Collaborative Filtering Recommenders
Connected Components Graph Analytics Network manipulation, image
analysis
Latent Dirichlet Allocation (LDA) Topic Modeling Document Clustering
Structure Attribute Clustering Network analysis, consumer seg
K-Truss Clustering Social network analysis
KNN* Clustering Recommenders
Logistic Regression* Classification Fraud detection
Random Forest* Classification Fraud detection, consumer seg
Generalized Linear Model (Binomial,
Poisson)
Non-linear Curve Fitting Forecasting, pricing, market mix
models
Association Rule Mining Data Mining Market basket analysis,
recommenders
Frequent Pattern Mining* Data Mining Pattern Recognition
Bringing a full spectrum of possibilities
Graph
21
Article Tagging Problem
• Articles are tagged by experts with MeSH terms, drawn
from a hierarchical controlled vocabulary of 55,000
keywords
• Process is resource-intensive – can we automate it?
• Categorize articles into a hierarchy that matches the
same categorization from the MeSH controlled
vocabulary
Hierarchy Level
Article Count
Demo: Graph Analytics For Medical Journal
Analysis
INGEST
&
CLEAN
ENGINEER
FEATURES
STRUCTURE
GRAPH
QUERY &
ANALYZE
LEARN
VISUALIZE
PARSE AND
EXTRACT
WORDS
CREATE
ARTICLE/
WORD LIST
BUILD GRAPH
QUERY/
VISUALIZE DATA
DETECT
CLUSTERS
USING LDA
• Medline™ XML
• MeSH Ontology XML
• Create list of unique
words
• Stemming and
lemmatization
• Index word list
• Transform articles
into list of article/word
pairs
• Extract vertices
• Assign id columns to
vertex property
• Assign year and
count edge
properties
• Gremlin query for
each visual
• Python web server
and other libraries
• Select
optimization
parameters
• Invoke LDA
The Playbook?
PARSE AND
EXTRACT
WORDS
CREATE
ARTICLE/
WORD LIST
BUILD
GRAPH
QUERY/
VISUALIZE
DATA
DETECT
CLUSTERS
USING LDA
Parse Prepare graph data Basic analysis Run LDA
INSIGHTFUL
RESULT
This never happens!
The Real Playbook
PARSE AND
EXTRACT
WORDS
CREATE
ARTICLE/
WORD LIST
BUILD
GRAPH
QUERY/
VISUALIZE
DATA
DETECT
CLUSTERS
USING LDA
Parse
Correct mistake
Prepare graph data
Correct schema mistake
Correct aggregation mistake
Data validation
Correct dataset mistake
Guess LDA settings
Tune and re-run
Detect bias in dataset
WE NEED THE AGILITY OF INTERACTIVE SCRIPTING
AND
THE
BRAINS AND BRAWN OF
SCALABLE GRAPH ANALYTICS
Build Frame
28
Build Graph
29
Query
Vertices
30
LDA with 3 Topics
LDA with 5
Topics
LDA with 7 Topics
Query Vertices Again – Now with ML
Properties
34
Following Analysis
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
Wakefulness
Sleep
Animals
Electroencephalography
Circadian Rhythm
Arousal
Sleep Stages
REM
Mental Recall
Attention
Rats
Child
Evoked Potentials
Aged
Schizophrenia
Ocular
Conditioning
Infant
Psychophysics
Dreams
Top MeSH terms that predict which category an article will be assigned
Reimagining 2014
New partnerships in big data
Contributions to the open source community
The Intel Analytics Toolkit – COMING SOON
SEMANTICS + MACHINE LEARNING
TOGETHER AT LAST!
INTERESTED IN THE INTEL ANALYTICS
TOOLKIT?
THEODORE.L.WILLKE@INTEL
.COM
Legal Disclaimers
All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without
notice.
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across
different processor families. Go to: http://www.intel.com/products/processor_number
Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate
from published specifications. Current characterized errata are available on request.
Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor
(VMM). Functionality, performance or other benefits will vary depending on hardware and software configurations. Software applications may
not be compatible with all operating systems. Consult your PC manufacturer. For more information, visit http://www.intel.com/go/virtualization
No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) requires a computer
system with Intel® Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT-
compatible measured launched environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit
http://www.intel.com/technology/security
Intel, Intel Xeon, Intel Atom, Intel Xeon Phi, Intel Itanium, the Intel Itanium logo, the Intel Xeon Phi logo, the Intel Xeon logo and the Intel logo are
trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Other names and brands may be claimed as the property of others.
Copyright © 2013, Intel Corporation. All rights reserved.

Más contenido relacionado

La actualidad más candente

Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 

La actualidad más candente (20)

Splunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search DojoSplunk Ninjas: New Features, Pivot, and Search Dojo
Splunk Ninjas: New Features, Pivot, and Search Dojo
 
Predicting Patient Outcomes in Real-Time at HCA
Predicting Patient Outcomes in Real-Time at HCAPredicting Patient Outcomes in Real-Time at HCA
Predicting Patient Outcomes in Real-Time at HCA
 
Open Data Science Conference Agile Data
Open Data Science Conference Agile DataOpen Data Science Conference Agile Data
Open Data Science Conference Agile Data
 
Developing Highly Instrumented Applications with Minimal Effort
Developing Highly Instrumented Applications with Minimal EffortDeveloping Highly Instrumented Applications with Minimal Effort
Developing Highly Instrumented Applications with Minimal Effort
 
Nicola Pagni - Anomaly Detection in Elasticsearch
Nicola Pagni - Anomaly Detection in ElasticsearchNicola Pagni - Anomaly Detection in Elasticsearch
Nicola Pagni - Anomaly Detection in Elasticsearch
 
Customer Presentation - Financial Services Organization
Customer Presentation - Financial Services OrganizationCustomer Presentation - Financial Services Organization
Customer Presentation - Financial Services Organization
 
The Critical Missing Component in the Production ML Stack
The Critical Missing Component in the Production ML StackThe Critical Missing Component in the Production ML Stack
The Critical Missing Component in the Production ML Stack
 
Hadoop testing workshop - july 2013
Hadoop testing workshop - july 2013Hadoop testing workshop - july 2013
Hadoop testing workshop - july 2013
 
Join 2017_Deep Dive_Smart Caching
Join 2017_Deep Dive_Smart CachingJoin 2017_Deep Dive_Smart Caching
Join 2017_Deep Dive_Smart Caching
 
Software Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that MattersSoftware Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that Matters
 
Getting Started with Splunk Enterprise Hands-On Breakout Session
Getting Started with Splunk Enterprise Hands-On Breakout SessionGetting Started with Splunk Enterprise Hands-On Breakout Session
Getting Started with Splunk Enterprise Hands-On Breakout Session
 
Advanced Use Cases for Analytics Breakout Session
Advanced Use Cases for Analytics Breakout SessionAdvanced Use Cases for Analytics Breakout Session
Advanced Use Cases for Analytics Breakout Session
 
Big Data Testing
Big Data TestingBig Data Testing
Big Data Testing
 
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love TestsDr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
Got Open Source?
Got Open Source?Got Open Source?
Got Open Source?
 
Data Testing
Data TestingData Testing
Data Testing
 
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable Workflows
 
Fast Data Intelligence in the IoT - real-time data analytics with Spark
Fast Data Intelligence in the IoT - real-time data analytics with SparkFast Data Intelligence in the IoT - real-time data analytics with Spark
Fast Data Intelligence in the IoT - real-time data analytics with Spark
 

Similar a MLconf NYC Ted Willke

How Will Going Virtual Impact Your Search Performance?
How Will Going Virtual Impact Your Search Performance?How Will Going Virtual Impact Your Search Performance?
How Will Going Virtual Impact Your Search Performance?
IdeaEng
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 

Similar a MLconf NYC Ted Willke (20)

What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Machine Learning at the Edge
Machine Learning at the EdgeMachine Learning at the Edge
Machine Learning at the Edge
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
 
How Will Going Virtual Impact Your Search Performance?
How Will Going Virtual Impact Your Search Performance?How Will Going Virtual Impact Your Search Performance?
How Will Going Virtual Impact Your Search Performance?
 
Alten calsoft labs analytics service offerings
Alten calsoft labs   analytics service offeringsAlten calsoft labs   analytics service offerings
Alten calsoft labs analytics service offerings
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
Datapalooza: A Music Festival Themed ML & IoT Workshop
Datapalooza: A Music Festival Themed ML & IoT WorkshopDatapalooza: A Music Festival Themed ML & IoT Workshop
Datapalooza: A Music Festival Themed ML & IoT Workshop
 
#Datacaeer - AI Guild workshop on data roles in industry with Adam Green
#Datacaeer - AI Guild workshop on data roles in industry with Adam Green#Datacaeer - AI Guild workshop on data roles in industry with Adam Green
#Datacaeer - AI Guild workshop on data roles in industry with Adam Green
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
 
DataPalooza: ML & IoT Workshop
DataPalooza: ML & IoT WorkshopDataPalooza: ML & IoT Workshop
DataPalooza: ML & IoT Workshop
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDriven
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AI
 

Más de MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 

Más de MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

MLconf NYC Ted Willke

  • 2. Danny : isBrotherOf : Nezih food cart : uses : bicycles Frank : isFriendsWith : Mohit Frank : isFriendsWith : Ted Frank : likes : bicycles Frank : likes : food carts Ivy : isFriendsWith : Kushal Ivy : isFriendsWith : Ted Ivy : likes : bicycles Ivy : likes : food carts Kushal : isFriendsWith : Mohit Kushal : isFriendsWith : Nezih Nezih : is FriendsWith : Ted Ted : likes : bicycles
  • 3. This model... ... infers this interest. Ted Kushal Mohit Danny Ivy Frank Nezih friends friends friends brothers friends friends friends friends Food Cart likes likes likesBicycles likes likes likes uses Likes?
  • 4. Virtuous cycle of data CLOUD Richer data to analyze CLIENTS Richer data from devices Richer user experiences INTELLIGENT SYSTEMS
  • 6. 1985 1995 2005 2015 enterprise NoSQL Docs + Semantics RDF WIDESPREAD MACHINE LEARNING ON THIS
  • 8. Graph centrality High Program Importance (Centrality) Low Graph of channel viewing behavior Current popular surfing patterns SH002463130000 EP005544723744 Changes in surfing behavior may predict customer churn.
  • 9. Preference and Similarity Recommendations User Movie 1.7MM Nodes 23.9MM Edges similar cast prefers similar topic userId: A0A22A5 title: The Godfather genre: Crime drama cast: [M. Brando, Al Pacino] title: Scarface genre: Crime drama cast: [Al Pacino, M. Pfeiffer] title: The Departed genre: Crime drama cast: [L. DiCaprio, M. Damon] weight=11.8 weight=0.67 weight=0.03 weight=14.98 Min-cost path search
  • 10. 10 URL Ground-Truth Data IP/Domain Reputations 420MM Records 74.5MM Nodes 185MM Edges URL Domain IP Address Calculation of priors LBP Messaging Loopy Belief Propagation on the (semantic) web 84.231.82.93 86.39.155.137 forum.vsichko.com hermansonskok.se euskzzbz.nonetheups.com keesenbep.spaces.live.com
  • 11. Loopy Belief Propagation on the (semantic) web
  • 13. You may actually need this • When the problem is an information network • When a graph is a natural way of expressing the algorithm • When you want to study specific relationships • When you want faster machine learning or solvers on sparse data shortest path central influence sub networks triangle count
  • 14. But there are challenges. Handling all that data. Finding people good at both handling all that data and data analysis. Putting exploratory work into production fast enough to keep up with the competition. 14
  • 16. It’s a demanding job Ingest & Clean Engineer Features Structure Model Train Model Query & Analyze Learn Visualize Skills shortage at intersection of systems engineering and data analysis Painful data ingestion and preparation Workflows that are not designed with loopbacks in mind Few tools for analyzing semantics at scale Composing pipeline is DIY
  • 17. Decomposing the “data scientist” Source: 2013 Report from Accenture Institute for High Performance
  • 18. IMAGINE A PLATFORM FOR DATA SCIENTISTS DOCS + SEMANTICS + MACHINE LEARNING
  • 19. Ease-of-use: Making big data familiar Python R Dataflow GUI ... Datacenter / CloudNetworkClient BIG DATA API Connec tManag e Secure Analyzedistributed and parallel Manag eSecure Connec t Analyzelocal Query Big Data Java/Scala/C++ Computational Frameworks Big Data Algorithms Cluster Workload Mgmt Cluster Storage Machine Learning & Statistics Data WranglingAnalyst Skills The Other Skills
  • 20. Delivering it FILESYSTEMS AND NOSQL STORAGE HW PLATFORM APACHE HADOOP APACHE SPARK DATA WRANGLING MACHINE LEARNING AND STATISTICS Graphical Algorithms Classical Algorithms Graph Construction Tools Useful String Manipulation Useful Math Operators BIG DATA API DATA SCIENCE SERVER (Query and Scripting) Intel Analytics Toolkit A UNIFIED DOCUMENT + SEMANTIC STORE The Ask
  • 21. Approach Algorithm Category Applications/Use Cases Loopy Belief Propagation (LBP) Structured Prediction Personalized recs, image de-noising Label Propagation Structured Prediction Personalized recommendations Alternating Least Squares (ALS) Collaborative Filtering Recommenders Conjugate Gradient Descent (CGD) Collaborative Filtering Recommenders Connected Components Graph Analytics Network manipulation, image analysis Latent Dirichlet Allocation (LDA) Topic Modeling Document Clustering Structure Attribute Clustering Network analysis, consumer seg K-Truss Clustering Social network analysis KNN* Clustering Recommenders Logistic Regression* Classification Fraud detection Random Forest* Classification Fraud detection, consumer seg Generalized Linear Model (Binomial, Poisson) Non-linear Curve Fitting Forecasting, pricing, market mix models Association Rule Mining Data Mining Market basket analysis, recommenders Frequent Pattern Mining* Data Mining Pattern Recognition Bringing a full spectrum of possibilities Graph 21
  • 22. Article Tagging Problem • Articles are tagged by experts with MeSH terms, drawn from a hierarchical controlled vocabulary of 55,000 keywords • Process is resource-intensive – can we automate it? • Categorize articles into a hierarchy that matches the same categorization from the MeSH controlled vocabulary
  • 24. Demo: Graph Analytics For Medical Journal Analysis INGEST & CLEAN ENGINEER FEATURES STRUCTURE GRAPH QUERY & ANALYZE LEARN VISUALIZE PARSE AND EXTRACT WORDS CREATE ARTICLE/ WORD LIST BUILD GRAPH QUERY/ VISUALIZE DATA DETECT CLUSTERS USING LDA • Medline™ XML • MeSH Ontology XML • Create list of unique words • Stemming and lemmatization • Index word list • Transform articles into list of article/word pairs • Extract vertices • Assign id columns to vertex property • Assign year and count edge properties • Gremlin query for each visual • Python web server and other libraries • Select optimization parameters • Invoke LDA
  • 25. The Playbook? PARSE AND EXTRACT WORDS CREATE ARTICLE/ WORD LIST BUILD GRAPH QUERY/ VISUALIZE DATA DETECT CLUSTERS USING LDA Parse Prepare graph data Basic analysis Run LDA INSIGHTFUL RESULT This never happens!
  • 26. The Real Playbook PARSE AND EXTRACT WORDS CREATE ARTICLE/ WORD LIST BUILD GRAPH QUERY/ VISUALIZE DATA DETECT CLUSTERS USING LDA Parse Correct mistake Prepare graph data Correct schema mistake Correct aggregation mistake Data validation Correct dataset mistake Guess LDA settings Tune and re-run Detect bias in dataset
  • 27. WE NEED THE AGILITY OF INTERACTIVE SCRIPTING AND THE BRAINS AND BRAWN OF SCALABLE GRAPH ANALYTICS
  • 31. LDA with 3 Topics
  • 33. LDA with 7 Topics
  • 34. Query Vertices Again – Now with ML Properties 34
  • 35. Following Analysis 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Wakefulness Sleep Animals Electroencephalography Circadian Rhythm Arousal Sleep Stages REM Mental Recall Attention Rats Child Evoked Potentials Aged Schizophrenia Ocular Conditioning Infant Psychophysics Dreams Top MeSH terms that predict which category an article will be assigned
  • 36. Reimagining 2014 New partnerships in big data Contributions to the open source community The Intel Analytics Toolkit – COMING SOON SEMANTICS + MACHINE LEARNING TOGETHER AT LAST!
  • 37. INTERESTED IN THE INTEL ANALYTICS TOOLKIT? THEODORE.L.WILLKE@INTEL .COM
  • 38.
  • 39. Legal Disclaimers All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM). Functionality, performance or other benefits will vary depending on hardware and software configurations. Software applications may not be compatible with all operating systems. Consult your PC manufacturer. For more information, visit http://www.intel.com/go/virtualization No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) requires a computer system with Intel® Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT- compatible measured launched environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit http://www.intel.com/technology/security Intel, Intel Xeon, Intel Atom, Intel Xeon Phi, Intel Itanium, the Intel Itanium logo, the Intel Xeon Phi logo, the Intel Xeon logo and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Other names and brands may be claimed as the property of others. Copyright © 2013, Intel Corporation. All rights reserved.