Streaming Hypothesis Reasoning - William Smith, Jan 2016
1. SHyRe
Streaming Hypothesis Reasoning
WILLIAM SMITH, PATRICK PAULSON, MARK BORKUM,
DEBORAH MCGUINNESS, BRENDA PRAGGASTIS, RUI YAN, YUE LIU
DAML 2016 – Seattle, WA
Smart Data Conference, 2015 – San Jose, California
January 26, 2016
The legends PROTECTED INFORMATION and PROPRIETARY INFORMATION apply to information describing Subject Inventions as defined in
Contract No. DE-AC05-76RL01830 and any other information which may be properly withheld from public disclosure thereunder
3. Mission
Drivers
Analyzing Changing
Online Landscapes
Seed LDRD Projects
- Signatures of Communities & Change
- Digital Currency Graph Forensics
- DarkNet Characterization
- Signatures in the Cloud
Signature Discovery
Initiative (SDI)
Analysis in Motion
(AIM)
National
Security
Computing
Disrupting Illicit
Trafficking
Nuclear Security
National Defense
Homeland Security
Special Programs
Seattle Innovation
District
Asymmetric Resilient
Cybersecurity (ARC)
Cyber-Physical
Systems
Ubiquitous
Sensing
4. Analysis in Motion
6
Streaming Data Characterization & Processing
Library of foundational streaming algorithms, methods for extracting features from streams
Data reduction techniques like semantic characterization
Hypothesis Generation & Testing
Scalable symbolic deduction & incremental machine learning to track a stream
Generate, update, and validate human-understandable hypotheses from streaming classifiers
Human-Machine Feedback
Interaction with human interfaces to implicitly weight, tune, and modify underlying models
Visual strategies for bidirectional communication of and interaction with multiple hypotheses
Work Environments
Integration framework and testing range
Instrumentation to measure overall accuracy, utility, and throughput
5. January 27, 2016 7
AIM Program Area 1
Streaming Data Characterization
Compression Analysis (CA)
Video compression algorithms provide an
efficient means of detecting and
classifying events in a stream
Nonstandard features
Became full project at mid-year
Scalable Feature Extraction and
Sampling (SFE)
Given a dataset, can we find a minimum
subset that provides similar accuracy as
the entire dataset?
Parallel setting using MPI
Open source library (MaTEX)
6. 8
AIM Program Area 3
Human-Machine Feedback
User-Centered Hypothesis Definition
(UCHD)
Transitioned to new PM and new
technical focus in February
What does a machine-generated
hypothesis look like to a human
analyst?
Science of Interaction (SOI)
Use user clickstream data as an
indicator of user sensemaking
Developed and open-sourced the
Streaming Canvas software
UI engineering for use cases
User studies
January 27, 2016
7. January 27, 2016 9
AIM Program Area 3
Human-Machine Feedback
Mitigating Cognitive Depletion in Streaming Environments (CD)
Predict and mitigate human performance degradation
Quantify increase in error and impulsivity based on time from last break
Studies using Halo and exam data
User study planned
Kills / Deaths
Halo: Reach
8. Streaming Analytics
10
CHALLENGE
____________________________________________________________________
Craft machine-generated hypotheses as data
arrive, steering data collection and using human
feedback to tune a multi-classifier system.
PNNL IMPACT
____________________________________________________________
Developing niche in interactive streaming
analytics at scale; basis for invited keynotes at
IEEE HCBDR, AAAS Big Data in Life Science,
Data Science Innovation Summit, Science of
Multi-INT.
Developed streaming automated detection of first
point of failure in lithium battery through electron
microscopy.
PNNL streaming architecture used as reference
model for special programs sponsors.
Collaborators: Rensselaer Polytechnic,
Laboratory for Analytic Sciences.
TXT VIS STREAM GRAPH STATS DATA PROV CYBER
9. Data Provenance & Workflow at Extreme Scale
11
CHALLENGE
____________________________________________________________________
Ensuring reliable performance and
reproducibility of complex and adaptive
workflows in extreme scale environments.
PNNL IMPACT
____________________________________________________________
Workflow Performance Provenance
ontology captures performance and
reproducibility metrics across the complete
system and application stack, helping to
identify causal relationships.
ProvEn uses PNNL’s provenance ontology
to record, correlate, and analyze events;
distinguished from mainstream provenance
by focusing on process not just data
heritage.
PNNL is informing ASCR directions for
future provenance investments.
TXT VIS STREAM GRAPH STATS DATA PROV CYBER
11. National Security Computing Program Areas
13
INFRASTRUCTURE
Data and workflow
management
HPC programming models
and libraries
Power, performance, and
reliability modeling
Resiliency theory
Mobile and edge computing
Embedded systems
Systems engineering and
agile development
Cloud and streaming
architectures
Modeling and simulation
Data quality and
provenance
Sampling strategies
Experimental design
Human language
technology
Computer vision
Large graph analysis
Recommender systems
Social and behavioral
science
ANALYTICS DECISION SUPPORT
Visualization
Human-computer
interaction
User experience design
Semantic computing
Operations research
Test environments
Analytic tradecraft and
critical thinking
Situational awareness
Collaborative systems
Training systems
MISSION AREAS AND OPERATIONAL DEPLOYMENT
Cyber analysis | Bio-surveillance | Social media analysis | Forensics | Emergency preparedness and response
Law enforcement | Critical infrastructure resiliency | Trafficking networks | Power grid management
12. January 27, 2016 14
Project Goals
Research Question
How do we structure the Semantic technology stack to consume and
reason over a volatile data stream, and what are the effects of this
configuration when expressing streaming data models through common-of-
the-shelf (COTS) reasoners?
Goals of Project
Build prototype frameworks created to consume streaming data into a
Semantic Web stack
Model streaming data in a Description Logic (DL) ontology and reason over
the new graph using a set of DL compliant reasoners
Model streaming data into an ontology, DL or comparable rule set, that can
be compared across reasoning clients
Study the effects of cache maintenance, primarily data eviction, on the
Semantic Web stack and results across reasoners
Develop engineering proposal to convert prototypes into singular platform
that can be deployed on cloud networks (AWS, PIC)
13. January 27, 2016 15
Project Approach
Propositional data are streaming in at a certain rate, and we can only see
some “window” of them at any given time.
We sample the data in the window and add them to a fixed-size cache.
We need effective methods of sampling.
The fixed-size cache differentiates our framing of the problem from
agglomerative databases (i.e., “just store everything”).
Deductive reasoning is continuously performed over the cache in order
to try and answer queries and corroborate/refute hypotheses as quickly
as possible.
Low-latency, high-throughput reasoning on ephemeral data is a hard, open
problem.
There will likely be many conclusions to bring to the attention of the user,
and so ranking is needed in order to prioritize attention.
The idea of ranking is not so hard, but determining the correct ordering is.
21. January 27, 2016 23
SHyRe Decision Tree
5 Possible Outcomes:
1. Query Pellet with built in JENA RDF functionality
2. Query Pellet with SPARQL Query
3. Encode SPARQL to URL format and CURL a triplestore endpoint.
4. Use SNARL protocol to query StarDog with SPARQL Query
5. Use AGQuery protocol to query AllegroGraph with SPARQL Query
a. *RDFS++ Reasoning
22. January 27, 2016 24
Engineering Approach
J2EE Pipeline
AVRO Packet StreamStream
JAVA Stream “Pull” Client
Use Case JAVA - Streaming Design Pattern Use Case JAVA - Streaming Design Pattern
JAVA Pellet Reasoner
StarDog TripleStore / Reasoner
AllegroGraph TripleStore / Reasoner
Not Implemented / Reasoner
23. Use Case 1: Nuclear Magnetic
Resonance
Protected Information | Proprietary Information 25
25. January 27, 2016 27
NMR Accomplishments to Date
Research Question Answered
By consuming an undefined count of scans, can we assemble a NMR run,
model compounds within an ontology of background data, and then reason
across this new combined model of compound and spectrum ontology?
Logic Constraints Answered
Streaming data – When is a spectrum fully assembled?
How do we decide which functions to model in the ontology, and which to
apply to a query?
SHyRe NMR Model
Description Logic background ontology of compound classes and peaks
(Pellet implementation)
RDFS background ontology of compound classes and peaks (StarDog /
AllegroGraph implementations)
Consume and model a NMR run from a stream of spectrum scans
Query the NMR run after applying the compound background ontology
28. Use Case 2: Shipping a
Strategic Surprise
Protected Information | Proprietary Information 30
29. January 27, 2016 31
How do we detect a
Strategic Surprise?
Ford
Exemplar
HS-10237
HS-10238
HS-10239
HS-10240
HS-10241
HS-10242
HS-10237
HS-10238
HS-10239
HS-10246
HS-10248
HS-10243
Import
Stream
HS-10243
HS-10244
HS-10245
HS-10246
HS-10247
HS-10248
Nike
Exemplar
HS-10301
HS-10302
HS-10303
HS-10304
HS-10305
HS-10306
HS-10307
HS-10308
HS-10309
HS-10310
HS-10311
HS-10312
30. January 27, 2016 32
How do we detect a
Strategic Surprise?
Ford
Exemplar
HS-10237
HS-10238
HS-10239
HS-10240
HS-10241
HS-10242
HS-10237
HS-10238
HS-10239
HS-10246
HS-10248
HS-10243
Import
Stream
HS-10243
HS-10244
HS-10245
HS-10246
HS-10247
HS-10248
Nike
Exemplar
HS-10301
HS-10302
HS-10303
HS-10304
HS-10305
HS-10306
HS-10307
HS-10308
HS-10309
HS-10310
HS-10311
HS-10312
31. January 27, 2016 33
How do we detect a
Strategic Surprise?
Ford
Exemplar
HS-10237
HS-10238
HS-10239
HS-10240
HS-10241
HS-10242
HS-10246
HS-10248
HS-10243
HS-10303
HS-10311
HS-10307
Import
Stream
HS-10243
HS-10244
HS-10245
HS-10246
HS-10247
HS-10248
Nike
Exemplar
HS-10301
HS-10302
HS-10303
HS-10304
HS-10305
HS-10306
HS-10307
HS-10308
HS-10309
HS-10310
HS-10311
HS-10312
32. January 27, 2016 34
How do we detect a
Strategic Surprise?
Ford
Exemplar
HS-10237
HS-10238
HS-10239
HS-10240
HS-10241
HS-10242
HS-10303
HS-10311
HS-10307
HS-10304
HS-10305
HS-10312
Import
Stream
HS-10243
HS-10244
HS-10245
HS-10246
HS-10247
HS-10248
Nike
Exemplar
HS-10301
HS-10302
HS-10303
HS-10304
HS-10305
HS-10306
HS-10307
HS-10308
HS-10309
HS-10310
HS-10311
HS-10312
33. January 27, 2016 35
Strategic Surprise Accomplishments
to Date
Research Question Answered
Based on a company’s import records, can we determine if they are entering
a new LOB?
Logic Constraints Answered
Streaming data – have to determine if record might be important in future
Explain reasoning to enable user intervention / interaction and integration
with other models
SHyRe Strategic Surprise Model
Model each company by the HSCODEs it imports
Identify companies that represent all companies in a LOB
Exemplar of the LOB
Use training data to get HSCODEs used by each exemplar
Count the number of matching HSCODEs between monitored company and
exemplars
34. 36
Strategic Surprise Accomplishments
to Date
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Outputs 0 0 15 88 129
Inputs
Outputs
Required Input Records to Produce Output
35. January 27, 2016 37
Strategic Surprise Accomplishments
to Date
Input Import Records Output Results CPU (seconds) CPU (inputs / second)
0 0 1.292
1 0 1.693
10,000 15 77.619 128.834
20,000 88 185.553 107.786
30,000 129 330.895 90.663
40,000 169 508.902 78.601
Required Input Records to Produce Output
37. Challenges
Reasoning Differences in Standards (RDFS / OWL EL/DL / RDFS++)
January 27, 2016 40
Reasoner Difficulty
Pellet Nearly complete OWL DL, but not currently maintained.
StarDog Strict separation of A-Box / T-Box reasoning within OWL DL across
embedded Pellet and StarDog systems. Creates oddly formed,
verbose SPARQL queries.
AllegroGraph Proprietary reasoning with inconsistent standards.
Complex cache eviction algorithms and unsupported SPARQL standards
Reasoner Difficulty
Pellet Requires complex internal storage algorithms to manipulate memory
graphs
StarDog SPARQL DELETE can only support literal triples. Variables within a
DELETE invoke background graph indexing and frequently fail.
38. January 27, 2016 41
Conclusions
Contract with Rensselaer Polytechnic Institute
Rui Yan and Yue Liu joined SHyRe team advised by Prof. Deborah McGuinness
Complete: International Conference for Biomedical Ontologies Paper
William Smith, Alan Chapell, Courtney Courley
Complete: Smart Data 2015 Conference
William Smith, Deborah McGuinness, Rui Yan
Complete: Conference on Information Knowledge Management 2015 Paper
Mark Borkum, William Smith, Deborah McGuinness, Rui Yan, Yue Liu
Complete: ISWC 2015 Workshop Paper
Rui Yan, Brenda Praggastis, William Smith, Deborah McGuinness
In Progress: Skolemization/Currying to enable decidable reasoning
Patrick Paulson
In Progress: Journal of Web Semantics, Streaming Edition Paper
39. William Smith
Human Centered Analytics
william.smith@pnnl.gov
+1.206.528.3356
SHYRE: Streaming
Hypothesis Reasoning
aim.pnnl.gov
Protected Information | Proprietary Information
Notas del editor
2 Outside views -
Civil servants – Amusing during election season, which stopped being a season long ago and is now just the perpetual state of things
Department of Defense – Controversial when at peace, and necessary when somebody is somewhere they’re not supposed to be and it’s going to take a pretty penny to get them out.
Third Arm of Government
DOE – Infrastructure and Science. 17 labs, national highway system, power plants, green energy, smart grids, environmental regulation, cyber security, disease tracking…
Point out some of the labs
PNNL
Nuclear Labs
Sanida (z-machine)
Fermi – (collider)
NREL
Energy Grid Lab
We are the support system for the myriad of other internal US departments that support state governments and national projects.
2 Outside views -
Civil servants – Amusing during election season, which stopped being a season long ago and is now just the perpetual state of things
Department of Defense – Controversial when at peace, and necessary when somebody is somewhere they’re not supposed to be and it’s going to take a pretty penny to get them out.
Third Arm of Government
DOE – Infrastructure and Science. 17 labs, national highway system, power plants, green energy, smart grids, environmental regulation, cyber security, disease tracking…
Point out some of the labs
PNNL
Nuclear Labs
Sanida (z-machine)
Fermi – (collider)
NREL
Energy Grid Lab
We are the support system for the myriad of other internal US departments that support state governments and national projects.
PNNL strengths and cultural focus on …
Focus on strengthening & leveraging the science base
Focus on impacting mission & developing next generation
History – 3 innovations that won WWII
DOE labs overall role
Accelerate the rate of innovation (user facilities, next gen, scientific leadership)
Address enduring, S&T centered mission challenges (naval reactors, nonpro, energy, stockpile)
Ensure ability to react to rapid change or crises (critical materials, Fukushima, 9/11, cyber)
Achieve and prevent technology surprise (security mission-centric)
Enhance economic competitiveness
Oppy here for analytic provenance in SP sponsors – big challenge there.
one other thing that makes ProvEn research different is that we are more focused on how the captured provenance can provide real actionable insights for the users e.g. why is my workflow performance so variable, what was different in this process from different instances, how can I improve workflow performance etc.
A strategic surprise is a material that doesn’t relate to the line of business of a company, or that is embargoed from that specific company receiving.
Infant incubators. Great for premature babies, and also not bad for creating bio weapons. So if you’re a country with 10 hospitals and 4 premature babies why would your state company need 10,000 infant incubators? Same with glove boxes. Great for specific industries… that your country doesn’t have… so why do you need 10,000 glove boxes?
Company that needs ONE industrial piece of equipment very rarely. Like a french bakery. Once every 20 years it will need an industrial HVAC system… so why did this one just import 50 of them?
Loading dock switches. Let’s say I buy up the loading dock right next to a company that does need to import something, and make a deal about things moving from loading dock 1-AC to 1-AD.
We took a much broader point of view before the fine grained loading dock switch…
We started with a central research question like all PNNL projects… <read question>
Through this question, and a slight project reorganization, the following goals were aligned to FY15. <read slides>
Data pipe & sampling window – Provided by outside Java enterprise framework.
Fixed size cache. There were a couple different of caching algorithms depending on in memory caching, application caching, and triple store caching
Several reasoning platforms were selected, depending mainly on claims from manuals and marketing resources
End user. Human in the loop was a large requirement, but we did focus on an analyst function.
We chose 3 core areas for FY15:
We selected three COTS – common off the shelf – reasoners with different degrees of the OWL and RDFS specification implemented and expressivity for testing.
We researched cache maintenance and have placed emphasis on cache eviction and how do you maintain a stable, but relevant, graph for queries
Communication with the underlying infrastructure. The original build of SHYRE was good at consuming and producing metrics about the query operation. FY15 focused not only on the use cases, but providing results back to the infrastructure for user review:
a. Talk to automated entity extraction agent
b. Provided conclusions to analytics UIs
c. Create propositions for future use within Shyre
The mile high picture of the design pattern is a workflow engine (or state machine) that runs four concurrent processes. MOST IMPORTANTLY, after starting each process they MUST run independently and not require a synchronization mechanisms beyond thread safe programming.
The ingestion client is responsible for connecting to a stream, decoding an byte-stream encoded packet, completing an initial test on data conformity to the use case upon initialization. It then provides the decoded data to the annotation mechanism. Ingestion is not responsible for establishing a medium term storage solution, and all decoded data is immediately stored in a FIFO list for annotation.
This is the state composed of processes for encoding data into an RDF graph and providing the graph to a reasoner. Data annotation to Semantic Web RDF identifiers creates a unique challenge, as each use case generally requires a different markup for decoded data.
The ability to create a question and propose it in a way a reasoner can answer using an RDF graph or triplestore. Querying the Semantic Web, after annotating and storing a data stream, is a variable state intended to run in tandem with all states after the initial ingestion. Because the graphs are created and modified as the stream arrives, annotation and query design must be composed for a streaming architecture
Cache maintenance… that’s really really hard.
Ingestion can run really really fast
Querying can run really really slow, especially depending on how you structure the query and logic. DL almost exponentially adds to query time as more triples are added to the graph, and RDFS you have to take special care on how you construct the query to ensure something is always returned for cache management
And cache management never works.
These 4 states create a decision tree -
Data consumption from architecture
Now that we have the data we have to make a decision based on annotation requirements, use case, and reasoner package
In-memory systems quickly assembling RDF models, no metering of FIFO cache consumption is necessary
Attempts to meter access to the temporary cache providing time to complete the string building process necessary
String manipulation into RDF triples
Template variable substitution into valid RDF/XML/TURTLE/ETC Graphs
Query Pellet with built in JENA/OPEN RDF functionality
Query Pellet across the file system with SPARQL Query
Encode SPARQL to URL format and CURL a triplestore endpoint.
Use SNARL protocol to query StarDog with SPARQL Query
Use AGQuery protocol to query AllegroGraph with SPARQL Query
a. *RDFS++ Reasoning
And here are the 5 outcomes you have to account for when evaluating this kind of system.
1 and 2, while in memory (or on the file system if you’re particularly pressed for time) Pellet is slow, but it isn’t hand cuffed when it comes to pretty much complete DL logic. This is the easiest to create a static query, that always returns an expectation of true / false per entity within the query and ontology.
3 is the grand mystery highly dependent on triplestore. How did you populate it? Could you curl a query? Was there a proprietary loading method like ISQL or mark logic pipeline demo’d yesterday? How much file system is used? How do we know when we can query?
4-5 Both StarDog and AllegroGraph have similar custom protocols for populating the graphs, and it becomes much more of an issue of query composition, especially with Standard SPARQL, RDFS or OWL EL.
Now that the broad overview is covered, we will be focusing on this yellow strip as our use case applications. This is where the majority of the custom SHYRE LOGIC – not stream ingestion - had to be created, and where the majority of the design pattern and decisions we just discussed were deployed.
Nuclear Magnetic Resonance (NMR) spectroscopy is an analytic technique that exploits the magnetic properties of certain atomic nuclei in order to determine the physical and chemical properties of the molecules in which they are contained (e.g., the chemical structure).
Let’s run through our accomplishments with each use case - we will begin with NMR.
1. <read question 1> - Yes, we can. However there is a large scalability issue as scans become more complex ballooning our query time from 10 seconds to 19 minutes.
2. <read logic constraints> -
a. Yes, we tracked on run numbers and did an additional test using query result completeness. IE – When is a query not returning any new results?
b. OWL DL vs. RDFS – This was decided for us generally by the reasoner being used. StarDog has an interesting quirk where it breaks queries into A-Box / T-Box reasoning forcing the query author to be careful when composing queries and modeling an ontology
3. Finally, this is what we came up with - <read slide>
MAJOR POINTS
25 Time Linear Runs out of 25,000+
This should be a straight line with no bumps. There should not have been this much change in such a small amount of time so something isn’t tracking between the graphs, ontology and queries.
Roughly 1,730 seems to be golden graph size where we start to lose results and compound confirmations
However, 1,698 / 1,722 returns all 11 positives here. This brings up the question of graph utilization – every triple in this graph applies to a compound we’re searching
Each of these queries took roughly 20 seconds on an NMR run composed of between 30 and 50 scans. As scans increase query time increases dramatically due to ranging functions in the DL ontology having to search every triple 30 times.
By the time we’re at 250+ scans (~20K triples in graph) queries take around 19 minutes and only return “possible” for all 30 compounds. This brings up the question of a “data deluge” never negating or affirming a chemical, while also providing enough information the probability is high – or nearly certain.
A strategic surprise is a material that doesn’t relate to the line of business of a company, or that is embargoed from that specific company receiving.
Infant incubators. Great for premature babies, and also not bad for creating bio weapons. So if you’re a country with 10 hospitals and 4 premature babies why would your state company need 10,000 infant incubators? Same with glove boxes. Great for specific industries… that your country doesn’t have… so why do you need 10,000 glove boxes?
Company that needs ONE industrial piece of equipment very rarely. Like a french bakery. Once every 20 years it will need an industrial HVAC system… so why did this one just import 50 of them?
Loading dock switches. Let’s say I buy up the loading dock right next to a company that does need to import something, and make a deal about things moving from loading dock 1-AC to 1-AD.
We took a much broader point of view before the fine grained loading dock switch…
Import stream – just your normal company pulling things in
Ford is good for automotive
Nike is good for athletic gear
Poor examples:
Samsung
Toyota
General Electric
Walmart is actually on the line, but can be an exemplar. Sure they import a lot of finished consumer goods, but how many stores do they own and how often would they need raw materials and construction systems replaced (HVAC)? Walmart doesn’t sell cars or reactors, and they don’t need the raw goods to create their consumer goods. So there is a superstore category for finished consumer goods.
Company A looks like Exemplar Ford so they’re importing automotive goods
Well…. Now they don’t…
Company A looks like Exemplar Nike so they’re importing athletic goods
Next we move onto the Strategic Surprise accomplishments
From the questions earlier in the talk, and established Strategic Surprise DL Ontology, our problem became how do we model so much data – record packets of 30+ values – and which are useful for determining drift in business category? This was accomplished by establishing a set of exemplar companies, or companies that never or very infrequently import outside of a given business category.
Constraints on logic included – Record decomposition – common problem in the streaming data . More importantly, and the problem I frequently have, is explaining when and why the SHYRE reasoner creates an output record. Records are created ONLY when a company begins aligning itself with an exemplar, and this drift toward a specific industry provides the metrics for SOI from Shyre. That means in the SOI demonstration you’re only going to see SHYRE lines stepping up as a company becomes more and more like a specific exemplar company within an industry.
3. Finally, this is what we came up with - <read slide>
This follows common logic in the fact most companies aren’t rapidly entering different lines of business and importing goods that don’t pertain to their business model.
CPU seconds are nearly doubling as company import tendencies are modeled.
Processing power is halved by the time we ingest roughly a magnitude of 4 from the original 10,000 records. Good because it doesn’t happen at 20,000, but an obvious bottle neck at some point.
30 seconds it starts –
What is OPA
What is Shyre
What is the movement in between
Talk threshold
1:30 shyre detects appliances
2:10 Detects a change
Go to the background
5:20
Talk console
So.. What challenges did we have. Beyond the aforementioned scalability covered previously….
I’m just reading this slide because it’s my favorite slide and the bane of my last year of research.
Read Slide