Signals from outer space

GraphAware®
SIGNALS FROM OUTER
SPACE
Vlasta Kůs, Data Scientist @ GraphAware
graphaware.com
@graph_aware, @VlastaKus
How NASA Beneﬁts from Graph-Powered NLP

‣ Database of learned knowledge across NASA’s programs & projects
‣ Unstructured text with basic metadata
‣ Collected since late 1950s (100s of millions of documents)
‣ Public dataset: ~1600 documents
NASA’s Lessons Learned
GraphAware®

"1406",420,"Roberts, J “,
"VO'75 Pressure Regulator Leakage and Work-Around
Procedures (~1976)”,
"The pressure regulator in the Viking Orbiter Propulsion
Subsystem started leaking following a pyro firing that
occurred prior to the near-Mars TCM. Likely causes were
corrosion or residue from propellant migration or pyro
valve blowby, or particulate contamination. Recommendations
included using separate regulators for the fuel and
oxidizer sides, incorporating a bellows in the pyro valve
to eliminate blowby, and adding a isolation valve between
the regulator and propellant tank.“,
" The micro-scale effects of long-term propellant exposure
should be investigated in order to better critique
regulator design. “,
"JPL",1996-07-08,"",TRUE,"",1460,7,NA,"https://
nen.nasa.gov/web/11/viewall/-/viewall/420"
NASA’s Lessons Learned Database
GraphAware®

“673",1326,"Relvini, Kristine “,
"Lessons Learned Not Being Inputted Into Lessons
Learned Information System (LLIS) Database”,
“",
"If you don't document the lessons learned, you loose
knowledgeable, shared information and tracking capacity
across programs.“,
"KSC",2002-10-11,"Aeronautics Research, Science,
Exploration Systems, Space Operations, ",FALSE,"",
702,6,NA,"https://nen.nasa.gov/web/11/viewall/-/
viewall/1326"
GraphAware®

Graph database = isolated data silos -> connected knowledge
‣ Efﬁcient search
‣ Relationships among various areas
Apollo, Space Shuttle, Orion, …
‣ Pattern recognition (clusters, communities, correlations, …)
Example: correlation between corrosion of valves & topics involving batteries
‣ Useful for planning future projects and preventing/solving issues
GraphAware®

What is a Graph?
GraphAware®
G = (V, E)

WHY NEO4J?
GraphAware®
It is a proper graph database
It is a proper database

Graph-Based Architecture: Knowledge Graph
GraphAware®

‣ NLP = machine learning tools allowing computers to process - and
perhaps understand - human languages
‣ Basic steps
Sentence segmentation
Tokenisation
Lemmatisation
Part of Speech (POS) tagging
Parsing
Named Entities Recognition (NER)
Sentiment analysis
…
Natural Language Processing
GraphAware®

Currently supported toolkits for human language processing
‣ Stanford CoreNLP
‣ developed at Stanford University
‣ fast, robust, production ready
‣ many pre-built models
‣ license: GPL v3+
‣ Apache OpenNLP
‣ developed by volunteers
‣ many pre-built models
‣ license: Apache License v2.0
NLP: Text Processors
GraphAware®

‣ Named Entity Recognition (NER) = classiﬁcation of words into predeﬁned
classes
‣ Examples: Dr. Who -> Person, May 2018 -> Date, EU -> Country …
‣ Stanford NLP default entities: Person, Location, Date, Organisation,
Number, Money, Percentage
‣ Custom NE classes -> training on large tokenised & labeled corpus
‣ Wikipedia, Wikidata - rich sources of multilingual training data that can
be extracted automatically
Named Entity Recognition
GraphAware®

Custom Named Entities based on Wikipedia
GraphAware®
NASA use case: identify names of space missions
Training - crawling Wikipedia & identifying relevant information

Universal Dependencies: cross-linguistically consistent grammatical relations
among words in a sentence
Examples:
‣ amod (adjectival modiﬁer)
Matt likes red wine.
‣ appos (appositional modiﬁer)
Mars Global Surveyor (MGS) was an American robotic spacecraft …
‣ conj (conjunct)
It failed to respond to messages and commands.
‣ …
Universal Dependencies
GraphAware®

‣ Stanford CoreNLP: Dependency & Part of Speech analysis of a single sentence
Source: http://nlp.stanford.edu:8080/corenlp/process
Either ﬁnd an eﬃcient representation in some traditional database, or …
Graph-Powered NLP
GraphAware®

Graph-Powered NLP
GraphAware®
NLP and property graphs: natural ﬁt
… use a property graph!

Unsupervised techniques tend to be underestimated, while …
‣ No need for time & money to get massive labeled training datasets
‣ Often faster to train & faster to predict
‣ Unsupervised deep learning
Unsupervised ML Algorithms
GraphAware®

PageRank
GraphAware®
PageRank = a measure of importance of a web page based on the quality
of links from other pages
The formula reﬂects a model of a random surfer.
Source: https://en.wikipedia.org/wiki/PageRank

Keyword Extraction: TextRank
GraphAware®
Keywords = words/phrases that capture the semantic essence of a text
Graph-Based Unsupervised Algorithm:
‣ Construct a graph of word co-occurrences
‣ Asses the importance of words by PageRank algorithm
‣ Use top 1/3 of words as keyword candidates
‣ Use universal dependencies to construct key phrases

GraphAware®
Rada Mihalcea, Paul Tarau. TextRank: Bringing Order into Texts. Proceedings of EMNLP 2004, pages 404–411, Barcelona,
Spain. Association for Computational Linguistics. http://www.aclweb.org/anthology/W04-3252.
Keyword Extraction: TextRank
Despite its simplicity, TextRank provides
state of the art results on wide range of
unstructured texts.
Leveraging universal dependencies allowed
us to surpass precision & recall of the
original TextRank paper.
NASA examples: “space shuttle”, “ﬂight hardware”, “launch vehicle”, …

Automatic text summarisation
‣ Abstractive
‣ Extractive
TextRank can be adapted for efﬁcient
sentence ranking for extractive summarisation.
Summarisation: TextRank
GraphAware®

ConceptNet 5 = semantic network for understanding the meaning of words
‣ Relational knowledge from MIT’s Open Mind Common Sense project
‣ DBPedia (information from Wikipedia info-boxes)
‣ Wiktionary (free multilingual dictionary)
‣ …
Knowledge Enrichment: ConceptNet 5
GraphAware®
Microsoft Concept Graph = semantic network introducing knowledge
about concepts
‣ harnessed from billions of web pages and years’ worth of search logs

Expand the knowledge from external or other internal sources.
Knowlege Enrichment
GraphAware®

‣ Latent Dirichlet Allocation (LDA) - generative statistical model that
describes documents as a probabilistic mixture of a small number of topics
‣ Each topic described by a list of most relevant words
‣ Sample of topics from the NASA dataset
[“design”, "failure", "test", "result", "ﬂight", "hardware", "mission", “testing”, “system”,
“due”]
[“pressure", "system", "cause", "valve", "propellant", "leak", "operation", “shuttle”,
“space”, “gas”]
[“space”, "shuttle", "NASA", "operation", "safety", "iss", "crew", "ISS", "astronaut", "progr
am"]
Topic Extraction: Latent Dirichlet Allocation
GraphAware®

‣ Word embeddings = representation of words as multi-dimensional
semantic vectors which encode linguistic patterns
‣ Use cases: word sense disambiguation, new distance functions between
documents, starting point for further ML (e.g. NN classiﬁcation)
‣ Word2vec = shallow two-layer neural network model for producing word
embeddings
‣ ConceptNet Numberbatch - consists of state-of-the-art word embeddings
Word Embeddings
GraphAware®

Word Embeddings: word2vec
GraphAware®
Tomas Mikolov et al.: https://arxiv.org/abs/1301.3781

Word Embeddings: word2vec
GraphAware®
Kusner et al.: http://mkusner.github.io/publications/WMD.pdf
Document distance: min. cumulative distance that all words need to travel
Semantic patterns representable as linear translations:
distance(Oslo -> Norway) similar to distance(Berlin -> Germany)
vec(Germany) - vec(Berlin) + vec(Oslo) = vec(Norway)

Document Embeddings
GraphAware®
Q. Le, T. Mikolov: Distributed representations of sentences and documents, arXiv:1405.4053v2
Paragraph Vector (doc2vec): extension of word2vec
The additional paragraph node represents context (topic) of the current document.
Paragraph vectors have the same behaviour towards linear vector translations as
word vectors.

Document Embeddings
GraphAware®
doc2vec vectors of dimension 300, NASA sentences -> dimensionality reduction (PCA + t-SNE)

Document Embeddings
GraphAware®
doc2vec vectors of dimension 2000, 30k Wikipedia pages -> dimensionality reduction (PCA + t-SNE)

Some of the neural networks applicable to text processing
‣ Shallow networks (word & document embeddings)
‣ Deep Auto-Encoders
‣ Convolutional Neural Networks
‣ Recurrent Neural Networks (LSTMs)
Deep Learning for Text Processing
GraphAware®

Self-supervised Auto-Encoders: useful for vector embeddings (images, texts)
DeepLearning4J - Java-based deep learning library
Example of auto-encoder (e.g. stacked RBMs) …
Deep Learning: Auto-encoders
GraphAware®
Works well for images, but problematic for texts (sparsity).

Convolutional Neural Networks
GraphAware®
Y. Zhang, B. Wallace: arXiv:1510.03820
Classiﬁcation of documents based on word embeddings and CNN

Deep Learning: Summarisation
GraphAware®
S. Narayan et al.: Ranking Sentences for Extractive Summarisation with Reinforcement learning, arXiv:1802.08636

Deep Learning: Summarisation
GraphAware®
Extractive summarisation (sentence ranking) notably outperforms abstractive.
S. Narayan et al.: Ranking Sentences for Extractive Summarisation with Reinforcement learning, arXiv:1802.08636

Knowledge Graphs are a powerful problem-solving tool
‣ Augmented search
‣ Actionable knowledge
‣ Machine Learning
‣ Chatbots and Question answering systems
‣ Foundational to AI
Conclusion
GraphAware®

www.graphaware.com @graph_aware

Signals from outer space

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Signals from outer space

Similar a Signals from outer space (20)

Más de GraphAware

Más de GraphAware (20)

Último

Último (20)

Signals from outer space