Talk at IKNOW 2013, describing the Semantic Pattern Transformation.
This process transforms feature vectors, which are commonly used in machine learning into a semantic representation. The advantage is that we can use this model across all domains, which is not possible for the raw feature vectors without cumbersome preprocessing operations.
2. IAIK
Our Background
Topics
Mobile device security
Cloud security
Security consulting for public insititutions
(Austria)
IT security research
IT security lectures
e-Government
A-SIT
3. IAIK
Why does he talk about Knowledge Discovery?
How does IT security relate to knowledge discovery?
eGov - eParticipation: document analysis, twitter etc.
intrusion detection systems (network traffic analysis)
malware detection (network traffic, mobile phones)
mobile application analysis (metadata, market descriptions)
mobile application security (hot topic, BYOD, etc.)
4. IAIK
What to expect?
Motivation for the Semantic Pattern Transformation
Basic concepts, techniques
How does it work? Evaluation?
Applications, results, current topics!
5. IAIK
Environment
Arbitrary features
No apriori knowledge
Heteregenous domains
Clustering
Supervised learning
Anomaly Detection
Semantic search
Visualization
Extracting knowledge
Text analysis
Android market descriptions
histograms
flexible
deployment
new
domains
terms
numbers
6. IAIK
Process...
•Different processing steps
•From defining the goals
•To extracting the desired
knowledge
•Machine learning algorithms are
often used within KDD
•However, the complete machine
learning process is quite similar
to KDD
Knowledge discovery
goals
Target data set
Preprocessing
Data extraction
Data mining method
Data mining
algorithm
Knowledge extraction
Data mining
Knowledge processing
Fayyad et al. Machine learning
Domain-specific data set
KDT
Machine learning
goals
Instance extraction
Feature selection,
construction
Instance selection
Machine learning
algorithm
Preprocessing
Algorithm application
Interpretation
ML-KDT
7. IAIK
ADAPTATION COMPLEXITY?
•Assuming an arbitrary data-set (e-Participation,
Android Market applications)
•Further assuming: a knowledge discovery goal: e.g.,
unsupervised clustering
•Then: we need to adapt the steps on the left
•And: We need to adapt this setup when the data
changes, even when the knowledge discovery goals
remain the same!
•Android Market applications vs. text documents vs.
network traffic vs. malware detection?
Domain-specific data
set
Machine learning
goals
Instance extraction
Feature selection,
construction
Instance selection
Algorithm selection
Preprocessing
Algorithm application
Interpretation
Machine Learning
High
Dependence on domain data and goals
Medium Low
8. IAIK
TOWARDS A SEMANTIC REPRESENTATION
•Finding a new representation...
•New representation is called Semantic Patterns
•Key properties:
•Still a vector representation (compatible to old representation)
•Not the feature values themselves, but their semantic relations are represented
•All values have the same meaning and feature type (activation)
•Transformation from raw data into Semantic Patterns:
Semantic Pattern Transformation
9. IAIK
SEMANTIC PATTERN TRANSFORMATION
•The Semantic Pattern Transformation is arranged
in five layers
•Layer 1 - Feature extraction
•Layer 2 - Associative network - Node generation
•Layer 3 - Associative network - Link generation
•Layer 4 - Spreading activation (SA)
•Layer 5 - Analysis (machine learning, semantic
search etc.)
Data set
Relation
FROM TO TIME
FROM TO TIME
FROM TO TIME SF 2
Instance SF 1 DF 1 DF 2SF 2
SV
MV
SV
SV
SV
MV
SV
MV
MV
P 1
P 3 P 4
P 2
Supervised
learning
Unsupervised
clustering
Semantic
relations
Feature value
relevance
Anomaly detection
Semantic
development over
time
Pattern similarity
Layer 1
Feature Extraction
Layer 2 - 3
Associative Network
Generation
Layer 4
Spreading Activation
Layer 5
Analysis
SF 2
Instances
Map
Map
Map
10. IAIK
SPT: Layer 1 - Feature extraction
Extract features, their values and determine the type
(categorical, distance-based)
Categorical: Exports
Distance-based: Unemployment rate, fertility rate
Country Exports Unemployment rate Fertility rate
C1 coffee 20% 5
C2 cacao 20% 5
C3 coffee, cacao 20% 5
C4 machinery 5% 2
C5 chemicals 5% 2
C6 chemicals, machinery 5% 2
C7 chemicals, cacao 20% missing data
C8 missing data 20% 5
C9 coffee, cacao missing data missing data
11. IAIK
SPT: Layer 2 - Node generation
20%
5%
coffee
cocoa
machinery
chemicals
5
2
Country Exports Unemployment rate Fertility rate
C1 coffee 20% 5
C2 cacao 20% 5
C3 coffee, cacao 20% 5
C4 machinery 5% 2
C5 chemicals 5% 2
C6 chemicals, machinery 5% 2
C7 chemicals, cacao 20% missing data
C8 missing data 20% 5
C9 coffee, cacao missing data missing data
Categorical feature
values:
one node for each
value
Distance-based feature values:
map value ranges to single nodes
Associative network
13. IAIK
SPT: Layer 4 - Spreading activation
Creating a Semantic Pattern: in this case for “coffee” and “cacao”
Set activation value of the two nodes to 1.0
Spread this activation value to neighboring nodes via the weighted links
20%
5
5%
coffee
cocoa
machinery
chemicals
2
1.0
1.0
14. IAIK
SPT: Layer 4 - Spreading activation
Typically, one would create Semantic Patterns for all instances within the data
set
E.g. a pattern for C1 by activating coffee, 20% and 5
However, we can also create patterns for feature values: e.g. “coffee”
Country Exports Unemployment rate Fertility rate
C1 coffee 20% 5
C2 cacao 20% 5
C3 coffee, cacao 20% 5
C4 machinery 5% 2
C5 chemicals 5% 2
C6 chemicals, machinery 5% 2
C7 chemicals, cacao 20% missing data
C8 missing data 20% 5
C9 coffee, cacao missing data missing data
15. IAIK
SPT: Layer 4 - Spreading activation
After SA: each node
in the network has
an activation value
By representing the
nodes and their
activation values as
a vector, we gain
a Semantic Pattern coffee cocoa machinery chemicals 20% 5% 5 2
0.00 0.08 0.38 0.300.00 0.001.151.15
cocoa
1.15
coffee
1.15
20%
0.38
5
0.30
chemicals
0.08
2
0.00
5%
0.00
machinery
0.00
16. IAIK
0
0.25
0.50
coffee cacao machinery chemicals 20% 5% 5 2
Export: Cacao
Unsorted Semantic Pattern
0
0.25
0.50
coffee cacao machinery chemicals 20% 5% 5 2
Export: Coffee
Unsorted Semantic Pattern
0
0.25
0.50
coffee cacao machinery chemicals 20% 5% 5 2
Fertility: 2
Unsorted Semantic Pattern
Country Exports Unemployment rate Fertility rate
C1 coffee 20% 5
C2 cacao 20% 5
C3 coffee, cacao 20% 5
C4 machinery 5% 2
C5 chemicals 5% 2
C6 chemicals, machinery 5% 2
C7 chemicals, cacao 20% missing data
C8 missing data 20% 5
C9 coffee, cacao missing data missing data
Each feature value is
represented by a semantic
fingerprint
Allows for an instant analysis of
semantic relations to other
feature values
Sort, mean, variance, adding,
subtracting
17. IAIK
SPT: Layer 5 - Analysis
Calculating the
distance between two
patterns (Euclidean
distance, Cosine
similarity)
For unsupervised
clustering, semantic-
aware search
algorithms
Keyword search for coffeeKeyword search for coffeeKeyword search for coffeeKeyword search for coffee
C1 coffee 20% 5
C3 coffee, cacao 20% 5
C9 coffee, cacao missing data missing data
Semantic aware search for coffeeSemantic aware search for coffeeSemantic aware search for coffeeSemantic aware search for coffee
C9 coffee, cacao missing data missing data
C1 coffee 20% 5
C3 coffee, cacao 20% 5
C2 cacao 20% 5
C8 missing data 20% 5
C7 chemicals, cacao 20% missing data
C5 chemicals 5% 2
C6 chemicals, machinery 5% 2
C4 machinery 5% 2
19. IAIK
Benefits?
Domain-specific data
set
Machine learning
goals
Instance extraction
Feature selection,
construction
Instance selection
Algorithm selection
Preprocessing
Algorithm application
Interpretation
Machine Learning
Domain-specific data
set
Machine learning
goals
Instance extraction
Feature selection,
construction
Instance selection
Algorithm selection
Preprocessing
Algorithm application
Interpretation
High
Dependence on domain data and goals
Medium Low
Application in heterogeneous domains
regardless of the nature of the data
Except for Layer 1, we do not need any
manual setup for the layers
Regardless of the analyzed data, the
Semantic Patterns always use the same
model
This means: Regardless of the deployed
knowledge discovery method, we can
always use the same methods for
knowledge extraction!
22. IAIK
•Applications described in several publications, which analyze
•e-Participation (Egyptian revolution, Fukoshima, Mitmachen): text documents
•Intrusion detection: event correlation
•RDF data analysis (semantic web)
•WiFi privacy (analyzing captured emails)
•Android Market application analysis
DOES IT WORK?
23. IAIK
Current Project
Android application security
Container applications for BYOD (require encryption, secure
communication, key derivation functions, root checks etc.)
Manual analysis is cumbersome
Semantic Patterns
Extract Dalvik VM code, features (opcodes, methods, local variables etc.)
Apply Semantic Patterns technique
Clustering, supervised learning, anomaly detection etc.
25. IAIK
Current Project
Also works directly on the
phone...
Detecting SMS catchers/sniffers
More fine grained detection
assymmetric cryptography
symmetric cryptography
26. IAIK
Outlook
Publish the Java API...
basically a converter from arbitrary feature vectors to
Semantic Patterns (e.g. in/out in ARFF format)
Deep learning...