Modeling individual behavior from heterogeneous sensory time-series

Jiang Zhu
jiang.zhu@sv.cmu.edu
December 13th, 2012

1

Study the fundamental scientific problem

of modeling an individual’s behavior from

heterogeneous sensory time-series
• Data collected from physical and soft sensors

• Apply the behavioral models to real applications
• Security: Accountable Mobility Model
• Mobile Security: SenSec
• Psychological status estimation: StressSens

2

• Derived from

Behavioral Biometrics
Behaviometrics
• Behavioral: the way a human subject behaves

• Biometrics: technologies and methods that measure and analyzes
biological characteristics of the human body
• Finger prints, eye retina, voice patterns

• BehavioMetrics: Measurable behavior to Recognize or to Verify
• Identity of a human subject, or
• Subject’s certain behaviors

3

Raw
Preprocessing Applications
Data

Modeling Applications

Ground
Evaluation Applications
Truth

4

Heterogonous Behavioral Text Accountable
Sensor Data Representation Mobility

n-gram
MobiSens Skipped n-gram
Helix, Helix Tree SenSec
DT, RF, SVM…

Sim. Attacks
Ctrl. Exp. Prec. Recall
Auth. Records Accuracy StressSens
Mem. Test Error & FP

5

• Human behavior/activities share some common properties
with natural languages
• Meanings are composed from meanings of building blocks
• Exists an underlying structure (grammar)
• Expressed as a sequence (time-series)

• Apply rich sets of Statistical NLPs to mobile sensory data

6

Quantization Clustering

7

• Generative language model: P( English sentence) given a
model
P(“President Obama has signed the Bill of … ”| Politics ) >>
P(“President Obama has signed the Bill of … ” | Sports )
LM reflects the n-gram distribution of the training data:
domain, genre, topics.
• With labeled behavior text data, we can train a LM for
each activity type: “walking”-LM, “running”-LM and
classify the activity as

8

• User activity at time t depends only on the last n-1 locations

• Sequence of activities can be predicted by n consecutive
activities in the past

• Maximum Likelihood Estimation from training data by counting:

• MLE assign zero probability to unseen n-grams
Incorporate smoothing function (Katz)
Discount probability for observed grams
Reserve probability for unseen grams

9

• Long distance dependency of words in sentences
• tri-grams for “I hit the tennis ball”: “I hit the”, “hit the tennis” “the tennis ball”
• “I hit ball” not captured

• Future activities depends on activities far in the past. Intermediate
behavior has little relevance or influence
• Noise in the data sets: “ping-pong” effects in time-
series, interference, sampling errors, etc
• Model size

10

• Build BehavioMetrics models for M classes P0, P1, P2, PM-1
• Genders, age groups, occupations
• Behaviors, activities, actions
• Health and mental status

• For a new behavioral text string L, we calculate the probability if L
is generated by model m

• Classification problem formulated as

11

• Is this play Shakespeare’s work?

• Comparing the play to Shakespeare’s known
library of works
• Track words and phases patterns in the data

• Calculate the probability the unknown U
given all the known Shakespeare’s work {S}
• Compare with a threshold θ
• Authentic work (a=1)
• Fake, Forgery or Plagiarism (a=0)

12

• A special binary classification problem

• Given a normal BehavioMetrics model Pn, a new behavior text
sequence L, and a threshold θ, calculate the likelihood L is
generated by Pn and compare with θ

• If the outcome is -1, flag an anomaly alert

• Variation caused by noise could be smoothed out statistically

• Need certain feedbacks to handle false positives, usually caused
by unseen behaviors or sub-optimal threshold.

13

0. 8

0. 7
Aver age Log Pr obabi l i t y

0. 6

0. 5

0. 4
C D A
0. 3

0. 2
Log Probility B
Low Threshold
High Threshold
0. 1

0
Sl i di ng W ndow Posi t i on
i

14

• Convert feature vector series to label streams – dimension reduction

• Step window with assigned length

A1 A2 A1 A4

G2 G5 G2 G2

W2 W1 W2

P1 P3 P6 P1

A2 G2G5 W1 P1P3 A1A4 G2 W1W2 P1
15

• Induce underlying grammar of human activities
• Identify atomic activities through bracketing and collocation
• Generalize semantically similar activities into higher level activities.

16

1. Vocabulary Initialization using Time-series Motifs

2. Super-Activity Discovery by Statistical Collocation

3. Vocabulary Generalization via Aggregated Similarity

17

ACM MONE Journal, 2012
19

• Collect RSS of the devices on multiple WAPs with timestamps

• Aggregate and serialize into time series of RSS vectors

* Lin, et al “WASP: An enhanced indoor location algorithm for a congested wi-fi environment”
21

• Dimensionality in RSS vector – too fine for modeling

• Proximity in location results in similar RSS vector

• K-means clustering algorithm with distance function similar to
WASP[1] and each cluster assigned a pseudo location label

[1] Lin, et al “WASP: An enhanced indoor location algorithm for a congested wi-fi environment”
22

Dataset
• RSS vector clustering
Users 40
• Run small subset trace with
Cisco SJC 14 1F
Location
Alpha networks
different K and evaluate
clustering performance by
RSS
13 sec average distance to centroids
sampling rate
Period 5 days • K = 3X #WAPs has the best
trade-offs
Number of WAPs 87
• Yield ~260 pseudo locations
Cisco Aironet
Device
1500 + MSE
Dataset Size 3.2 mil points

23

• Testing samples
Positive sample: simulated anomaly by splicing traces from two different users
Negative sample: trace from “owner”

24

1
0.9
0.8
True Positive Rate

0.7
0.6
0.5
0.4
0.3
0.2 Data Size (12 Hrs)
0.1 Data Size (8 Hrs)

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Positive Rate
Source information is set at 12 points.
25

1

0.9

0.8

0.7

0.6
Accuracy

0.5

0.4

0.3
Data size (4hr)
0.2
Data size (8hr)
0.1 Data size (12hr)
0
0 1 2 3 4 5 6 7 8 9 10
n-gram order
Source information is set at 12 points.
26

Quantization Clustering

Risk Analysis Sensor Fusion Activity
Tree and Segmentation Recognition

Certainty of Risk Application Sensitivity

< Application
Access
Control
Application Access Control

27

Sensing Preprocessing Modeling

N-gram
Model

Feature Behavior Text
Construction Generation

User

Classifier
Classification

• SenSec collects sensor data
•Motion sensors User

Classifier
Binary
Authentication

•GPS and WiFi Scanning Threshold

•In-use applications and their traffic patterns Inference

• SenSec modulebuild user behavior models
• Unsupervised Activity Segmentation and model the sequence using
Language model
• Building Risk Analysis Tree (DT) to detect anomaly
• Combine above to estimate risk (online): certainty score

• Application Access Control Module activate authentication based
on the score and a customizable threshold.

28

• Accelerometer
• Used to summarize
acceleration stream
• Calculated separately for each
dimension [x,y,z,m]
• Meta features:
Total Time, Window Size

• GPS: location string from Google Map API and mobility path

• WiFi: SSIDs, RSSIs and path

• Applications: Bitmap of well-known applications

• Application Traffic Pattern: TCP UDP traffic pattern vectors: [
remote host, port, rate ]
29

• Offline data collection (for training and testing)
Pick up the device from a desk
Unlock the device using the right slide pattern
Invoke Email app from the "Home Screen"
Lock the device by pressing the "Power" button
Put the device back on the desk

31

• 71.3% True-Positive Rate with 13.1% False Positive

32

• Alpha test in Jun 2012, 1st Google Play Store release in Oct 2012

• False Positive: 13% FPR still annoying users sometimes

• Use adaptive model
• Adding the trace data shortly before a false positive to the training data and
update the model

• Change passcode validation to sliding pattern

• A false positive will grant a “free ride” for a configurable duration
• Assumption: just authenticated user should control the device for a given
period of time

• “Free Ride” period will end immediately if abrupt context change is
detected.
• Newer version is scheduled to be release in Jan 2013.

33

• Human stress need to be properly handled
• DARPA - Detection and Computational Analysis of Psychological Signals
• Develop analytical tools to assess psychological status of war fighters
• Improve psychological health awareness and enable them to seek timely
help

• Measurement of Stress is expensive and time-consuming
• Expensive medical procedures: EKG, EEG
• Self-report: questionnaires, interviews, surveys

• BehavioMetrics-based estimation
• Monitor mouse movements, screen touches(Windows 8), key
strokes, active applications, network traffic patterns to build Behaviometrics.
• Use memory test and other mental exercise results as ground truth.
• Perform classification and regression to build Behavior-Stress models.

34

Heterogonous Behavioral Text Accountable
Sensor Data Representation Mobility

n-gram
MobiSens Skipped n-gram
Helix, Helix Tree SenSec
DT, RF, SVM…

Sim. Attacks
Ctrl. Exp. Prec. Recall
Auth. Records ROC StressSens
Mem. Test Accuracy
FP

36

Language approach in modeling Build and release 3 applications
user behavior via textual • MobiSens
representation of heterogeneous
• SenSec
time-series
• StressSens
Evaluate and adapt NLP
techniques to BehavioMetrics in Gain insights from experiments and
activity provide guidelines in selecting
segmentation, recognition, classific models, tuning parameters and
ation and anomaly detection from improving UX
sequential data Valuable labeled or partially labeled
Unsupervised Helix and Helix-TF data sets to enable other
to discover hierarchical structure in BehavioMetric research
BehavioMetrics for general
classification and anomaly
detection
© 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 37

“MobiSens: A Versatile Mobile Sensing Platform for Real-world Applications”, MONE, 2013, [with P.Wu, J.Zhang]
"SenSec: Mobile Application Security through Passive Sensing," to appear in the Proceedings of International Conference on
Computing, Networking and Communications. (ICNC 2013). San Diego, USA. January 28-31, 2013 [with
P.Wu, X.Wang, J.Zhang]
“Towards Accountable Mobility Model: A Language Approach on User Behavior Modeling in Office WiFi Networks”, accepted
to ICCCN 2011, Maui, HI, Aug 1-5, 2011 [with Y.Zhang]
"Retweet Modeling Using Conditional Random Fields," in the Proceedings of DMCCI 2011: ICDM 2011 Workshop on Data
Mining Technologies for Computational Collective Intelligence, December 11, 2011.[ with H.Peng, D.Piao, R.Yan and
Y.Zhang]
" Mobile Lifelogger - recording, indexing, and understanding a mobile user's life", in the Proceedings of The Second
International Conference on Mobile Computing, Applications, and Services, Santa Clara, CA, Oct 25-28, 2010 [With
S.Chennuru, P.Cheng, Y.Zhang]
"SensCare: Semi-Automatic Activity Summarization System for Elderly Care", MobiCase 2011, Los Angeles, CA, October
24-27, 2011. [with Pang Wu, Huan-kai Peng,Joy Ying Zhang]
"Helix: Unsupervised Grammar Induction for Structured Human Activity Recognition," to appear in the Proceedings of The
IEEE International Conference on Data Mining series (ICDM), Vancouver, Canada, Dec 11-14, 2011.[with Huan-Kai
Peng, Pang Wu, and Ying Zhang]
"Statistically Modeling the Effectiveness of Disaster Information in Social Media," to appear in the Proceedings of IEEE
Global Humanitarian Technology Conference (GHTC), Seattle, Washington, Oct. 30 - Nov. 1st, 2011.[with Fei
Xiong, Dongzhen Piao, Yun Liu, and Ying Zhang]
"A dissipative network model with neighboring activation," to appear in THE EUROPEAN PHYSICAL JOURNAL B.[with F.
Xiong, Y. Liu, J. Zhu, Z. J. Zhang, Y. C. Zhang, and J. Zhang]
"Opinion Formation with the Evolution of Network," to appear in the Proceedings of 2011 Cross-Strait Conference on
Information Science and Technology and iCube, TaiBei, China, Dec 8-9, 2011.[with F.Xiong, Y.Liu, Y.Zhang]

38

Modeling individual behavior from heterogeneous sensory time-series

Recomendados

Recomendados

Más contenido relacionado

Similar a Modeling individual behavior from heterogeneous sensory time-series

Similar a Modeling individual behavior from heterogeneous sensory time-series (20)

Más de Jiang Zhu

Más de Jiang Zhu (7)

Último

Último (20)

Modeling individual behavior from heterogeneous sensory time-series

Notas del editor