The penetration of mobile devices equipped with various embedded sensors also make it possible to capture the physical and virtual context of the user and surrounding environment. Further, the modeling of human behaviors based on those data becomes very important due to the increasing popularity of context-aware computing and people-centric applications, which utilize users' behavior pattern to improve the existing services or enable new services. In many natural settings, however, their broader applications are hindered by three main challenges: rarity of labels, uncertainty of activity granularities, and the difficulty of multi-dimensional sensor fusion.
2. Study the fundamental scientific problem
of modeling an individual’s behavior from
heterogeneous sensory time-series
• Data collected from physical and soft sensors
• Apply the behavioral models to real applications
• Security: Accountable Mobility Model
• Mobile Security: SenSec
• Psychological status estimation: StressSens
2
3. • Derived from
Behavioral Biometrics
Behaviometrics
• Behavioral: the way a human subject behaves
• Biometrics: technologies and methods that measure and analyzes
biological characteristics of the human body
• Finger prints, eye retina, voice patterns
• BehavioMetrics: Measurable behavior to Recognize or to Verify
• Identity of a human subject, or
• Subject’s certain behaviors
3
4. Raw
Preprocessing Applications
Data
Modeling Applications
Ground
Evaluation Applications
Truth
4
5. Heterogonous Behavioral Text Accountable
Sensor Data Representation Mobility
n-gram
MobiSens Skipped n-gram
Helix, Helix Tree SenSec
DT, RF, SVM…
Sim. Attacks
Ctrl. Exp. Prec. Recall
Auth. Records Accuracy StressSens
Mem. Test Error & FP
5
6. • Human behavior/activities share some common properties
with natural languages
• Meanings are composed from meanings of building blocks
• Exists an underlying structure (grammar)
• Expressed as a sequence (time-series)
• Apply rich sets of Statistical NLPs to mobile sensory data
6
8. • Generative language model: P( English sentence) given a
model
P(“President Obama has signed the Bill of … ”| Politics ) >>
P(“President Obama has signed the Bill of … ” | Sports )
LM reflects the n-gram distribution of the training data:
domain, genre, topics.
• With labeled behavior text data, we can train a LM for
each activity type: “walking”-LM, “running”-LM and
classify the activity as
8
9. • User activity at time t depends only on the last n-1 locations
• Sequence of activities can be predicted by n consecutive
activities in the past
• Maximum Likelihood Estimation from training data by counting:
• MLE assign zero probability to unseen n-grams
Incorporate smoothing function (Katz)
Discount probability for observed grams
Reserve probability for unseen grams
9
10. • Long distance dependency of words in sentences
• tri-grams for “I hit the tennis ball”: “I hit the”, “hit the tennis” “the tennis ball”
• “I hit ball” not captured
• Future activities depends on activities far in the past. Intermediate
behavior has little relevance or influence
• Noise in the data sets: “ping-pong” effects in time-
series, interference, sampling errors, etc
• Model size
10
11. • Build BehavioMetrics models for M classes P0, P1, P2, PM-1
• Genders, age groups, occupations
• Behaviors, activities, actions
• Health and mental status
• For a new behavioral text string L, we calculate the probability if L
is generated by model m
• Classification problem formulated as
11
12. • Is this play Shakespeare’s work?
• Comparing the play to Shakespeare’s known
library of works
• Track words and phases patterns in the data
• Calculate the probability the unknown U
given all the known Shakespeare’s work {S}
• Compare with a threshold θ
• Authentic work (a=1)
• Fake, Forgery or Plagiarism (a=0)
12
13. • A special binary classification problem
• Given a normal BehavioMetrics model Pn, a new behavior text
sequence L, and a threshold θ, calculate the likelihood L is
generated by Pn and compare with θ
• If the outcome is -1, flag an anomaly alert
• Variation caused by noise could be smoothed out statistically
• Need certain feedbacks to handle false positives, usually caused
by unseen behaviors or sub-optimal threshold.
13
14. 0. 8
0. 7
Aver age Log Pr obabi l i t y
0. 6
0. 5
0. 4
C D A
0. 3
0. 2
Log Probility B
Low Threshold
High Threshold
0. 1
0
Sl i di ng W ndow Posi t i on
i
14
16. • Induce underlying grammar of human activities
• Identify atomic activities through bracketing and collocation
• Generalize semantically similar activities into higher level activities.
16
17. 1. Vocabulary Initialization using Time-series Motifs
2. Super-Activity Discovery by Statistical Collocation
3. Vocabulary Generalization via Aggregated Similarity
17
21. • Collect RSS of the devices on multiple WAPs with timestamps
• Aggregate and serialize into time series of RSS vectors
* Lin, et al “WASP: An enhanced indoor location algorithm for a congested wi-fi environment”
21
22. • Dimensionality in RSS vector – too fine for modeling
• Proximity in location results in similar RSS vector
• K-means clustering algorithm with distance function similar to
WASP[1] and each cluster assigned a pseudo location label
[1] Lin, et al “WASP: An enhanced indoor location algorithm for a congested wi-fi environment”
22
23. Dataset
• RSS vector clustering
Users 40
• Run small subset trace with
Cisco SJC 14 1F
Location
Alpha networks
different K and evaluate
clustering performance by
RSS
13 sec average distance to centroids
sampling rate
Period 5 days • K = 3X #WAPs has the best
trade-offs
Number of WAPs 87
• Yield ~260 pseudo locations
Cisco Aironet
Device
1500 + MSE
Dataset Size 3.2 mil points
23
24. • Testing samples
Positive sample: simulated anomaly by splicing traces from two different users
Negative sample: trace from “owner”
24
25. 1
0.9
0.8
True Positive Rate
0.7
0.6
0.5
0.4
0.3
0.2 Data Size (12 Hrs)
0.1 Data Size (8 Hrs)
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Positive Rate
Source information is set at 12 points.
25
26. 1
0.9
0.8
0.7
0.6
Accuracy
0.5
0.4
0.3
Data size (4hr)
0.2
Data size (8hr)
0.1 Data size (12hr)
0
0 1 2 3 4 5 6 7 8 9 10
n-gram order
Source information is set at 12 points.
26
27. Quantization Clustering
Risk Analysis Sensor Fusion Activity
Tree and Segmentation Recognition
Certainty of Risk Application Sensitivity
< Application
Access
Control
Application Access Control
27
28. Sensing Preprocessing Modeling
N-gram
Model
Feature Behavior Text
Construction Generation
User
Classifier
Classification
• SenSec collects sensor data
•Motion sensors User
Classifier
Binary
Authentication
•GPS and WiFi Scanning Threshold
•In-use applications and their traffic patterns Inference
• SenSec modulebuild user behavior models
• Unsupervised Activity Segmentation and model the sequence using
Language model
• Building Risk Analysis Tree (DT) to detect anomaly
• Combine above to estimate risk (online): certainty score
• Application Access Control Module activate authentication based
on the score and a customizable threshold.
28
29. • Accelerometer
• Used to summarize
acceleration stream
• Calculated separately for each
dimension [x,y,z,m]
• Meta features:
Total Time, Window Size
• GPS: location string from Google Map API and mobility path
• WiFi: SSIDs, RSSIs and path
• Applications: Bitmap of well-known applications
• Application Traffic Pattern: TCP UDP traffic pattern vectors: [
remote host, port, rate ]
29
31. • Offline data collection (for training and testing)
Pick up the device from a desk
Unlock the device using the right slide pattern
Invoke Email app from the "Home Screen"
Lock the device by pressing the "Power" button
Put the device back on the desk
31
33. • Alpha test in Jun 2012, 1st Google Play Store release in Oct 2012
• False Positive: 13% FPR still annoying users sometimes
• Use adaptive model
• Adding the trace data shortly before a false positive to the training data and
update the model
• Change passcode validation to sliding pattern
• A false positive will grant a “free ride” for a configurable duration
• Assumption: just authenticated user should control the device for a given
period of time
• “Free Ride” period will end immediately if abrupt context change is
detected.
• Newer version is scheduled to be release in Jan 2013.
33
34. • Human stress need to be properly handled
• DARPA - Detection and Computational Analysis of Psychological Signals
• Develop analytical tools to assess psychological status of war fighters
• Improve psychological health awareness and enable them to seek timely
help
• Measurement of Stress is expensive and time-consuming
• Expensive medical procedures: EKG, EEG
• Self-report: questionnaires, interviews, surveys
• BehavioMetrics-based estimation
• Monitor mouse movements, screen touches(Windows 8), key
strokes, active applications, network traffic patterns to build Behaviometrics.
• Use memory test and other mental exercise results as ground truth.
• Perform classification and regression to build Behavior-Stress models.
34
38. “MobiSens: A Versatile Mobile Sensing Platform for Real-world Applications”, MONE, 2013, [with P.Wu, J.Zhang]
"SenSec: Mobile Application Security through Passive Sensing," to appear in the Proceedings of International Conference on
Computing, Networking and Communications. (ICNC 2013). San Diego, USA. January 28-31, 2013 [with
P.Wu, X.Wang, J.Zhang]
“Towards Accountable Mobility Model: A Language Approach on User Behavior Modeling in Office WiFi Networks”, accepted
to ICCCN 2011, Maui, HI, Aug 1-5, 2011 [with Y.Zhang]
"Retweet Modeling Using Conditional Random Fields," in the Proceedings of DMCCI 2011: ICDM 2011 Workshop on Data
Mining Technologies for Computational Collective Intelligence, December 11, 2011.[ with H.Peng, D.Piao, R.Yan and
Y.Zhang]
" Mobile Lifelogger - recording, indexing, and understanding a mobile user's life", in the Proceedings of The Second
International Conference on Mobile Computing, Applications, and Services, Santa Clara, CA, Oct 25-28, 2010 [With
S.Chennuru, P.Cheng, Y.Zhang]
"SensCare: Semi-Automatic Activity Summarization System for Elderly Care", MobiCase 2011, Los Angeles, CA, October
24-27, 2011. [with Pang Wu, Huan-kai Peng,Joy Ying Zhang]
"Helix: Unsupervised Grammar Induction for Structured Human Activity Recognition," to appear in the Proceedings of The
IEEE International Conference on Data Mining series (ICDM), Vancouver, Canada, Dec 11-14, 2011.[with Huan-Kai
Peng, Pang Wu, and Ying Zhang]
"Statistically Modeling the Effectiveness of Disaster Information in Social Media," to appear in the Proceedings of IEEE
Global Humanitarian Technology Conference (GHTC), Seattle, Washington, Oct. 30 - Nov. 1st, 2011.[with Fei
Xiong, Dongzhen Piao, Yun Liu, and Ying Zhang]
"A dissipative network model with neighboring activation," to appear in THE EUROPEAN PHYSICAL JOURNAL B.[with F.
Xiong, Y. Liu, J. Zhu, Z. J. Zhang, Y. C. Zhang, and J. Zhang]
"Opinion Formation with the Evolution of Network," to appear in the Proceedings of 2011 Cross-Strait Conference on
Information Science and Technology and iCube, TaiBei, China, Dec 8-9, 2011.[with F.Xiong, Y.Liu, Y.Zhang]
38
So building along this line, we use a continousn-gram model to learn the sequence of locations from user’s wifi traces.N-gram model works under the assumptions that the next location in the sequence .. depends on just the last n-1 locations… Once the n-gram model is trained, we can use it to calculate the probability of all possible next locations given the past n-1 locations…. and see which one is the most likely location.To train the model, we use maximum likelihood estimation on the training sequences to estimate these conditional probability … just by counting. As show in this equation, MLE probability of being in location at time i conditioned on the past n-1 history locations is… just the count of all n sequences in the data divided by the count of all these n-1 sequences. There is one small problem with this approach. Let’s say our model come across a location that has not been seen in the training. It just assumes a zero probability. This may push the system to trigger anomaly alert. Luckily, N-gram model is very robust in handling unseen labels if we use smoothing. Smoothing algorithms such as Katz… are to take some probability mass from the seen lables and reserve them for those unseen lables.
In natural language, words in a sentence may have long-distance dependencies. For example, the sentence “I hit the tennis ball” … has 3 tri-grams.. “I hit the” … “hit the tennis” .. And.. “the tennis ball” It is clear that an equally important tri-gram “I hit ball” is not normally captured by the continuous n-gram… because the separators ‘the” “tennis” is in the middle. If we could skip the separators … and we can form this important tri-gram. I hit ball Similarity, in our continuous n-gram model I just described, user’s next locations is dependent only on his n-1 previous locations. However, in many cases this may not be true.Use the same example, if a user is leaving the break room and entering hallway that leads to his office, we can predict he will be in his office soon. The intermediate locations along the hallway and before entering the office are not that important. Those locations can be skipped in the modeling. As shown in the diagram here, ABC is the break room, ACD is the entrance of the hallway and EDB is the office. Anything in the middle can be skipped and still give the same results. By skipping detracting grams, now… the effective n-gram order becomes (n-d). Therefore, we can reduce the size of the model in terms of computation and storage because the n-gram model has better performance for a lower value of n.
Once we constructed a model of a user's behaviometrics through learning, we can continue monitoring user's behaviometrics and compare them with the learned model. If the new behaviometrics deviate from the learned model, we may choose to trigger an anomaly alert. However, variations in sensory data streams could also be caused by noise and new behaviors in addition to anomalous behaviors. Variations caused by noise is less significant and can be smooth out statistically. On the other hand, to distinguish between anomalous and new behaviors, we need to evaluate if those unseen patterns can be incorporated into the model over time. Failing to identify such a distinction might yield false positive temporarily, but if certain feedback mechanisms are in place to correct those false positives, we are still able to build a robust anomaly detection system in various application domains such as theft detection and prevention, casual authentication, emergency detection and healthcare monitoring.
To illustrate this process, let’s take a look at an example.The blue curve is the log probability we just described. Let’s say anomaly happens at point A. If we set the threshold lower like the red line, the system will detect the anomaly at point B with a reasonable delay. But if we set the threshold too high like the pink line, we will mistakenly flag an anomaly for a sequence of normal behavior text…. Which is counted towards false positives at points C and D. The way to find the right threshold for different applications is to use receiver-operating-characteristic curve or ROC curve. We will look at this in more details later in the talk.
Thinking of a simple example, where the red traces in this office floor represent the usual mobility of a user. In this case, this user is finishing a meeting in a conference room and is going back to his cubicle. << hit enter >>Now, if we look at the another path user is taking, instead of going this way, he is going towards the other direction. <<hit enter>>Then deviating further and further like thisIn such a case, we would want to flag this as an anomaly. It could be a case that a visitor who attend the meeting and took the device the employee forgot in the conference room and went away. the device may still has the access to company internal network and other data source, by receiving this alert, the infrastructure would revoke his authentication credentials temporarily until the user can authentication himself again. <<hit enter>>Now, if in stead of going further away, he is going back to his cubile, just by taking an alternate path. In this case, we probably do not want to flag this as a anomaly
The management, control and data frames from a device will be heard by multiple APs. In our particular setup, these APs will record the Received signal strength or RSS of those frame along with the Identity of the device and timing information.These traces will be aggregated to a central location .. where we can serialize these traces based on the time stamp and classify them using the device IDs. So.. for a particular device, we can build a time series of RSS vector, each element in the vector is the RSS from a particular AP. These series of RSS vector along with other context information serves as the input to the preprocessing module…. Where we will convert these to a text representation before feed them into our n-gram model.
From the signal propagation model, if two vectors are very similar, we know that the location where this vectors are measured should be within a reasonable proximity. Based on this assumption, we want to partition the RSS vector space into many “pseudo locations” and assign each “pseudo location” a unique label. By pseudo, we mean we don’t need to know the exact location of the reading, we just need to distinguish between two different locationsWell, this can be easily done by clustering algorithm… for example K-means clustering. In the k-mean clustering runs, we use a distance function similar to redpin and WASP in addition to the standard cosine function to reduce the noise caused by interference.Once the clustering is done, we assign labels to all the members belong to the same cluster….
So… we collected the RSS traces from 87 WAPs in an office building over 5 days. The time precision of the RSS sample is at 13 sec level. These traces contain complete data of 40 users and … in total we have about 3.2 mil data points. Backup data points:Pseudo location from RSS (other schem not very ….) 1500 data points (RSS) per user at average RSS from 3-7 WAPs.assume user up half of the time -> 80k data points per user for 5 days3.2 mil data points collected for 40 users. 20 mils rss readingsFor each of these 40 users, 16K RSS vector total
To validate our system, we need to have some testing data. However, from the trace we collected, there are no recorded anomaly fortunately. We created simulated device stolen events by splicing two users’ trace segments at their intersection points…. where similar label or labels sequences are shared. We combined this simulated traces with normal traces to create a testing data set.
Now we gained some insights on our approach. It is time to explore some of the design parameters we mentioned in the beginning. The first set of experiments is to find the best anomaly detection threshold. Actually there is no best threshold, the threshold is depending on the applications we are running. What’s the requirements on the detection accuracy? Can we allow much false positive? Do we have enough training data? To provide a guideline in answering these questions, we plot Receiver Operating Characteristic curve (or ROC curve) Essentially, ROC curve is about the trade-offs between the true-positive rate and false-positive rate in our anomaly detection. We perform the experiments with different training data sizes. We plot the ROC curve by varying the threshold and record the TPR and FPRWith the ROC curve, we can decide the threshold for a particular application depending on The amount of data the model should see before the model can detect anomaly The required TPR Or the acceptable FPRFor example, we want to use 8 hour training size and want to have less than 0.1 false positive rate, then we just need to locate this point and obtain the threshold by which this data point is generated. (0.4) We need to use threshold < 0.4 in order to fulfill the FPR requirement. Another example: let’s say we want to have the same FPR requirement but want to have TPR > 0.8, then we have to use more than 8 hours training size to archive this goal.
We plot this graphs with different training size and n-gram orders. From the graph, we can see several things. A higher order model captures more context and in turn increase accuracy. But…. , accuracy saturates beyond 5, which means in user’s behavior is more likely to be dependent on its last 5 pseudo locations. This resonates with the past work we mentioned in the beginning. It also tells us that increase the model complexity beyond this point will NOT bring about significant improvement.Second, it shows that if the training size is as small as 4 hours, it may not capture users’ mobility behavior thoroughly enough to make an accurate detection. Also, the closeness between 8 hr and 12 hour curves also suggests that our system will provide relative good results if we have observed users’ behavior for 8 hours. One interesting point to make here is the 12 hour and 8 hour curve cross over at the lower n-gram orders. While this could be due to errors in handling the data, our explanation is leaning towards that the bigger training data set will exposure more common locations that are not captured in the shorter training size. With these common locations, people are sharing a lot of shorter sequences, leading to more simulated anomaly are not detected and … bring down the accuracy.
SenSec is constantly collecting sensory data from accelerometer, gyroscope, GPS, WiFi, microphone or even camera. Through analyzing the sensory data, it constructs the context under which the mobile device is used. This includes locations, movements and usage patterns, etc. From the context, the system can calculate the certainty that the system is at risk. Different applications on mobile device are assigned either manually or automatically with a sensitivity value. When user is invoking an application, SenSec compares the certainty with this application’s sensitivity level. If the sensitivity passes the certainty threshold, authentication mechanism would be employed to ensure security policy for that application.
That brings me to the end of my presentation. Thank you very much for your attention.