SlideShare una empresa de Scribd logo
1 de 39
Jiang Zhu
jiang.zhu@sv.cmu.edu
December 13th, 2012




                       1
Study the fundamental scientific problem

    of    modeling an individual’s behavior from

      heterogeneous sensory time-series
• Data collected from physical and soft sensors

• Apply the behavioral models to real applications
   • Security: Accountable Mobility Model
   • Mobile Security: SenSec
   • Psychological status estimation: StressSens

                                                     2
• Derived from


                   Behavioral Biometrics
                      Behaviometrics
• Behavioral: the way a human subject behaves

• Biometrics: technologies and methods that measure and analyzes
 biological characteristics of the human body
   • Finger prints, eye retina, voice patterns

• BehavioMetrics: Measurable behavior to Recognize or to Verify
   • Identity of a human subject, or
   • Subject’s certain behaviors


                                                                   3
Raw
         Preprocessing   Applications
 Data




           Modeling      Applications




Ground
          Evaluation     Applications
 Truth




                                        4
Heterogonous    Behavioral Text    Accountable
  Sensor Data    Representation       Mobility



                     n-gram
 MobiSens       Skipped n-gram
                Helix, Helix Tree     SenSec
                DT, RF, SVM…



 Sim. Attacks
  Ctrl. Exp.      Prec. Recall
Auth. Records      Accuracy         StressSens
  Mem. Test        Error & FP




                                                  5
• Human behavior/activities share some common properties
  with natural languages
     • Meanings are composed from meanings of building blocks
     • Exists an underlying structure (grammar)
     • Expressed as a sequence (time-series)

• Apply rich sets of Statistical NLPs to mobile sensory data




                                                                6
Quantization   Clustering




                            7
• Generative language model: P( English sentence) given a
 model
   P(“President Obama has signed the Bill of … ”| Politics ) >>
   P(“President Obama has signed the Bill of … ” | Sports )
   LM reflects the n-gram distribution of the training data:
   domain, genre, topics.
• With labeled behavior text data, we can train a LM for
 each activity type: “walking”-LM, “running”-LM and
 classify the activity as




                                                                  8
• User activity at time t depends only on the last n-1 locations

• Sequence of activities can be predicted by n consecutive
 activities in the past


• Maximum Likelihood Estimation from training data by counting:



• MLE assign zero probability to unseen n-grams
   Incorporate smoothing function (Katz)
    Discount probability for observed grams
    Reserve probability for unseen grams



                                                                   9
• Long distance dependency of words in sentences
   • tri-grams for “I hit the tennis ball”: “I hit the”, “hit the tennis” “the tennis ball”
   • “I hit ball” not captured

• Future activities depends on activities far in the past. Intermediate
 behavior has little relevance or influence
   • Noise in the data sets: “ping-pong” effects in time-
   series, interference, sampling errors, etc
   • Model size




                                                                                              10
• Build BehavioMetrics models for M classes P0, P1, P2, PM-1
   • Genders, age groups, occupations
   • Behaviors, activities, actions
   • Health and mental status

• For a new behavioral text string L, we calculate the probability if L
 is generated by model m




• Classification problem formulated as




                                                                     11
• Is this play Shakespeare’s work?

• Comparing the play to Shakespeare’s known
 library of works
• Track words and phases patterns in the data

• Calculate the probability the unknown U
 given all the known Shakespeare’s work {S}
• Compare with a threshold θ
   • Authentic work (a=1)
   • Fake, Forgery or Plagiarism (a=0)




                                                12
• A special binary classification problem

• Given a normal BehavioMetrics model Pn, a new behavior text
 sequence L, and a threshold θ, calculate the likelihood L is
 generated by Pn and compare with θ




• If the outcome is -1, flag an anomaly alert

• Variation caused by noise could be smoothed out statistically

• Need certain feedbacks to handle false positives, usually caused
 by unseen behaviors or sub-optimal threshold.


                                                                  13
0. 8

                                0. 7
Aver age Log Pr obabi l i t y




                                0. 6

                                0. 5

                                0. 4
                                       C            D           A
                                0. 3

                                0. 2
                                       Log Probility                  B
                                       Low Threshold
                                       High Threshold
                                0. 1

                                  0
                                             Sl i di ng W ndow Posi t i on
                                                         i


                                                                             14
• Convert feature vector series to label streams – dimension reduction

• Step window with assigned length



                 A1           A2          A1           A4

                    G2             G5          G2           G2

               W2                  W1                 W2

                    P1          P3      P6           P1


                         A2 G2G5 W1 P1P3 A1A4 G2 W1W2 P1
                                                                   15
• Induce underlying grammar of human activities
   • Identify atomic activities through bracketing and collocation
   • Generalize semantically similar activities into higher level activities.




                                                                                16
1. Vocabulary Initialization using Time-series Motifs

2. Super-Activity Discovery by Statistical Collocation

3. Vocabulary Generalization via Aggregated Similarity


                                                         17
18
ACM MONE Journal, 2012
                         19
20
• Collect RSS of the devices on multiple WAPs with timestamps

• Aggregate and serialize into time series of RSS vectors




* Lin, et al “WASP: An enhanced indoor location algorithm for a congested wi-fi environment”
                                                                                               21
• Dimensionality in RSS vector – too fine for modeling

• Proximity in location results in similar RSS vector

• K-means clustering algorithm with distance function similar to
   WASP[1] and each cluster assigned a pseudo location label




[1] Lin, et al “WASP: An enhanced indoor location algorithm for a congested wi-fi environment”
                                                                                                 22
Dataset
                                     • RSS vector clustering
Users              40
                                     • Run small subset trace with
                   Cisco SJC 14 1F
Location
                   Alpha networks
                                      different K and evaluate
                                      clustering performance by
RSS
                   13 sec             average distance to centroids
sampling rate
Period             5 days            • K = 3X #WAPs has the best
                                      trade-offs
Number of WAPs 87
                                     • Yield ~260 pseudo locations
                   Cisco Aironet
Device
                   1500 + MSE
Dataset Size       3.2 mil points

                                                                     23
• Testing samples
   Positive sample: simulated anomaly by splicing traces from two different users
   Negative sample: trace from “owner”




                                                                               24
1
                                   0.9
                                   0.8
              True Positive Rate




                                   0.7
                                   0.6
                                   0.5
                                   0.4
                                   0.3
                                   0.2                                                     Data Size (12 Hrs)
                                   0.1                                                     Data Size (8 Hrs)

                                    0
                                         0   0.1   0.2   0.3    0.4    0.5   0.6     0.7      0.8     0.9       1
                                                               False Positive Rate
Source information is set at 12 points.
                                                                                                                    25
1

                     0.9

                     0.8

                     0.7

                     0.6
          Accuracy




                     0.5

                     0.4

                     0.3
                                                                         Data size (4hr)
                     0.2
                                                                         Data size (8hr)
                     0.1                                                 Data size (12hr)
                      0
                           0   1     2    3   4      5      6    7   8        9       10
                                                  n-gram order
Source information is set at 12 points.
                                                                                            26
Quantization              Clustering




Risk Analysis       Sensor Fusion            Activity
    Tree           and Segmentation        Recognition




                   Certainty of Risk                                    Application Sensitivity



                                                         <        Application
                                                                  Access
                                                                  Control
                                           Application Access Control




                                                                                                  27
Sensing            Preprocessing                                             Modeling

                                                                                                N-gram
                                                                                                Model




                                                       Feature                  Behavior Text
                                                     Construction                Generation



                                                                                                                          User




                                                                                                 Classifier
                                                                                                                      Classification

• SenSec collects sensor data
   •Motion sensors                                                                                                        User




                                                                                                 Classifier
                                                                                                  Binary
                                                                                                                      Authentication

   •GPS and WiFi Scanning                                           Threshold



   •In-use applications and their traffic patterns                                                            Inference



• SenSec modulebuild user behavior models
   • Unsupervised Activity Segmentation and model the sequence using
   Language model
   • Building Risk Analysis Tree (DT) to detect anomaly
   • Combine above to estimate risk (online): certainty score

• Application Access Control Module activate authentication based
 on the score and a customizable threshold.

                                                                                                                          28
• Accelerometer
   • Used to summarize
     acceleration stream
   • Calculated separately for each
     dimension [x,y,z,m]
   • Meta features:
      Total Time, Window Size

• GPS: location string from Google Map API and mobility path

• WiFi: SSIDs, RSSIs and path

• Applications: Bitmap of well-known applications

• Application Traffic Pattern: TCP UDP traffic pattern vectors: [
 remote host, port, rate ]
                                                                    29
30
• Offline data collection (for training and testing)
    Pick up the device from a desk
    Unlock the device using the right slide pattern
    Invoke Email app from the "Home Screen"
    Lock the device by pressing the "Power" button
     Put the device back on the desk




                                                       31
• 71.3% True-Positive Rate with 13.1% False Positive

                                                       32
• Alpha test in Jun 2012, 1st Google Play Store release in Oct 2012

• False Positive: 13% FPR still annoying users sometimes

• Use adaptive model
   • Adding the trace data shortly before a false positive to the training data and
     update the model

• Change passcode validation to sliding pattern

• A false positive will grant a “free ride” for a configurable duration
   • Assumption: just authenticated user should control the device for a given
     period of time

• “Free Ride” period will end immediately if abrupt context change is
  detected.
• Newer version is scheduled to be release in Jan 2013.

                                                                                  33
• Human stress need to be properly handled
   • DARPA - Detection and Computational Analysis of Psychological Signals
   • Develop analytical tools to assess psychological status of war fighters
   • Improve psychological health awareness and enable them to seek timely
     help

• Measurement of Stress is expensive and time-consuming
   • Expensive medical procedures: EKG, EEG
   • Self-report: questionnaires, interviews, surveys

• BehavioMetrics-based estimation
   • Monitor mouse movements, screen touches(Windows 8), key
     strokes, active applications, network traffic patterns to build Behaviometrics.
   • Use memory test and other mental exercise results as ground truth.
   • Perform classification and regression to build Behavior-Stress models.

                                                                                  34
35
Heterogonous    Behavioral Text    Accountable
  Sensor Data    Representation       Mobility



                     n-gram
 MobiSens       Skipped n-gram
                Helix, Helix Tree     SenSec
                DT, RF, SVM…



 Sim. Attacks
  Ctrl. Exp.      Prec. Recall
Auth. Records        ROC            StressSens
  Mem. Test        Accuracy
                      FP



                                                  36
Language approach in modeling                              Build and release 3 applications
user behavior via textual                                     • MobiSens
representation of heterogeneous
                                                              • SenSec
time-series
                                                              • StressSens
Evaluate and adapt NLP
techniques to BehavioMetrics in                            Gain insights from experiments and
activity                                                   provide guidelines in selecting
segmentation, recognition, classific                       models, tuning parameters and
ation and anomaly detection from                           improving UX
sequential data                                            Valuable labeled or partially labeled
Unsupervised Helix and Helix-TF                            data sets to enable other
to discover hierarchical structure in                      BehavioMetric research
BehavioMetrics for general
classification and anomaly
detection
© 2010 Cisco and/or its affiliates. All rights reserved.                                Cisco Confidential   37
“MobiSens: A Versatile Mobile Sensing Platform for Real-world Applications”, MONE, 2013, [with P.Wu, J.Zhang]
"SenSec: Mobile Application Security through Passive Sensing," to appear in the Proceedings of International Conference on
Computing, Networking and Communications. (ICNC 2013). San Diego, USA. January 28-31, 2013 [with
P.Wu, X.Wang, J.Zhang]
“Towards Accountable Mobility Model: A Language Approach on User Behavior Modeling in Office WiFi Networks”, accepted
to ICCCN 2011, Maui, HI, Aug 1-5, 2011 [with Y.Zhang]
 "Retweet Modeling Using Conditional Random Fields," in the Proceedings of DMCCI 2011: ICDM 2011 Workshop on Data
Mining Technologies for Computational Collective Intelligence, December 11, 2011.[ with H.Peng, D.Piao, R.Yan and
Y.Zhang]
" Mobile Lifelogger - recording, indexing, and understanding a mobile user's life", in the Proceedings of The Second
International Conference on Mobile Computing, Applications, and Services, Santa Clara, CA, Oct 25-28, 2010 [With
S.Chennuru, P.Cheng, Y.Zhang]
"SensCare: Semi-Automatic Activity Summarization System for Elderly Care", MobiCase 2011, Los Angeles, CA, October
24-27, 2011. [with Pang Wu, Huan-kai Peng,Joy Ying Zhang]
"Helix: Unsupervised Grammar Induction for Structured Human Activity Recognition," to appear in the Proceedings of The
IEEE International Conference on Data Mining series (ICDM), Vancouver, Canada, Dec 11-14, 2011.[with Huan-Kai
Peng, Pang Wu, and Ying Zhang]
"Statistically Modeling the Effectiveness of Disaster Information in Social Media," to appear in the Proceedings of IEEE
Global Humanitarian Technology Conference (GHTC), Seattle, Washington, Oct. 30 - Nov. 1st, 2011.[with Fei
Xiong, Dongzhen Piao, Yun Liu, and Ying Zhang]
"A dissipative network model with neighboring activation," to appear in THE EUROPEAN PHYSICAL JOURNAL B.[with F.
Xiong, Y. Liu, J. Zhu, Z. J. Zhang, Y. C. Zhang, and J. Zhang]
"Opinion Formation with the Evolution of Network," to appear in the Proceedings of 2011 Cross-Strait Conference on
Information Science and Technology and iCube, TaiBei, China, Dec 8-9, 2011.[with F.Xiong, Y.Liu, Y.Zhang]

                                                                                                                           38
Thank you.

Más contenido relacionado

Similar a Modeling individual behavior from heterogeneous sensory time-series

#lspe: Dynamic Scaling
#lspe: Dynamic Scaling #lspe: Dynamic Scaling
#lspe: Dynamic Scaling steveshah
 
Cyber Analytics Applications for Data-Intensive Computing
Cyber Analytics Applications for Data-Intensive ComputingCyber Analytics Applications for Data-Intensive Computing
Cyber Analytics Applications for Data-Intensive ComputingMike Fisk
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialXavier Amatriain
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverviewMotaz El-Saban
 
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBigDataCloud
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
 
Fortuna 2012 metadata_management_web_of_things
Fortuna 2012 metadata_management_web_of_thingsFortuna 2012 metadata_management_web_of_things
Fortuna 2012 metadata_management_web_of_thingscarolninap
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBasedarach
 
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic SystemsDynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic SystemsAmel Bennaceur
 
Millions quotes per second in pure java
Millions quotes per second in pure javaMillions quotes per second in pure java
Millions quotes per second in pure javaRoman Elizarov
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinarTed Dunning
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
MRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsMRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsAntonio García-Domínguez
 
Energy Aware performance evaluation of WSNs.
Energy Aware performance evaluation of WSNs.Energy Aware performance evaluation of WSNs.
Energy Aware performance evaluation of WSNs.ikrrish
 
Presentation l`aquila new
Presentation l`aquila newPresentation l`aquila new
Presentation l`aquila newikrrish
 
High-Volume Data Collection and Real Time Analytics Using Redis
High-Volume Data Collection and Real Time Analytics Using RedisHigh-Volume Data Collection and Real Time Analytics Using Redis
High-Volume Data Collection and Real Time Analytics Using Rediscacois
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsSanghamitra Deb
 

Similar a Modeling individual behavior from heterogeneous sensory time-series (20)

#lspe: Dynamic Scaling
#lspe: Dynamic Scaling #lspe: Dynamic Scaling
#lspe: Dynamic Scaling
 
Cyber Analytics Applications for Data-Intensive Computing
Cyber Analytics Applications for Data-Intensive ComputingCyber Analytics Applications for Data-Intensive Computing
Cyber Analytics Applications for Data-Intensive Computing
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Gray 110916 ns-fwkshp
Gray 110916 ns-fwkshpGray 110916 ns-fwkshp
Gray 110916 ns-fwkshp
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
 
TechnicalBackgroundOverview
TechnicalBackgroundOverviewTechnicalBackgroundOverview
TechnicalBackgroundOverview
 
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
Fortuna 2012 metadata_management_web_of_things
Fortuna 2012 metadata_management_web_of_thingsFortuna 2012 metadata_management_web_of_things
Fortuna 2012 metadata_management_web_of_things
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
 
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic SystemsDynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
 
Millions quotes per second in pure java
Millions quotes per second in pure javaMillions quotes per second in pure java
Millions quotes per second in pure java
 
MapR lucidworks joint webinar
MapR lucidworks joint webinarMapR lucidworks joint webinar
MapR lucidworks joint webinar
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Deeplearning in finance
Deeplearning in financeDeeplearning in finance
Deeplearning in finance
 
MRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsMRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph models
 
Energy Aware performance evaluation of WSNs.
Energy Aware performance evaluation of WSNs.Energy Aware performance evaluation of WSNs.
Energy Aware performance evaluation of WSNs.
 
Presentation l`aquila new
Presentation l`aquila newPresentation l`aquila new
Presentation l`aquila new
 
High-Volume Data Collection and Real Time Analytics Using Redis
High-Volume Data Collection and Real Time Analytics Using RedisHigh-Volume Data Collection and Real Time Analytics Using Redis
High-Volume Data Collection and Real Time Analytics Using Redis
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 

Más de Jiang Zhu

Behaviometrics: Behavior Modeling from Heterogeneous Sensory Time-Series
Behaviometrics: Behavior Modeling from Heterogeneous Sensory Time-SeriesBehaviometrics: Behavior Modeling from Heterogeneous Sensory Time-Series
Behaviometrics: Behavior Modeling from Heterogeneous Sensory Time-SeriesJiang Zhu
 
Core of Personalization at Polyvore: Style Profile
Core of Personalization at Polyvore: Style ProfileCore of Personalization at Polyvore: Style Profile
Core of Personalization at Polyvore: Style ProfileJiang Zhu
 
Big Data and Internet of Things: A Roadmap For Smart Environments, Fog Comput...
Big Data and Internet of Things: A Roadmap For Smart Environments, Fog Comput...Big Data and Internet of Things: A Roadmap For Smart Environments, Fog Comput...
Big Data and Internet of Things: A Roadmap For Smart Environments, Fog Comput...Jiang Zhu
 
KeySens: Passive User Authentication Through Micro Behavior Modeling of Soft ...
KeySens: Passive User Authentication Through Micro Behavior Modeling of Soft ...KeySens: Passive User Authentication Through Micro Behavior Modeling of Soft ...
KeySens: Passive User Authentication Through Micro Behavior Modeling of Soft ...Jiang Zhu
 
Art and Science of Web Sites Performance: A Front-end Approach
Art and Science of Web Sites Performance: A Front-end ApproachArt and Science of Web Sites Performance: A Front-end Approach
Art and Science of Web Sites Performance: A Front-end ApproachJiang Zhu
 
Improving Web Siste Performance Using Edge Services in Fog Computing Architec...
Improving Web Siste Performance Using Edge Services in Fog Computing Architec...Improving Web Siste Performance Using Edge Services in Fog Computing Architec...
Improving Web Siste Performance Using Edge Services in Fog Computing Architec...Jiang Zhu
 
美国云计算发展现状及趋势-2010
美国云计算发展现状及趋势-2010美国云计算发展现状及趋势-2010
美国云计算发展现状及趋势-2010Jiang Zhu
 

Más de Jiang Zhu (7)

Behaviometrics: Behavior Modeling from Heterogeneous Sensory Time-Series
Behaviometrics: Behavior Modeling from Heterogeneous Sensory Time-SeriesBehaviometrics: Behavior Modeling from Heterogeneous Sensory Time-Series
Behaviometrics: Behavior Modeling from Heterogeneous Sensory Time-Series
 
Core of Personalization at Polyvore: Style Profile
Core of Personalization at Polyvore: Style ProfileCore of Personalization at Polyvore: Style Profile
Core of Personalization at Polyvore: Style Profile
 
Big Data and Internet of Things: A Roadmap For Smart Environments, Fog Comput...
Big Data and Internet of Things: A Roadmap For Smart Environments, Fog Comput...Big Data and Internet of Things: A Roadmap For Smart Environments, Fog Comput...
Big Data and Internet of Things: A Roadmap For Smart Environments, Fog Comput...
 
KeySens: Passive User Authentication Through Micro Behavior Modeling of Soft ...
KeySens: Passive User Authentication Through Micro Behavior Modeling of Soft ...KeySens: Passive User Authentication Through Micro Behavior Modeling of Soft ...
KeySens: Passive User Authentication Through Micro Behavior Modeling of Soft ...
 
Art and Science of Web Sites Performance: A Front-end Approach
Art and Science of Web Sites Performance: A Front-end ApproachArt and Science of Web Sites Performance: A Front-end Approach
Art and Science of Web Sites Performance: A Front-end Approach
 
Improving Web Siste Performance Using Edge Services in Fog Computing Architec...
Improving Web Siste Performance Using Edge Services in Fog Computing Architec...Improving Web Siste Performance Using Edge Services in Fog Computing Architec...
Improving Web Siste Performance Using Edge Services in Fog Computing Architec...
 
美国云计算发展现状及趋势-2010
美国云计算发展现状及趋势-2010美国云计算发展现状及趋势-2010
美国云计算发展现状及趋势-2010
 

Último

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Modeling individual behavior from heterogeneous sensory time-series

  • 2. Study the fundamental scientific problem of modeling an individual’s behavior from heterogeneous sensory time-series • Data collected from physical and soft sensors • Apply the behavioral models to real applications • Security: Accountable Mobility Model • Mobile Security: SenSec • Psychological status estimation: StressSens 2
  • 3. • Derived from Behavioral Biometrics Behaviometrics • Behavioral: the way a human subject behaves • Biometrics: technologies and methods that measure and analyzes biological characteristics of the human body • Finger prints, eye retina, voice patterns • BehavioMetrics: Measurable behavior to Recognize or to Verify • Identity of a human subject, or • Subject’s certain behaviors 3
  • 4. Raw Preprocessing Applications Data Modeling Applications Ground Evaluation Applications Truth 4
  • 5. Heterogonous Behavioral Text Accountable Sensor Data Representation Mobility n-gram MobiSens Skipped n-gram Helix, Helix Tree SenSec DT, RF, SVM… Sim. Attacks Ctrl. Exp. Prec. Recall Auth. Records Accuracy StressSens Mem. Test Error & FP 5
  • 6. • Human behavior/activities share some common properties with natural languages • Meanings are composed from meanings of building blocks • Exists an underlying structure (grammar) • Expressed as a sequence (time-series) • Apply rich sets of Statistical NLPs to mobile sensory data 6
  • 7. Quantization Clustering 7
  • 8. • Generative language model: P( English sentence) given a model P(“President Obama has signed the Bill of … ”| Politics ) >> P(“President Obama has signed the Bill of … ” | Sports ) LM reflects the n-gram distribution of the training data: domain, genre, topics. • With labeled behavior text data, we can train a LM for each activity type: “walking”-LM, “running”-LM and classify the activity as 8
  • 9. • User activity at time t depends only on the last n-1 locations • Sequence of activities can be predicted by n consecutive activities in the past • Maximum Likelihood Estimation from training data by counting: • MLE assign zero probability to unseen n-grams Incorporate smoothing function (Katz) Discount probability for observed grams Reserve probability for unseen grams 9
  • 10. • Long distance dependency of words in sentences • tri-grams for “I hit the tennis ball”: “I hit the”, “hit the tennis” “the tennis ball” • “I hit ball” not captured • Future activities depends on activities far in the past. Intermediate behavior has little relevance or influence • Noise in the data sets: “ping-pong” effects in time- series, interference, sampling errors, etc • Model size 10
  • 11. • Build BehavioMetrics models for M classes P0, P1, P2, PM-1 • Genders, age groups, occupations • Behaviors, activities, actions • Health and mental status • For a new behavioral text string L, we calculate the probability if L is generated by model m • Classification problem formulated as 11
  • 12. • Is this play Shakespeare’s work? • Comparing the play to Shakespeare’s known library of works • Track words and phases patterns in the data • Calculate the probability the unknown U given all the known Shakespeare’s work {S} • Compare with a threshold θ • Authentic work (a=1) • Fake, Forgery or Plagiarism (a=0) 12
  • 13. • A special binary classification problem • Given a normal BehavioMetrics model Pn, a new behavior text sequence L, and a threshold θ, calculate the likelihood L is generated by Pn and compare with θ • If the outcome is -1, flag an anomaly alert • Variation caused by noise could be smoothed out statistically • Need certain feedbacks to handle false positives, usually caused by unseen behaviors or sub-optimal threshold. 13
  • 14. 0. 8 0. 7 Aver age Log Pr obabi l i t y 0. 6 0. 5 0. 4 C D A 0. 3 0. 2 Log Probility B Low Threshold High Threshold 0. 1 0 Sl i di ng W ndow Posi t i on i 14
  • 15. • Convert feature vector series to label streams – dimension reduction • Step window with assigned length A1 A2 A1 A4 G2 G5 G2 G2 W2 W1 W2 P1 P3 P6 P1 A2 G2G5 W1 P1P3 A1A4 G2 W1W2 P1 15
  • 16. • Induce underlying grammar of human activities • Identify atomic activities through bracketing and collocation • Generalize semantically similar activities into higher level activities. 16
  • 17. 1. Vocabulary Initialization using Time-series Motifs 2. Super-Activity Discovery by Statistical Collocation 3. Vocabulary Generalization via Aggregated Similarity 17
  • 18. 18
  • 19. ACM MONE Journal, 2012 19
  • 20. 20
  • 21. • Collect RSS of the devices on multiple WAPs with timestamps • Aggregate and serialize into time series of RSS vectors * Lin, et al “WASP: An enhanced indoor location algorithm for a congested wi-fi environment” 21
  • 22. • Dimensionality in RSS vector – too fine for modeling • Proximity in location results in similar RSS vector • K-means clustering algorithm with distance function similar to WASP[1] and each cluster assigned a pseudo location label [1] Lin, et al “WASP: An enhanced indoor location algorithm for a congested wi-fi environment” 22
  • 23. Dataset • RSS vector clustering Users 40 • Run small subset trace with Cisco SJC 14 1F Location Alpha networks different K and evaluate clustering performance by RSS 13 sec average distance to centroids sampling rate Period 5 days • K = 3X #WAPs has the best trade-offs Number of WAPs 87 • Yield ~260 pseudo locations Cisco Aironet Device 1500 + MSE Dataset Size 3.2 mil points 23
  • 24. • Testing samples Positive sample: simulated anomaly by splicing traces from two different users Negative sample: trace from “owner” 24
  • 25. 1 0.9 0.8 True Positive Rate 0.7 0.6 0.5 0.4 0.3 0.2 Data Size (12 Hrs) 0.1 Data Size (8 Hrs) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate Source information is set at 12 points. 25
  • 26. 1 0.9 0.8 0.7 0.6 Accuracy 0.5 0.4 0.3 Data size (4hr) 0.2 Data size (8hr) 0.1 Data size (12hr) 0 0 1 2 3 4 5 6 7 8 9 10 n-gram order Source information is set at 12 points. 26
  • 27. Quantization Clustering Risk Analysis Sensor Fusion Activity Tree and Segmentation Recognition Certainty of Risk Application Sensitivity < Application Access Control Application Access Control 27
  • 28. Sensing Preprocessing Modeling N-gram Model Feature Behavior Text Construction Generation User Classifier Classification • SenSec collects sensor data •Motion sensors User Classifier Binary Authentication •GPS and WiFi Scanning Threshold •In-use applications and their traffic patterns Inference • SenSec modulebuild user behavior models • Unsupervised Activity Segmentation and model the sequence using Language model • Building Risk Analysis Tree (DT) to detect anomaly • Combine above to estimate risk (online): certainty score • Application Access Control Module activate authentication based on the score and a customizable threshold. 28
  • 29. • Accelerometer • Used to summarize acceleration stream • Calculated separately for each dimension [x,y,z,m] • Meta features: Total Time, Window Size • GPS: location string from Google Map API and mobility path • WiFi: SSIDs, RSSIs and path • Applications: Bitmap of well-known applications • Application Traffic Pattern: TCP UDP traffic pattern vectors: [ remote host, port, rate ] 29
  • 30. 30
  • 31. • Offline data collection (for training and testing) Pick up the device from a desk Unlock the device using the right slide pattern Invoke Email app from the "Home Screen" Lock the device by pressing the "Power" button Put the device back on the desk 31
  • 32. • 71.3% True-Positive Rate with 13.1% False Positive 32
  • 33. • Alpha test in Jun 2012, 1st Google Play Store release in Oct 2012 • False Positive: 13% FPR still annoying users sometimes • Use adaptive model • Adding the trace data shortly before a false positive to the training data and update the model • Change passcode validation to sliding pattern • A false positive will grant a “free ride” for a configurable duration • Assumption: just authenticated user should control the device for a given period of time • “Free Ride” period will end immediately if abrupt context change is detected. • Newer version is scheduled to be release in Jan 2013. 33
  • 34. • Human stress need to be properly handled • DARPA - Detection and Computational Analysis of Psychological Signals • Develop analytical tools to assess psychological status of war fighters • Improve psychological health awareness and enable them to seek timely help • Measurement of Stress is expensive and time-consuming • Expensive medical procedures: EKG, EEG • Self-report: questionnaires, interviews, surveys • BehavioMetrics-based estimation • Monitor mouse movements, screen touches(Windows 8), key strokes, active applications, network traffic patterns to build Behaviometrics. • Use memory test and other mental exercise results as ground truth. • Perform classification and regression to build Behavior-Stress models. 34
  • 35. 35
  • 36. Heterogonous Behavioral Text Accountable Sensor Data Representation Mobility n-gram MobiSens Skipped n-gram Helix, Helix Tree SenSec DT, RF, SVM… Sim. Attacks Ctrl. Exp. Prec. Recall Auth. Records ROC StressSens Mem. Test Accuracy FP 36
  • 37. Language approach in modeling Build and release 3 applications user behavior via textual • MobiSens representation of heterogeneous • SenSec time-series • StressSens Evaluate and adapt NLP techniques to BehavioMetrics in Gain insights from experiments and activity provide guidelines in selecting segmentation, recognition, classific models, tuning parameters and ation and anomaly detection from improving UX sequential data Valuable labeled or partially labeled Unsupervised Helix and Helix-TF data sets to enable other to discover hierarchical structure in BehavioMetric research BehavioMetrics for general classification and anomaly detection © 2010 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 37
  • 38. “MobiSens: A Versatile Mobile Sensing Platform for Real-world Applications”, MONE, 2013, [with P.Wu, J.Zhang] "SenSec: Mobile Application Security through Passive Sensing," to appear in the Proceedings of International Conference on Computing, Networking and Communications. (ICNC 2013). San Diego, USA. January 28-31, 2013 [with P.Wu, X.Wang, J.Zhang] “Towards Accountable Mobility Model: A Language Approach on User Behavior Modeling in Office WiFi Networks”, accepted to ICCCN 2011, Maui, HI, Aug 1-5, 2011 [with Y.Zhang] "Retweet Modeling Using Conditional Random Fields," in the Proceedings of DMCCI 2011: ICDM 2011 Workshop on Data Mining Technologies for Computational Collective Intelligence, December 11, 2011.[ with H.Peng, D.Piao, R.Yan and Y.Zhang] " Mobile Lifelogger - recording, indexing, and understanding a mobile user's life", in the Proceedings of The Second International Conference on Mobile Computing, Applications, and Services, Santa Clara, CA, Oct 25-28, 2010 [With S.Chennuru, P.Cheng, Y.Zhang] "SensCare: Semi-Automatic Activity Summarization System for Elderly Care", MobiCase 2011, Los Angeles, CA, October 24-27, 2011. [with Pang Wu, Huan-kai Peng,Joy Ying Zhang] "Helix: Unsupervised Grammar Induction for Structured Human Activity Recognition," to appear in the Proceedings of The IEEE International Conference on Data Mining series (ICDM), Vancouver, Canada, Dec 11-14, 2011.[with Huan-Kai Peng, Pang Wu, and Ying Zhang] "Statistically Modeling the Effectiveness of Disaster Information in Social Media," to appear in the Proceedings of IEEE Global Humanitarian Technology Conference (GHTC), Seattle, Washington, Oct. 30 - Nov. 1st, 2011.[with Fei Xiong, Dongzhen Piao, Yun Liu, and Ying Zhang] "A dissipative network model with neighboring activation," to appear in THE EUROPEAN PHYSICAL JOURNAL B.[with F. Xiong, Y. Liu, J. Zhu, Z. J. Zhang, Y. C. Zhang, and J. Zhang] "Opinion Formation with the Evolution of Network," to appear in the Proceedings of 2011 Cross-Strait Conference on Information Science and Technology and iCube, TaiBei, China, Dec 8-9, 2011.[with F.Xiong, Y.Liu, Y.Zhang] 38

Notas del editor

  1. So building along this line, we use a continousn-gram model to learn the sequence of locations from user’s wifi traces.N-gram model works under the assumptions that the next location in the sequence .. depends on just the last n-1 locations… Once the n-gram model is trained, we can use it to calculate the probability of all possible next locations given the past n-1 locations…. and see which one is the most likely location.To train the model, we use maximum likelihood estimation on the training sequences to estimate these conditional probability … just by counting. As show in this equation, MLE probability of being in location at time i conditioned on the past n-1 history locations is… just the count of all n sequences in the data divided by the count of all these n-1 sequences. There is one small problem with this approach. Let’s say our model come across a location that has not been seen in the training. It just assumes a zero probability. This may push the system to trigger anomaly alert. Luckily, N-gram model is very robust in handling unseen labels if we use smoothing. Smoothing algorithms such as Katz… are to take some probability mass from the seen lables and reserve them for those unseen lables.
  2. In natural language, words in a sentence may have long-distance dependencies. For example, the sentence “I hit the tennis ball” … has 3 tri-grams.. “I hit the” … “hit the tennis” .. And.. “the tennis ball” It is clear that an equally important tri-gram “I hit ball” is not normally captured by the continuous n-gram… because the separators ‘the” “tennis” is in the middle. If we could skip the separators … and we can form this important tri-gram. I hit ball Similarity, in our continuous n-gram model I just described, user’s next locations is dependent only on his n-1 previous locations. However, in many cases this may not be true.Use the same example, if a user is leaving the break room and entering hallway that leads to his office, we can predict he will be in his office soon. The intermediate locations along the hallway and before entering the office are not that important. Those locations can be skipped in the modeling. As shown in the diagram here, ABC is the break room, ACD is the entrance of the hallway and EDB is the office. Anything in the middle can be skipped and still give the same results. By skipping detracting grams, now… the effective n-gram order becomes (n-d). Therefore, we can reduce the size of the model in terms of computation and storage because the n-gram model has better performance for a lower value of n.
  3. Once we constructed a model of a user&apos;s behaviometrics through learning, we can continue monitoring user&apos;s behaviometrics and compare them with the learned model. If the new behaviometrics deviate from the learned model, we may choose to trigger an anomaly alert. However, variations in sensory data streams could also be caused by noise and new behaviors in addition to anomalous behaviors. Variations caused by noise is less significant and can be smooth out statistically. On the other hand, to distinguish between anomalous and new behaviors, we need to evaluate if those unseen patterns can be incorporated into the model over time. Failing to identify such a distinction might yield false positive temporarily, but if certain feedback mechanisms are in place to correct those false positives, we are still able to build a robust anomaly detection system in various application domains such as theft detection and prevention, casual authentication, emergency detection and healthcare monitoring.
  4. To illustrate this process, let’s take a look at an example.The blue curve is the log probability we just described. Let’s say anomaly happens at point A. If we set the threshold lower like the red line, the system will detect the anomaly at point B with a reasonable delay. But if we set the threshold too high like the pink line, we will mistakenly flag an anomaly for a sequence of normal behavior text…. Which is counted towards false positives at points C and D. The way to find the right threshold for different applications is to use receiver-operating-characteristic curve or ROC curve. We will look at this in more details later in the talk.
  5. Thinking of a simple example, where the red traces in this office floor represent the usual mobility of a user. In this case, this user is finishing a meeting in a conference room and is going back to his cubicle. &lt;&lt; hit enter &gt;&gt;Now, if we look at the another path user is taking, instead of going this way, he is going towards the other direction. &lt;&lt;hit enter&gt;&gt;Then deviating further and further like thisIn such a case, we would want to flag this as an anomaly. It could be a case that a visitor who attend the meeting and took the device the employee forgot in the conference room and went away. the device may still has the access to company internal network and other data source, by receiving this alert, the infrastructure would revoke his authentication credentials temporarily until the user can authentication himself again. &lt;&lt;hit enter&gt;&gt;Now, if in stead of going further away, he is going back to his cubile, just by taking an alternate path. In this case, we probably do not want to flag this as a anomaly
  6. The management, control and data frames from a device will be heard by multiple APs. In our particular setup, these APs will record the Received signal strength or RSS of those frame along with the Identity of the device and timing information.These traces will be aggregated to a central location .. where we can serialize these traces based on the time stamp and classify them using the device IDs. So.. for a particular device, we can build a time series of RSS vector, each element in the vector is the RSS from a particular AP. These series of RSS vector along with other context information serves as the input to the preprocessing module…. Where we will convert these to a text representation before feed them into our n-gram model.
  7. From the signal propagation model, if two vectors are very similar, we know that the location where this vectors are measured should be within a reasonable proximity. Based on this assumption, we want to partition the RSS vector space into many “pseudo locations” and assign each “pseudo location” a unique label. By pseudo, we mean we don’t need to know the exact location of the reading, we just need to distinguish between two different locationsWell, this can be easily done by clustering algorithm… for example K-means clustering. In the k-mean clustering runs, we use a distance function similar to redpin and WASP in addition to the standard cosine function to reduce the noise caused by interference.Once the clustering is done, we assign labels to all the members belong to the same cluster….
  8. So… we collected the RSS traces from 87 WAPs in an office building over 5 days. The time precision of the RSS sample is at 13 sec level. These traces contain complete data of 40 users and … in total we have about 3.2 mil data points. Backup data points:Pseudo location from RSS (other schem not very ….) 1500 data points (RSS) per user at average RSS from 3-7 WAPs.assume user up half of the time -&gt; 80k data points per user for 5 days3.2 mil data points collected for 40 users. 20 mils rss readingsFor each of these 40 users, 16K RSS vector total
  9. To validate our system, we need to have some testing data. However, from the trace we collected, there are no recorded anomaly fortunately. We created simulated device stolen events by splicing two users’ trace segments at their intersection points…. where similar label or labels sequences are shared. We combined this simulated traces with normal traces to create a testing data set.
  10. Now we gained some insights on our approach. It is time to explore some of the design parameters we mentioned in the beginning. The first set of experiments is to find the best anomaly detection threshold. Actually there is no best threshold, the threshold is depending on the applications we are running. What’s the requirements on the detection accuracy? Can we allow much false positive? Do we have enough training data? To provide a guideline in answering these questions, we plot Receiver Operating Characteristic curve (or ROC curve) Essentially, ROC curve is about the trade-offs between the true-positive rate and false-positive rate in our anomaly detection. We perform the experiments with different training data sizes. We plot the ROC curve by varying the threshold and record the TPR and FPRWith the ROC curve, we can decide the threshold for a particular application depending on The amount of data the model should see before the model can detect anomaly The required TPR Or the acceptable FPRFor example, we want to use 8 hour training size and want to have less than 0.1 false positive rate, then we just need to locate this point and obtain the threshold by which this data point is generated. (0.4) We need to use threshold &lt; 0.4 in order to fulfill the FPR requirement. Another example: let’s say we want to have the same FPR requirement but want to have TPR &gt; 0.8, then we have to use more than 8 hours training size to archive this goal.
  11. We plot this graphs with different training size and n-gram orders. From the graph, we can see several things. A higher order model captures more context and in turn increase accuracy. But…. , accuracy saturates beyond 5, which means in user’s behavior is more likely to be dependent on its last 5 pseudo locations. This resonates with the past work we mentioned in the beginning. It also tells us that increase the model complexity beyond this point will NOT bring about significant improvement.Second, it shows that if the training size is as small as 4 hours, it may not capture users’ mobility behavior thoroughly enough to make an accurate detection. Also, the closeness between 8 hr and 12 hour curves also suggests that our system will provide relative good results if we have observed users’ behavior for 8 hours. One interesting point to make here is the 12 hour and 8 hour curve cross over at the lower n-gram orders. While this could be due to errors in handling the data, our explanation is leaning towards that the bigger training data set will exposure more common locations that are not captured in the shorter training size. With these common locations, people are sharing a lot of shorter sequences, leading to more simulated anomaly are not detected and … bring down the accuracy.
  12. SenSec is constantly collecting sensory data from accelerometer, gyroscope, GPS, WiFi, microphone or even camera. Through analyzing the sensory data, it constructs the context under which the mobile device is used. This includes locations, movements and usage patterns, etc. From the context, the system can calculate the certainty that the system is at risk. Different applications on mobile device are assigned either manually or automatically with a sensitivity value. When user is invoking an application, SenSec compares the certainty with this application’s sensitivity level. If the sensitivity passes the certainty threshold, authentication mechanism would be employed to ensure security policy for that application.
  13. That brings me to the end of my presentation. Thank you very much for your attention.