SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
FINDING SENSITIVE INFORMATION
IN TEXT DATA
Jan Neerbek
Senior IT Solutions Architect
ABOUT ME
14/06/2016 Side 2
• Jan Neerbek
• Alexandra Institute
• Senior IT Solutions Architect
• ph.d. student – text mining
About Alexandra Institute
• Tech Transfer
• General IT, one focus areas is Data Science
• Both commercial and research activities
• Offices in København and Aarhus
MINING TEXT
14/06/2016 Side 3
• Old field (1950’s)
• First applications: Automatic classification of news and
patents
• Unstructured
• Large data sets
• Good automatic understanding is hard
SENSITIVE INFORMATION IN TEXT
14/06/2016 Side 4
• Industry: Protect business secrets
• Healthcare: Protect private information
• Military: Protect plans and invented equipment
• Government: Project citizen data and closed agendas
Sensitive is domain specific
DATA LEAK PREVENTION (DLP)
14/06/2016 Side 5
• accidental or malicious leak?
Production
Sensitivity
checking Governance Publication
Sensitive Information detection
Censoring
Redaction
Obfuscation
SENSITIVE SUBSETS
14/06/2016 Side 6
You know
this
I know
this
Togetherwe
Know this
I know A, You know B
We know that A+B implies C
So now we know C
SENSITIVE SUBSETS
14/06/2016 Side 7
Attacker
knowledge
Can be
published
Sensitive
information
EARLY APPROACHES
(OFTEN STILL USED)
14/06/2016 Side 8
• Censoring approach (e.g. manual censoring)
• Keywords based
AUTOMATED APPROACHES
14/06/2016 Side 9
Want:
• As little as possible human intervention
• As high accuracy as possible
• Only relevant alerts
• We will look at two approaches
KNOWN DATA WITH “SENSITIVE”
INFORMATION
14/06/2016 Side 10
• Enron corpus
• Wikileaks
• Panama papers
• …
FINDING WHARTON (ENRON
CORPUS)
14/06/2016 Side 11
• Want: B+C+D
• Got: B+C+A Chow, Richard, Philippe Golle, and Jessica Staddon.
"Detecting privacy leaks using corpus-based association rules."
ACM SIGKDD, 2008.
N-GRAM
14/06/2016 Side 12
• Used to solve many NLP problems
• N-grams is an old technology (1950’s)
• However high order n-gram models have witnessed a revival
because of faster hardware (allowing for bigger corpora)
• We assign probabilities to sentences.
• What is the probability of sentence: “Your mother and I are
going to …”?
• N-gram considers the probability of the next word E.g.
P(“divorce”) vs. P(“disney”)
• For 2-grams probability of a sentence becomes
𝑃 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒 = ∏𝑃(𝑣+|𝑣+-.)
HOW TO EXTRACT WORD-TO-WORD
PROBABILITIES?
14/06/2016 Side 13
• Want 𝑃 𝑣+ 𝑣+-.)
• Usually we would look in our training data and use
𝑐𝑜𝑢𝑛𝑡 𝑣+-. 𝑣+
𝑐𝑜𝑢𝑛𝑡(𝑣+-.)
• But for sensitive that is not good enough
HOW TO EXTRACT WORD-TO-WORD
PROBABILITIES?
14/06/2016 Side 14
• Want 𝑃 𝑣+ 𝑣+-.)
• Usually we would look in our training data and use
𝑐𝑜𝑢𝑛𝑡 𝑣+-. 𝑣+
𝑐𝑜𝑢𝑛𝑡(𝑣+-.)
• But for sensitive that is not good enough
• Because we want to model an attacker’s knowledge
• One solution: using search queries over the Web
NUMBER OF WORD-TO-WORD
RELATIONS FOUND USING WEB
SEARCH
14/06/2016 Side 15
INFERENCE RULES
14/06/2016 Side 16
• Want to model rules like:
What word should come next:
𝑐𝑢𝑟𝑟𝑒𝑛𝑡↓ 𝑤𝑜𝑟𝑑 "𝑡ℎ𝑒" → 𝑛𝑜𝑡(𝑛𝑒𝑥𝑡↓ 𝑤𝑜𝑟𝑑("the"))
Is this text about Wharton University:
“Wharton” + “University” → 𝑌𝐸𝑆
• Used very much within recommender systems -> known
as association rules
• Like expert systems – but here automatic rule generation
AUTOMATIC INFERENCE RULES
14/06/2016 Side 17
• How to find them?
Again; we count
𝐴 → 𝐵
Confidence of rule
𝑐𝑜𝑢𝑛𝑡(𝐴 ∩ 𝐵)
𝑐𝑜𝑢𝑛𝑡(𝐴)
Support of rule
𝑐𝑜𝑢𝑛𝑡(𝐴)
𝑁
AUTOMATIC INFERENCE RULES
14/06/2016 Side 18
• How to find them?
Again; we count
𝐴 → 𝐵
Confidence of rule
𝑐𝑜𝑢𝑛𝑡(𝐴 ∩ 𝐵)
𝑐𝑜𝑢𝑛𝑡(𝐴)
Support of rule
𝑐𝑜𝑢𝑛𝑡(𝐴)
𝑁
For sensitive data we
want all rules with
high confidence!
AUTOMATIC INFERENCE RULES
14/06/2016 Side 19
• Count all combinations -> Grows exponentially!
• Current algorithms (Apriori, FP-growth, the implementations in
Mahout) uses an invariant to reduce running time:
Apriori invariant:
If a rule has high support then sub-rules also have high support
Example
𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴⋀𝐵 → 𝐶)
Implies
𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴 → 𝐶) and 𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐵 → 𝐶)
AUTOMATIC INFERENCE RULES
14/06/2016 Side 20
• Count all combinations -> Grows exponentially!
• Current algorithms (Apriori, FP-growth, the implementations in
Mahout) uses an invariant to reduce running time:
Apriori invariant:
If a rule has high support then sub-rules also have high support
Example
𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴⋀𝐵 → 𝐶)
Implies
𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴 → 𝐶) and 𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐵 → 𝐶)
For sensitive data we
want all rules with high
confidence!
AUTOMATIC INFERENCE RULES
14/06/2016 Side 21
• Count all combinations -> Grows exponentially!
• We need to restrict search space:
– Length of rules
– Search distance between words
– Only consider single clause consequent
– Only consider conjunction
– Subsampling
– Approximation
QUESTIONS
14/06/2016 Side 22
How do I select a sensitive text detection system?
• Still early days
• Difficult to compare solutions out there
– No uniform performance measure yet
– Lack of public datasets with labels for calculating performance
measure
So when should I consider a sensitive text detection system?
• When you have a large censoring effort -> current
systems will lessen the effort
FUTURE
14/06/2016 Side 23
“NLP is kind of like a rabbit in the headlights of the Deep
Learning machine, waiting to be flattened.”
-
Neil Lawrence (U. Sheffield) @ ICML panel 2015
Tak for opmærksomheden!
14/06/2016Side 24

Más contenido relacionado

La actualidad más candente

Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceData Science Thailand
 
#nowplaying-RS: A New Benchmark Dataset for Building Context-Aware Music Reco...
#nowplaying-RS: A New Benchmark Dataset for Building Context-Aware Music Reco...#nowplaying-RS: A New Benchmark Dataset for Building Context-Aware Music Reco...
#nowplaying-RS: A New Benchmark Dataset for Building Context-Aware Music Reco...Asmita Poddar
 
Searching for reliable business information: free versus fee
Searching for reliable business information: free versus feeSearching for reliable business information: free versus fee
Searching for reliable business information: free versus feevoginip
 
How to Calculate OA APC Spend for Your University
How to Calculate OA APC Spend for Your UniversityHow to Calculate OA APC Spend for Your University
How to Calculate OA APC Spend for Your UniversityHeather Piwowar
 
The changing landscape of search for business information
The changing landscape of search for business informationThe changing landscape of search for business information
The changing landscape of search for business informationvoginip
 

La actualidad más candente (6)

Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data Science
 
#nowplaying-RS: A New Benchmark Dataset for Building Context-Aware Music Reco...
#nowplaying-RS: A New Benchmark Dataset for Building Context-Aware Music Reco...#nowplaying-RS: A New Benchmark Dataset for Building Context-Aware Music Reco...
#nowplaying-RS: A New Benchmark Dataset for Building Context-Aware Music Reco...
 
Searching for reliable business information: free versus fee
Searching for reliable business information: free versus feeSearching for reliable business information: free versus fee
Searching for reliable business information: free versus fee
 
How to Calculate OA APC Spend for Your University
How to Calculate OA APC Spend for Your UniversityHow to Calculate OA APC Spend for Your University
How to Calculate OA APC Spend for Your University
 
The changing landscape of search for business information
The changing landscape of search for business informationThe changing landscape of search for business information
The changing landscape of search for business information
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 

Similar a Finding sensitive information in text data

Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data Vaibhav Kurkute
 
Love Can't Wait! Optimizing PageLoad Time of SPAs at Zoosk [FutureStack16]
Love Can't Wait!  Optimizing PageLoad Time of SPAs at Zoosk [FutureStack16]Love Can't Wait!  Optimizing PageLoad Time of SPAs at Zoosk [FutureStack16]
Love Can't Wait! Optimizing PageLoad Time of SPAs at Zoosk [FutureStack16]New Relic
 
2016 06-07 data driven production
2016 06-07 data driven production2016 06-07 data driven production
2016 06-07 data driven productionMark Reynolds
 
Big Data Workshop: Splunk and Dell EMC...Better Together
Big Data Workshop: Splunk and Dell EMC...Better TogetherBig Data Workshop: Splunk and Dell EMC...Better Together
Big Data Workshop: Splunk and Dell EMC...Better TogetherZivaro Inc
 
Using big data_to_your_advantage
Using big data_to_your_advantageUsing big data_to_your_advantage
Using big data_to_your_advantageJohn Repko
 
Big data analytics presented at meetup big data for decision makers
Big data analytics presented at meetup big data for decision makersBig data analytics presented at meetup big data for decision makers
Big data analytics presented at meetup big data for decision makersRuhollah Farchtchi
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXKirk Haslbeck
 
ER 2016 Tutorial
ER 2016 TutorialER 2016 Tutorial
ER 2016 TutorialRim Moussa
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkEvan Chan
 
Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearcheXascale Infolab
 
Data science workshop
Data science workshopData science workshop
Data science workshopHortonworks
 
[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSM[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSMSunView Software, Inc.
 
Demystifying Spark and Hadoop to gain Business Insights
Demystifying Spark and Hadoop to gain Business InsightsDemystifying Spark and Hadoop to gain Business Insights
Demystifying Spark and Hadoop to gain Business InsightsAdrian Whitehead
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Lucidworks
 
Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big dataJ Singh
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIAmazon Web Services
 

Similar a Finding sensitive information in text data (20)

Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data
 
Love Can't Wait! Optimizing PageLoad Time of SPAs at Zoosk [FutureStack16]
Love Can't Wait!  Optimizing PageLoad Time of SPAs at Zoosk [FutureStack16]Love Can't Wait!  Optimizing PageLoad Time of SPAs at Zoosk [FutureStack16]
Love Can't Wait! Optimizing PageLoad Time of SPAs at Zoosk [FutureStack16]
 
2016 06-07 data driven production
2016 06-07 data driven production2016 06-07 data driven production
2016 06-07 data driven production
 
Big Data Workshop: Splunk and Dell EMC...Better Together
Big Data Workshop: Splunk and Dell EMC...Better TogetherBig Data Workshop: Splunk and Dell EMC...Better Together
Big Data Workshop: Splunk and Dell EMC...Better Together
 
Using big data_to_your_advantage
Using big data_to_your_advantageUsing big data_to_your_advantage
Using big data_to_your_advantage
 
Big data analytics presented at meetup big data for decision makers
Big data analytics presented at meetup big data for decision makersBig data analytics presented at meetup big data for decision makers
Big data analytics presented at meetup big data for decision makers
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
 
ER 2016 Tutorial
ER 2016 TutorialER 2016 Tutorial
ER 2016 Tutorial
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
 
Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web Search
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
A6 big data_in_the_cloud
A6 big data_in_the_cloudA6 big data_in_the_cloud
A6 big data_in_the_cloud
 
[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSM[Webinar] How Big Data and Machine Learning Are Transforming ITSM
[Webinar] How Big Data and Machine Learning Are Transforming ITSM
 
Demystifying Spark and Hadoop to gain Business Insights
Demystifying Spark and Hadoop to gain Business InsightsDemystifying Spark and Hadoop to gain Business Insights
Demystifying Spark and Hadoop to gain Business Insights
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
 
Big Data & Data Science
Big Data & Data ScienceBig Data & Data Science
Big Data & Data Science
 
Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big data
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AI
 

Más de InfinIT - Innovationsnetværket for it

Más de InfinIT - Innovationsnetværket for it (20)

Erfaringer med-c kurt-noermark
Erfaringer med-c kurt-noermarkErfaringer med-c kurt-noermark
Erfaringer med-c kurt-noermark
 
Object orientering, test driven development og c
Object orientering, test driven development og cObject orientering, test driven development og c
Object orientering, test driven development og c
 
Embedded softwaredevelopment hcs
Embedded softwaredevelopment hcsEmbedded softwaredevelopment hcs
Embedded softwaredevelopment hcs
 
C og c++-jens lund jensen
C og c++-jens lund jensenC og c++-jens lund jensen
C og c++-jens lund jensen
 
201811xx foredrag c_cpp
201811xx foredrag c_cpp201811xx foredrag c_cpp
201811xx foredrag c_cpp
 
C som-programmeringssprog-bt
C som-programmeringssprog-btC som-programmeringssprog-bt
C som-programmeringssprog-bt
 
Infinit seminar 060918
Infinit seminar 060918Infinit seminar 060918
Infinit seminar 060918
 
DCR solutions
DCR solutionsDCR solutions
DCR solutions
 
Not your grandfathers BPM
Not your grandfathers BPMNot your grandfathers BPM
Not your grandfathers BPM
 
Kmd workzone - an evolutionary approach to revolution
Kmd workzone - an evolutionary approach to revolutionKmd workzone - an evolutionary approach to revolution
Kmd workzone - an evolutionary approach to revolution
 
EcoKnow - oplæg
EcoKnow - oplægEcoKnow - oplæg
EcoKnow - oplæg
 
Martin Wickins Chatbots i fronten
Martin Wickins Chatbots i frontenMartin Wickins Chatbots i fronten
Martin Wickins Chatbots i fronten
 
Marie Fenger ai kundeservice
Marie Fenger ai kundeserviceMarie Fenger ai kundeservice
Marie Fenger ai kundeservice
 
Mads Kaysen SupWiz
Mads Kaysen SupWizMads Kaysen SupWiz
Mads Kaysen SupWiz
 
Leif Howalt NNIT Service Support Center
Leif Howalt NNIT Service Support CenterLeif Howalt NNIT Service Support Center
Leif Howalt NNIT Service Support Center
 
Jan Neerbek NLP og Chatbots
Jan Neerbek NLP og ChatbotsJan Neerbek NLP og Chatbots
Jan Neerbek NLP og Chatbots
 
Anders Soegaard NLP for Customer Support
Anders Soegaard NLP for Customer SupportAnders Soegaard NLP for Customer Support
Anders Soegaard NLP for Customer Support
 
Stephen Alstrup infinit august 2018
Stephen Alstrup infinit august 2018Stephen Alstrup infinit august 2018
Stephen Alstrup infinit august 2018
 
Innovation og værdiskabelse i it-projekter
Innovation og værdiskabelse i it-projekterInnovation og værdiskabelse i it-projekter
Innovation og værdiskabelse i it-projekter
 
Rokoko infin it presentation
Rokoko infin it presentation Rokoko infin it presentation
Rokoko infin it presentation
 

Último

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Último (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Finding sensitive information in text data

  • 1. FINDING SENSITIVE INFORMATION IN TEXT DATA Jan Neerbek Senior IT Solutions Architect
  • 2. ABOUT ME 14/06/2016 Side 2 • Jan Neerbek • Alexandra Institute • Senior IT Solutions Architect • ph.d. student – text mining About Alexandra Institute • Tech Transfer • General IT, one focus areas is Data Science • Both commercial and research activities • Offices in København and Aarhus
  • 3. MINING TEXT 14/06/2016 Side 3 • Old field (1950’s) • First applications: Automatic classification of news and patents • Unstructured • Large data sets • Good automatic understanding is hard
  • 4. SENSITIVE INFORMATION IN TEXT 14/06/2016 Side 4 • Industry: Protect business secrets • Healthcare: Protect private information • Military: Protect plans and invented equipment • Government: Project citizen data and closed agendas Sensitive is domain specific
  • 5. DATA LEAK PREVENTION (DLP) 14/06/2016 Side 5 • accidental or malicious leak? Production Sensitivity checking Governance Publication Sensitive Information detection Censoring Redaction Obfuscation
  • 6. SENSITIVE SUBSETS 14/06/2016 Side 6 You know this I know this Togetherwe Know this I know A, You know B We know that A+B implies C So now we know C
  • 7. SENSITIVE SUBSETS 14/06/2016 Side 7 Attacker knowledge Can be published Sensitive information
  • 8. EARLY APPROACHES (OFTEN STILL USED) 14/06/2016 Side 8 • Censoring approach (e.g. manual censoring) • Keywords based
  • 9. AUTOMATED APPROACHES 14/06/2016 Side 9 Want: • As little as possible human intervention • As high accuracy as possible • Only relevant alerts • We will look at two approaches
  • 10. KNOWN DATA WITH “SENSITIVE” INFORMATION 14/06/2016 Side 10 • Enron corpus • Wikileaks • Panama papers • …
  • 11. FINDING WHARTON (ENRON CORPUS) 14/06/2016 Side 11 • Want: B+C+D • Got: B+C+A Chow, Richard, Philippe Golle, and Jessica Staddon. "Detecting privacy leaks using corpus-based association rules." ACM SIGKDD, 2008.
  • 12. N-GRAM 14/06/2016 Side 12 • Used to solve many NLP problems • N-grams is an old technology (1950’s) • However high order n-gram models have witnessed a revival because of faster hardware (allowing for bigger corpora) • We assign probabilities to sentences. • What is the probability of sentence: “Your mother and I are going to …”? • N-gram considers the probability of the next word E.g. P(“divorce”) vs. P(“disney”) • For 2-grams probability of a sentence becomes 𝑃 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒 = ∏𝑃(𝑣+|𝑣+-.)
  • 13. HOW TO EXTRACT WORD-TO-WORD PROBABILITIES? 14/06/2016 Side 13 • Want 𝑃 𝑣+ 𝑣+-.) • Usually we would look in our training data and use 𝑐𝑜𝑢𝑛𝑡 𝑣+-. 𝑣+ 𝑐𝑜𝑢𝑛𝑡(𝑣+-.) • But for sensitive that is not good enough
  • 14. HOW TO EXTRACT WORD-TO-WORD PROBABILITIES? 14/06/2016 Side 14 • Want 𝑃 𝑣+ 𝑣+-.) • Usually we would look in our training data and use 𝑐𝑜𝑢𝑛𝑡 𝑣+-. 𝑣+ 𝑐𝑜𝑢𝑛𝑡(𝑣+-.) • But for sensitive that is not good enough • Because we want to model an attacker’s knowledge • One solution: using search queries over the Web
  • 15. NUMBER OF WORD-TO-WORD RELATIONS FOUND USING WEB SEARCH 14/06/2016 Side 15
  • 16. INFERENCE RULES 14/06/2016 Side 16 • Want to model rules like: What word should come next: 𝑐𝑢𝑟𝑟𝑒𝑛𝑡↓ 𝑤𝑜𝑟𝑑 "𝑡ℎ𝑒" → 𝑛𝑜𝑡(𝑛𝑒𝑥𝑡↓ 𝑤𝑜𝑟𝑑("the")) Is this text about Wharton University: “Wharton” + “University” → 𝑌𝐸𝑆 • Used very much within recommender systems -> known as association rules • Like expert systems – but here automatic rule generation
  • 17. AUTOMATIC INFERENCE RULES 14/06/2016 Side 17 • How to find them? Again; we count 𝐴 → 𝐵 Confidence of rule 𝑐𝑜𝑢𝑛𝑡(𝐴 ∩ 𝐵) 𝑐𝑜𝑢𝑛𝑡(𝐴) Support of rule 𝑐𝑜𝑢𝑛𝑡(𝐴) 𝑁
  • 18. AUTOMATIC INFERENCE RULES 14/06/2016 Side 18 • How to find them? Again; we count 𝐴 → 𝐵 Confidence of rule 𝑐𝑜𝑢𝑛𝑡(𝐴 ∩ 𝐵) 𝑐𝑜𝑢𝑛𝑡(𝐴) Support of rule 𝑐𝑜𝑢𝑛𝑡(𝐴) 𝑁 For sensitive data we want all rules with high confidence!
  • 19. AUTOMATIC INFERENCE RULES 14/06/2016 Side 19 • Count all combinations -> Grows exponentially! • Current algorithms (Apriori, FP-growth, the implementations in Mahout) uses an invariant to reduce running time: Apriori invariant: If a rule has high support then sub-rules also have high support Example 𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴⋀𝐵 → 𝐶) Implies 𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴 → 𝐶) and 𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐵 → 𝐶)
  • 20. AUTOMATIC INFERENCE RULES 14/06/2016 Side 20 • Count all combinations -> Grows exponentially! • Current algorithms (Apriori, FP-growth, the implementations in Mahout) uses an invariant to reduce running time: Apriori invariant: If a rule has high support then sub-rules also have high support Example 𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴⋀𝐵 → 𝐶) Implies 𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴 → 𝐶) and 𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐵 → 𝐶) For sensitive data we want all rules with high confidence!
  • 21. AUTOMATIC INFERENCE RULES 14/06/2016 Side 21 • Count all combinations -> Grows exponentially! • We need to restrict search space: – Length of rules – Search distance between words – Only consider single clause consequent – Only consider conjunction – Subsampling – Approximation
  • 22. QUESTIONS 14/06/2016 Side 22 How do I select a sensitive text detection system? • Still early days • Difficult to compare solutions out there – No uniform performance measure yet – Lack of public datasets with labels for calculating performance measure So when should I consider a sensitive text detection system? • When you have a large censoring effort -> current systems will lessen the effort
  • 23. FUTURE 14/06/2016 Side 23 “NLP is kind of like a rabbit in the headlights of the Deep Learning machine, waiting to be flattened.” - Neil Lawrence (U. Sheffield) @ ICML panel 2015