2. ABOUT ME
14/06/2016 Side 2
• Jan Neerbek
• Alexandra Institute
• Senior IT Solutions Architect
• ph.d. student – text mining
About Alexandra Institute
• Tech Transfer
• General IT, one focus areas is Data Science
• Both commercial and research activities
• Offices in København and Aarhus
3. MINING TEXT
14/06/2016 Side 3
• Old field (1950’s)
• First applications: Automatic classification of news and
patents
• Unstructured
• Large data sets
• Good automatic understanding is hard
4. SENSITIVE INFORMATION IN TEXT
14/06/2016 Side 4
• Industry: Protect business secrets
• Healthcare: Protect private information
• Military: Protect plans and invented equipment
• Government: Project citizen data and closed agendas
Sensitive is domain specific
5. DATA LEAK PREVENTION (DLP)
14/06/2016 Side 5
• accidental or malicious leak?
Production
Sensitivity
checking Governance Publication
Sensitive Information detection
Censoring
Redaction
Obfuscation
6. SENSITIVE SUBSETS
14/06/2016 Side 6
You know
this
I know
this
Togetherwe
Know this
I know A, You know B
We know that A+B implies C
So now we know C
9. AUTOMATED APPROACHES
14/06/2016 Side 9
Want:
• As little as possible human intervention
• As high accuracy as possible
• Only relevant alerts
• We will look at two approaches
10. KNOWN DATA WITH “SENSITIVE”
INFORMATION
14/06/2016 Side 10
• Enron corpus
• Wikileaks
• Panama papers
• …
11. FINDING WHARTON (ENRON
CORPUS)
14/06/2016 Side 11
• Want: B+C+D
• Got: B+C+A Chow, Richard, Philippe Golle, and Jessica Staddon.
"Detecting privacy leaks using corpus-based association rules."
ACM SIGKDD, 2008.
12. N-GRAM
14/06/2016 Side 12
• Used to solve many NLP problems
• N-grams is an old technology (1950’s)
• However high order n-gram models have witnessed a revival
because of faster hardware (allowing for bigger corpora)
• We assign probabilities to sentences.
• What is the probability of sentence: “Your mother and I are
going to …”?
• N-gram considers the probability of the next word E.g.
P(“divorce”) vs. P(“disney”)
• For 2-grams probability of a sentence becomes
𝑃 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒 = ∏𝑃(𝑣+|𝑣+-.)
13. HOW TO EXTRACT WORD-TO-WORD
PROBABILITIES?
14/06/2016 Side 13
• Want 𝑃 𝑣+ 𝑣+-.)
• Usually we would look in our training data and use
𝑐𝑜𝑢𝑛𝑡 𝑣+-. 𝑣+
𝑐𝑜𝑢𝑛𝑡(𝑣+-.)
• But for sensitive that is not good enough
14. HOW TO EXTRACT WORD-TO-WORD
PROBABILITIES?
14/06/2016 Side 14
• Want 𝑃 𝑣+ 𝑣+-.)
• Usually we would look in our training data and use
𝑐𝑜𝑢𝑛𝑡 𝑣+-. 𝑣+
𝑐𝑜𝑢𝑛𝑡(𝑣+-.)
• But for sensitive that is not good enough
• Because we want to model an attacker’s knowledge
• One solution: using search queries over the Web
16. INFERENCE RULES
14/06/2016 Side 16
• Want to model rules like:
What word should come next:
𝑐𝑢𝑟𝑟𝑒𝑛𝑡↓ 𝑤𝑜𝑟𝑑 "𝑡ℎ𝑒" → 𝑛𝑜𝑡(𝑛𝑒𝑥𝑡↓ 𝑤𝑜𝑟𝑑("the"))
Is this text about Wharton University:
“Wharton” + “University” → 𝑌𝐸𝑆
• Used very much within recommender systems -> known
as association rules
• Like expert systems – but here automatic rule generation
17. AUTOMATIC INFERENCE RULES
14/06/2016 Side 17
• How to find them?
Again; we count
𝐴 → 𝐵
Confidence of rule
𝑐𝑜𝑢𝑛𝑡(𝐴 ∩ 𝐵)
𝑐𝑜𝑢𝑛𝑡(𝐴)
Support of rule
𝑐𝑜𝑢𝑛𝑡(𝐴)
𝑁
18. AUTOMATIC INFERENCE RULES
14/06/2016 Side 18
• How to find them?
Again; we count
𝐴 → 𝐵
Confidence of rule
𝑐𝑜𝑢𝑛𝑡(𝐴 ∩ 𝐵)
𝑐𝑜𝑢𝑛𝑡(𝐴)
Support of rule
𝑐𝑜𝑢𝑛𝑡(𝐴)
𝑁
For sensitive data we
want all rules with
high confidence!
19. AUTOMATIC INFERENCE RULES
14/06/2016 Side 19
• Count all combinations -> Grows exponentially!
• Current algorithms (Apriori, FP-growth, the implementations in
Mahout) uses an invariant to reduce running time:
Apriori invariant:
If a rule has high support then sub-rules also have high support
Example
𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴⋀𝐵 → 𝐶)
Implies
𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴 → 𝐶) and 𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐵 → 𝐶)
20. AUTOMATIC INFERENCE RULES
14/06/2016 Side 20
• Count all combinations -> Grows exponentially!
• Current algorithms (Apriori, FP-growth, the implementations in
Mahout) uses an invariant to reduce running time:
Apriori invariant:
If a rule has high support then sub-rules also have high support
Example
𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴⋀𝐵 → 𝐶)
Implies
𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴 → 𝐶) and 𝐻𝑖𝑔ℎ𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐵 → 𝐶)
For sensitive data we
want all rules with high
confidence!
21. AUTOMATIC INFERENCE RULES
14/06/2016 Side 21
• Count all combinations -> Grows exponentially!
• We need to restrict search space:
– Length of rules
– Search distance between words
– Only consider single clause consequent
– Only consider conjunction
– Subsampling
– Approximation
22. QUESTIONS
14/06/2016 Side 22
How do I select a sensitive text detection system?
• Still early days
• Difficult to compare solutions out there
– No uniform performance measure yet
– Lack of public datasets with labels for calculating performance
measure
So when should I consider a sensitive text detection system?
• When you have a large censoring effort -> current
systems will lessen the effort
23. FUTURE
14/06/2016 Side 23
“NLP is kind of like a rabbit in the headlights of the Deep
Learning machine, waiting to be flattened.”
-
Neil Lawrence (U. Sheffield) @ ICML panel 2015