SlideShare una empresa de Scribd logo
1 de 30
Easier, Faster, Smarter
Friday, October 18, 2013
How to compute
Column Dependencies on a
Data Stream using MapReduce
Hans-Henning Gabriel

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
Relationship Between Attributes

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
Some Basic Theory

Friday, October 18, 2013
From Entropy To Mutual Information
A
x
x
y
x
z
z
y

B
a
b
a
a
b
b
a

C
just
some
random
text
in
this
column

Relationship Between A and B?

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

A == z ➔ B == b
B == b ➔ A == ?
C ➔ A?

How strong do A, B and C
determine each other?
From Entropy To Mutual Information
Entropy: how mixed up are the values?
H(X) =


x

1
p(x) log
p(x)

• H(X) ≥ 0
• maximum entropy is log |X|
• the more X is uniform distributed , the higher the
Entropy is

H(Y ) = 0.54

H(Y ) = 1

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

H(Z) = 1.41
From Entropy To Mutual Information
A
x
x
y
x
z
z
y

B
a
b
a
a
b
b
a

Joint Entropy:

x

y

1
p(x, y) log
p(x, y)

H(A, B) = 1.95

x

y

a

2/7 2/7

b

1/7

0

z
0

4/7

2/7 3/7

3/7 2/7 2/7
H(A) = 1.56

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

H(X, Y ) =



H(B) = 0.985
From Entropy To Mutual Information
A
x
x
y
x
z
z
y

B
a
b
a
a
b
b
a

Conditional Entropy:
how much uncertainty remains about X when
we know the value of Y?

H(Y |X) =
p(x)H(Y |X = x)
x

x
y
z
a 2/4 2/4 0 1.0
b 1/3 0 2/3 1.0
• compute Entropies on conditional distribution
• compute weighted average
4
3
H(A|B) = ∗ H(A|B = a) + ∗ H(A|B = b) = 0.965
7
7
© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
From Entropy To Mutual Information
A
x
x
y
x
z
z
y

B
a
b
a
a
b
b
a

Mutual Information:
reduction of uncertainty of X due to the
knowledge of Y
I(X; Y ) = H(Y ) − H(Y |X) = H(X) − H(X|Y )



p(x, y)
=
p(x, y)log
p(x)p(y)
x
y

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
Further Conditions
data arrives as a stream
data is big
as little user interaction as possible

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
Outline

Friday, October 18, 2013
Outline
Partition Incremental Discretization (PiD)
•
•
•

original
adjusted
as MapReduce

2-D histograms on a data stream
•
•
•

how to create
handle discrete data
mutual information

QA

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
Partition Incremental
Discretization (PiD)

Friday, October 18, 2013
PiD - 2 layer approach
counts
7
2

3
3

10
4

 alpha?
5

Border Extension
10
3

7

breaks
2

Histogram of Values

3

5
4

5
5

6



10

Split

5

Frequency

15

step=1

5 5

5

5

0

7
2

3

4

5

6

Values

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

2

3 3.5 4

5

6
PiD - dropping parameters
splitting threshold alpha:
count + 1
α
total + 2

what is a good value?
parameter step:
maintain min and max values
extend border breaks based on min and max

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
PiD - number of bins
count + 1
split when: total 
−2
α

200
150
0

50

100

number of bins

250

300

alpha=0.01
alpha=0.02
alpha=0.04
alpha=0.08
alpha=0.16
alpha=0.32

0

200

400

600
number of records

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

800

1000
PiD - MapReduce

A3
A1

A2

A5

A1

A4

A6

A2
+
A5

A3
+
A6

A7

A8

A4
+
A7

A8

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
PiD - MapReduce

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
PiD - Evaluation
Percentage Error
(P, S) =

k

i=1

|Pi − Si |

k

i=1

Si

Affinity Coefficient
δ(P, S) =

k

i=1

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013


Pi ∗ Si
PiD - Evaluation
Uniform Distribution

6000
4000

600

Varying Distributions

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

0

200
0

0

500

2000

1000

!PiD=0.0010695
!aPiD=0.0044543
PiD=0.9999998
aPiD=0.9999959

400

!PiD=0.0934349
!aPiD=0.0369968
PiD=0.9869035
aPiD=0.9956227

1500

2000

800

2500

original
PiD
aPiD

Log Normal Distribution

1000

Normal Distribution

!PiD=0.0153203
!aPiD=0.0197731
PiD=0.9993737
aPiD=0.9958205
PiD - Evaluation

Varying Alpha
© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
Two-Dimensional
Histograms

Friday, October 18, 2013
Building a Quadtree
1

3

2

2 3
11 1
21

1

3
2

2
3 1

• how to choose bin width?
• how to merge?
• equal frequencies or equal width?

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

1

1 1

2
Distributed Merge

• start with unit-square
• extend by double; split by half

1{

➔ logarithmic number of splits/extensions

• merge by aligning unit-squares
1
2

2 3
11 1
21

1

3
1
2

2
2
1

8

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

2

5

1.5 2.5 4
1.51.5

5

2.5 1.5 2.51.5 3
Deriving the Layer 2 Histogram
2

1.5
2.5

5

2.5
1.5

4
1.5 1.5
2.5 1.5

2

5

3

1.5
2.5

Equal Width

5

2.5
1.5

4
1.5 1.5
2.5 1.5

5

3

2.5 2.5 7.25 5.25
2.5
2.5 6.25 5.25



= 34

➔ 4.25 per bin

Equal
Frequency

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
How to deal with discrete data
PiD and Map per bin
A

B

2

e

2.3

3

g

3.6

e
a

4.1
2.9

...

4

1.5

{a:3, e:1}
{e:2, g:2, h:1}
...
...

...

5
6

1.5
2
2.5
3

2.5
3.5

{a:1, b:1}
{e:2}

{a:0.5, b:0.5}
{a:2, b:0.5, e:0.5}
...
...

Layer 2: number of bins = |vocabulary|
© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
Mutual Information

equal width

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

equal frequency
5

10

15

20

20
0

10

15

20

5

10

15

20

Mutual Information: 0.396 (0.919)

10

15

20

Mutual Information: 0.023 (0.026)

10
5
0
-5
0

5

10

15

20

Mutual Information: 0.171 (0.131)

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013

5

15

20
15
10
5
0
0

0

Mutual Information: 0.102 (0.022)

-5

-5

0

5

10

15

20

Mutual Information: 0.013 (0.03)

5

20

0

0

5

10

15

20
15
10
5
0

0

5

10

15

20

Mutual Information

0

5

10

15

20

Mutual Information: 0.35 (0.544)
Normalization


I(X; Y )
H(X)H(Y )

• panelize variable with large cardinality
• scale value between 0 and 1

© 2013 Datameer, Inc. All rights reserved.

Friday, October 18, 2013
@Datameer
hgabriel@datameer.com

Friday, October 18, 2013

Más contenido relacionado

Más de Datameer

How to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited DataHow to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited Data
Datameer
 

Más de Datameer (16)

Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User Webinar - Big Data: Power to the User
Webinar - Big Data: Power to the User
 
Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?Why Use Hadoop for Big Data Analytics?
Why Use Hadoop for Big Data Analytics?
 
Why Use Hadoop?
Why Use Hadoop?Why Use Hadoop?
Why Use Hadoop?
 
Online Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics WebinarOnline Fraud Detection Using Big Data Analytics Webinar
Online Fraud Detection Using Big Data Analytics Webinar
 
Instant Visualizations in Every Step of Analysis
Instant Visualizations in Every Step of AnalysisInstant Visualizations in Every Step of Analysis
Instant Visualizations in Every Step of Analysis
 
Customer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data AnalyticsCustomer Case Studies of Self-Service Big Data Analytics
Customer Case Studies of Self-Service Big Data Analytics
 
BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics? BI, Hive or Big Data Analytics?
BI, Hive or Big Data Analytics?
 
Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?Is Your Hadoop Environment Secure?
Is Your Hadoop Environment Secure?
 
Fight Fraud with Big Data Analytics
Fight Fraud with Big Data AnalyticsFight Fraud with Big Data Analytics
Fight Fraud with Big Data Analytics
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & Hadoop
 
Lean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use CaseLean Production Meets Big Data: A Next Generation Use Case
Lean Production Meets Big Data: A Next Generation Use Case
 
The Economics of SQL on Hadoop
The Economics of SQL on HadoopThe Economics of SQL on Hadoop
The Economics of SQL on Hadoop
 
Top 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big DataTop 3 Considerations for Machine Learning on Big Data
Top 3 Considerations for Machine Learning on Big Data
 
Best Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by DatameerBest Practices for Big Data Analytics with Machine Learning by Datameer
Best Practices for Big Data Analytics with Machine Learning by Datameer
 
How to do Data Science Without the Scientist
How to do Data Science Without the ScientistHow to do Data Science Without the Scientist
How to do Data Science Without the Scientist
 
How to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited DataHow to do Predictive Analytics with Limited Data
How to do Predictive Analytics with Limited Data
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Big Data TechCon: How to Compute Column Dependencies on a Data Stream Using MapReduce

  • 2. How to compute Column Dependencies on a Data Stream using MapReduce Hans-Henning Gabriel © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 3. Relationship Between Attributes © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 4. Some Basic Theory Friday, October 18, 2013
  • 5. From Entropy To Mutual Information A x x y x z z y B a b a a b b a C just some random text in this column Relationship Between A and B? © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 A == z ➔ B == b B == b ➔ A == ? C ➔ A? How strong do A, B and C determine each other?
  • 6. From Entropy To Mutual Information Entropy: how mixed up are the values? H(X) = x 1 p(x) log p(x) • H(X) ≥ 0 • maximum entropy is log |X| • the more X is uniform distributed , the higher the Entropy is H(Y ) = 0.54 H(Y ) = 1 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 H(Z) = 1.41
  • 7. From Entropy To Mutual Information A x x y x z z y B a b a a b b a Joint Entropy: x y 1 p(x, y) log p(x, y) H(A, B) = 1.95 x y a 2/7 2/7 b 1/7 0 z 0 4/7 2/7 3/7 3/7 2/7 2/7 H(A) = 1.56 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 H(X, Y ) = H(B) = 0.985
  • 8. From Entropy To Mutual Information A x x y x z z y B a b a a b b a Conditional Entropy: how much uncertainty remains about X when we know the value of Y? H(Y |X) = p(x)H(Y |X = x) x x y z a 2/4 2/4 0 1.0 b 1/3 0 2/3 1.0 • compute Entropies on conditional distribution • compute weighted average 4 3 H(A|B) = ∗ H(A|B = a) + ∗ H(A|B = b) = 0.965 7 7 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 9. From Entropy To Mutual Information A x x y x z z y B a b a a b b a Mutual Information: reduction of uncertainty of X due to the knowledge of Y I(X; Y ) = H(Y ) − H(Y |X) = H(X) − H(X|Y ) p(x, y) = p(x, y)log p(x)p(y) x y © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 10. Further Conditions data arrives as a stream data is big as little user interaction as possible © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 12. Outline Partition Incremental Discretization (PiD) • • • original adjusted as MapReduce 2-D histograms on a data stream • • • how to create handle discrete data mutual information QA © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 14. PiD - 2 layer approach counts 7 2 3 3 10 4 alpha? 5 Border Extension 10 3 7 breaks 2 Histogram of Values 3 5 4 5 5 6 10 Split 5 Frequency 15 step=1 5 5 5 5 0 7 2 3 4 5 6 Values © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 2 3 3.5 4 5 6
  • 15. PiD - dropping parameters splitting threshold alpha: count + 1 α total + 2 what is a good value? parameter step: maintain min and max values extend border breaks based on min and max © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 16. PiD - number of bins count + 1 split when: total −2 α 200 150 0 50 100 number of bins 250 300 alpha=0.01 alpha=0.02 alpha=0.04 alpha=0.08 alpha=0.16 alpha=0.32 0 200 400 600 number of records © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 800 1000
  • 17. PiD - MapReduce A3 A1 A2 A5 A1 A4 A6 A2 + A5 A3 + A6 A7 A8 A4 + A7 A8 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 18. PiD - MapReduce © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 19. PiD - Evaluation Percentage Error (P, S) = k i=1 |Pi − Si | k i=1 Si Affinity Coefficient δ(P, S) = k i=1 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 Pi ∗ Si
  • 20. PiD - Evaluation Uniform Distribution 6000 4000 600 Varying Distributions © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 0 200 0 0 500 2000 1000 !PiD=0.0010695 !aPiD=0.0044543 PiD=0.9999998 aPiD=0.9999959 400 !PiD=0.0934349 !aPiD=0.0369968 PiD=0.9869035 aPiD=0.9956227 1500 2000 800 2500 original PiD aPiD Log Normal Distribution 1000 Normal Distribution !PiD=0.0153203 !aPiD=0.0197731 PiD=0.9993737 aPiD=0.9958205
  • 21. PiD - Evaluation Varying Alpha © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 23. Building a Quadtree 1 3 2 2 3 11 1 21 1 3 2 2 3 1 • how to choose bin width? • how to merge? • equal frequencies or equal width? © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 1 1 1 2
  • 24. Distributed Merge • start with unit-square • extend by double; split by half 1{ ➔ logarithmic number of splits/extensions • merge by aligning unit-squares 1 2 2 3 11 1 21 1 3 1 2 2 2 1 8 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 2 5 1.5 2.5 4 1.51.5 5 2.5 1.5 2.51.5 3
  • 25. Deriving the Layer 2 Histogram 2 1.5 2.5 5 2.5 1.5 4 1.5 1.5 2.5 1.5 2 5 3 1.5 2.5 Equal Width 5 2.5 1.5 4 1.5 1.5 2.5 1.5 5 3 2.5 2.5 7.25 5.25 2.5 2.5 6.25 5.25 = 34 ➔ 4.25 per bin Equal Frequency © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 26. How to deal with discrete data PiD and Map per bin A B 2 e 2.3 3 g 3.6 e a 4.1 2.9 ... 4 1.5 {a:3, e:1} {e:2, g:2, h:1} ... ... ... 5 6 1.5 2 2.5 3 2.5 3.5 {a:1, b:1} {e:2} {a:0.5, b:0.5} {a:2, b:0.5, e:0.5} ... ... Layer 2: number of bins = |vocabulary| © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013
  • 27. Mutual Information equal width © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 equal frequency
  • 28. 5 10 15 20 20 0 10 15 20 5 10 15 20 Mutual Information: 0.396 (0.919) 10 15 20 Mutual Information: 0.023 (0.026) 10 5 0 -5 0 5 10 15 20 Mutual Information: 0.171 (0.131) © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013 5 15 20 15 10 5 0 0 0 Mutual Information: 0.102 (0.022) -5 -5 0 5 10 15 20 Mutual Information: 0.013 (0.03) 5 20 0 0 5 10 15 20 15 10 5 0 0 5 10 15 20 Mutual Information 0 5 10 15 20 Mutual Information: 0.35 (0.544)
  • 29. Normalization I(X; Y ) H(X)H(Y ) • panelize variable with large cardinality • scale value between 0 and 1 © 2013 Datameer, Inc. All rights reserved. Friday, October 18, 2013