TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
17 bigdata
1. CS315: Principles of Database Systems
Big Data
Arnab Bhattacharya
arnabb@cse.iitk.ac.in
Computer Science and Engineering,
Indian Institute of Technology, Kanpur
http://web.cse.iitk.ac.in/˜cs315/
2nd
semester, 2013-14
Tue, Fri 1530-1700 at CS101
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 1 / 9
2. Big Data
What is big data?
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
3. Big Data
What is big data?
Data which is big
How big is “big”?
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
4. Big Data
What is big data?
Data which is big
How big is “big”?
For sociologists, 10 subjects is big
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
5. Big Data
What is big data?
Data which is big
How big is “big”?
For sociologists, 10 subjects is big
For social networks, terabytes is normal
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
6. Big Data
What is big data?
Data which is big
How big is “big”?
For sociologists, 10 subjects is big
For social networks, terabytes is normal
For astrophysicists, terrabytes is a day’s work
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
7. Big Data
What is big data?
Data which is big
How big is “big”?
For sociologists, 10 subjects is big
For social networks, terabytes is normal
For astrophysicists, terrabytes is a day’s work
So, no absolute definition or threshold
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
8. Big Data
What is big data?
Data which is big
How big is “big”?
For sociologists, 10 subjects is big
For social networks, terabytes is normal
For astrophysicists, terrabytes is a day’s work
So, no absolute definition or threshold
When data is bigger than most machines can store or most
algorithms can handle
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
9. Characterization of big data
Having large volumes of data requires
Newer techniques
Newer tools
Newer architectures
Allows solving newer problems
Can also solve older problems better
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 3 / 9
10. Properties of big data
3 V’s: volume, variety, velocity
Volume: When data is extremely large in size, how to load it, index it
or query it
Variety: Data can be semi-structured or unstructured as well; how to
query
Velocity: Data can arrive at real time and can be streaming
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 4 / 9
11. Properties of big data
3 V’s: volume, variety, velocity
Volume: When data is extremely large in size, how to load it, index it
or query it
Variety: Data can be semi-structured or unstructured as well; how to
query
Velocity: Data can arrive at real time and can be streaming
Extended V’s: veracity, validity, visibility, variability
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 4 / 9
12. Enablers of big data
Increased storage volume and type
Increased processing power
Increased data
Increased network speed
Increased capital
Increased business
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 5 / 9
13. Too much data can lead to spurious results
Is it sensible to try and detect possible terror links among people?
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
14. Too much data can lead to spurious results
Is it sensible to try and detect possible terror links among people?
Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: scan hotel logs to identify such occurrences
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
15. Too much data can lead to spurious results
Is it sensible to try and detect possible terror links among people?
Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: scan hotel logs to identify such occurrences
Data assumptions
Number of people: 109
Tracked over 103
days (about 3 years)
A person stays in a hotel with a probability of 1%
Each hotel hosts 102
people at a time
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
16. Too much data can lead to spurious results
Is it sensible to try and detect possible terror links among people?
Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: scan hotel logs to identify such occurrences
Data assumptions
Number of people: 109
Tracked over 103
days (about 3 years)
A person stays in a hotel with a probability of 1%
Each hotel hosts 102
people at a time
Deductions
A person stays in hotel for
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
17. Too much data can lead to spurious results
Is it sensible to try and detect possible terror links among people?
Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: scan hotel logs to identify such occurrences
Data assumptions
Number of people: 109
Tracked over 103
days (about 3 years)
A person stays in a hotel with a probability of 1%
Each hotel hosts 102
people at a time
Deductions
A person stays in hotel for 10 days
Total number of hotels is
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
18. Too much data can lead to spurious results
Is it sensible to try and detect possible terror links among people?
Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: scan hotel logs to identify such occurrences
Data assumptions
Number of people: 109
Tracked over 103
days (about 3 years)
A person stays in a hotel with a probability of 1%
Each hotel hosts 102
people at a time
Deductions
A person stays in hotel for 10 days
Total number of hotels is 105
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
19. Too much data can lead to spurious results
Is it sensible to try and detect possible terror links among people?
Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: scan hotel logs to identify such occurrences
Data assumptions
Number of people: 109
Tracked over 103
days (about 3 years)
A person stays in a hotel with a probability of 1%
Each hotel hosts 102
people at a time
Deductions
A person stays in hotel for 10 days
Total number of hotels is 105
Each day, 107
people stay in a hotel
Per hotel, 102
people stay
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
20. Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
21. Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
22. Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
23. Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
24. Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
25. Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
26. Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is (roughly) 5 × 105
Any 2 out of 103
: (103
2
)
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
27. Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is (roughly) 5 × 105
Any 2 out of 103
: (103
2
)
Probability that A and B meet twice in some pair of days is
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
28. Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is (roughly) 5 × 105
Any 2 out of 103
: (103
2
)
Probability that A and B meet twice in some pair of days is 10−13
10−18
× 5 × 105
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
29. Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is (roughly) 5 × 105
Any 2 out of 103
: (103
2
)
Probability that A and B meet twice in some pair of days is 10−13
10−18
× 5 × 105
Total pairs of people is
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
30. Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is (roughly) 5 × 105
Any 2 out of 103
: (103
2
)
Probability that A and B meet twice in some pair of days is 10−13
10−18
× 5 × 105
Total pairs of people is (roughly) 5 × 1017
Any 2 out of 109
: (109
2
)
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
31. Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is (roughly) 5 × 105
Any 2 out of 103
: (103
2
)
Probability that A and B meet twice in some pair of days is 10−13
10−18
× 5 × 105
Total pairs of people is (roughly) 5 × 1017
Any 2 out of 109
: (109
2
)
Expected number of suspicions, i.e., number of people meeting twice
on any pair of days is
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
32. Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is (roughly) 5 × 105
Any 2 out of 103
: (103
2
)
Probability that A and B meet twice in some pair of days is 10−13
10−18
× 5 × 105
Total pairs of people is (roughly) 5 × 1017
Any 2 out of 109
: (109
2
)
Expected number of suspicions, i.e., number of people meeting twice
on any pair of days is 2.5 × 105
5 × 10−13
× 5 × 1017
Bonferroni’s principle: if you look in more places for interesting
patterns than your amount of data supports, you are bound to “find”
something “interesting” (most likely spurious)
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
33. Tools for Big Data
Hosting: Distributed servers or Cloud
Amazon EC2
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
34. Tools for Big Data
Hosting: Distributed servers or Cloud
Amazon EC2
File system: Scalable and distributed
HDFS, Amazon S3
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
35. Tools for Big Data
Hosting: Distributed servers or Cloud
Amazon EC2
File system: Scalable and distributed
HDFS, Amazon S3
Programming model: Distributed scalable processing
Map-reduce framework, Hadoop
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
36. Tools for Big Data
Hosting: Distributed servers or Cloud
Amazon EC2
File system: Scalable and distributed
HDFS, Amazon S3
Programming model: Distributed scalable processing
Map-reduce framework, Hadoop
Database: NoSQL
HBase, MongoDB, Cassandra
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
37. Tools for Big Data
Hosting: Distributed servers or Cloud
Amazon EC2
File system: Scalable and distributed
HDFS, Amazon S3
Programming model: Distributed scalable processing
Map-reduce framework, Hadoop
Database: NoSQL
HBase, MongoDB, Cassandra
Operations: Querying, indexing, analytics
Data mining, Information retrieval
Machine learning: Mahout on top of Hadoop
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
39. Discussion
Emerging term: data science
Big data does not always need large investment
Many open-source tools
Cloud, etc. can be rented
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 9 / 9
40. Discussion
Emerging term: data science
Big data does not always need large investment
Many open-source tools
Cloud, etc. can be rented
Most applications do not require big data
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 9 / 9
41. Discussion
Emerging term: data science
Big data does not always need large investment
Many open-source tools
Cloud, etc. can be rented
Most applications do not require big data
“Big Data” is currently too hyped
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 9 / 9