SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
CS315: Principles of Database Systems
Big Data
Arnab Bhattacharya
arnabb@cse.iitk.ac.in
Computer Science and Engineering,
Indian Institute of Technology, Kanpur
http://web.cse.iitk.ac.in/˜cs315/
2nd
semester, 2013-14
Tue, Fri 1530-1700 at CS101
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 1 / 9
Big Data
What is big data?
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
Big Data
What is big data?
Data which is big
How big is “big”?
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
Big Data
What is big data?
Data which is big
How big is “big”?
For sociologists, 10 subjects is big
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
Big Data
What is big data?
Data which is big
How big is “big”?
For sociologists, 10 subjects is big
For social networks, terabytes is normal
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
Big Data
What is big data?
Data which is big
How big is “big”?
For sociologists, 10 subjects is big
For social networks, terabytes is normal
For astrophysicists, terrabytes is a day’s work
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
Big Data
What is big data?
Data which is big
How big is “big”?
For sociologists, 10 subjects is big
For social networks, terabytes is normal
For astrophysicists, terrabytes is a day’s work
So, no absolute definition or threshold
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
Big Data
What is big data?
Data which is big
How big is “big”?
For sociologists, 10 subjects is big
For social networks, terabytes is normal
For astrophysicists, terrabytes is a day’s work
So, no absolute definition or threshold
When data is bigger than most machines can store or most
algorithms can handle
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
Characterization of big data
Having large volumes of data requires
Newer techniques
Newer tools
Newer architectures
Allows solving newer problems
Can also solve older problems better
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 3 / 9
Properties of big data
3 V’s: volume, variety, velocity
Volume: When data is extremely large in size, how to load it, index it
or query it
Variety: Data can be semi-structured or unstructured as well; how to
query
Velocity: Data can arrive at real time and can be streaming
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 4 / 9
Properties of big data
3 V’s: volume, variety, velocity
Volume: When data is extremely large in size, how to load it, index it
or query it
Variety: Data can be semi-structured or unstructured as well; how to
query
Velocity: Data can arrive at real time and can be streaming
Extended V’s: veracity, validity, visibility, variability
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 4 / 9
Enablers of big data
Increased storage volume and type
Increased processing power
Increased data
Increased network speed
Increased capital
Increased business
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 5 / 9
Too much data can lead to spurious results
Is it sensible to try and detect possible terror links among people?
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
Too much data can lead to spurious results
Is it sensible to try and detect possible terror links among people?
Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: scan hotel logs to identify such occurrences
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
Too much data can lead to spurious results
Is it sensible to try and detect possible terror links among people?
Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: scan hotel logs to identify such occurrences
Data assumptions
Number of people: 109
Tracked over 103
days (about 3 years)
A person stays in a hotel with a probability of 1%
Each hotel hosts 102
people at a time
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
Too much data can lead to spurious results
Is it sensible to try and detect possible terror links among people?
Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: scan hotel logs to identify such occurrences
Data assumptions
Number of people: 109
Tracked over 103
days (about 3 years)
A person stays in a hotel with a probability of 1%
Each hotel hosts 102
people at a time
Deductions
A person stays in hotel for
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
Too much data can lead to spurious results
Is it sensible to try and detect possible terror links among people?
Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: scan hotel logs to identify such occurrences
Data assumptions
Number of people: 109
Tracked over 103
days (about 3 years)
A person stays in a hotel with a probability of 1%
Each hotel hosts 102
people at a time
Deductions
A person stays in hotel for 10 days
Total number of hotels is
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
Too much data can lead to spurious results
Is it sensible to try and detect possible terror links among people?
Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: scan hotel logs to identify such occurrences
Data assumptions
Number of people: 109
Tracked over 103
days (about 3 years)
A person stays in a hotel with a probability of 1%
Each hotel hosts 102
people at a time
Deductions
A person stays in hotel for 10 days
Total number of hotels is 105
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
Too much data can lead to spurious results
Is it sensible to try and detect possible terror links among people?
Setting: assume terrorists meet at least twice in a hotel to plot
something sinister
Government method: scan hotel logs to identify such occurrences
Data assumptions
Number of people: 109
Tracked over 103
days (about 3 years)
A person stays in a hotel with a probability of 1%
Each hotel hosts 102
people at a time
Deductions
A person stays in hotel for 10 days
Total number of hotels is 105
Each day, 107
people stay in a hotel
Per hotel, 102
people stay
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is (roughly) 5 × 105
Any 2 out of 103
: (103
2
)
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is (roughly) 5 × 105
Any 2 out of 103
: (103
2
)
Probability that A and B meet twice in some pair of days is
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is (roughly) 5 × 105
Any 2 out of 103
: (103
2
)
Probability that A and B meet twice in some pair of days is 10−13
10−18
× 5 × 105
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is (roughly) 5 × 105
Any 2 out of 103
: (103
2
)
Probability that A and B meet twice in some pair of days is 10−13
10−18
× 5 × 105
Total pairs of people is
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is (roughly) 5 × 105
Any 2 out of 103
: (103
2
)
Probability that A and B meet twice in some pair of days is 10−13
10−18
× 5 × 105
Total pairs of people is (roughly) 5 × 1017
Any 2 out of 109
: (109
2
)
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is (roughly) 5 × 105
Any 2 out of 103
: (103
2
)
Probability that A and B meet twice in some pair of days is 10−13
10−18
× 5 × 105
Total pairs of people is (roughly) 5 × 1017
Any 2 out of 109
: (109
2
)
Expected number of suspicions, i.e., number of people meeting twice
on any pair of days is
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
Bonferroni’s principle
In a day, probability that person A and B stays in same hotel is 10−9
Probability that A stays in a hotel that day is 10−2
Probability that B stays in a hotel that day is 10−2
Probability that B chooses A’s hotel is 10−5
Probability that A and B meet twice is 10−18
Two independent events: 10−9
× 10−9
Total pairs of days is (roughly) 5 × 105
Any 2 out of 103
: (103
2
)
Probability that A and B meet twice in some pair of days is 10−13
10−18
× 5 × 105
Total pairs of people is (roughly) 5 × 1017
Any 2 out of 109
: (109
2
)
Expected number of suspicions, i.e., number of people meeting twice
on any pair of days is 2.5 × 105
5 × 10−13
× 5 × 1017
Bonferroni’s principle: if you look in more places for interesting
patterns than your amount of data supports, you are bound to “find”
something “interesting” (most likely spurious)
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
Tools for Big Data
Hosting: Distributed servers or Cloud
Amazon EC2
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
Tools for Big Data
Hosting: Distributed servers or Cloud
Amazon EC2
File system: Scalable and distributed
HDFS, Amazon S3
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
Tools for Big Data
Hosting: Distributed servers or Cloud
Amazon EC2
File system: Scalable and distributed
HDFS, Amazon S3
Programming model: Distributed scalable processing
Map-reduce framework, Hadoop
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
Tools for Big Data
Hosting: Distributed servers or Cloud
Amazon EC2
File system: Scalable and distributed
HDFS, Amazon S3
Programming model: Distributed scalable processing
Map-reduce framework, Hadoop
Database: NoSQL
HBase, MongoDB, Cassandra
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
Tools for Big Data
Hosting: Distributed servers or Cloud
Amazon EC2
File system: Scalable and distributed
HDFS, Amazon S3
Programming model: Distributed scalable processing
Map-reduce framework, Hadoop
Database: NoSQL
HBase, MongoDB, Cassandra
Operations: Querying, indexing, analytics
Data mining, Information retrieval
Machine learning: Mahout on top of Hadoop
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
Discussion
Emerging term: data science
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 9 / 9
Discussion
Emerging term: data science
Big data does not always need large investment
Many open-source tools
Cloud, etc. can be rented
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 9 / 9
Discussion
Emerging term: data science
Big data does not always need large investment
Many open-source tools
Cloud, etc. can be rented
Most applications do not require big data
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 9 / 9
Discussion
Emerging term: data science
Big data does not always need large investment
Many open-source tools
Cloud, etc. can be rented
Most applications do not require big data
“Big Data” is currently too hyped
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 9 / 9

Más contenido relacionado

Destacado

Destacado (13)

An Introduction to Goel Community
An Introduction to Goel CommunityAn Introduction to Goel Community
An Introduction to Goel Community
 
La Via delle Bocchette e le Ferrate del Brenta
La Via delle Bocchette e le Ferrate del BrentaLa Via delle Bocchette e le Ferrate del Brenta
La Via delle Bocchette e le Ferrate del Brenta
 
Escursioni a Cortina e Misurina
Escursioni a Cortina e MisurinaEscursioni a Cortina e Misurina
Escursioni a Cortina e Misurina
 
Teoria della relatività
Teoria della relativitàTeoria della relatività
Teoria della relatività
 
Il pianeta terra
Il pianeta terra Il pianeta terra
Il pianeta terra
 
El rol del Service Manager
El rol del Service ManagerEl rol del Service Manager
El rol del Service Manager
 
Focus on customers the psychology of buying behaviour
Focus on customers the psychology of buying behaviourFocus on customers the psychology of buying behaviour
Focus on customers the psychology of buying behaviour
 
Archi Modelling
Archi ModellingArchi Modelling
Archi Modelling
 
Secondary Memory
Secondary MemorySecondary Memory
Secondary Memory
 
Safe use of power tools
Safe use of power tools Safe use of power tools
Safe use of power tools
 
Factory act(1948)
Factory act(1948)Factory act(1948)
Factory act(1948)
 
Factories act ppt
Factories act pptFactories act ppt
Factories act ppt
 
Tettonica Regionale della Pianura Padana Emiliano-Romagnola
Tettonica Regionale della Pianura Padana Emiliano-RomagnolaTettonica Regionale della Pianura Padana Emiliano-Romagnola
Tettonica Regionale della Pianura Padana Emiliano-Romagnola
 

Último

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

17 bigdata

  • 1. CS315: Principles of Database Systems Big Data Arnab Bhattacharya arnabb@cse.iitk.ac.in Computer Science and Engineering, Indian Institute of Technology, Kanpur http://web.cse.iitk.ac.in/˜cs315/ 2nd semester, 2013-14 Tue, Fri 1530-1700 at CS101 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 1 / 9
  • 2. Big Data What is big data? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
  • 3. Big Data What is big data? Data which is big How big is “big”? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
  • 4. Big Data What is big data? Data which is big How big is “big”? For sociologists, 10 subjects is big Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
  • 5. Big Data What is big data? Data which is big How big is “big”? For sociologists, 10 subjects is big For social networks, terabytes is normal Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
  • 6. Big Data What is big data? Data which is big How big is “big”? For sociologists, 10 subjects is big For social networks, terabytes is normal For astrophysicists, terrabytes is a day’s work Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
  • 7. Big Data What is big data? Data which is big How big is “big”? For sociologists, 10 subjects is big For social networks, terabytes is normal For astrophysicists, terrabytes is a day’s work So, no absolute definition or threshold Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
  • 8. Big Data What is big data? Data which is big How big is “big”? For sociologists, 10 subjects is big For social networks, terabytes is normal For astrophysicists, terrabytes is a day’s work So, no absolute definition or threshold When data is bigger than most machines can store or most algorithms can handle Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 2 / 9
  • 9. Characterization of big data Having large volumes of data requires Newer techniques Newer tools Newer architectures Allows solving newer problems Can also solve older problems better Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 3 / 9
  • 10. Properties of big data 3 V’s: volume, variety, velocity Volume: When data is extremely large in size, how to load it, index it or query it Variety: Data can be semi-structured or unstructured as well; how to query Velocity: Data can arrive at real time and can be streaming Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 4 / 9
  • 11. Properties of big data 3 V’s: volume, variety, velocity Volume: When data is extremely large in size, how to load it, index it or query it Variety: Data can be semi-structured or unstructured as well; how to query Velocity: Data can arrive at real time and can be streaming Extended V’s: veracity, validity, visibility, variability Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 4 / 9
  • 12. Enablers of big data Increased storage volume and type Increased processing power Increased data Increased network speed Increased capital Increased business Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 5 / 9
  • 13. Too much data can lead to spurious results Is it sensible to try and detect possible terror links among people? Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
  • 14. Too much data can lead to spurious results Is it sensible to try and detect possible terror links among people? Setting: assume terrorists meet at least twice in a hotel to plot something sinister Government method: scan hotel logs to identify such occurrences Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
  • 15. Too much data can lead to spurious results Is it sensible to try and detect possible terror links among people? Setting: assume terrorists meet at least twice in a hotel to plot something sinister Government method: scan hotel logs to identify such occurrences Data assumptions Number of people: 109 Tracked over 103 days (about 3 years) A person stays in a hotel with a probability of 1% Each hotel hosts 102 people at a time Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
  • 16. Too much data can lead to spurious results Is it sensible to try and detect possible terror links among people? Setting: assume terrorists meet at least twice in a hotel to plot something sinister Government method: scan hotel logs to identify such occurrences Data assumptions Number of people: 109 Tracked over 103 days (about 3 years) A person stays in a hotel with a probability of 1% Each hotel hosts 102 people at a time Deductions A person stays in hotel for Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
  • 17. Too much data can lead to spurious results Is it sensible to try and detect possible terror links among people? Setting: assume terrorists meet at least twice in a hotel to plot something sinister Government method: scan hotel logs to identify such occurrences Data assumptions Number of people: 109 Tracked over 103 days (about 3 years) A person stays in a hotel with a probability of 1% Each hotel hosts 102 people at a time Deductions A person stays in hotel for 10 days Total number of hotels is Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
  • 18. Too much data can lead to spurious results Is it sensible to try and detect possible terror links among people? Setting: assume terrorists meet at least twice in a hotel to plot something sinister Government method: scan hotel logs to identify such occurrences Data assumptions Number of people: 109 Tracked over 103 days (about 3 years) A person stays in a hotel with a probability of 1% Each hotel hosts 102 people at a time Deductions A person stays in hotel for 10 days Total number of hotels is 105 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
  • 19. Too much data can lead to spurious results Is it sensible to try and detect possible terror links among people? Setting: assume terrorists meet at least twice in a hotel to plot something sinister Government method: scan hotel logs to identify such occurrences Data assumptions Number of people: 109 Tracked over 103 days (about 3 years) A person stays in a hotel with a probability of 1% Each hotel hosts 102 people at a time Deductions A person stays in hotel for 10 days Total number of hotels is 105 Each day, 107 people stay in a hotel Per hotel, 102 people stay Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 6 / 9
  • 20. Bonferroni’s principle In a day, probability that person A and B stays in same hotel is Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
  • 21. Bonferroni’s principle In a day, probability that person A and B stays in same hotel is 10−9 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
  • 22. Bonferroni’s principle In a day, probability that person A and B stays in same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
  • 23. Bonferroni’s principle In a day, probability that person A and B stays in same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
  • 24. Bonferroni’s principle In a day, probability that person A and B stays in same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
  • 25. Bonferroni’s principle In a day, probability that person A and B stays in same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
  • 26. Bonferroni’s principle In a day, probability that person A and B stays in same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103 : (103 2 ) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
  • 27. Bonferroni’s principle In a day, probability that person A and B stays in same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103 : (103 2 ) Probability that A and B meet twice in some pair of days is Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
  • 28. Bonferroni’s principle In a day, probability that person A and B stays in same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103 : (103 2 ) Probability that A and B meet twice in some pair of days is 10−13 10−18 × 5 × 105 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
  • 29. Bonferroni’s principle In a day, probability that person A and B stays in same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103 : (103 2 ) Probability that A and B meet twice in some pair of days is 10−13 10−18 × 5 × 105 Total pairs of people is Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
  • 30. Bonferroni’s principle In a day, probability that person A and B stays in same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103 : (103 2 ) Probability that A and B meet twice in some pair of days is 10−13 10−18 × 5 × 105 Total pairs of people is (roughly) 5 × 1017 Any 2 out of 109 : (109 2 ) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
  • 31. Bonferroni’s principle In a day, probability that person A and B stays in same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103 : (103 2 ) Probability that A and B meet twice in some pair of days is 10−13 10−18 × 5 × 105 Total pairs of people is (roughly) 5 × 1017 Any 2 out of 109 : (109 2 ) Expected number of suspicions, i.e., number of people meeting twice on any pair of days is Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
  • 32. Bonferroni’s principle In a day, probability that person A and B stays in same hotel is 10−9 Probability that A stays in a hotel that day is 10−2 Probability that B stays in a hotel that day is 10−2 Probability that B chooses A’s hotel is 10−5 Probability that A and B meet twice is 10−18 Two independent events: 10−9 × 10−9 Total pairs of days is (roughly) 5 × 105 Any 2 out of 103 : (103 2 ) Probability that A and B meet twice in some pair of days is 10−13 10−18 × 5 × 105 Total pairs of people is (roughly) 5 × 1017 Any 2 out of 109 : (109 2 ) Expected number of suspicions, i.e., number of people meeting twice on any pair of days is 2.5 × 105 5 × 10−13 × 5 × 1017 Bonferroni’s principle: if you look in more places for interesting patterns than your amount of data supports, you are bound to “find” something “interesting” (most likely spurious) Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 7 / 9
  • 33. Tools for Big Data Hosting: Distributed servers or Cloud Amazon EC2 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
  • 34. Tools for Big Data Hosting: Distributed servers or Cloud Amazon EC2 File system: Scalable and distributed HDFS, Amazon S3 Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
  • 35. Tools for Big Data Hosting: Distributed servers or Cloud Amazon EC2 File system: Scalable and distributed HDFS, Amazon S3 Programming model: Distributed scalable processing Map-reduce framework, Hadoop Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
  • 36. Tools for Big Data Hosting: Distributed servers or Cloud Amazon EC2 File system: Scalable and distributed HDFS, Amazon S3 Programming model: Distributed scalable processing Map-reduce framework, Hadoop Database: NoSQL HBase, MongoDB, Cassandra Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
  • 37. Tools for Big Data Hosting: Distributed servers or Cloud Amazon EC2 File system: Scalable and distributed HDFS, Amazon S3 Programming model: Distributed scalable processing Map-reduce framework, Hadoop Database: NoSQL HBase, MongoDB, Cassandra Operations: Querying, indexing, analytics Data mining, Information retrieval Machine learning: Mahout on top of Hadoop Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 8 / 9
  • 38. Discussion Emerging term: data science Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 9 / 9
  • 39. Discussion Emerging term: data science Big data does not always need large investment Many open-source tools Cloud, etc. can be rented Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 9 / 9
  • 40. Discussion Emerging term: data science Big data does not always need large investment Many open-source tools Cloud, etc. can be rented Most applications do not require big data Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 9 / 9
  • 41. Discussion Emerging term: data science Big data does not always need large investment Many open-source tools Cloud, etc. can be rented Most applications do not require big data “Big Data” is currently too hyped Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS315: Big Data 2013-14 9 / 9