Google processes 400 petabytes of data every month and that was way back in 2007! With users generating massive amounts of data in social networking sites like Facebook and Twitter, and an increase in the use of sensor devices, the amount of data generated is only going to go up. Further, with the cost of hard-disks going down, and such data being made available to everyone, and with the advent of cloud computing, we now have the power to process such data ourselves.
What are the challenges of processing such massive amounts of data? With such data being available to every corporation, big or small, how does this change how we have been perceiving data? The talk takes you through some of the technologies used to tackle these challenges.
The talk has been tailored to suit students. It helps them relate to and appreciate the subjects they learn in their curriculum - data structures, programming languages, databases, operating systems, networking etc. At the same time, it describes some of the interesting work being done in the software industry in the areas of databases, data analysis, cloud computing etc.
2. A Few Guidelines
Ask questions – be active
What I cover depends on how active you are
Learn concepts before technology
You will be bombarded with several concepts, tools and
technologies – just remember that you are learning to bridge
concepts and technology.
After this program, you should be comfortable dabbling with
these concepts on your own – even reading things that are
not covered today.
http://jnaapti.com/
4. The Different Vases :(
Not preferable
Ideal!
Sufficient
Source: http://www.flickr.com/photos/bachmont/1382572541/
http://jnaapti.com/
5. Quick Poll
How many of you are from a CS background?
Knowledge of:
Data Structures
Algorithms
Databases
Have heard of:
NoSQL
Key-Value Stores
Cloud Computing
MapReduce
Hadoop
http://jnaapti.com/
7. What is this talk about?
2 themes in this talk:
About data – how is it stored, how do we work with
it
About understanding technology via concepts
learnt
http://jnaapti.com/
8. How much data are we talking about really?
200 million Tweets per day – as of Jun 2011
Wikipedia dump
current revisions only - 31GB uncompressed
entire history runs into multiple TBs uncompressed
Common Crawl data – 10s of Tbs
Tumblr – adding 3TB of new data everyday
Google processes 25PB of data per day
Facebook – 135+ billion messages a month
Facebook – 130TB of logs generated per day
Vestas - Wind data - 18 to 24 petabytes of data to be processed
http://jnaapti.com/
9. We are dealing with a lot more data...
Increase in the number of sensor devices
Larger audience of users using our applications via the
web and social networks results in increased data
generation
Cost of storage is falling – so we never discard any of
the data
http://jnaapti.com/
10. What's in it for me?
Scrabulous case study
Built by 2 young chaps from Kolkata
Both were in their early 20's when
they built it
One was still in college.
500,000 users daily – back in 2008,
25,000$ in ad-revenues per month
These days lots of apps being built by
Source: Wikipedia
college under-graduates.
If they can do it, you can do it too!
http://jnaapti.com/
11. You have all it takes
You have access to a lot
of the tools that big
corporations use for free
You have computing
power available cheaply
You have access to a lot
of the data for free
http://jnaapti.com/
12. What do I need then?
All you need is a little intelligence and a lot of
perseverance and you are on your way!
http://jnaapti.com/
13. Questions to ask
Ok, you have the resources
You build a cool web
application
It is an overnight hit - can you
handle it?
What happens if the server has
a disk crash?
Can we prevent website Slashdot Effect
outages in the account of
hardware failures?
http://jnaapti.com/
14. Looking for answers
What do technology companies like Google/Facebook/Twitter
use to manage data? What challenges do they face in managing
such huge volumes of data? How do they analyze such data?
Image Source: http://opencompute.org/
http://jnaapti.com/
15. From concept to technology
We learn quite a few subjects in
Computer Science – data structures,
algorithms, databases, networking,
operating systems, graph theory, etc.
Are we ever going to use this/need this
as engineers?
How do I use my knowledge of CS to
understand the latest developments in
the industry?
Image Source:http://www.flickr.com/photos/nics_events/2223583947/
http://jnaapti.com/
16. From concept to technology
This talk is about connecting concepts to real world
examples
Image Source:http://www.flickr.com/photos/nics_events/2223583947/
http://jnaapti.com/
17. A few snappy examples
Analysis of question papers from various companies
Analysis of image patterns in your photos and movie
collections
Analysis of your Facebook friends
2nd degree connections
Who is active at what time?
Who talks about what?
http://jnaapti.com/
19. What is this section all about?
Before dealing with big-data problems, we first need to
know how data is handled.
This section tries to answer questions like:
How is it that 0's and 1's are sufficient to do anything
that a computer does?
Why do we need data structures?
Why do we need databases – why can't I just store all
data as flat files?
http://jnaapti.com/
20. Computers – A Bit Processor
Computers only 0 0 1 0 0 1 0 1
1 0 0 1 0 0 1 0
understand bits 0 1 1 1 1 1 1 0
0 0 1 1 0 1 1 0
They have a way to store 0 1 0 0 1 0 0 0
0 0 1 0 0 1 0 1
and process these bits 1 0 0 1 0 0 1 0
0 1 1 1 1 1 1 0
It is upto users to give 0 0 1 1 0 1 1 0
0 1 0 0 1 0 0 0
the bits a “meaning”
http://jnaapti.com/
21. Data Structures
Data structure is like a
cast
Pour your bits into it and
a 'shape' is created
The 'shape' helps us
provide a meaning to the
bits Image Source: http://www.flickr.com/photos/andrein/3020194734/
http://jnaapti.com/
22. Programming Languages
Human mind does not understand bits. We need higher level
constructs to process bits. This is where programming languages
come in. They act as a bridge between what humans want to do and
what machines understand.
Image Source: http://www.flickr.com/photos/jurvetson/5872448596/
http://jnaapti.com/
23. Programming Languages
Variables a = 10, b = 20
c = a + b
Types
if condition:
Operators do_this()
for i in range(10):
Conditionals
do_this()
Looping
urllib.urlopen('http://yahoo.com
/').read()
Libraries
[str.lower() for str in
list_of_strings]
http://jnaapti.com/
24. Primitive Types
Languages usually have two 'bangalore'
primitive types
123
Numbers – Integers,
567.89
Floats, Doubles etc
0
Strings – A sequence of
characters put together -123
Why these two types? Why -567.89
not just strings? '123'
http://jnaapti.com/
25. Composite Types (or Collections)
The world is complex Name → First Name + Last Name
---
We cannot model everything
Phone No → (Country Code) Area Code +
with only strings and numbers Subscriber Number
---
We need ways to put
Address → Door No + Street + City +
primitive values together to State + Pin Code
form more complex types ---
Collections are a bag of values Composite of composites: Person →
Name + Phone No + Address
put together
---
Bottom up v/s Top down Group of People
http://jnaapti.com/
26. Collections – General Object Containers
We can represent As a matter of fact,
anything in the world this is what JSON allows
using collections you to do
Collections can be
mapped to bits
Computers can interpret
those bits
http://jnaapti.com/
28. Collections – Lists
Grocery shopping example
Order of items matter
Do items need to be of the same type?
The key identifier is the position of the item in the list
Operations on a list:
add an item to list
remove an item from the list
get an item from the list at a specific position
http://jnaapti.com/
29. Collections – Sets
Items in a set are unique
There is no definite order
Operations on a set:
Add items to the set
Test if an item exists in the set
Remove an item from the set
http://jnaapti.com/
30. Collections - Maps
Lots of maps in the real Toothpaste - 1, Rs. 54
Matchbox - 10, Rs. 15
world
Tomatoes - 1kg, Rs. 10
Indices are not always
Chips - 1, Rs. 15
integers in real world ---
We may want to Identify Dictionary of word definitions
properties of an item, ---
Phone book containing phone
using some name
numbers
http://jnaapti.com/
31. Collections – Maps contd...
Maps allow us to Grocery list: Item is the key,
properties are values
associate a key with a ---
value Dictionary as a map: keys are the
words, values are the definitions
The name that is used to ---
identify the set of Phone book as a map: keys are the
names, values are the phone numbers
properties is called a key
The properties identified
is called the value
http://jnaapti.com/
32. Collections – Maps contd...
Keys don't have a definite Important:
order The analogy breaks here -
Don't get confused with the
Operations on a map:
way a map works – keys
Put a key, value pair don't have an order...
Get a value for a key Looking up keys, not values
- You don't say get me the
Get me all the keys and
word whose definition
I will look at them one is ...
by one
http://jnaapti.com/
33. More composite types
List of lists List of people is a list
of maps
List of maps
---
Map of maps
Mailboxes containing
... mails is a map of maps
http://jnaapti.com/
35. Hashtables
Run the key through a magic
function that gives you a number
The number is a unique slot into an
array
The magic function is called a
“hash function” - it is chosen such
that there are minimal collisions
and most uniform distribution
Image Source: Wikipedia
http://jnaapti.com/
36. Gmail – An Example
What datastructures do we use
here?
Mail
Mailbox
Person
Label
A mailbox has a list of mails
A mail can be represented
using a map
http://jnaapti.com/
37. Gmail – An Example
What is the mailbox size? How much RAM does a system have?
If all the data of the world could fit into the RAM of a single machine,
we wouldn't have a lot of the problems we face
Luckily, that's not the case!
Properties of RAMs
Are limited in their capacity
Are volatile (data disappears on reboot)
Max data in memory is 256GB
Conclusion: We need the disk
http://jnaapti.com/
38. Hmm... Our First “Big” Data Problem
Let us say, the data is present as a huge 7 GB file in the
disk.
What is the amount of time it takes to read this file
into memory?
How do I measure disk speeds?
http://jnaapti.com/
40. Disk Read Speed
We can get disk read speeds close to 80MB/s
Let's round it off to 100MB/s
Reading 7000MB would take 70 seconds
Would you wait if Gmail took 70 seconds to fetch your mails?
Remember, parallel read accesses and writes slow it down further.
Hmm, ok, this doesn't work, we need something faster, solution?
http://jnaapti.com/
41. How do we solve this?
Imagine a world where there are no databases - you
have a hard-disk and you are asked to solve this
problem.
We need to be able to read only the data we want as
quickly as we can.
How do we solve this?
http://jnaapti.com/
42. Solution
Store data in fixed sized records and then have a way to
jump to the starting location of a specific record
http://jnaapti.com/
44. A word about Abstraction
Reading from a disk
Instruct the hardware to move the read head to a specific location, now
read the data
Reading from a file
Open the file, Read it, Close it
Reading from a database
Connect to the DB, query for data, Close connection
One of the skills you can pickup as an engineer is being able to define an
operation at every level of abstraction
http://jnaapti.com/
45. Relational Database Design
Define Entities and their Relationships
Handling 1..1, 1..n and m..n relationships
Perform normalization
Take the entities and their relationships and come up
with tables, fields, primary keys and foreign keys
Define queries to add, update, fetch and delete data
http://jnaapti.com/
46. Mapping Design to Implementation
Data is stored in tables (which map to entities)
Tables contain records (rows) and fields (columns)
Records are of fixed length
Records are stored sequentially
http://jnaapti.com/
47. Relational Databases – Storage Structure
Use hash-tables to point to records in the tables – so
individual records can be retrieved without having to
search the entire dataset.
This process is called “indexing”.
In theory you can have many such indexes.
Foreign keys are also indexed to speed up the lookup.
http://jnaapti.com/
53. Problem 1 – Too Many Requests
What if a thousand users access my server at the same time?
If the server can handle 200 such requests in parallel in one
second, what if I have 400 requests per second?
1st second → 200 requests
2nd second → 600 requests (200 are from the previous second)
Results in server thrashing
Solution: Load Balanced Setup
http://jnaapti.com/
55. Load Balancing
Load balancing is a way of parallelizing processing
across multiple machines
The load balancer acts as a proxy that streams
requests and responses between the client and the
processing server.
Eg: HAProxy
Stateful and Stateless Architectures
http://jnaapti.com/
56. Problem 2 – Even More Requests
What if the Load Balancer itself becomes the
bottleneck?
Solution:
Round Robin DNS
Building multiple independent clusters
http://jnaapti.com/
59. Problem 3 – The Stateful Database
A single database cannot handle all requests from all
users.
Unlike front-end servers, databases are not “stateless”
If we are trying to only read information, it's fine, but
if we are trying to write information, this is a problem.
http://jnaapti.com/
60. Scale Up v/s Scale Out
Scale up means to add resources (CPUs or memory) to
a single system system in order to increase its
processing capabilities
Scale up has limitations in how much we can scale –
but is easier to do
Scale out means to add more nodes to a system
Scale out provides linear scalability, is less
expensive, but is complex compared to scale-up
http://jnaapti.com/
62. Scale Up Solution to the DB Problem
Increase the system's capacity by adding more
resources to the system – faster disks, more RAM,
faster processors, more cores etc
Introduce on-the-fly compression of data in the
database
Scale up is not scalable enough
http://jnaapti.com/
65. Scale Out Solutions to the DB Problem
Until the virtualization revolution and until we reached
the limits of hardware, we were looking at scale up
solutions rather than scale out solutions
Partition your data and put them on multiple systems
– a subset of the rows in each system
This is called Sharding
http://jnaapti.com/
66. Issues with Sharding
No clear way of partitioning the data
Maintaining ACID (Atomicity, Consistency, Isolation,
Durability) properties is complex
Joining data across machines is complex
Re-sharding is complex
http://jnaapti.com/
67. Other Issues with Relational Databases
Data could be unstructured/semi-structured
Impedance mismatch (ORM issues)
Sparse values are not handled well - results in wastage of
storage (although some engines handle this today)
Changes in schema are difficult
Not all data require ACID/Transactional support
Normalization results in more queries and that means
more disk accesses - some apps can do without them
http://jnaapti.com/
68. The NoSQL Revolution
NoSQL revolution happened to solve the many issues faced
with storing web-scale data in relational databases
NoSQL as the name suggests don't use SQL to store and
retrieve data
Widely adopted in web applications these days, several
solutions available
Still in research – no clear winner and therefore difficult to
choose among alternatives
http://jnaapti.com/
69. Advantages of NoSQL Stores
They don't require fixed schemas
Avoid joins
Sharding (Scale out) is easier – some even do it
automatically
Many of the implementations replicate the data and
thus avoid SPOFs (Single Point of Failure)
http://jnaapti.com/
74. Examples of Web Scale Data Analysis
Distributed Grep - Look for a pattern in all the Tweets
Inverted Index Building - This is what is used by search
engines
Sentiment Analysis
Competition Analysis
Log Analysis
http://jnaapti.com/
75. Understanding the problem of Analysis
Unlike in the case of retrieving data, in the case of
analysis, we need to read through everything, but
reads are slow in the disk.
Let's see a simple math:
1 Hard Disk read speed is 100MB/s
100 Hard Disks read in parallel gives 10GB/s!
Can we exploit this parallelism?
http://jnaapti.com/
76. The Coin Counting Example
You have a sack full of coins, and you are asked to separate
them into 1, 2, 5 and 10 Rs coins and tell how many of each
are present.
Now, let's say you have few sacks full of coins and it will take
you a lot of time to count it yourself – so you call a few other
people to help you out.
Now, let's say there is few rooms full of coins (like in some
large temples in India) – how will you count them?
http://jnaapti.com/
77. Coin Counting Problem – in depth
You can't add more people to the same room – the
room is already full.
You can get a few more rooms, ask people to take some
coins to the other room and then do the counting
there, and come back with the coins and the final count.
This will mean a lot of “traffic” in the corridor.
So what's a better solution?
http://jnaapti.com/
78. A Possible Solution to the Coin Counting Problem
Unload the coins in different rooms rather than in the
same room.
Then get workers in different rooms. With an increase
in coins, increase the number of rooms and workers.
Let the workers in each room work independently.
This is how Map/Reduce frameworks work
http://jnaapti.com/
79. Traditional Parallel Processing
Use of threads, sharing data, synchronization
Results in Deadlocks, Livelocks, Starvation etc
Handling failures is complex
Parallel Programming is hard this way.
http://jnaapti.com/
80. Requirements from a parallel processing framework
Higher level programming constructs – don't need to deal with sockets,
threading, locking, sharing data etc
Manage failures - if a task fails or a system breaks down, we want the
framework to transparently manage it
Recoverability - If a system fails, another system must be able to pick up
its workload
Replication – if a system fails, we don't lose data – the framework
should replicate data in multiple nodes
Scalability – Adding more compute nodes should help us increase the
compute capacity
http://jnaapti.com/
81. Pulling data Or Pushing Computations?
Pulling data for computation results in a bottleneck
Every “database store” also has a “processor”.
Instead of pulling the data for computation, can we
think about pushing the computation out to where the
data resides?
Computation is in "bytes", may be a few MB of object
code, that is still trivial compared to the data it works
on
http://jnaapti.com/
82. MapReduce
Concept introduced by Google in 2004
Framework is inspired by map and reduce functions
found in functional programming languages
Hadoop is an opensource implementation of
MapReduce
http://jnaapti.com/
83. MapReduce Frameworks
Data is spread throughout machines before starting
the task
Computation is done in the nodes where data is stored
Data is replicated in multiple machines to increase
reliability
Tasks are executed on multiple nodes just in case one
of them is running slow
http://jnaapti.com/
84. Using the Common Crawl Data – A Case Study
The dump is a few 10s of TBs in size
Where/How do you download it?
Answer: You don't need to download it
Instead you push your computation to where the data
exists, perform your computation and then only fetch
results you are interested in!
http://jnaapti.com/
85. Recap
My knowledge of computer science:
Am I ever going to use this/need this as an
engineer?
How do I use this knowledge to understand the
latest developments in software engineering?
Hope you have an answer now!
http://jnaapti.com/
86. Parting Thoughts
Technology changes very rapidly – don't expect to be
spoon-fed
Practise, Practise, Practise - Katas
Concept before Technology
Try out new things – even if they are not related to your
project/curriculum
Read and understand other people's code
Read a lot, for example: http://highscalability.com/
http://jnaapti.com/
87. We at jnaapti conduct workshops and provide
training on these technologies – contact us at
http://jnaapti.com/ for more details
http://jnaapti.com/