Big Data & Hadoop Introduction

BIG DATA
Enlightening Big Data

- Jayant

What is BIG Data ? ? ?

How BIG is BIG Data ?

How to define BIG Data ? ? ?

Gartner’s Doug Laney in a 2001 research report.

Velocity
• 300m photos uploaded / day
• 2.5b content shared / day
Facebook • 70K Queries executed / day
• 500+TB / day

• 340m tweets / day
Twitter • 140m active users

• 4.7b search queries / day
Google • Processing 20 PB data / day

• 1m transaction / hour
Walmart • 2.5 petabytes of data / hour

Variety

Structured Analysis Unstructured Analysis
Responses to Pledge, Responses to following questions
multiple choice questions • Share your story
• Ask a question to Aamir
• Send a message of hope
• Share your solution

Content Filtering Rating Tagging
System (CFRTS)
L0, L1, L2 phased analytics

Impact Analysis
Crawling general internet for measuring the before & after scenario
on a particular topic

Value

It is a capital mistake to theorize
before one has data.

-Sherlock Holmes

Variability
Who enjoys the fastest internet? Where does our energy come from?

Living longer with fewer children

http://www.google.com/publicdata/directory

Other Effect – Geo, Event …

3 I’s for Big Data
• “data that’s an order of magnitude greater than data you’re accustomed to.”
- Gartner analyst Doug Laney
• “data that exceeds the processing capacity of conventional database systems. The data is too big,
moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from
this data, you must choose an alternative way to process it.”
Ill-Defined - Ed Dumbill, program chair for the O’Reilly Strata Conference

• How do you make Big Data approachable?
• There are lots of challenges in leveraging Big Data, from managing the data to having the right
tools to get you the insights that matter.
• Companies like Splunk and Sumo Logic are Big Data Apps for machine data.
Marketing relevance company BloomReach processes more than 100 million web pages,
Intimidating generating 94% average annual incremental traffic as a result.

• What’s actionable about big data?
• “the analytic value of data decays rapidly.”
- Andrew Rogers, founder and CTO of Space Curve

That means being able to analyze your data as fast as possible is critical to gaining competitive
Immediate advantage. “hit the iron when it is hot”

Managing BIG Data
• Distributed Computing
• Multiprocessing Unit
• Parallel processing

• SMP (Symmetric MultiProcessing solutions) :
SMP systems use multiple processors that share a common operating system
(OS) and memory.
e.g. Microsoft SQL Server 2008 R2 Fast Track Data Warehouse platform

• MPP (Massively Parallel Processing) :
MPP systems harness numerous processors each having own OS & memory
working on different parts of an operation in a coordinated way.
e.g. Microsoft’s Parallel Data Warehouse solution

• NoSQL Platforms :
They increase performance at a lower cost, with linear scalability, true
commodity hardware, a schema-free structure, and more relaxed data-
consistency validation.
e.g. Hadoop

Evolution – Distributed System
Atomicity For the internet workload, with distributed
Consistency computing, ACID properties are too strong.
Isolation
Durability

Rather than requiring consistency Basic
after every transaction, it is enough Availability
for the database to eventually be in Soft-state
a consistent state -- BASE. Eventual consistency

• Consistent – Reads always pick up the latest write.
• Available – can always read and write.
• Partition tolerant – The system can be split across
multiple machines and datacenters

Can do at most two of these three.
Brewer’s CAP Theorem for Distributed Systems

Path to DataStack 3.0
Must support Variety, Volume and Velocity

Data Stack 1.0 Data Stack 2.0 Data Stack 3.0
Relational Database Systems Enterprise Data Warehouse Dynamic Data Platform

Recording Business Events Support for Decision Making Uncovering Key Insights

Highly Normalized Data Unnormalize Dimensional Model Schema less Approach

GBs of Data TBs of Data PBs of Data

End User Access thru Ent Apps End User Access Through Reports End User Direct Access

Structured Structured Structured + Semi Structured

Hadoop
• A scalable fault-tolerant grid operating system for data
storage and processing
• Its scalability comes from the marriage of:
• HDFS: Self-Healing High-Bandwidth Clustered Storage
• MapReduce: Fault-Tolerant Distributed Processing
• Operates on unstructured and structured data
• A large and active ecosystem (many developers and additions
like HBase, Hive, Pig, …)
• Open source under the friendly Apache License
• http://wiki.apache.org/hadoop/

Hadoop Design Axioms:-
• System Shall Manage and Heal Itself
• Performance Shall Scale Linearly
• Compute Should Move to Data
• Simple Core, Modular and Extensible

Hadoop
• Hadoop’s Inspiration – Google’s MapReduce 2002-2004: Doug Cutting and Mike Cafarella
started working on Nutch
• Google’s GFS & GMR  Hadoop’s HDFS & HMR 2003-2004: Google publishes GFS and
• Hadoop was created by Doug Cutting and MapReduce papers
2004: Cutting adds DFS & MapReduce support to
Michael J. Cafarella. Nutch
• Hadoop is written in the Java programming 2006: Yahoo! hires Cutting, Hadoop spins out of
Nutch
language and is a top-level Apache project 2007: NY Times converts 4TB of archives over
being built and used by a global community of 100 EC2s
contributors. 2008: Web-scale deployments at Y!, Facebook,
Last.fm
April 2008: Yahoo does fastest sort of a TB,
3.5mins over 910 nodes
May 2009:
Yahoo does fastest sort of a TB, 62secs over
1460 nodes
Yahoo sorts a PB in 16.25hours over 3658
nodes
June 2009, Oct 2009: Hadoop Summit (750),
Hadoop World (500)

HDFS Hadoop Distributed File System
Block Size = 64MB
Replication Factor = 3

Cost/GB is a few ¢/month vs $/month

MapReduce Distributed Processing

Working of Hadoop – I (Map Reduce)

Working of Hadoop – I (MR Code)

public void map(Object key, Text value, …. ) {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}

public void reduce(Text key, Iterable<IntWritable> values, ……… ) {
int sum = 0;
for (IntWritable val : values) {sum += val.get();}
result.set(sum);
context.write(key, result);
}

Hadoop - Economics
• Typical Hardware:
• Two Quad Core Nehalems
• 24GB RAM
• 12 * 1TB SATA disks (JBOD mode, no need for RAID)
• 1 Gigabit Ethernet card
• Cost/node: $5K/node
• Effective HDFS Space:
• ¼ reserved for temp shuffle space, which leaves 9TB/node
• 3 way replication leads to 3TB effective HDFS space/node
• But assuming 7x compression that becomes ~ 20TB/node
Effective Cost per user TB: $250/TB
Other solutions cost in the range of $5K to $100K per user TB
Powered by Hadoop:
• Facebook
• 1100-nodes cluster with 8800 cores
• store copies of internal log and dimension data sources and use it as a
source for reporting/analytics and machine learning
• Yahoo
• Biggest cluster: 4000 nodes
• Search Marketing, People you may know, Search Assist, and many more…
• Ebay
• 532 nodes cluster (8 * 532 cores, 5.3PB).
• Using it for Search optimization and Research

http://wiki.apache.org/hadoop/PoweredBy

RDBMS and Hadoop
RDBMS MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Structure Fixed schema Unstructured schema
Language SQL Procedural (Java, C++, Ruby, etc)
Integrity High Low
Scaling Nonlinear Linear
Updates Read and write Write once, read many times
Latency Low High

Hadoopable Problem Types
1 Batchable
• They are batchable into the two-phase Map/Reduce sequence(s)

2 Massive Volume
• There is a need to analyze massive data volumes, which precluded their solution using more traditional platforms.

3 No Data Dependency
• They exhibit little or no data dependence, meaning that work being done by one computational node is largely done on
data locally accessible to that computational node.

4 No Process Dependency
• They are amenable to massive parallelism in that there is little process dependence across computations. The tasks do
not have to be “sequentialized,” meaning that those tasks really can be executed at the same time without having to
wait for each other to provide interim results, except during the transition between the map and reduce phases.

5 Unstructure++
• They are not limited to data managed within a structured environment, and in fact unstructured data analysis and
analyzing combinations of structured and unstructured data are suitable.

6 No Inter-Process Communication
• Individually-assigned tasks require limited inter-process communication, reducing any latency delays associated with
injecting data into and pulling data out of a network.

6 Super Scale Hadoop Deployments

Myths
1 Big Data is Only About Massive Data Volume
• Volume is just one key element in defining Big Data, and it is arguably the least important of three elements. The other
two are variety and velocity.
• Experts consider PBs of data volume as the starting point for Big Data, although this volume indicator is a moving target.

2 Big Data Means Hadoop
• Hadoop is the Apache open-source software framework for working with Big Data. It was derived from Google
technology and put to practice by Yahoo and others.
• Big Data is too varied and complex for a one-size-fits-all solution.

3 Big Data Means Unstructured Data
• The term “unstructured" is imprecise and doesn’t account for the many varying and subtle structures typically
associated with Big Data types. Big Data is probably better termed “multi-structured” as it could include text strings,
documents of all types, audio and video files, metadata, web pages, email messages, social media feeds, form data, etc.

4 Big Data is for Social Media Feeds and Sentiment Analysis
• Early pioneers of Big Data have been the largest, web-based, social media companies — Google, Yahoo, Facebook — it
was the volume, variety, and velocity of data generated by their services that required a radically new solution rather
than the need to analyze social feeds or gauge audience sentiment.

5 NoSQL means No SQL
• NoSQL means “not only” SQL because these types of data stores offer domain-specific access
• Technologies in this NoSQL category include key value stores, document-oriented databases, graph databases, big table
structures, and caching data stores.

Where/How its used

Business Technical
• Behavioral analysis • Staging area for Data
• Targeting marketing offers warehouse / analytics
• Analyzing marketing • Analytics Sandbox
effectiveness • Unstructured / semi-
• Root cause analysis structured content
• Sentiment Analysis storage and analysis
• Fraud Analysis • Total data analysis
• Risk Mitigation • Commodity based Storage

Case Study
Rigorous Weekly
Operation Cycle
producing instant
analytics
Killer combo of Human+Softwareto analyze the data
efficiently
Topic opens on Sunday

Episode Tags are
refined and messages Live Analytics report is
are re-ingested for sent during the show
another pass

Featured content is
Data capture from SMS,
delivered thrice a day
phone calls, social
all through out the
media, website,
week.

JSONs are created for
System runs L0 Analysis,
the external and
L1, L2 Analysts continue
internal dashboards

“With too little data, you won’t be able to make any conclusions that you trust.
With loads of data you will find relationships that aren’t real…

Big data isn’t about bits, it’s about talent”

– Douglas Merrill

Q&A

Torture the data, and it will confess to anything.
-Ronald Coase, Economics, Nobel Prize Laureate

Thank You

Big Data & Hadoop Introduction

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Big Data & Hadoop Introduction

Similar a Big Data & Hadoop Introduction (20)

Último

Último (20)

Big Data & Hadoop Introduction

Notas del editor