1. Big data And Hadoop
Submitted To:-
Mr. B. Varghese
Submitted By:-
Chanchal tripathi
Roll no:- 36
MCA-5
1
2. Index
• Big data
• How much big data
• Type of Data
• Apache Hadoop
• Design principle of hadoop
• Hadoop Vs other system
• Hadoop Vs database
• About Cloudera
• Other elements of big data
2
3. Big Data EveryWhere!
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
3
4. How much data?
• Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month
• Facebook has 2.5 PB of user data + 15 TB/day
• eBay has 6.5 PB of user data + 50 TB/day
• CERN’s Large Hydron Collider (LHC) generates
15 PB a year
4
5. Type of Data
• Relational Data
(Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
– Social Network, Semantic Web (RDF), …
5
6. Apache Hadoop
• Open source software framework for distributed
processing of large datasets across large clusters of
computers
• Hadoop is based on a simple programming model called
MapReduce
• Hadoop is based on a simple data model, any data will
fit
• Hadoop is one of the tools designed to handle big data
6
7. • Automatic parallelization & distribution
– Hidden from the end-user
• Fault tolerance and automatic recovery
– Nodes/tasks will fail and will recover automatically
• Clean and simple programming abstraction
– Users only provide two functions “map” and
“reduce”
Design Principles of Hadoop
7
8. Hadoop vs. Other Systems
Distributed Databases Hadoop
Computing Model - Notion of transactions
- Transaction is the unit of work
- ACID properties, Concurrency control
- Notion of jobs
- Job is the unit of work
- No concurrency control
Data Model - Structured data with known schema
- Read/Write mode
- Any data will fit in any format
- (un)(semi)structured
- Read Only mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare
- Recovery mechanisms
- Failures are common over thousands
of machines
- Simple yet efficient fault tolerance
Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
8
9. Why Hadoop is able to compete?
vs.
Flexibility in accepting all
data formats (no schema)
Commodity inexpensive
hardware
Efficient and simple fault-
tolerant mechanism
Performance (tons of indexing,
tuning, data organization tech.)
Features:
- Provenance tracking
- Annotation
management
Scalability (petabytes of
data, thousands of
machines)
Database
9
10. About Cloudera
• Cloudera is The commercial Hadoop
company
• Founded by leading experts on Hadoop from
Facebook, Google,Oracle and Yahoo
• Provides consulting and training services for
Hadoop users
10
11. BIG DATA is not just HADOOP
Manage & store huge volume
of any data
Hadoop File System
Map Reduce
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Data WarehousingStructure and control data
Integrate and govern all
data sources
Integration, Data Quality, Security,
Lifecycle Management, MDM
Understand and navigate
federated big data sources
Federated Discovery and Navigation
11
12. Conclusion
• Big data is fastest emerging business
intelligence analysis data and hadoop is a tool
to handle those big data in a very efficient
way.
• Big data make BIG changes in the business
analyzing world.
13. References
• Hadoop: The Definitive Guide: The Definitive
Guide - Tom White - Google Books
• hadoop - Google Scholar
• http://hadoop.apache.org/#Who+Uses+Hado
op%3F
• http://hadoop.apache.org/
• http://www.cloudera.com
12