Hadoop is an open source platform for storing and processing large amounts of unstructured data across clusters of commodity hardware. It provides flexibility in storing various data types without schemas, and scales out workload by distributing data and processing across nodes. Hadoop is also fault tolerant, continuing operations even if nodes fail, and moves computation to where the data resides for efficiency. Key components include Hadoop Common, HDFS for storage, and MapReduce for distributed processing.
13. Hadoop is an Open Source (Java based), “Scalable”, “fault
tolerant” platform for large amount of unstructured data storage
& processing, distributed across machines.
14. Flexibility
A Single Repo for storing
and analyzing any kind
of data not bounded by
schema
Scalability
Scale-out architecture
divides workload across
multiple nodes using flexible
distributed file system
Low Cost
Deployed on
commodity
hardware & open
source platform
Fault Tolerant
Continue working
event if node(s) go
down
15. A system to move computation, where the data is.
23. Cloudera Impala Hortonworks Tez
Impala uses C++ based in-memory
processing of HDFS data through SQL
like statements to expedite the data
processing
Use cases include user collaborative
filtering, user recommendations,
clustering and classification.
Editor's Notes
Slides are for reference only. We can understand and learn more about live example and live discussion.Lets see who is who in the room. How many coders? Program Managers? Any Hadoop stories? How about where is Hadoop headquarter?