1. Why do we need Grid Storage-->Single disk can not host all the data Computation-->Single cpu can not provide all the computing needs Parallel jobs--> Serial execution is no more viable option
2. What we expect from a framework Distributed storage Job specification platform Job spliting/merging Job execution and monitoring
3.
4.
5. HDFS attributes Distributed, Reliable,Scalable Optimized for streaming reads, very large data sets Assumes write once read several times No local caching possible due to large files and streaming reads High data replication Fit logically with mapreduce Synchronized access to metadata--> namenode Metadata (Edit log, FSI image) stored in namenode local os file system.
14. Scalability goal Flat scalability--> addition and removal of a node is fairly straight forward
15. Sub projects Zoo keeper for small shared information (useful for synchronization, lock, leader selection and so many sharing problems in distributed systems). Hbase for semi structured data (provides implementation of google big table design) Hive for ad hoc query analysis (currently supports insertion in multiple tables, group by, multiple table selection and order by is under construction) Avro for data serialization applicable to map reduce