February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Apache Hadoop India Summit 2011 talk "Provisioning Hadoop’s MapReduce in cloud for Effective Storage as a Service" by S. M. Shalinie
1. Provisioning Hadoop’sMapReduce in Cloud for Effective Storage as a Service Dr. S.M.Shalinie, Associate Professor and Head, Department of Computer Science and Engineering, Thiagarajar College of Engineering, Madurai 625 015
2.
3. Explosive growth of audio, video and user generated content clearly implies that maintaining data center hardware infrastructure is a biggest challenge
9. Impact of Data Growth According to Gartner recent survey report : 47% of enterprises identified ‘data growth’ as their top challenge with other 2 challenges as 37% ‘system performance and scalability’ and 36% ‘network congestion and connectivity’ It is because data growth is particularly associated with increased costs relative to hardware, software, associated maintenance, administration and services Source: http://www.gartner.com/it/page.jsp?id=1460213 Thiagarajar College of Engineering, Madurai
10. Traditional Datacenters High performance and high degree of control Building a scalable and reliable storage requires experienced skillful engineering team Upfront cost and maintenance cost and using resources efficiently is a key factor to save cost Consumes heavy internet bandwidth Additional Internet connections and equipments for redundancy or load balancing By Moore’s law hardware price per Gigabyte is dropping every day if company has deployed too much storage equipments without full utilization the equipment will be wasted Thiagarajar College of Engineering, Madurai
15. Provision for archivesS3 Put Objects Get Thiagarajar College of Engineering, Madurai
16. Data at Rest Maintain Integrity - Accuracy and consistency of data Confidentiality - Ensuring Privacy of data - Ensuring Data access only by authorized users Information Assurance - Measures to ensure availability Information Security - Protecting data from unauthorised access, use, disclosure, disruption and modification[2] Thiagarajar College of Engineering, Madurai
17.
18. Parallelizing Encryption Process Encryption consumes large resources and time Abundant utilization of resources make the encryption process effective Hadoop'sMapReduce supports large scale parallel data processing framework for high end computing applications Suitable Algorithm is required to perform Encryption Process Thiagarajar College of Engineering, Madurai
22. The key has to be generated such that the user has control over the data
23. The key should be strong enough so that it is not vulnerable to attacks (like brute-force)Valid User User name, Password User name: hadoop Password: ******* File Password: ******** Thiagarajar College of Engineering, Madurai
24. Key Management Generation of unique key per user File password 128 bits Username DES SHA1 so5y/8WBOZlSg4d8 1ff360f124b6e2 453597010ea 589ee6871681840 Thiagarajar College of Engineering, Madurai
25. Overall Process User name File password 128 bits SHA1 DES 1ff360f124b6e2 453597010ea 589ee6871681840 so5y/8WBOZlSg4d8 User name: hadoop Password: ******* File password: ********* Thiagarajar College of Engineering, Madurai
26.
27. To Adapt the algorithm for a particular application
28. For parallelisation, the mode should support Encryption of subsequent blocks independent of each other ELECTRONIC CODE BOOK (ECB) MODE Plaintext handled one block at a time Each block encrypted using same key XEX-TCB-CTS (XTS) MODE Each block encrypted using 2 different keys. Tweak key – varies based on the position of the block. Handles last incomplete block of plaintext[1] K2 αj p AES XOR X K1 AES XOR C Thiagarajar College of Engineering, Madurai
29.
30. A MapReduce includes set of mappers (M1,M2..... Mr ) and reducers (R1,R2..... Rr)
31. The input is given to mapper in the form of <block_id,object>
32. The object is data stored in the corresponding block id [3][4]Thiagarajar College of Engineering, Madurai
33.
34. Encryption using MapReduce Name node Map 1 AES+XTS Reducer Output Map 2 AES+XTS Rack 2 Map 3 AES+XTS . . . Map N AES+XTS Rack 1 Thiagarajar College of Engineering, Madurai
35. Storage as a Service Rack 1 Encryption through Map Reduce Cluster Web Server Rack 2 Plaintext Rack 3 HDFS Thiagarajar College of Engineering, Madurai
36. Performance of the Algorithm Time(mins) Time(mins) Data Size(GB) Data Size(GB) (i) AES-ECB with mapper only (ii) AES-XTS with reducer Time(mins) Data Size(GB) (iii) AES-XTS with mapper only Thiagarajar College of Engineering, Madurai
37. Deduplication Technique to improve storage utilization by eliminating coarse-grained redundant data Process involves deleting duplicate and leaving only one copy of the data The unique copy of the data is referred using Symbolic link By default Hadoop does not support Data Deduplication Symbolic link File1 <abcd> File1 <abcd> File3 <abcd> File2: <Wxyz> File2 <Wxyz> User1 File3 <abcd> HDFS Thiagarajar College of Engineering, Madurai
38.
39. Among many algorithms bzip analysis proved that bzip2 has better compression ration for text files
40. MapReduce can be used for compressing a set of large text files in efficient mannerMapReduce Framework Performing Compression <file_name,hdfs_path_uncompressed file> <file_name,hdfs_path,compressed file> Thiagarajar College of Engineering, Madurai
41. Deduplication and Compression Using MapReduce Plain Text User1 User1 PlainText MapReduce performing Deduplication and Compression Audio MapReduce performing Encryption User2 Audio User3 Video User2 Compressed Output Video User3 Thiagarajar College of Engineering, Madurai
42. Text Data Results Compression Ratio Time(mins) DataSize(MB) DataSize(GB) AES+XTS Encryption without compression for Text Data AES+XTS Encryption with compression for Text Data Thiagarajar College of Engineering, Madurai
43. Compression Ratio Time(mins) DataSize(GB) DataSize(GB) Thiagarajar College of Engineering, Madurai Image Data Results
46. Compression results prove that storage requirements have been reduced by ratio of 1:10 for text data and 1:2 for image dataThiagarajar College of Engineering, Madurai
47.
48. Include in the bucket system to store objects in the bucket securely and efficiently
49. Validate the results using standard data sets such as EnronThiagarajar College of Engineering, Madurai
56. Storage space can be managed efficiently by including compression technique before performing encryption strategy
57. Experimental results prove that compression followed by encryption using MapReduce suits securing Data at Rest in cloudThiagarajar College of Engineering, Madurai
58. Other projects 1.TCE MR Simulator - To reduce the execution time of Map Reduce jobs - To design a scheduler with pre-emption support - To address the HDFS scalability issue - To index larger files before searching 2.Securing Hadoop Environment - To develop a bucket management system - To maintain the integrity of data between nodes during MapReduce process 3. Parallelization of Machine Learning Algorithms - To generate frequent item sets using MapReduce for large datasets Thiagarajar College of Engineering, Madurai
59. References [1]M.Dworkin,”Recommendation for Block Cipher Modes of Operation:The XTS-AES Mode for confidentiality on Storage Devices”, NIST Special Publication 800-38E, US Nat’l Inst. Of Standards and Tech,2010. [2]Lori M.Kaufman,”Data Security in the world of Cloud Computing”, IEEE Security and Privacy Vol2,pp61-64,2010. [3]Jeffrey Dean and Sanjay Ghemawat ,”MapReduce: Simplified Data Processing on Large Clusters”, Communications of the ACM, Vol.51, No 1, 2008. [4]http://hadoop.apache.org [5]Bruce Schneier and Doug Whiting,”A Performance Comparison of the Five AES Finalist”, Second AES Candidate Conference,2000 Thiagarajar College of Engineering, Madurai