A data lake is an architectural approach that allows you to store massive amounts of data into a central location, so it's readily available to be categorized, processed, analyzed and consumed by diverse groups within an organization.
In this session, we will introduce the Data Lake concept and its implementation on AWS.
We will explain the different roles our services play and how they fit into the Data Lake picture.
21. Data Lake - The Goal
Emilykil (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons
22. Data Lake - What Sometimes Happens
NatalieMaynor from Jackson, Mississippi, USA - Winter Ugliness, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=5503067
23. Data Lake VS DataMart
Emilykil (Own work) [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons, Abras2010 - WalmartUploaded by
SchuminWeb, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=14571617
25. Alooma Usage of S3 as a Data Lake
● Separate between data of different tenants
○ IAM Role based access ensures data isolation
● Allow Alooma tenants to replay their data from any data
source or time
● Staging area before loading into Data Warehouse
● Storage for things that need infinite retention (e.g. audit logs)
27. S3 as Data Lake - Tips and Tricks
● Use Server-Side encryption to provide automatic encryption at
rest - but it does impact performance
● Loading data in high volume
○ Keys in S3 are partitioned by prefix
○ Use Randomly prefixed or at least sharded filenames
● Use Object Expiration to avoid storing unnecessary data
Important resource:
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-
perf-considerations.html