By using a Data Lake, you no longer need to worry about structuring or transforming data before storing it. A Data Lake on AWS enables your organization to more rapidly analyze data, helping you quickly discover new business insights. Join us for our webinar to learn about the benefits of building a Data Lake on AWS and how your organization can begin reaping their rewards. In this session, we will share methodology for implementing a Data Lake on AWS and best practices for getting the most from your Data Lake.
Speaker: Russell Nash,
APAC Solution Architect, DW, AWS APAC
16. Compute Memory Storage
Machine Learning
C4 Family
C3 Family
X1 Family
R3 Family
Interactive Analysis
D2 Family
I2 Family
Large HDFS
General
Batch Process
M4 Family
M3 Family
Compute Flexibility
36. ID Age State
123 20 NSW
345 25 WA
678 40 VIC
999 21 WA
123 20 NSW 345 25 WA 678 40 VIC 999 21 WA
123 345 678 999 20 25 40 21 NSW WA VIC WA
ROW FORMAT
COLUMN FORMAT
41. AWS Solution Builder – Data Lake on AWS
Reference Architecture deployment
via CloudFormation
Configures core services to tag,
search and catalogue datasets
Deploys a console to search and
browse available datasets
http://amzn.to/2nTVjcp
Editor's Notes
Let’s look at a traditional analytics pipeline that is characterized by an ETL process that occurs before the data is loaded into the Data Warehouse.
The problem with this is that everyone sees the same curated data which may be summarized and aggregated.
Users don’t get access to raw data.
The data lake contains the raw data and this then allows different users to have their own ETL processes to format the data the way they need it.
When you look at the requirements for a data lake it seems that Hadoop is the perfect choice because it is scalable, extensible and very flexible. It can run on commodity hardware, has a vast ecosystem of tools and appears to be cost effective to run.
However there is one issue that makes Hadoop by itself less appealing as a data lake.
If you use Hadoop’s storage layer (HDFS) to store your data then you are coupling the storage with the compute.
If you need more storage space then you have to add more machines (virtual or otherwise) and this increases your compute capacity as well.
For maximum flexibility and cost effectiveness you need to separate compute and storage and scale them both independently. Using S3 for storage and Amazon EMR as your compute layer allows you to do that.
Amazon EMR simplifies big data processing, providing a managed Hadoop framework that makes it easy, fast, and cost-effective for you to distribute and process vast amounts of your data across dynamically scalable Amazon EC2 instances.
The EMR File System allows EMR clusters to efficiently and securely use Amazon S3 as an object store for Hadoop. You can store your data in Amazon S3 and use multiple Amazon EMR clusters to process the same data set. Each cluster can be optimized for a particular workload, which can be more efficient than a single cluster serving multiple workloads with different requirements. For example, you might have one cluster that is optimized for I/O and another that is optimized for CPU, each processing the same data set in Amazon S3. Additionally, by storing your input and output data in Amazon S3, you can shut down clusters when they are no longer needed.
Amazon EMR makes it easy to use spot instances so you can save both time and money. Amazon EMR clusters include 'core nodes' that run HDFS and ‘task nodes’ that do not; task nodes are ideal for Spot because if the Spot price increases and you lose those instances you will not lose data stored in HDFS.
Amazon EMR supports powerful and proven Hadoop tools such as Hive, Pig, HBase, and Impala. Additionally, it can run distributed computing frameworks besides Hadoop MapReduce such as Spark or Presto using bootstrap actions. You can also use Hue and Zeppelin as GUIs for interacting with applications on your cluster.
At the very heart of solving the constraints Customers face is the notion of decoupling the storage from compute.
For maximum flexibility and cost effectiveness separating compute and storage allows each to scale independently.
And this is the very first step with building a data lake on AWS
Athena is a fully managed serverless service.
There is no provisioning or administration to be performed and the service is available instantly.
Pricing is per query