Most companies are over-run with data, yet they lack critical insights to make timely and accurate business decisions. They are missing the opportunity to combine large amounts of new, unstructured big data that resides outside their data warehouse with trusted structured data inside their data warehouse. In this session, we’ll take an in-depth look at how modern data warehousing blends and analyzes all your data inside and outside your data warehouse without moving the data to give you deeper insights to run your business. We’ll cover best practices on how to design optimal schemas load data efficiently, and optimize your queries to deliver high throughput and performance.
22. Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
§ Multiple upload
§ Range GET
§ Store as much as you need
§ Scale storage and compute
independently
§ No minimum usage
commitments
Scalable
§ Amazon Redshift / Spectrum
§ Amazon EMR
§ Amazon Athena
§ Amazon DynamoDB
Integrated
§ Simple REST API
§ AWS SDKs
§ Read-after-create consistency
§ Event notification
§ Lifecycle policies
Easy to use
Why Amazon S3 for the Data Lake?
25. But why not use Amazon Athena?
• No Infrastructure or administration
• Zero spin up time
• Transparent upgrades
• Query data in its raw format
• AVRO, Text, CSV, JSON, weblogs, AWS service logs
• Convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No loading of data, no ETL required
• Stream data from directly from Amazon S3, take advantage
of Amazon S3 durability and availability
26. Amazon Redshift Spectrum or Amazon Athena?
Amazon Athena
• Interactive, ad-hoc queries using
SQL and S3
• Serverless architecture
• Structured and un-structured data
• Reduce the amount of data scanned
to reduce cost and increase
performance (use compression,
partitioning, or convert to columnar
format)
• Charged on S3 data scanned
• Fast, simple queries on S3
• Integrates with BI, SQL Clients and
JDBC tools
Amazon Redshift Spectrum
• Large sets of structured data
• Combine data in S3 and Amazon
Redshift
• Limitless concurrency
• No contention on Redshift Cluster
• Amazon manages cluster scaling to
thousands of instances
• S3 cost effective storage
Amazon Redshift
• Multiple and complex joins
• Low IO queries
• Lower variability in latency for use cases
with strict SLAs
33. Kinesis Firehose
Athena
Query Service Glue
Data Access & Authorisation
Give your users easy and secure access
Data Ingestion
Get your data into S3
quickly and securely
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Storage & Catalog
Secure, cost-effective storage in Amazon
S3. Robust metadata in AWS Catalog
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Machine Learning
Predictive analytics
Amazon AI