1. Querying and analyzing
data in Amazon S3
• April, 2017
• Dario Rivera, Solutions Architect, AWS
You can find this presentation here: http://tinyurl.com/sfloft-bigdataday-2017-ws1
2. Your Big Data Application Architecture
Amazon
EMR
Amazon
Redshift
Amazon
QuickSight
Raw web logs from
Firehose
Run SQL queries on
processed web logs
Visualize web logs to
discover insights
Amazon S3
Bucket
Ad-hoc analysis of
web logs
Amazon
Athena
Interactive querying of
web logs
3. What is qwikLABS?
• Provides access to AWS services for this bootcamp
• No need to provide a credit card
• Automatically deleted when you’re finished
http://events-aws.qwiklab.com
• Create an account with the same email that you used to register for this
bootcamp
4. Sign in and start the lab
Once the lab is started you will see a “Create in Progress” message in the upper
right hand corner.
5. Navigating qwikLABS
Connect tab: Access and login information
Addl Info tab: Links to Interfaces
Lab Instruction tab:
Scripts for your labs
6. Everything you need for the lab
• Open AWS Console, login and verify the following AWS resources are
created:
• One Amazon EMR Cluster
• One Amazon Redshift Cluster
• Sign up (later) for
• Amazon QuickSight
8. Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
9. Amazon Redshift architecture
• Leader Node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
• Compute Nodes
Local columnar storage
Parallel/distributed execution of all queries, loads, backups,
restores, resizes
• Start at just $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
Ingestion/Backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)
10. Benefit #1: Amazon Redshift is fast
• Parallel and Distributed
Query
Load
Export
Backup
Restore
Resize
11. Benefit #2: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups
to S3
Continuous and incremental backups
across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2
12. Benefit #3: Security is built-in
• Load encrypted from S3
• SSL to secure data in transit
• ECDHE perfect forward security
• Amazon VPC for network isolation
• Encryption to secure data at rest
• All blocks on disks & in Amazon S3 encrypted
• Block key, Cluster key, Master key (AES-256)
• On-premises HSM & AWS CloudHSM support
• Audit logging and AWS CloudTrail integration
• SOC 1/2/3, PCI-DSS, FedRAMP, BAA
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
13. Benefit #4: Amazon Redshift is powerful
• Approximate functions
• User defined functions
• Machine Learning
• Data Science
Amazon ML
14. Benefit #5: Amazon Redshift has a large ecosystem
Data Integration Systems IntegratorsBusiness Intelligence
15. Activity 1: Deliver data to Redshift using s3 Copy Cmd
• Time: 5 minutes
• We are going to:
A. Connect to Redshift cluster and create a table to hold web logs data
B. COPY Data from S3 into Redshift
C. Run Queries against Recently S3 copied Data
16. Activity 1A: Connect to Amazon Redshift
• You can connect with pgweb
• Installed and configured for the Redshift Cluster
• Just navigate to pgweb and start interacting
Note: Click on the Addl. Info tab in qwikLABS and then open the pgWeb link in a
new window.
• Or, Use any JDBC/ODBC/libpq client
• Aginity Workbench for Amazon Redshift
• SQL Workbench/J
• DBeaver
• Datagrip
17. Activity 1B: Create table in Redshift
• Create table weblogs to capture the in-coming data from a Firehose delivery
stream
Note: You can download Redshift SQL code from qwikLabs. Click on the lab
instructions tab in qwikLABS and then download the Redshift SQL file.
18. Activity 1C: Deliver Data to Redshift from S3
• Run the Copy Command on Redshift to Load Data into wbelogs Table from S3
1. Remove last query from pgWeb
2. Run the below copy command (get access/secret key from qwiklabs connect
tab) in the query window
COPY weblogs
FROM 's3://bigdataworkshop-sfloft/processed/processed-logs-1.gz'
CREDENTIALS
'aws_access_key_id=<account_access_key>;aws_secret_access_key=<account_secret_key'
DELIMITER ','
REMOVEQUOTES
MAXERROR 0
GZIP;
19. Review: Amazon Redshift Test Queries
• Find distribution of response codes over days
• Count the number of 404 response codes
weblogs
weblogs
20. Review: Amazon Redshift Test Queries
• Show all requests paths with status “PAGE NOT FOUND
• Change ‘request_path’ to ‘request_uri’ in below query
weblogs
23. Familiar Technologies Under the Covers
• Used for SQL Queries
• In-memory distributed query engine
• ANSI-SQL compatible with extensions
• Used for DDL functionality
• Complex data types
• Multitude of formats
• Supports data partitioning
24. Comparing performance and cost savings for
compression and columnar format
Dataset
Size on
Amazon S3
Query Run
time
Data Scanned Cost
Data stored as
text files
1 TB 236 seconds 1.15 TB $5.75
Data stored in
Apache
Parquet
format*
130 GB 6.78 seconds 2.51 GB $0.013
Savings /
Speedup
87% less with
Parquet
34x faster
99% less data
scanned
99.7% savings
(*compressed using Snappy compression)
https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-
athena/
26. Activity 2A: Interactive Querying with Athena
• From the AWS Management Console, click on All Services
27. Activity 2A: Interactive Querying with Athena
• Select Athena from the Analytics section and click on Get Started on the next
page
28. Activity 2A: Interactive Querying with Athena
• Dismiss the window for running the Athena tutorial.
• Dismiss any other tutorial window
29. Activity 2A: Interactive Querying with Athena
• Enter the SQL command to create a table as follows. The SQL ddl for this
exercise can be found on the Lab instructions tab in the file Athena.sql.
Please make sure to replace the <YOUR-KINESIS-FIREHOSE-DESTINATION-
BUCKET> with the bucket name ‘s3://bigdataworkshop-sfloft/raw/’
30. Activity 2A: Interactive Querying with Athena
• Notice that the table will be created in the sample database (sampledb). Click
on Run Query to create the table
31. Activity 2B: Interactive Querying with Athena
• The SQL ddl in the previous step creates a table in Athena based on the data
streamed from Kinesis Firehose to S3
• Select sampledb from the database section and click on the eye icon to sample a
few rows of the S3 data
32. Activity 2C: Interactive Querying with Athena
• Run interactive queries (copy SQL queries from Athena.sql under Lab
instructions) and see the results on the console
33. Activity 4D: Interactive Querying with Athena
• Optionally, you can save the results of a query to CSV by choosing the
file icon on the Results pane.
• You can also view the results of previous queries or queries that may take some
time to complete. Choose History then either search for your query or
choose View or Download to view or download the results of previous
completed queries. This also displays the status of queries that are currently
running.
34. Activity 2D: Interactive Querying with Athena
• Exercise: Query results are also stored in Amazon S3 in a bucket called aws-
athena-query-results-ACCOUNTID-REGION. Where can you can change the
default location in the console?
39. On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
Meet SLA at predictable cost Exceed SLA at lower cost
40. Amazon S3 as your persistent data store
• Separate compute and storage
• Resize and shut down Amazon EMR
clusters with no data loss
• Point multiple Amazon EMR clusters at same
data in Amazon S3
EMR
EMR
Amazon
S3
41. EMRFS makes it easier to leverage S3
• Better performance and error handling options
• Transparent to applications – Use “s3://”
• Consistent view
• For consistent list and read-after-write for new puts
• Support for Amazon S3 server-side and client-side encryption
• Faster listing using EMRFS metadata
42. Apache Spark• Fast, general-purpose engine for large-
scale data processing
• Write applications quickly in Java, Scala,
or Python
• Combine SQL, streaming, and complex
analytics
43. Apache Zeppelin
• Web-based notebook for interactive
analytics
• Multiple language back end
• Apache Spark integration
• Data visualization
• Collaboration
https://zeppelin.incubator.apache.org/
45. Activity 3: Process and Query data using Amazon EMR
• Time: 20 minutes
• We are going to:
A. Use a Zeppelin Notebook to interact with Amazon EMR Cluster
B. Process the data delivered to Amazon S3 by Firehose using Apache Spark
C. Query the data processed in the earlier stage and create simple charts
46. Activity 3A: Open the Zeppelin interface
1. Click on the Lab Instructions tab in
qwikLABS and then download the
Zeppelin Notebook
2. Click on the Addl. Info tab in
qwikLABS and then open the
zeppelin link into a new window.
3. Import the Notebook using the
Import Note link on Zeppelin
interface
48. Activity 3B: Run the notebook
• Enter the S3 bucket name where the logs are delivered by Kinesis Firehose. The
bucket name begins with bigdataworkshop-sfloft
• Execute Step 1
• Enter bucket name (bigdataworkshop-sfloft)
• Execute Step 2
• Change the ‘/*/*/*/*/*.gz’ post fix to be ‘/raw/*.gz’
• Create a Dataframe from the dataset delivered by Firehose
• Execute Step 3
• Sample a few rows
49. Activity 3B: Run the notebook
• Execute Step 4 to process the data
• Notice how the ‘REQUEST’ field consists of both the ’REQUEST
PROTOCOL’ and ‘REQUEST PATH’. Let’s fix that.
• Create a UDF that will split the column and add it to the Dataframe
• Print the new Dataframe
50. Activity 3B: Run the notebook
• Execute Step 6
• Register the data frame as a temporary tabl
• Now you can run SQL queries on the temporary tables.
• Execute the next 3 steps and observe the charts created
• What did you learn about the dataset?
51. Review : Ad-hoc analysis using Amazon EMR
• You just learned on how to process and query data using Amazon EMR with
Apache Spark
• Amazon EMR has many other frameworks available for you to use
• Hive, Presto, Flink, Pig, MapReduce
• Hue, Oozie, HBase
53. Fast, Easy Ad-Hoc Analytics for
Anyone, Everywhere
• Ease of use targeted at business users.
• Blazing fast performance powered by SPICE.
• Broad connectivity with AWS data services,
on-premises data, files and business
applications.
• Cloud-native solution that scales
automatically.
• 1/10th the cost of traditional BI solutions.
• Create, share and collaborate with anyone
in your organization, on the web or on
mobile.
54. Connect, SPICE, Analyze
QuickSight allows you to connect to data from a wide variety of AWS, third party,
and on-premise sources and import it to SPICE or query directly. Users can then
easily explore, analyze, and share their insights with anyone.
Amazon RDS
Amazon S3
Amazon Redshift
56. Activity 4: Visualization with QuickSight
• We are going to:
A. Register for a QuickSight account
B. Connect to the Redshift Cluster
C. Create visualizations for analysis to answer questions like:
A. What are the most common http requests and how successful (response
code of 200) are they
B. Which are the most requested URIs
57. Activity 4A: QuickSight Registration
• Go to AWS Console, click on
QuickSight from the Analytics
section.
• Click on Signup in the next window
• Make sure the subscription type is
Standard and click Continue on the
next screen
58. Activity 4A: QuickSight Registration
• On the Subscription Type page, enter
the account name (see note below)
• Enter your email address
• Select US West region
• Check the S3 (all buckets) box
Note: QuickSight Account name is the
AWS account number from qwikLABS in
the Connect tab
59. Activity 4A: QuickSight Registration
• If a pop box to choose S3 buckets
appears, click Select buckets
• Click on Go To Amazon Quicksight
• Dismiss the next screen
60. Activity 4B: Connect to data source
• Click on Manage Data to
create a new data set in
QuickSight
• Choose Redshift (Auto-
discovered) as the data
source. QuickSight
autodiscovers databases
associated with your AWS
account (Redshift
database in this case)
61. Activity 4B: Connect to Amazon Redshift
Note: You can get the Redshift database
password from qwikLABS by navigating to
the “Custom Connection Details” section in
the Connect tab
63. Activity 4D: Ingest data into SPICE
• SPICE is Amazon QuickSight's in-
memory optimized calculation
engine, designed specifically for fast,
ad-hoc data visualization
• You can improve the performance of
database data sets by importing the
data into SPICE instead of using a
direct query to the database
64. Activity 4E: Creating your first analysis
• What are the most requested
http request types and their
corresponding response codes
for this site?
• Simply select request_type,
response_code and let
AUTOGRAPH create the
optimal visualization
65. Review – Creating your Analysis
• Exercise: Add a visual to demonstrate which URI are the most requested?
66. Your Big Data Application Architecture
Amazon
EMR
Amazon
Redshift
Amazon
QuickSight
Raw web logs from
Firehose
Run SQL queries on
processed web logs
Visualize web logs to
discover insights
Amazon S3
Bucket
Ad-hoc analysis of
web logs
Amazon
Athena
Interactive querying of
web logs
Amazon EMR is more than just MapReduce.
Bootstrap actions available on GitHub
In the next few slides, we’ll talk about data persistence models with Amazon EMR. The first pattern is Amazon S3 as HDFS. With this data persistence model, data gets stored on Amazon S3. HDFS does not play any role in storing data. As a matter of fact, HDFS is only there for temporary storage. Another common thing I hear is that storing data on Amazon S3 instead of HDFS slows my job down a lot because data has to get copied to the HDFS/disk first before processing starts. That’s incorrect. If you tell Hadoop that your data is on Amazon S3, Hadoop reads directly from Amazon S3 and streams data to Mappers without toughing the disk. Not to be completely correct, data does touch HDFS when data has to shuffle from mappers to reducers, but as I mentioned, HDFS acts as the temp space and nothing more.
EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency.
And every other feature that comes with Amazon S3. Features such as SSE, LifeCycle, etc. And again keep in mind that Amazon S3 as the storage is the main reason why we can’t build elastic clusters where nodes get added and removed dynamically without any data loss.
Write programs in terms of transformations on distributed data sets.
The SSH command below enables “port forwarding” on TCP 9026 so you can use http://localhost:9026 from a web browser on your local machine to view cluster details and job progress
QuickSight is a fast, easy to use, cloud powered Business Analytics service lets business users quickly and easily visualize, explore, and share insights from their data to anyone in their organization, on the web or on mobile.
QuickSight combines an elegant, easy to use interface with blazing fast performance powered by SPICE to provide a fast, easy to use business analytics service at 1/10 the cost of traditional BI solutions.