Querying and Analyzing Data in Amazon S3

Querying and analyzing
data in Amazon S3
• April, 2017
• Dario Rivera, Solutions Architect, AWS
You can find this presentation here: http://tinyurl.com/sfloft-bigdataday-2017-ws1

Your Big Data Application Architecture
Amazon
EMR
Amazon
Redshift
Amazon
QuickSight
Raw web logs from
Firehose
Run SQL queries on
processed web logs
Visualize web logs to
discover insights
Amazon S3
Bucket
Ad-hoc analysis of
web logs
Amazon
Athena
Interactive querying of
web logs

What is qwikLABS?
• Provides access to AWS services for this bootcamp
• No need to provide a credit card
• Automatically deleted when you’re finished
http://events-aws.qwiklab.com
• Create an account with the same email that you used to register for this
bootcamp

Sign in and start the lab
Once the lab is started you will see a “Create in Progress” message in the upper
right hand corner.

Navigating qwikLABS
Connect tab: Access and login information
Addl Info tab: Links to Interfaces
Lab Instruction tab:
Scripts for your labs

Everything you need for the lab
• Open AWS Console, login and verify the following AWS resources are
created:
• One Amazon EMR Cluster
• One Amazon Redshift Cluster
• Sign up (later) for
• Amazon QuickSight

Activity 1
Deliver Log Files to
Redshift

Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper

Amazon Redshift architecture
• Leader Node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
• Compute Nodes
Local columnar storage
Parallel/distributed execution of all queries, loads, backups,
restores, resizes
• Start at just $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
Ingestion/Backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)

Benefit #1: Amazon Redshift is fast
• Parallel and Distributed
Query
Load
Export
Backup
Restore
Resize

Benefit #2: Amazon Redshift is fully managed
Continuous/incremental backups
Multiple copies within cluster
Continuous and incremental backups
to S3
Continuous and incremental backups
across regions
Streaming restore
Amazon S3
Amazon S3
Region 1
Region 2

Benefit #3: Security is built-in
• Load encrypted from S3
• SSL to secure data in transit
• ECDHE perfect forward security
• Amazon VPC for network isolation
• Encryption to secure data at rest
• All blocks on disks & in Amazon S3 encrypted
• Block key, Cluster key, Master key (AES-256)
• On-premises HSM & AWS CloudHSM support
• Audit logging and AWS CloudTrail integration
• SOC 1/2/3, PCI-DSS, FedRAMP, BAA
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC

Benefit #4: Amazon Redshift is powerful
• Approximate functions
• User defined functions
• Machine Learning
• Data Science
Amazon ML

Benefit #5: Amazon Redshift has a large ecosystem
Data Integration Systems IntegratorsBusiness Intelligence

Activity 1: Deliver data to Redshift using s3 Copy Cmd
• Time: 5 minutes
• We are going to:
A. Connect to Redshift cluster and create a table to hold web logs data
B. COPY Data from S3 into Redshift
C. Run Queries against Recently S3 copied Data

Activity 1A: Connect to Amazon Redshift
• You can connect with pgweb
• Installed and configured for the Redshift Cluster
• Just navigate to pgweb and start interacting
Note: Click on the Addl. Info tab in qwikLABS and then open the pgWeb link in a
new window.
• Or, Use any JDBC/ODBC/libpq client
• Aginity Workbench for Amazon Redshift
• SQL Workbench/J
• DBeaver
• Datagrip

Activity 1B: Create table in Redshift
• Create table weblogs to capture the in-coming data from a Firehose delivery
stream
Note: You can download Redshift SQL code from qwikLabs. Click on the lab
instructions tab in qwikLABS and then download the Redshift SQL file.

Activity 1C: Deliver Data to Redshift from S3
• Run the Copy Command on Redshift to Load Data into wbelogs Table from S3
1. Remove last query from pgWeb
2. Run the below copy command (get access/secret key from qwiklabs connect
tab) in the query window
COPY weblogs
FROM 's3://bigdataworkshop-sfloft/processed/processed-logs-1.gz'
CREDENTIALS
'aws_access_key_id=<account_access_key>;aws_secret_access_key=<account_secret_key'
DELIMITER ','
REMOVEQUOTES
MAXERROR 0
GZIP;

Review: Amazon Redshift Test Queries
• Find distribution of response codes over days
• Count the number of 404 response codes
weblogs
weblogs

Review: Amazon Redshift Test Queries
• Show all requests paths with status “PAGE NOT FOUND
• Change ‘request_path’ to ‘request_uri’ in below query
weblogs

Interactive Querying with
Amazon Athena

Amazon
Athena
Interactive Query Service
• Query directly from
Amazon S3
• Use ANSI SQL
• Serverless
• Multiple Data Formats
• Cost Effective

Familiar Technologies Under the Covers
• Used for SQL Queries
• In-memory distributed query engine
• ANSI-SQL compatible with extensions
• Used for DDL functionality
• Complex data types
• Multitude of formats
• Supports data partitioning

Comparing performance and cost savings for
compression and columnar format
Dataset
Size on
Amazon S3
Query Run
time
Data Scanned Cost
Data stored as
text files
1 TB 236 seconds 1.15 TB $5.75
Data stored in
Apache
Parquet
format*
130 GB 6.78 seconds 2.51 GB $0.013
Savings /
Speedup
87% less with
Parquet
34x faster
99% less data
scanned
99.7% savings
(*compressed using Snappy compression)
https://aws.amazon.com/blogs/big-data/analyzing-data-in-s3-using-amazon-
athena/

Activity 2
Ad-hoc Querying with
Amazon Athena

Activity 2A: Interactive Querying with Athena
• From the AWS Management Console, click on All Services

• Select Athena from the Analytics section and click on Get Started on the next
page

• Dismiss the window for running the Athena tutorial.
• Dismiss any other tutorial window

• Enter the SQL command to create a table as follows. The SQL ddl for this
exercise can be found on the Lab instructions tab in the file Athena.sql.
Please make sure to replace the <YOUR-KINESIS-FIREHOSE-DESTINATION-
BUCKET> with the bucket name ‘s3://bigdataworkshop-sfloft/raw/’

• Notice that the table will be created in the sample database (sampledb). Click
on Run Query to create the table

Activity 2B: Interactive Querying with Athena
• The SQL ddl in the previous step creates a table in Athena based on the data
streamed from Kinesis Firehose to S3
• Select sampledb from the database section and click on the eye icon to sample a
few rows of the S3 data

Activity 2C: Interactive Querying with Athena
• Run interactive queries (copy SQL queries from Athena.sql under Lab
instructions) and see the results on the console

Activity 4D: Interactive Querying with Athena
• Optionally, you can save the results of a query to CSV by choosing the
file icon on the Results pane.
• You can also view the results of previous queries or queries that may take some
time to complete. Choose History then either search for your query or
choose View or Download to view or download the results of previous
completed queries. This also displays the status of queries that are currently
running.

Activity 2D: Interactive Querying with Athena
• Exercise: Query results are also stored in Amazon S3 in a bucket called aws-
athena-query-results-ACCOUNTID-REGION. Where can you can change the
default location in the console?

Data processing with
Amazon EMR

Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Mahout, Sqoop
HBase/Phoenix
Presto
Hue (SQL Interface/Metastore Management)
Zeppelin (Interactive Notebook)
Ganglia (Monitoring)
HiveServer2/Spark Thriftserver
(JDBC/ODBC)
Amazon EMR service
Amazon EMR release
Streaming
Flink

On-cluster UIs
Manage applicationsNotebooks
SQL editor, Workflow designer,
Metastore browser
Design and execute queries
and workloads
And more using
bootstrap actions!

The Hadoop ecosystem can run in Amazon EMR

On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Spot Instances
for task nodes
Up to 90%
off Amazon EC2
on-demand
pricing
Meet SLA at predictable cost Exceed SLA at lower cost

Amazon S3 as your persistent data store
• Separate compute and storage
• Resize and shut down Amazon EMR
clusters with no data loss
• Point multiple Amazon EMR clusters at same
data in Amazon S3
EMR
EMR
Amazon
S3

EMRFS makes it easier to leverage S3
• Better performance and error handling options
• Transparent to applications – Use “s3://”
• Consistent view
• For consistent list and read-after-write for new puts
• Support for Amazon S3 server-side and client-side encryption
• Faster listing using EMRFS metadata

Apache Spark• Fast, general-purpose engine for large-
scale data processing
• Write applications quickly in Java, Scala,
or Python
• Combine SQL, streaming, and complex
analytics

Apache Zeppelin
• Web-based notebook for interactive
analytics
• Multiple language back end
• Apache Spark integration
• Data visualization
• Collaboration
https://zeppelin.incubator.apache.org/

Activity 3
Ad-hoc analysis using
Amazon EMR

Activity 3: Process and Query data using Amazon EMR
• Time: 20 minutes
A. Use a Zeppelin Notebook to interact with Amazon EMR Cluster
B. Process the data delivered to Amazon S3 by Firehose using Apache Spark
C. Query the data processed in the earlier stage and create simple charts

Activity 3A: Open the Zeppelin interface
1. Click on the Lab Instructions tab in
qwikLABS and then download the
Zeppelin Notebook
2. Click on the Addl. Info tab in
qwikLABS and then open the
zeppelin link into a new window.
3. Import the Notebook using the
Import Note link on Zeppelin
interface

Activity 3B: Run the notebook
• Enter the S3 bucket name where the logs are delivered by Kinesis Firehose. The
bucket name begins with bigdataworkshop-sfloft
• Execute Step 1
• Enter bucket name (bigdataworkshop-sfloft)
• Execute Step 2
• Change the ‘/*/*/*/*/*.gz’ post fix to be ‘/raw/*.gz’
• Create a Dataframe from the dataset delivered by Firehose
• Execute Step 3
• Sample a few rows

• Execute Step 4 to process the data
• Notice how the ‘REQUEST’ field consists of both the ’REQUEST
PROTOCOL’ and ‘REQUEST PATH’. Let’s fix that.
• Create a UDF that will split the column and add it to the Dataframe
• Print the new Dataframe

• Execute Step 6
• Register the data frame as a temporary tabl
• Now you can run SQL queries on the temporary tables.
• Execute the next 3 steps and observe the charts created
• What did you learn about the dataset?

Review : Ad-hoc analysis using Amazon EMR
• You just learned on how to process and query data using Amazon EMR with
Apache Spark
• Amazon EMR has many other frameworks available for you to use
• Hive, Presto, Flink, Pig, MapReduce
• Hue, Oozie, HBase

(Optional Exercise): Data
Visualization with Amazon
QuickSight

Fast, Easy Ad-Hoc Analytics for
Anyone, Everywhere
• Ease of use targeted at business users.
• Blazing fast performance powered by SPICE.
• Broad connectivity with AWS data services,
on-premises data, files and business
applications.
• Cloud-native solution that scales
automatically.
• 1/10th the cost of traditional BI solutions.
• Create, share and collaborate with anyone
in your organization, on the web or on
mobile.

Connect, SPICE, Analyze
QuickSight allows you to connect to data from a wide variety of AWS, third party,
and on-premise sources and import it to SPICE or query directly. Users can then
easily explore, analyze, and share their insights with anyone.
Amazon RDS
Amazon S3
Amazon Redshift

Activity 4
Visualize results in
QuickSight

Activity 4: Visualization with QuickSight
A. Register for a QuickSight account
B. Connect to the Redshift Cluster
C. Create visualizations for analysis to answer questions like:
A. What are the most common http requests and how successful (response
code of 200) are they
B. Which are the most requested URIs

Activity 4A: QuickSight Registration
• Go to AWS Console, click on
QuickSight from the Analytics
section.
• Click on Signup in the next window
• Make sure the subscription type is
Standard and click Continue on the
next screen

• On the Subscription Type page, enter
the account name (see note below)
• Enter your email address
• Select US West region
• Check the S3 (all buckets) box
Note: QuickSight Account name is the
AWS account number from qwikLABS in
the Connect tab

• If a pop box to choose S3 buckets
appears, click Select buckets
• Click on Go To Amazon Quicksight
• Dismiss the next screen

Activity 4B: Connect to data source
• Click on Manage Data to
create a new data set in
QuickSight
• Choose Redshift (Auto-
discovered) as the data
source. QuickSight
autodiscovers databases
associated with your AWS
account (Redshift
database in this case)

Activity 4B: Connect to Amazon Redshift
Note: You can get the Redshift database
password from qwikLABS by navigating to
the “Custom Connection Details” section in
the Connect tab

Activity 4C: Choose your weblogs Redshift
table

Activity 4D: Ingest data into SPICE
• SPICE is Amazon QuickSight's in-
memory optimized calculation
engine, designed specifically for fast,
ad-hoc data visualization
• You can improve the performance of
database data sets by importing the
data into SPICE instead of using a
direct query to the database

Activity 4E: Creating your first analysis
• What are the most requested
http request types and their
corresponding response codes
for this site?
• Simply select request_type,
response_code and let
AUTOGRAPH create the
optimal visualization

Review – Creating your Analysis
• Exercise: Add a visual to demonstrate which URI are the most requested?

Congratulations on building
your big data application on
AWS !!!

Querying and Analyzing Data in Amazon S3

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Querying and Analyzing Data in Amazon S3

Similar to Querying and Analyzing Data in Amazon S3 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Querying and Analyzing Data in Amazon S3

Editor's Notes