AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the Machine Learning Service

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Wangechi Doble, Solutions Architect
Radhika Ravirala, Solutions Architect
November 19, 2015
Advanced Analytics with Amazon
Redshift and Amazon Machine
Learning

Current Trends
Amazon Redshift
•New SQL Functions
•User Defined Functions (UDFs)
•Connecting R with Amazon Redshift
What to Expect from the Webinar
Amazon Machine Learning
•Amazon ML Overview
•Developing with Amazon ML
Visualizing with Amazon
QuickSight
Demo
Q&A

Data is part of the fabric of the applications
Front-end and UX Mobile Back-end
and operations
Data and
analytics

Three types of data-driven development
Retrospective
analysis and
reporting
Here-and-now
real-time processing
and dashboards
Predictions
to enable smart
applications

Machine learning and smart applications
Machine learning is the technology that
automatically finds patterns in your data and
uses them to make predictions for new data
points as they become available
Your data + machine learning = smart applications

Smart applications by example
Based on what you
know about the user:
Will they use your
product?
Based on what you
know about an order:
Is this order
fraudulent?
Based on what you know
about a news article:
What other articles are
interesting?

Machine Learning Use Cases
Personalization Recommending content, predictive content loading,
improving user experience, …
Targeted marketing Matching customers and offers, choosing marketing
campaigns, cross-selling and up-selling, …
Content classification Categorizing documents, matching hiring managers and
resumes, …
Churn prediction Finding customers who are likely to stop using the
service, free-tier upgrade targeting, …
Customer support Predictive routing of customer emails, social media
listening, …
Fraud detection Detecting fraudulent transactions, filtering spam emails,
flagging suspicious reviews, …

Amazon Redshift
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/year

Amazon Redshift system architecture
Leader node
• SQL endpoint
• Stores metadata
• Coordinates query execution
Compute nodes
• Local, columnar storage
• Execute queries in parallel
• Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB, Amazon EMR, or SSH
Two hardware platforms
• Optimized for data processing
• DS2: HDD; scale from 2TB to 2PB
• DC1: SSD; scale from 160GB to 356TB
10 GigE
(HPC)
SQL Clients/BI ToolsSQL Clients/BI Tools
128GB RAM128GB RAM
16TB disk16TB disk
16 cores16 cores
S3 / EMR / DynamoDB / SSHS3 / EMR / DynamoDB / SSH
JDBC/ODBC
128GB RAM128GB RAM
16TB disk16TB disk
16 cores16 cores
Compute
Node
Compute
Node
128GB RAM128GB RAM
16TB disk16TB disk
16 cores16 cores
Compute
Node
Compute
Node
128GB RAM128GB RAM
16TB disk16TB disk
16 cores16 cores
Compute
Node
Compute
Node
Leader
Node
Leader
Node

New SQL Functions
We add SQL functions regularly to expand Amazon Redshift’s query capabilities
Added 25+ window and aggregate functions since launch, including:
LISTAGG
[APPROXIMATE] COUNT
DROP IF EXISTS, CREATE IF NOT EXISTS
REGEXP_SUBSTR, _COUNT, _INSTR, _REPLACE
PERCENTILE_CONT, _DISC, MEDIAN
PERCENT_RANK, RATIO_TO_REPORT
We’ll continue iterating but also want to enable you to write your own

Scalar User Defined Functions
You can write UDFs using Python 2.7
• Syntax is largely identical to PostgreSQL UDF
• Python execution is performed in parallel
• System and network calls within UDFs are prohibited
Comes integrated with Pandas, NumPy, SciPy, DateUtil and
Pytz analytic libraries
• Import your own libraries for even more flexibility
• Take advantage of thousands of functions available through Python
libraries to perform operations not easily expressed in SQL

Template
CREATE [ OR REPLACE ] FUNCTION
f_function_name
( [ argument_name arg_type, ... ] )
RETURNS data_type
{ VOLATILE | STABLE | IMMUTABLE }
AS $$
python_program
$$ LANGUAGE plpythonu;
Scalar UDF example – URL parsing
Example
CREATE FUNCTION f_hostname (url
VARCHAR)
RETURNS varchar
IMMUTABLE AS $$
import urlparse
return
urlparse.urlparse(url).hostname
SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘3')
FROM table;
SELECT f_hostname(url)
FROM table;

Statistical UDF Example
CREATE FUNCTION f_z_test_by_pval (alpha float,
x_bar float, test_val float, sigma float, n
float)
RETURNS varchar
STABLE AS $$
import scipy.stats as st
import math as math
z = (x_bar - test_val) / (sigma /
math.sqrt(n))
p = st.norm.cdf(z)
if p <= alpha:
return 'Statistically significant'
else:
return 'May have occurred by random chance'

Scalar UDFs – APN Partner Periscope
JSON Support
json_array_sort
json_array_reverse
json_array_pop
json_array_push
MySQL Date Helpers
mysql_year
Mysql_yearweek
json_array_push
Varchar Utilities
str_multiply
str_count
titlecase
json_array_push
Number Utilities
format_num
Second_max

Connecting R to
Amazon Redshift to add
predictive modeling

What is R?
Open source programming
language and software
environment designed for
statistical computing, data
analysis, and visualization
Open source IDE for R
Shiny Server - Visualization R
package for creating interactive
dashboards

Querying Amazon Redshift with RJDBC
install.packages("RJDBC")
library(RJDBC)
# download Amazon Redshift JDBC driver
download.file('http://s3.amazonaws.com/redshift-
downloads/drivers/RedshiftJDBC41-
1.1.6.1006.jar','RedshiftJDBC41-1.1.6.1006.jar')
# connect to Amazon Redshift
driver <- JDBC("com.amazon.redshift.jdbc41.Driver",
"RedshiftJDBC41-1.1.6.1006.jar",
identifier.quote="`")
url <- "jdbc:redshift://example.abcxyz.us-east-
1.redshift.amazonaws.com
:5439/demo?user=XXX&password=XXX“
conn <- dbConnect(driver, url)
# get some data from the Redshift table
dbGetQuery(conn, "select count(*) from sales")
# close connection
dbDisconnect(conn)
AWS cloud
R User
R
Amazon EC2
Raw Dataset
Amazon S3
User Profile
Amazon RDS
Amazon Redshift

Analysis with dplyr R Package
# Run analyses with the dplyr package on Amazon
Redshift
install.packages("dplyr")
library(dplyr)
library(RPostgreSQL)
#connect to redishift via the RPostgreSQL package
myRedshift <- src_postgres('demo',
host = 'jdbc:redshift://example.abcxyz.us-east-
1.redshift.amazonaws.com',
port = 5439,user = “demo”, password = “mypassword”)
# create table reference
sales <- tbl(myRedshift, “sales")
#analyze and plot
summarize(sales, avgsales=mean(store))
ggplot(aes(month, avgsales, fill=state), data=sales)
+ geom_bar(stat="identity")
AWS cloud
R User
R
Amazon EC2
Raw Dataset
Amazon S3
User Profile
Amazon RDS
Amazon Redshift

Predictive Modeling with R
# Generate predictions with rpart package
install.packages(“rpart")
#split dataset
splitdata <- createDataPartition(mydata$category,
times = 1,
p = 0.5,
list = FALSE)
trainingdata<-data[splitdata,]
testingdata<-data[-splitdata,]
#create model
model <- rpart(category ~ attr_1 + attr_2 + attr_3,
method="class", data=trainingdata)
#generate predictions
predictions <- predict(model, testingdata,
type="class")
AWS cloud
R User
R
Amazon EC2
Raw Dataset
Amazon S3
User Profile
Amazon RDS
Amazon Redshift

Easy to use, managed machine learning
service built for developers
Robust, powerful machine learning
technology based on Amazon’s internal
systems
Create models using your data already
stored in the AWS cloud
Deploy models to production in seconds

Easy to use
and developer-
friendly
Powerful machine
learning technology
Integrated with
AWS data
ecosystem
Fully managed model
and prediction
services

Build &
Train
model
Evaluate
and
optimize
Retrieve
predicti
ons
1 2 3
Building smart applications with Amazon ML

Build
&Train
model
Evaluate
and
optimize
Retrieve
predicti
ons
1 2 3
Create a data source object pointing to your data
Explore and understand your data
Transform data and train your model

Build
&Train
model
Evaluate
and
optimize
Retrieve
predicti
ons
1 2 3
Understand model quality
Adjust model interpretation

Build and
Train
model
Evaluate
and
optimize
Retrieve
predicti
ons
1 2 3
Batch predictions
Real-time predictions

Structured data
In Amazon Redshift
Load predictions into
Amazon Redshift
-or-
Read prediction results
directly from Amazon S3
Predictions
in Amazon S3
Query for predictions with
Amazon ML batch API
Your application
Batch predictions with Amazon Redshift

Visualize predictions
with Amazon QuickSight

Amazon Quicksight
Fast, easy to use, cloud powered business intelligence

Automatic Data Discovery and Intelligence
Discover data
sources
automatically
Recommend
Analyses
Select the best
visualization for the
data automatically
Inspect data types
and relationships

QuickSight API
Data Prep Metadata SuggestionsConnectors SPICE
QuickSight UI
Mobile Devices Web Browsers
Partner BI products
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon
EMR
Amazon
Redshift
Amazon
RDS
Files Apps
Direct connect
JDBC/ODBC
On
premises
Data
BI Users

Structured data
In Amazon Redshift
Load predictions into Amazon
Redshift
-or-
Read prediction results
directly from S3
Predictions
in S3
Query for predictions with
Amazon ML batch API
Visualization
Amazon QuickSight
Visualizing batch predictions

Demo Overview
Raw Dataset
Amazon S3
Structured Data
Amazon Redshift
Predictions
Amazon S3
Prediction Generation
Visualize
Amazon QuickSight

Resources
Amazon Redshift Getting Started Guide:
http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html
Scalar UDF documentation: http://docs.aws.amazon.com/redshift/latest/dg/user-
defined-functions.html
User Defined Functions for Amazon Redshift blog:
https://aws.amazon.com/blogs/aws/user-defined-functions-for-amazon-redshift/
Connecting R with Amazon Redshift:
https://blogs.aws.amazon.com/bigdata/post/Tx1G8828SPGX3PK/Connecting-R-with-
Amazon-Redshift
Amazon ML Getting Getting Started guide: https://aws.amazon.com/machine-
learning/getting-started/
Amazon QuickSight (Preview Registration): https://aws.amazon.com/quicksight/

AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the Machine Learning Service

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the Machine Learning Service

Similar a AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the Machine Learning Service (20)

Más de Amazon Web Services

Más de Amazon Web Services (20)

Último

Último (20)

AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the Machine Learning Service