Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology and Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. The combination of the two can provide a solution to power advanced analytics for not only what has happened in the past, but make intelligent predictions about the future. Please join this webinar to learn how get the most value from your data for your data driven business.
Learning Objectives:
How to scale your Redshift queries with user-defined functions (UDFs)
How to apply Machine learning to historical data in Amazon Redshift
How to visualize your data with Amazon QuickSight
Present a reference architecture for advanced analytics
Who Should Attend:
Application developers looking to add UDFs, or predictive analytics to their applications, database administrators that need to meet the demand of data driven organizations, decision makers looking to derive more insight from their data
2. Current Trends
Amazon Redshift
•New SQL Functions
•User Defined Functions (UDFs)
•Connecting R with Amazon Redshift
What to Expect from the Webinar
Amazon Machine Learning
•Amazon ML Overview
•Developing with Amazon ML
Visualizing with Amazon
QuickSight
Demo
Q&A
3. Data is part of the fabric of the applications
Front-end and UX Mobile Back-end
and operations
Data and
analytics
4. Three types of data-driven development
Retrospective
analysis and
reporting
Here-and-now
real-time processing
and dashboards
Predictions
to enable smart
applications
5. Machine learning and smart applications
Machine learning is the technology that
automatically finds patterns in your data and
uses them to make predictions for new data
points as they become available
Your data + machine learning = smart applications
6. Smart applications by example
Based on what you
know about the user:
Will they use your
product?
Based on what you
know about an order:
Is this order
fraudulent?
Based on what you know
about a news article:
What other articles are
interesting?
7. Machine Learning Use Cases
Personalization Recommending content, predictive content loading,
improving user experience, …
Targeted marketing Matching customers and offers, choosing marketing
campaigns, cross-selling and up-selling, …
Content classification Categorizing documents, matching hiring managers and
resumes, …
Churn prediction Finding customers who are likely to stop using the
service, free-tier upgrade targeting, …
Customer support Predictive routing of customer emails, social media
listening, …
Fraud detection Detecting fraudulent transactions, filtering spam emails,
flagging suspicious reviews, …
8. Three types of data-driven development
Retrospective
analysis and
reporting
Here-and-now
real-time processing
and dashboards
Predictions
to enable smart
applications
10. Amazon Redshift system architecture
Leader node
• SQL endpoint
• Stores metadata
• Coordinates query execution
Compute nodes
• Local, columnar storage
• Execute queries in parallel
• Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB, Amazon EMR, or SSH
Two hardware platforms
• Optimized for data processing
• DS2: HDD; scale from 2TB to 2PB
• DC1: SSD; scale from 160GB to 356TB
10 GigE
(HPC)
SQL Clients/BI ToolsSQL Clients/BI Tools
128GB RAM128GB RAM
16TB disk16TB disk
16 cores16 cores
S3 / EMR / DynamoDB / SSHS3 / EMR / DynamoDB / SSH
JDBC/ODBC
128GB RAM128GB RAM
16TB disk16TB disk
16 cores16 cores
Compute
Node
Compute
Node
128GB RAM128GB RAM
16TB disk16TB disk
16 cores16 cores
Compute
Node
Compute
Node
128GB RAM128GB RAM
16TB disk16TB disk
16 cores16 cores
Compute
Node
Compute
Node
Leader
Node
Leader
Node
11. Customers are from a variety of industries & sizes
NTT Docomo | Telecom FINRA | Financial Svcs Philips | Healthcare Yelp | Technology NASDAQ | Financial Svcs
The Weather Company | Media Nokia | Telecom Pinterest | Technology Foursquare | Technology Coursera | Education
Coinbase | Bitcoin Amazon | E-Commerce Etix | Entertainment Spuul | Entertainment Vivaki | Ad Tech
Z2 | Gaming Neustar | Ad Tech SoundCloud | Technology BeachMint | E-Commerce Civis | Technology
12. Three types of data-driven development
Retrospective
analysis and
reporting
Here-and-now
real-time processing
and dashboards
Predictions
to enable smart
applications
13. New SQL Functions
We add SQL functions regularly to expand Amazon Redshift’s query capabilities
Added 25+ window and aggregate functions since launch, including:
LISTAGG
[APPROXIMATE] COUNT
DROP IF EXISTS, CREATE IF NOT EXISTS
REGEXP_SUBSTR, _COUNT, _INSTR, _REPLACE
PERCENTILE_CONT, _DISC, MEDIAN
PERCENT_RANK, RATIO_TO_REPORT
We’ll continue iterating but also want to enable you to write your own
14. Scalar User Defined Functions
You can write UDFs using Python 2.7
• Syntax is largely identical to PostgreSQL UDF
• Python execution is performed in parallel
• System and network calls within UDFs are prohibited
Comes integrated with Pandas, NumPy, SciPy, DateUtil and
Pytz analytic libraries
• Import your own libraries for even more flexibility
• Take advantage of thousands of functions available through Python
libraries to perform operations not easily expressed in SQL
15. Template
CREATE [ OR REPLACE ] FUNCTION
f_function_name
( [ argument_name arg_type, ... ] )
RETURNS data_type
{ VOLATILE | STABLE | IMMUTABLE }
AS $$
python_program
$$ LANGUAGE plpythonu;
Scalar UDF example – URL parsing
Example
CREATE FUNCTION f_hostname (url
VARCHAR)
RETURNS varchar
IMMUTABLE AS $$
import urlparse
return
urlparse.urlparse(url).hostname
$$ LANGUAGE plpythonu;
SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘3')
FROM table;
SELECT f_hostname(url)
FROM table;
16. Statistical UDF Example
CREATE FUNCTION f_z_test_by_pval (alpha float,
x_bar float, test_val float, sigma float, n
float)
RETURNS varchar
STABLE AS $$
import scipy.stats as st
import math as math
z = (x_bar - test_val) / (sigma /
math.sqrt(n))
p = st.norm.cdf(z)
if p <= alpha:
return 'Statistically significant'
else:
return 'May have occurred by random chance'
$$ LANGUAGE plpythonu;
17. Scalar UDFs – APN Partner Periscope
JSON Support
json_array_sort
json_array_reverse
json_array_pop
json_array_push
MySQL Date Helpers
mysql_year
Mysql_yearweek
json_array_push
Varchar Utilities
str_multiply
str_count
titlecase
json_array_push
Number Utilities
format_num
Second_max
20. What is R?
Open source programming
language and software
environment designed for
statistical computing, data
analysis, and visualization
Open source IDE for R
Shiny Server - Visualization R
package for creating interactive
dashboards
21. Querying Amazon Redshift with RJDBC
install.packages("RJDBC")
library(RJDBC)
# download Amazon Redshift JDBC driver
download.file('http://s3.amazonaws.com/redshift-
downloads/drivers/RedshiftJDBC41-
1.1.6.1006.jar','RedshiftJDBC41-1.1.6.1006.jar')
# connect to Amazon Redshift
driver <- JDBC("com.amazon.redshift.jdbc41.Driver",
"RedshiftJDBC41-1.1.6.1006.jar",
identifier.quote="`")
url <- "jdbc:redshift://example.abcxyz.us-east-
1.redshift.amazonaws.com
:5439/demo?user=XXX&password=XXX“
conn <- dbConnect(driver, url)
# get some data from the Redshift table
dbGetQuery(conn, "select count(*) from sales")
# close connection
dbDisconnect(conn)
AWS cloud
R User
R
Amazon EC2
Raw Dataset
Amazon S3
User Profile
Amazon RDS
Amazon Redshift
22. Analysis with dplyr R Package
# Run analyses with the dplyr package on Amazon
Redshift
install.packages("dplyr")
library(dplyr)
library(RPostgreSQL)
#connect to redishift via the RPostgreSQL package
myRedshift <- src_postgres('demo',
host = 'jdbc:redshift://example.abcxyz.us-east-
1.redshift.amazonaws.com',
port = 5439,user = “demo”, password = “mypassword”)
# create table reference
sales <- tbl(myRedshift, “sales")
#analyze and plot
summarize(sales, avgsales=mean(store))
ggplot(aes(month, avgsales, fill=state), data=sales)
+ geom_bar(stat="identity")
AWS cloud
R User
R
Amazon EC2
Raw Dataset
Amazon S3
User Profile
Amazon RDS
Amazon Redshift
23. Predictive Modeling with R
# Generate predictions with rpart package
install.packages(“rpart")
#split dataset
splitdata <- createDataPartition(mydata$category,
times = 1,
p = 0.5,
list = FALSE)
trainingdata<-data[splitdata,]
testingdata<-data[-splitdata,]
#create model
model <- rpart(category ~ attr_1 + attr_2 + attr_3,
method="class", data=trainingdata)
#generate predictions
predictions <- predict(model, testingdata,
type="class")
AWS cloud
R User
R
Amazon EC2
Raw Dataset
Amazon S3
User Profile
Amazon RDS
Amazon Redshift
24. Three types of data-driven development
Retrospective
analysis and
reporting
Here-and-now
real-time processing
and dashboards
Predictions
to enable smart
applications
25. Amazon Machine Learning
Easy to use, managed machine learning
service built for developers
Robust, powerful machine learning
technology based on Amazon’s internal
systems
Create models using your data already
stored in the AWS cloud
Deploy models to production in seconds
26. Easy to use
and developer-
friendly
Powerful machine
learning technology
Integrated with
AWS data
ecosystem
Fully managed model
and prediction
services
31. Structured data
In Amazon Redshift
Load predictions into
Amazon Redshift
-or-
Read prediction results
directly from Amazon S3
Predictions
in Amazon S3
Query for predictions with
Amazon ML batch API
Your application
Batch predictions with Amazon Redshift
34. Automatic Data Discovery and Intelligence
Discover data
sources
automatically
Recommend
Analyses
Select the best
visualization for the
data automatically
Inspect data types
and relationships
35. QuickSight API
Data Prep Metadata SuggestionsConnectors SPICE
QuickSight UI
Mobile Devices Web Browsers
Partner BI products
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon
EMR
Amazon
Redshift
Amazon
RDS
Files Apps
Direct connect
JDBC/ODBC
On
premises
Data
BI Users
36. Structured data
In Amazon Redshift
Load predictions into Amazon
Redshift
-or-
Read prediction results
directly from S3
Predictions
in S3
Query for predictions with
Amazon ML batch API
Visualization
Amazon QuickSight
Visualizing batch predictions