Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

•Download as PPTX, PDF•

3 likes•8,741 views

[Presentation by Skylar Lyon at DataWeek 2014, September 17 2014.] I recently faced the task of how to scale out an existing analytics process. The schedule was compressed - it always is in my world. The data was big - 400+ million rows waiting in database. What did I do? I offered my favorite type of solution - quick and dirty. At the outset, I wasn't sure how easy it would be. Nor was I certain of realized performance gains. But the concept seemed sound and the exercise fun. Let's move the compute to the data via Revolution R Enterprise for Teradata. This presentation outlines my approach in leveraging a colleague's R models as I experimented with running R in-database. Would my path lead to significant improvement? Could it be used to productionalize the workflow?

Technology

Rapid Productionalization of Predictive Models
In-database Modeling with Revolution Analytics on Teradata
Skylar Lyon
Accenture Analytics

Introduction
Skylar Lyon
Accenture Analytics
• 7 years of experience with focus on big data
and predictive analytics - using discrete choice
modeling, random forest classification,
ensemble modeling, and clustering
• Technology experience includes: Hadoop,
Accumulo, PostgreSQL, qGIS, JBoss, Tomcat,
R, GeoMesa, and more
• Worked from Army installations across the
nation and also had the opportunity to travel
twice to Baghdad to deploy solutions
downrange.
Copyright © 2014 Accenture. All rights reserved. 2

How we got here
Project background and my involvement
• New Customer Analytics team for Silicon Valley Internet eCommerce
giant
• Data scientists developing predictive models
• Deferred focus on productionalization
• Joined as Big Data Infrastructure and Analytics Lead
Copyright © 2014 Accenture. All rights reserved. 3

Colleague‘s CRAN R model
Binomial logistic regression
• 50+ Independent variables including categorical with indicator
variables
• Train from small sample (many thousands) – not a problem in and of
itself
• Scoring across entire corpus (many hundred millions) – slightly more
challenging
Copyright © 2014 Accenture. All rights reserved. 4

We optimized the current productionalization process
We moved compute to data
Before After
Reduced 5+ hour process to 40 seconds
Copyright © 2014 Accenture. All rights reserved. 5

Benchmarking our optimized process
5+ hours to 40 seconds: Recommendation is that this now become
the defacto productionalization process
Copyright © 2014 Accenture. All rights reserved. 6
rows
minutes

Optimization process
Recode CRAN R to Rx R
Before
trainit <- glm(as.formula(specs[[i]]), data = training.data,
family='binomial', maxit=iters)
fits <- predict(trainit, newdata=test.data, type='response')
After
trainit <- rxGlm(as.formula(specs[[i]]), data = training.data,
family='binomial', maxIterations=iters)
fits <- rxPredict(trainit, newdata=test.data, type='response')
Copyright © 2014 Accenture. All rights reserved. 7

Additional benefits to new process
Technology is increasing data science team’s options and
opportunities
• Train in-database on much larger set – reduces need to sample
• Nearly “native” R language – decrease deploy time
• Hadoop support – score in multiple data warehouses
Copyright © 2014 Accenture. All rights reserved. 8

Appendix
Table of Contents
• Technical Considerations
Copyright © 2014 Accenture. All rights reserved. 9

Technical considerations
Environment setup
• Teradata environment – 4 node, 1700 series appliance server
• Revolution R Enterprise – version 7.1, running R 3.0.2
Copyright © 2014 Accenture. All rights reserved. 10

What's hot

R at Microsoft (useR! 2016)Revolution Analytics

Accelerating R analytics with Spark and Microsoft R Server for HadoopWilly Marroquin (WillyDevNET)

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks

R server and sparkBAINIDA

Building a scalable data science platform with RRevolution Analytics

R at MicrosoftRevolution Analytics

AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks

Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain

Introduction to TitanDB Knoldus Inc.

R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD

R and Big Data using Revolution R Enterprise with HadoopRevolution Analytics

DeployR: Revolution R Enterprise with Business Intelligence ApplicationsRevolution Analytics

Application and Challenges of Streaming Analytics and Machine Learning on Mu...Databricks

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks

Data Science at Scale by Sarah GuidoSpark Summit

Basics of Digital Design and VerilogGanesan Narayanasamy

Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino BusaSpark Summit

Pandas UDF: Scalable Analysis with Python and PySparkLi Jin

Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit

How Spark Enables the Internet of Things- Paula Ta-ShmaSpark Summit

What's hot (20)

R at Microsoft (useR! 2016)

Accelerating R analytics with Spark and Microsoft R Server for Hadoop

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

R server and spark

Building a scalable data science platform with R

R at Microsoft

AI on Spark for Malware Analysis and Anomalous Threat Detection

Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...

Introduction to TitanDB

R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose

R and Big Data using Revolution R Enterprise with Hadoop

DeployR: Revolution R Enterprise with Business Intelligence Applications

Application and Challenges of Streaming Analytics and Machine Learning on Mu...

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

Data Science at Scale by Sarah Guido

Basics of Digital Design and Verilog

Real-Time Anomoly Detection with Spark MLib, Akka and Cassandra by Natalino Busa

Pandas UDF: Scalable Analysis with Python and PySpark

Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee

How Spark Enables the Internet of Things- Paula Ta-Shma

Viewers also liked

Through the firewall with miniCRANRevolution Analytics

Company Introduction-OptimumNano Energy Co., LtdWilliam Zhang

WiproGuneet Singh

Applications of R (DataWeek 2014)Revolution Analytics

Route2 Company Introduction 25.07.11Route2 Sustainability

ATTEND Company Introduction 201507attend888

BPM Business Value Patterns Jürgen Kress

We Fashion Company Introductionmmjva

Chemicals: Smarter Investments, Outstanding Resultsaccenture

Digital Disruption Nordic Retail Banking_10june_digitalIlkka Ruotsila

Introducing a presentationNicholas Allen

Viewers also liked (11)

Through the firewall with miniCRAN

Company Introduction-OptimumNano Energy Co., Ltd

Wipro

Applications of R (DataWeek 2014)

Route2 Company Introduction 25.07.11

ATTEND Company Introduction 201507

BPM Business Value Patterns

We Fashion Company Introduction

Chemicals: Smarter Investments, Outstanding Results

Digital Disruption Nordic Retail Banking_10june_digital

Introducing a presentation

Similar to Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon Web Services

Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty

Achieve Performance Testing Excellence for Your SAP AppsNeotys

Getting It Right Exactly Once: Principles for Streaming ArchitecturesSingleStore

AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks

ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...epamspb

AWS Summit Singapore Webinar Edition | Move it! Migrating to AWS (Level 200) ...Amazon Web Services

Geniushive- Ruby on RailsGeniushive Inc

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely

Digital transformation slideshareShivamPatsariya1

nitesh_rajpurkar_2016Nitesh Rajpurkar

Oracle Big Data Appliance and Big Data SQL for advanced analyticsjdijcks

Peek into Neo4j Product Strategy and RoadmapNeo4j

Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...Amazon Web Services

Danny Bickson - Python based predictive analytics with GraphLab Create PyData

SigOpt at GTC - Reducing operational barriers to optimizationSigOpt

DOES14: Scott Prugh, CSG - DevOps and Lean in Legacy EnvironmentsDevOps Enterprise Summmit

Optimizing Open Source for Greater Database Savings & ControlEDB

Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...ModusOptimum

How Data Drives Business at Choice HotelsCloudera, Inc.

Similar to Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata (20)

Amazon SageMaker 內建機器學習演算法 (Level 400)

Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi

Achieve Performance Testing Excellence for Your SAP Apps

Getting It Right Exactly Once: Principles for Streaming Architectures

AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...

ITsubbotnik Spring 2017: Dmitriy Yatsyuk "Готовое комплексное инфраструктурно...

AWS Summit Singapore Webinar Edition | Move it! Migrating to AWS (Level 200) ...

Geniushive- Ruby on Rails

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...

Digital transformation slideshare

nitesh_rajpurkar_2016

Oracle Big Data Appliance and Big Data SQL for advanced analytics

Peek into Neo4j Product Strategy and Roadmap

Build Deep Learning Applications Using Apache MXNet - Featuring Chick-fil-A (...

Danny Bickson - Python based predictive analytics with GraphLab Create

SigOpt at GTC - Reducing operational barriers to optimization

DOES14: Scott Prugh, CSG - DevOps and Lean in Legacy Environments

Optimizing Open Source for Greater Database Savings & Control

Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...

How Data Drives Business at Choice Hotels

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

From Family Reminiscence to Scholarly Archive .Alan Dix

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

How to write a Business Continuity PlanDatabarracks

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

"ML in Production",Oleksandr BaganFwdays

unit 4 immunoblotting technique complete.pptxBkGupta21

Advanced Computer Architecture – An IntroductionDilum Bandara

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

How AI, OpenAI, and ChatGPT impact business and software.

DSPy a system for AI to Write Prompts and Do Fine Tuning

Developer Data Modeling Mistakes: From Postgres to NoSQL

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

From Family Reminiscence to Scholarly Archive .

Unleash Your Potential - Namagunga Girls Coding Club

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

How to write a Business Continuity Plan

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

WordPress Websites for Engineers: Elevate Your Brand

Streamlining Python Development: A Guide to a Modern Project Setup

"ML in Production",Oleksandr Bagan

unit 4 immunoblotting technique complete.pptx

Advanced Computer Architecture – An Introduction

Generative AI for Technical Writer or Information Developers

SIP trunking in Janus @ Kamailio World 2024

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

1. Rapid Productionalization of Predictive Models In-database Modeling with Revolution Analytics on Teradata Skylar Lyon Accenture Analytics

2. Introduction Skylar Lyon Accenture Analytics • 7 years of experience with focus on big data and predictive analytics - using discrete choice modeling, random forest classification, ensemble modeling, and clustering • Technology experience includes: Hadoop, Accumulo, PostgreSQL, qGIS, JBoss, Tomcat, R, GeoMesa, and more • Worked from Army installations across the nation and also had the opportunity to travel twice to Baghdad to deploy solutions downrange. Copyright © 2014 Accenture. All rights reserved. 2

3. How we got here Project background and my involvement • New Customer Analytics team for Silicon Valley Internet eCommerce giant • Data scientists developing predictive models • Deferred focus on productionalization • Joined as Big Data Infrastructure and Analytics Lead Copyright © 2014 Accenture. All rights reserved. 3

4. Colleague‘s CRAN R model Binomial logistic regression • 50+ Independent variables including categorical with indicator variables • Train from small sample (many thousands) – not a problem in and of itself • Scoring across entire corpus (many hundred millions) – slightly more challenging Copyright © 2014 Accenture. All rights reserved. 4

6. Benchmarking our optimized process 5+ hours to 40 seconds: Recommendation is that this now become the defacto productionalization process Copyright © 2014 Accenture. All rights reserved. 6 rows minutes

7. Optimization process Recode CRAN R to Rx R Before trainit <- glm(as.formula(specs[[i]]), data = training.data, family='binomial', maxit=iters) fits <- predict(trainit, newdata=test.data, type='response') After trainit <- rxGlm(as.formula(specs[[i]]), data = training.data, family='binomial', maxIterations=iters) fits <- rxPredict(trainit, newdata=test.data, type='response') Copyright © 2014 Accenture. All rights reserved. 7

8. Additional benefits to new process Technology is increasing data science team’s options and opportunities • Train in-database on much larger set – reduces need to sample • Nearly “native” R language – decrease deploy time • Hadoop support – score in multiple data warehouses Copyright © 2014 Accenture. All rights reserved. 8

10. Technical considerations Environment setup • Teradata environment – 4 node, 1700 series appliance server • Revolution R Enterprise – version 7.1, running R 3.0.2 Copyright © 2014 Accenture. All rights reserved. 10

Editor's Notes

Problem statement
Gabi’s binomial logistic regression model Admittedly, could be recoded to SQL, but not so easy with random forest and more powerful ensemble models
Lots of data movement; 6+ hour process
Show some CRAN R versus Rx R code

Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Similar to Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata (20)

More from Revolution Analytics

More from Revolution Analytics (20)

Recently uploaded

Recently uploaded (20)

Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Editor's Notes