SlideShare una empresa de Scribd logo
1 de 22
SQL Server to Redshift
Background
RealityMine provides digital behaviour
analytics.
Our applications passively measure the
activity of opt-in users on all digital
platforms.
This could be focused on
• how to direct marketing
• how to direct product development
• question individuals whom
undertake certain behavior patterns
Starting State

•
•
•
•
•

SQL Server DW on in-house server
SQL Server 2008 R2 Enterprise Edition
Single 4 core (8 thread) i7 w/ 16GB RAM
2 960GB PCIe SSDs for DBs
1 240GB PCIe SSD for TempDb

SQL Server to Redshift - @joeharris76
Data Environment

•
•
•
•
•

~20 billion rows in active use
Largest table is also the widest
Volume is doubling more than annually
Data is in many languages
Starts as JSON, ends as Star Schema DW

SQL Server to Redshift - @joeharris76
Pain Points

•
•
•
•
•

Biggest cost is SQL Server license
Biggest bottleneck is single threaded perf.
Hand tuning needed to push CPU / disks
SSD reliability is not perfect
SSD performance degrades over time

SQL Server to Redshift - @joeharris76
Why Redshift

•
•
•
•
•
•

Vertica wanted £45k per terabyte
16 SQL Server Enterprise cores even more!
Teradata, Netezza, etc. don’t want <5TB sales
SAP HANA not viable for this volume on AWS
Infobright does not support incremental loads
Hadoop/Impala slow & requires lots of learning
SQL Server to Redshift - @joeharris76
Data Processing Approach

• No ETL tool truly supports Redshift
– Requirement to load from S3 is a killer
– Tried SSIS, Pentaho, Talend and others
• You’re stuck with ELT
– Load data then transform as needed
– Keep data raw as possible from source
SQL Server to Redshift - @joeharris76
War of Encodings
The road to heaven goes
through ÜÑÎÇØDÈ hell

SQL Server to Redshift - @joeharris76
Redshift: UTF-8 Only
• Redshift has zero-tolerance for certain chars
– NUL/0x00 => Treated as EOR, documented
– DEL/0x7F => Treated as EOR, undocumented
– 0xBFEFEF => UTF-8 spec "guaranteed non-char"
– These must be removed before loading data
• Other control characters can be loaded by escaping
– You cannot escape a single column, all or nothing

SQL Server to Redshift - @joeharris76
SQL Server: UTF-16LE Only
• NVARCHAR takes 2x as much space as a VARCHAR
• Makes functions consistent across ASCII & Unicode
– N/VARCHAR(32) = 32chars / Redshift = 32 bytes
• SQL Server tolerates anything character columns
• Input and output is not sanitized against UTF-16 spec
– Invalid or "guaranteed non-chars" are stored as is

SQL Server to Redshift - @joeharris76
SQL Extract: The Hard Way
• BCP is the “standard” way to extract data
• Using BCP your process looks something like this:
– Extract data as a huge UTF-16LE file using bcp
– Convert to a new UTF-8 file using iconv
– Remove or escape problem chars using sed
– Compress the final file using gzip
– All steps are heavily constrained by disk speed

SQL Server to Redshift - @joeharris76
SQL Extract: The Easy Way

SQLCMD one-liner for extracts:
Set the cmd code page to UTF-8
Interactive SQL terminal
Prevent summary in output
Select from the table / view
No column headers
Remove special characters
Delimit output with 1 ASCII char
No padding in output
Output in Unicode
Pipe stdout to gzip

chcp 65001 &
sqlcmd –E -Q
“SET NOCOUNT ON;
SELECT * FROM Db.Schema.Table;”
-h-1
-k1
-s”|”
-W
-u
| gzip > “C:file.gz”

SQL Server to Redshift - @joeharris76
Data Encryption

•
•
•
•
•

On SQL Server we use TDE
Redshift offers AES encrypted data on disk
Redshift can load client-side encrypted data
Client side encryption only applies while on S3
“Small performance penalty” for using AES

SQL Server to Redshift - @joeharris76
Security
• S3 Access => Create bucket(s) just for Redshift staging
• Redshift admin => Use IAM, create automation user(s)
• Redshift database =>
– Do not use admin it’s like SQL Server ‘sa’
• Database objects =>
– Must actively GRANT access to each object
– Use groups to make management easier

SQL Server to Redshift - @joeharris76
Sizing your cluster

• Redshift is over-provisioned on storage
• Redshift is super efficient at compression
– Compression not affected by the data model
• Redshift scale out is almost perfectly linear
– 2 nodes is twice as fast as 1 node
• You'll be sizing your cluster for speed!
SQL Server to Redshift - @joeharris76
Performance
• Redshift speed depends on node count
– A single node is not particularly fast
• Loading speed appears to be linked to S3 speed
– You must use multiple files for bulk loads
• Query speed appears to be CPU constrained
– Vacuum runs 250 MB/s, queries <20 MB/s
• Data modeling matters for complex query speed
– Use a star schema & well chosen distribution key
SQL Server to Redshift - @joeharris76
Data Modeling

2 main concepts to learn
• Distribution key
– Where data is placed, which node & slice
– Needs to be common across most tables
• Sort key
– How data is ordered on disk within the slice
– Good sort keys simply expensive joins
SQL Server to Redshift - @joeharris76
Database Maintenance
•
•
•
•

Data loaded to non-empty tables is not sorted
Data loaded to non-empty tables may kills their stats
ANALYZE rebuilds the stats without making changes
VACUUM re-sorts the physical data and rebuilds stats
– Needed to get the best performance
– Very similar to a REBUILD in SQL Server

SQL Server to Redshift - @joeharris76
Database Backups
• Redshift ‘backups’ are snapshots of the system
• Taken very quickly, much slower to restore
• Redshift automatically takes intra-day snapshots
• Manual snapshots can be run using AWS cmd line
• Snapshot storage is free up to size of cluster storage
• Snapshots must be restored to an identical cluster
• Snapshots cannot be restored to a running cluster

SQL Server to Redshift - @joeharris76
Code Changes

Code changes required so far
• ROW_NUMBER() missing in Redshift
• We gain LAG() and LEAD() which helps
• But very difficult to persist an order value
• DATETIMEOFFSET (e.g. timezone) not avail.
• DATETIMEs now split into 2 columns
• Work in progress…
SQL Server to Redshift - @joeharris76
That’s all folks!

SQL Server to Redshift - @joeharris76
Come Work With Me!
http://www.realitymine.com/careers/
• Currently trying to fill the following roles:
• Business Intelligence Architect (Redshift!)
• Business Intelligence Developer (Tableau!)
• Test Engineer (Quality!)
• Server Developer (C#!)
• Mobile App Developer (Android! iOS!)
• Project Manager
SQL Server to Redshift - @joeharris76

Más contenido relacionado

La actualidad más candente

AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
Volodymyr Rovetskiy
 

La actualidad más candente (20)

Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesDeep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
 
AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performance
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and Optimization
 
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Amazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech TalksAmazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech Talks
 

Destacado

Destacado (20)

(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
 
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
 
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
 
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
 
AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)
AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)
AWS re:Invent 2016: Introduction to Managed Database Services on AWS (DAT307)
 
Delta Youth Support Link Society
Delta Youth Support Link SocietyDelta Youth Support Link Society
Delta Youth Support Link Society
 
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
 
Learn How to Run Python on Redshift
Learn How to Run Python on RedshiftLearn How to Run Python on Redshift
Learn How to Run Python on Redshift
 
Começando com Amazon Redshift
Começando com Amazon RedshiftComeçando com Amazon Redshift
Começando com Amazon Redshift
 
REDSHIFT - Amazon
REDSHIFT - AmazonREDSHIFT - Amazon
REDSHIFT - Amazon
 
Aws meetup ssm
Aws meetup ssmAws meetup ssm
Aws meetup ssm
 
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...Treasure Data:  Move your data from MySQL to Redshift with (not much more tha...
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
 
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
 
Getting Started with Amazon Kinesis | AWS Public Sector Summit 2016
Getting Started with Amazon Kinesis | AWS Public Sector Summit 2016Getting Started with Amazon Kinesis | AWS Public Sector Summit 2016
Getting Started with Amazon Kinesis | AWS Public Sector Summit 2016
 
Prince 2 project managment Document Lessons learned log
Prince 2 project managment Document Lessons learned logPrince 2 project managment Document Lessons learned log
Prince 2 project managment Document Lessons learned log
 
Como o Magazine Luiza inova suas operações utilizando as soluções de IoT e Bi...
Como o Magazine Luiza inova suas operações utilizando as soluções de IoT e Bi...Como o Magazine Luiza inova suas operações utilizando as soluções de IoT e Bi...
Como o Magazine Luiza inova suas operações utilizando as soluções de IoT e Bi...
 

Similar a Migration to Redshift from SQL Server

Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]
shuwutong
 
iMobileMagic Teck Talk Scale Up
iMobileMagic Teck Talk Scale UpiMobileMagic Teck Talk Scale Up
iMobileMagic Teck Talk Scale Up
Pedro Machado
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
Scott Miao
 

Similar a Migration to Redshift from SQL Server (20)

AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]
 
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
Building and Deploying Large Scale SSRS using Lessons Learned from Customer D...
 
Remote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New FeaturesRemote DBA Experts SQL Server 2008 New Features
Remote DBA Experts SQL Server 2008 New Features
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Configuring Sage 500 for Performance
Configuring Sage 500 for PerformanceConfiguring Sage 500 for Performance
Configuring Sage 500 for Performance
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
 
Cassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day TorontoCassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day Toronto
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
 
SQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinarSQL Server Reporting Services Disaster Recovery webinar
SQL Server Reporting Services Disaster Recovery webinar
 
30334823 my sql-cluster-performance-tuning-best-practices
30334823 my sql-cluster-performance-tuning-best-practices30334823 my sql-cluster-performance-tuning-best-practices
30334823 my sql-cluster-performance-tuning-best-practices
 
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
 
iMobileMagic Teck Talk Scale Up
iMobileMagic Teck Talk Scale UpiMobileMagic Teck Talk Scale Up
iMobileMagic Teck Talk Scale Up
 
Cassandra Summit 2014: Deploying Cassandra for Call of Duty
Cassandra Summit 2014: Deploying Cassandra for Call of DutyCassandra Summit 2014: Deploying Cassandra for Call of Duty
Cassandra Summit 2014: Deploying Cassandra for Call of Duty
 
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to Spark
 
What's new in SQL Server Integration Services 2012?
What's new in SQL Server Integration Services 2012?What's new in SQL Server Integration Services 2012?
What's new in SQL Server Integration Services 2012?
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Migration to Redshift from SQL Server

  • 1. SQL Server to Redshift
  • 2. Background RealityMine provides digital behaviour analytics. Our applications passively measure the activity of opt-in users on all digital platforms. This could be focused on • how to direct marketing • how to direct product development • question individuals whom undertake certain behavior patterns
  • 3. Starting State • • • • • SQL Server DW on in-house server SQL Server 2008 R2 Enterprise Edition Single 4 core (8 thread) i7 w/ 16GB RAM 2 960GB PCIe SSDs for DBs 1 240GB PCIe SSD for TempDb SQL Server to Redshift - @joeharris76
  • 4. Data Environment • • • • • ~20 billion rows in active use Largest table is also the widest Volume is doubling more than annually Data is in many languages Starts as JSON, ends as Star Schema DW SQL Server to Redshift - @joeharris76
  • 5. Pain Points • • • • • Biggest cost is SQL Server license Biggest bottleneck is single threaded perf. Hand tuning needed to push CPU / disks SSD reliability is not perfect SSD performance degrades over time SQL Server to Redshift - @joeharris76
  • 6. Why Redshift • • • • • • Vertica wanted £45k per terabyte 16 SQL Server Enterprise cores even more! Teradata, Netezza, etc. don’t want <5TB sales SAP HANA not viable for this volume on AWS Infobright does not support incremental loads Hadoop/Impala slow & requires lots of learning SQL Server to Redshift - @joeharris76
  • 7. Data Processing Approach • No ETL tool truly supports Redshift – Requirement to load from S3 is a killer – Tried SSIS, Pentaho, Talend and others • You’re stuck with ELT – Load data then transform as needed – Keep data raw as possible from source SQL Server to Redshift - @joeharris76
  • 8. War of Encodings The road to heaven goes through ÜÑÎÇØDÈ hell SQL Server to Redshift - @joeharris76
  • 9. Redshift: UTF-8 Only • Redshift has zero-tolerance for certain chars – NUL/0x00 => Treated as EOR, documented – DEL/0x7F => Treated as EOR, undocumented – 0xBFEFEF => UTF-8 spec "guaranteed non-char" – These must be removed before loading data • Other control characters can be loaded by escaping – You cannot escape a single column, all or nothing SQL Server to Redshift - @joeharris76
  • 10. SQL Server: UTF-16LE Only • NVARCHAR takes 2x as much space as a VARCHAR • Makes functions consistent across ASCII & Unicode – N/VARCHAR(32) = 32chars / Redshift = 32 bytes • SQL Server tolerates anything character columns • Input and output is not sanitized against UTF-16 spec – Invalid or "guaranteed non-chars" are stored as is SQL Server to Redshift - @joeharris76
  • 11. SQL Extract: The Hard Way • BCP is the “standard” way to extract data • Using BCP your process looks something like this: – Extract data as a huge UTF-16LE file using bcp – Convert to a new UTF-8 file using iconv – Remove or escape problem chars using sed – Compress the final file using gzip – All steps are heavily constrained by disk speed SQL Server to Redshift - @joeharris76
  • 12. SQL Extract: The Easy Way SQLCMD one-liner for extracts: Set the cmd code page to UTF-8 Interactive SQL terminal Prevent summary in output Select from the table / view No column headers Remove special characters Delimit output with 1 ASCII char No padding in output Output in Unicode Pipe stdout to gzip chcp 65001 & sqlcmd –E -Q “SET NOCOUNT ON; SELECT * FROM Db.Schema.Table;” -h-1 -k1 -s”|” -W -u | gzip > “C:file.gz” SQL Server to Redshift - @joeharris76
  • 13. Data Encryption • • • • • On SQL Server we use TDE Redshift offers AES encrypted data on disk Redshift can load client-side encrypted data Client side encryption only applies while on S3 “Small performance penalty” for using AES SQL Server to Redshift - @joeharris76
  • 14. Security • S3 Access => Create bucket(s) just for Redshift staging • Redshift admin => Use IAM, create automation user(s) • Redshift database => – Do not use admin it’s like SQL Server ‘sa’ • Database objects => – Must actively GRANT access to each object – Use groups to make management easier SQL Server to Redshift - @joeharris76
  • 15. Sizing your cluster • Redshift is over-provisioned on storage • Redshift is super efficient at compression – Compression not affected by the data model • Redshift scale out is almost perfectly linear – 2 nodes is twice as fast as 1 node • You'll be sizing your cluster for speed! SQL Server to Redshift - @joeharris76
  • 16. Performance • Redshift speed depends on node count – A single node is not particularly fast • Loading speed appears to be linked to S3 speed – You must use multiple files for bulk loads • Query speed appears to be CPU constrained – Vacuum runs 250 MB/s, queries <20 MB/s • Data modeling matters for complex query speed – Use a star schema & well chosen distribution key SQL Server to Redshift - @joeharris76
  • 17. Data Modeling 2 main concepts to learn • Distribution key – Where data is placed, which node & slice – Needs to be common across most tables • Sort key – How data is ordered on disk within the slice – Good sort keys simply expensive joins SQL Server to Redshift - @joeharris76
  • 18. Database Maintenance • • • • Data loaded to non-empty tables is not sorted Data loaded to non-empty tables may kills their stats ANALYZE rebuilds the stats without making changes VACUUM re-sorts the physical data and rebuilds stats – Needed to get the best performance – Very similar to a REBUILD in SQL Server SQL Server to Redshift - @joeharris76
  • 19. Database Backups • Redshift ‘backups’ are snapshots of the system • Taken very quickly, much slower to restore • Redshift automatically takes intra-day snapshots • Manual snapshots can be run using AWS cmd line • Snapshot storage is free up to size of cluster storage • Snapshots must be restored to an identical cluster • Snapshots cannot be restored to a running cluster SQL Server to Redshift - @joeharris76
  • 20. Code Changes Code changes required so far • ROW_NUMBER() missing in Redshift • We gain LAG() and LEAD() which helps • But very difficult to persist an order value • DATETIMEOFFSET (e.g. timezone) not avail. • DATETIMEs now split into 2 columns • Work in progress… SQL Server to Redshift - @joeharris76
  • 21. That’s all folks! SQL Server to Redshift - @joeharris76
  • 22. Come Work With Me! http://www.realitymine.com/careers/ • Currently trying to fill the following roles: • Business Intelligence Architect (Redshift!) • Business Intelligence Developer (Tableau!) • Test Engineer (Quality!) • Server Developer (C#!) • Mobile App Developer (Android! iOS!) • Project Manager SQL Server to Redshift - @joeharris76

Notas del editor

  1. Data and Log are always on different disks.Criss-cross pattern used to balance wear.TempDbsplit across 8 files (1 per thread)
  2. TDE required for data encryption.Compression used to maximise SSD speed.A lot of tuning done to push CPU and disks harder.We&apos;ve seen silent partial failures without any indication.Now have to regularly run DBCC to verify databases. So far we&apos;ve seen a ~20% perf loss over a year.
  3. We’re actually using out existing SQL Server automation setup to run batch scripts that execute SQL on Redshift.
  4. Four byte character support was recently added and that makes things a little easier.SQL Server&apos;s REPLACE() function is **broken** and ***cannot remove any of these values***! Yes, really. I can&apos;t tell you how fun it was to figure that out. Because it wasn&apos;t fun at all.All escape sensitive data must be escaped in all columns.Embedded newlines **must** be escaped as &apos;\n’
  5. vsOracle which has LENGTH() for characters and LENGTHB() for bytes.vsRedshift which has only LENGTH() and no way to get the byte length.SQL Server will tolerate _anything_ inside a character columnNo sanitisation of inputs or outputsUTF-16LE *compatible*, rather than *compliant* I know this from painful experience
  6. All web searches will suggest using BCP.All ETL tools actually wrap BCP to get data out**Forget about BCP. BCP is the enemy.**BCP DOES NOT SUPPORT STDOUT!!!
  7. Voila! UTF-8 output from SQL Server directly to a gzip file.
  8. * On SQL Server we use TDE (transparent encryption) * Data on disk is AES encrypted, transparently.* Redshift offers AES encryption of the data on disk. * Not actively encrypted during use, same as SQL Server.* Redshift supports loading client-side &apos;evelope&apos; encrypted data. * Good luck with that! * Slow: You&apos;ll have land your data on disk and then reprocess it. * Custom: You&apos;ll have to write your own encrypter using Open SSL or some such. * Client side encryption is somewhat moot as it only applies while data is on S3. * My 2p: Enable AES on both S3 and Redshift. Call it a day.* Amazon says there is a &apos;small perfomance penalty&apos; for using AES. * In practice it seems to be acceptable. * I have *not actually tested* it without AES because I don&apos;t want to generate 10 billion rows of sample data.
  9. * Managing user and admin access is kind of a pain in Redshift1. Access to S3 * Create bucket(s) just for Redshift staging data.2. Access to Redshift admin * Use IAM access controls to limit individual&apos;s access. * Create users just for automation and enforce password rotation. 3. Access to Redshift database * **Do not allow** use of the admin user - it&apos;s like SQL Server&apos;s `sa`. * Create 1:1 map of external users to Redshift users (no LDAP/AD support)4. Access to specific database objects * You must actively `GRANT` access to each object. * Use groups to make this task easier. * We have just 2 groups: &quot;admin&quot; (`GRANT ALL`) and &quot;readers&quot; (`GRANT SELECT`)
  10. * Redshift nodes are waaaaaay over-provisioned on storage * 2 TB of storage available per node* Redshift is suuuuuper efficient at compression * Our data in Redshift is roughly 2x the gzipped UTF8 input. * The size varies depending on how we sort the tables. * Therefore you&apos;ll be sizing the cluster for **speed**. * You add nodes to go faster _not when you run out of disk._* Tough to get your head around.
  11. Still faster than SQL Server on PCIe SSDs for our dataYou must use multiple files for bulk loads
  12. You cannot schedule these AFAICTThey are auto-deleted on a schedule you can setDefault auto-delete is 1 dayPriced same as S3 beyond cluster size