4. Aumento del número y
variedad de fuentes de
datos que generan
grandes cantidades de
datos.
Comprensión de que los
datos son “muy valiosos”
para borrarlos.
Dramática disminución
en el costo del hardware,
especialmente de
almacenamiento.
11. Permite que SQL 2016 ejecute
consultas T-SQL contra datos
relacionales en SQL Server y
datos “semiestructurados” en
HDFS o Azure
Hadoop
(non-
relational
data)
Windows
Azure Blob
Storage
(WASB)
SQL Server
SMP
SELECT Results
12. 2012 2013 ...
...
... 2016
...
2014
Polibase en SQL Server 16
(CTP3)
PolyBase en SQL DW
PolyBase en SQL Server
2016
2015
13. 2012 2013 ...
...
... 2016
...
2014
Polibase en SQL Server 16
(CTP3)
PolyBase en SQL DW
PolyBase en SQL Server
2016
2015
14. la limpieza de datos
antes de cargarlo
unir tablas
relacionales w /
flujos de tweets
datos de los
sensores para
análisis predictivo
15.
16. SQL Product Cargar datos Consulta de datos Age-out Data
Hadoop ERA B Hadoop ERA B Hadoop ERA B
SQL Server 2016
Y Y Y Y Y Y
Analytic Platform System
(APS)
Y
Y Y Y Y Y
Azure SQL DW
n Y n n Y
23. Ejemplo 1:
políticas de precios
Estructurado
Datos de
los
clientes
(No relacionales
Datos del
sensor
Datos de sensores
(guardado en
Hadoop)
datos relacionales
(Guardado en SQL
Server PDW / APS)
(Basado en el comportamiento del conductor)
24. Ejemplo # 2:
Shopping Basket
Analysis
Estructurado
Datos del
producto
(No relacionales
Medios de
comunicació
n social
(Mantenido en
Hadoop)
Datos del producto
(Guardado en SQL
Server PDW / APS)
(Basado en el comportamiento de los medios de comunicación social)
25. Ejemplo # 3:
Análisis de Plataforma de perforación
Datos
recientes
Datos
Históricos
Monitoreo y
funcionamiento de la
plataforma
(Mantenido en
Hadoop)
Los datos más recientes
(Guardado en SQL Serve
PDW / APS)
26.
27. 2012 2013 2015 ...
...
... 2016
...
Pasado Futuro
2014
HOY
PolyBase es ahora
parte de SQL Server
29. DB
PolyBase = SQL Server PDW V2 consulta
HDFS/Azure data, in-situ
HDFS
PolyBase
PolyBase
PolyBase
PolyBase
Lenguaje de consulta estándar de T-SQL. Elimina la
necesidad de escribir trabajos MapReduce
Aprovecha el marco de ejecución de consultas en paralelo
de PDW
Mueve los datos en paralelo directamente entre nodos de
datos de Hadoop y nodos de computación del PDW
Explota el optimizador de consultas en paralelo del PDW
para impulsar selectivamente cálculos sobre datos HDFS
como MapReduce jobs
30. Control Node
SQL Server
SQL Server
SQL Server
SQL Server
SQL Server
SQL Server
Client Connections
User Queries
JDBC, OLEDB, ODBC, ADO.NET
• Parsing de SQL
• Validar y autorizar
• Optimizar y crear plan de ejecución
• Ejecución de consulta en paralelo
• Devuelve los resultados al cliente
Data Movement Service (DMS)
• Proceso separado en cada nodo
• Tablas intermedias entre los
nodos durante la ejecución de la
consulta
31. Oferta escalable de SQL
Server DW
Desempeño altamente competitivo
disponible como servicio de DW de
SQL Azure
Componentes clave
Un nodo de
control
Engine Service + DMS
Compila y controla la
ejecución de consultas
Muchos nodos de
cálculo
Cada uno con SQL Server +
DMS
33. Azure Storage Blob (ASB) expone una capa HDFS
PolyBase lee y escribe desde ASB utilizando Hadoop API
No hay soporte de push-down para ASB
Azur
Almacena
miento
Volumen
Azur
Almacena
miento
Volumen
Azur
Almacena
miento
Volumen
Azure
38. 1. Instalar varias instancias de SQL
Server con PolyBase.
37
nodo cabeza
PolyBase
Motor
PolyBase
DMS
PolyBase
DMS
PolyBase
DMS
PolyBase
DMS
PolyBase
Motor
PolyBase
Motor
PolyBase
Motor
2. Elija un Head Node
3. Configurar restantes como nodos de computación
a. Run sp_polybase_join_group
b. Reiniciar PolyBase DMS
40. Paso 4 - Elija distribución de Hadoop
Últimas distribuciones de Hadoop soportados en SQL16 RTM
• Cloudera ECC 5.5 en Linux
• Hortonworks 2.3 en Linux y Windows Server
Funcionamiento intero?
• Cargar los jars correctos para conectarse a la distribución de
Hadoop
- diferentes números que representan distribuiciones de Hadoop
- ejemplo: valor 4 representa HDP 2.0 sobre Windows o ASB,
valor 5 para HDP 2.0 en Linux,
6 valor de EC 5.1 / 5.5 en Linux,
valor de 7 para HDP 2.1 / 2.2 / 2.3 en Linux / Windows o ASB
7
46. Principales desafíos técnicos
en HDFS (por ejemplo,
Texto, RC, ORC, ...)
Formatos de
archivos
arbitrarios
entre nodos de
cálculo y HDFS
nodos de datos
Paralelización
de
Transferencia
de Datos
en HDFS,
utilizando el
concepto de tabla
externa
La imposición
de la estructura
de datos no
estructurados
de clusters
Hadoop
La explotación
de recursos
computacionale
s
47. HDFS Hadoop
Cluster
HDFS
HDFS
DMS
SQL Server DMS SQL Server
HDFS
Bridge
HDFS
Bridge
Compute Node Compute Node
HDFS Bridge en PolyBase
DMS
HDFS
Bridge
(augmented w/)
Oculta complejidad de HDFS
Utiliza Hadoop “RecordReaders/Writers”
Utiliza para transferir datos en
paralelo de Hadoop
48. Principales desafíos técnicos
en HDFS (por ejemplo,
Texto, RC, ORC, ...)
Formatos de
archivos
arbitrarios
entre nodos de
cálculo y HDFS
nodos de datos
Paralelización
de
Transferencia
de Datos
en HDFS,
utilizando el
concepto de tabla
externa
La imposición
de la estructura
de datos no
estructurados
de clusters
Hadoop
La explotación
de recursos
computacionale
s
50. Principales desafíos técnicos
en HDFS (por
ejemplo,
Texto, RC, ORC,
parqué, ...)
Secundario
Formatos de
archivos
arbitrarios
entre nodos de
cálculo y HDFS
nodos de datos
Paralelizació
n de
Transferenci
a de Datos
en HDFS,
utilizando el
concepto de tabla
externa
La imposición
de la estructura
de datos no
estructurados
de clusters
Hadoop
La explotación
de recursos
computacionale
s
51. CREATE EXTERNAL DATA SOURCE HadoopCluster
WITH (TYPE = Hadoop, LOCATION = 'hdfs://10.193.26.177:8020',
RESOURCE_MANAGER_LOCATION = '10.193.26.178:8050');
CREATE EXTERNAL FILE FORMAT TextFile
WITH ( FORMAT_TYPE = DELIMITEDTEXT,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec',
FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE));
CREATE EXTERNAL TABLE [dbo].[Customer] (
[SensorKey] int NOT NULL,
int NOT NULL,
[Speed] float NOT NULL
)
WITH (LOCATION='//Sensor_Data//May2014/sensordata.tbl',
DATA_SOURCE = HadoopCluster,
FILE_FORMAT = TextFile
)
Uno por cada Hadoop
Cluster
Uno por cada File Format
HDFS File Path
52. CREATE EXTERNAL DATA SOURCE GSL_HDFS_CLUSTER
WITH (TYPE= HADOOP, LOCATION = ‘hdfs://10.xxx.xx.xx:8020’,
JOB_TRACKER_LOCATION=’10.xxx.xx.xx:5020’);
CREATE EXTERNAL FILE FORMAT TEXT_FORMAT
WITH (FORMAT_TYPE = DELIMITEDTEXT',
DATA_COMPRESSION = ‘org.apache.hadoop.io.compress.GzipCodec’,
FORMAT_OPTIONS (FIELD_TERMINATOR = ‘t‘));
CREATE EXTERNAL TABLE CUSTOMER
( c_custkey bigint not null,
c_name varchar(25) not null,
c_address varchar(40) not null,
c_nationkey integer not null,
…
)
WITH (LOCATION ='/tpch1gb/customer.tbl’, DATA_SOURCE = GSL_HDFS_CLUSTER,
FILE_FORMAT = TEXT_FORMAT);
HDFS file path
53. -- select on external table (sensor data in HDFS)
SELECT * FROM SensorData
WHERE Speed > 65;
Plan de ejecución:
CREATE
temp table T
Execute on compute nodes
IMPORT
FROM HDFS
HDFS Customer file read into T
EXECUTE
QUERY
Select * from T where
T.Speed > 65
54. Principales desafíos técnicos
en HDFS (por
ejemplo,
Texto, RC, ORC,
parqué, ...)
Secundario
Formatos de
archivos
arbitrarios
entre nodos de
cálculo y HDFS
nodos de datos
Paralelizació
n de
Transferenci
a de Datos
en HDFS,
utilizando el
concepto de tabla
externa
La imposición
de la estructura
de datos no
estructurados
de clusters
Hadoop
La explotación
de recursos
computacionale
s
55. Query
Plan
Generator
Query
Optimizer
Parser
SQL
Query
Logical
operator
tree
Physical
operator
tree
Se hace parsing de la consulta
“External tables” stored on HDFS are
identified
Parallel QO is performed
Estadísticas sobre tablas HDFS se utilizan de la manera
estándar
Engine
Service
HDFS
Hadoop
Query plan generator
plan de consulta optimizados mediante
conversión de subárboles cuyas entradas son
todos los archivos HDFS en secuencia de
MapReduce jobs
Engine Service envía MapReduce jobs
(como JAR file) al Hadoop cluster.
Aprovechar las capacidades computacionales de cluster de
Hadoop
56. HDFS
Hadoop 2
5
DB
3 4 6
PolyBase
Query
1
MapReduce
Decisión basado en costos en
cuánto procesamiento se
necesita
Operaciones de SQL con datos
HDFS “empujados” hacia
Hadoop como MapReduce jobs
Map job
7
57. Decisión basada en los costes (Para la ejecución de
consultas basadas en división)
• factor importante para la decisión es la reducción del
volumen de datos
• Hadoop toma 20-30 segundos para volver a ejecutar
trabajo Map Reduce
o Tiempo de aceleración varía en función de la distribución y el
sistema operativo
• Cardinalidad del predicado es importante
o No se hace push-down si puede ejecutar en menos de 20-30
segundos
o La creación de estadísticas de tabla externa (no auto-creado)
External Table
External Data
source
External File
Format
Your
Apps
PowerPivot
PowerView
PDW Engine
Service
Polybase Storage Layer (PPAX)
HDFS Bridge –
(as part of DMS)
Job
Submitter
58. -- select and aggregate on external table (sensor data in HDFS)
SELECT AVG(Speed), NationKey FROM SensorData
WHERE Speed > 65 GROUP BY NationKey;
Plan de ejecución:
Run MR Job
on Hadoop
Aplicar filtro y calcular el
total en el cliente.
Qué pasa?
Step 1: QO compila el
predicado en Java.
Step 2: Engine envía MR job al
Hadoop cluster. Salida se
deja en hdfsTemp.
hdfsTemp
<US, 75>
<FRA,67>
<UK 72>
59. -- select and aggregate on external table (sensor data in HDFS)
SELECT AVG(Speed), NationKey FROM SensorData
WHERE Speed > 65 GROUP BY NationKey;
Plan de ejecución: 1. Optimizador de consultas tomó
la decisión basado en costos lo
que los operadores para hacer
push.
2. Predicado y agregación hacen
push en cluster de Hadoop
como un map reduce.
Run MR Job on
Hadoop
Aplicar filtro y calcular
agregado en SensorData.
Salida se deja en hdfsTemp
IMPORT
hdfsTEMP Leer hdfsTemp en T
CREATE temp
table T
En nodos de
procesamiento
RETURN
OPERATION
Leer de T
Hacer agregación final
hdfsTemp
<US, 75>
<FRA,67>
<UK 72>
60. (En SQL Server APS)
Sencillez
datos de consulta
en Hadoop y / o
datos en APS vía
estándar T-SQL
PolyBase
Mayor rendimiento posible
transferencias de datos paralelizado entre PDW y cluster de Hadoop. Push
Down de operaciones SQL a Hadoop
Abierto
Es compatible con las
distribuciones de Hadoop
más populares para
Linux y Windows
Integración total con Microsoft Office & BI
PowerPivot de Excel, PowerView,, Cognos, SQL Server Reporting Services y Analysis
Notas del editor
This slide is required. Do NOT delete. This should be the first slide after your Title Slide. This is an important year and we need to arm our attendees with the information they can use to Grow Share! Please ensure that your objectives are SMART (defined below) and that they will enable them to go in and win against the competition to grow share. If you have questions, please contact your Track PM for guidance. We have also posted guidance on writing good objectives, out on the Speaker Portal (https://www.mytechready.com).
This slide should introduce the session by identifying how this information helps the attendee, partners and customers be more successful. Why is this content important?
This slide should call out what’s important about the session (sort of the why should we care, why is this important and how will it help our customers/partners be successful) as well as the key takeaways/objectives associated with the session. Call out what attendees will be able to execute on using the information gained in this session. What will they be able to walk away from this session and execute on with their customers.
Good Objectives should be SMART (specific, measurable, achievable, realistic, time-bound). Focus on the key takeaways and why this information is important to the attendee, our partners and our customers.
Each session has objectives defined and published on www.mytechready.com, please work with your Track PM to call these out here in the slide deck.
If you have questions, please contact your Track PM. See slide 5 in this template for a complete list of Tracks and TPMs.
Customers need a way to easily work with big data as well as relational data.
In the last few years, there has been an enormous increase in semi-strucutred data THERES A LOT OF DATA IN THE WORLD! For Three key reasons!
Data has Gravity.
PolyBase was first created 3 years ago for SQL Server PDW because we thought that customers needed a better way of combining relational data and HDFS data in a parallel scalable way.
The technology was tried and tested and hardened in PDW, and then brought into conception in SQL Server in 2015 CTP2.
At the same time PolyBase was enabled in SQL DW, the PaaS cloud offering of SQL Server PDW.
In the cloud, customers use PolyBase everyday to load data into SQL DW.
Now we are happy to announce that PolyBase will be launched with SQL Server 2016 this year.
PolyBase was first created 3 years ago for SQL Server PDW because we thought that customers needed a better way of combining relational data and HDFS data in a parallel scalable way.
The technology was tried and tested and hardened in PDW, and then brought into conception in SQL Server in 2015 CTP2.
At the same time PolyBase was enabled in SQL DW, the PaaS cloud offering of SQL Server PDW.
In the cloud, customers use PolyBase everyday to load data into SQL DW.
Now we are happy to announce that PolyBase will be launched with SQL Server 2016 this year.
Load Data:
- USE Hadoop as ETL.
- Bring data in once for intensive processing, reduces network latency
Interactively Query data
Combine relational data in SQL Server and semistructured data in Hadoop with full power of T-SQL
Age Out Data:
Server storage is expensive and management is hard. Offload some of the cold data to Hadoop or azure blob storage.
Use Hadoop or WASB as archival location
Table level backup. Allows a offsite duplication of the data in append only format to ensure that nothing bad happens during deployments.
We are going to go through three customer scenarios, one for APS, one for SQL Server, and one for SQL DW, that we find to be great examples of the usefulness of PolyBase and how customers can use relational data and semistructured data to truly revolutionize their markets.
The two things that I was most excited about when I turned 25 was being able to get a rental car without the extra fee and the drop in my car insurance payments. I know it’s sad, but I was really excited about these two things. Both of which originate from the same problem. How do car insurance companies minimize their risk while ensuring the best customer value?
In the past, this was done by joining a bunch of relational data together like customer demographic information, insurance claims, and maybe even some anecdotal evidence for color. This process would create an equation where parameters would be put in and a price would come out that the customer has to pay. It just so happens that as a young single guy, my price was quite high.
Which had some interesting consequences on my side.
I felt that all I had to do to keep my insurance at the current level was not get a speeding ticket and don’t get in a wreck. It had become a game in the worst way. How much could I do without getting caught. Which for an insurance company, isn’t the game you want your customers playing. You want them to be “good” driver, thus reducing claims across the board.
While I was being a secret bad driver, I was also hunting around for the cheapest insurance I could find because I had no way to get a cheaper rate with the same company.
With PolyBase, sensor data and relational data this whole dynamic changes.
Insurance companies can install sensors into individual drivers’ cars and that data is sent to HDFS. Your individual driving data analyzed to determine how good or bad of a driver you are based on a set of criteria, and you get a personalized insurance rate based on good driving practices.
By becoming their ideal customer, they reward you with a lower rate, probably the lowest rate you’d find on the market. This develops loyalty from good drivers, who in turn are good insurance customers. It’s a win-win for the insurance company and their customers.
The two things that I was most excited about when I turned 25 was being able to get a rental car without the extra fee and the drop in my car insurance payments. I know it’s sad, but I was really excited about these two things. Both of which originate from the same problem. How do car insurance companies minimize their risk while ensuring the best customer value?
In the past, this was done by joining a bunch of relational data together like customer demographic information, insurance claims, and maybe even some anecdotal evidence for color. This process would create an equation where parameters would be put in and a price would come out that the customer has to pay. It just so happens that as a young single guy, my price was quite high.
Which had some interesting consequences on my side.
I felt that all I had to do to keep my insurance at the current level was not get a speeding ticket and don’t get in a wreck. It had become a game in the worst way. How much could I do without getting caught. Which for an insurance company, isn’t the game you want your customers playing. You want them to be “good” driver, thus reducing claims across the board.
While I was being a secret bad driver, I was also hunting around for the cheapest insurance I could find because I had no way to get a cheaper rate with the same company.
With PolyBase, sensor data and relational data this whole dynamic changes.
Insurance companies can install sensors into individual drivers’ cars and that data is sent to HDFS. Your individual driving data analyzed to determine how good or bad of a driver you are based on a set of criteria, and you get a personalized insurance rate based on good driving practices.
By becoming their ideal customer, they reward you with a lower rate, probably the lowest rate you’d find on the market. This develops loyalty from good drivers, who in turn are good insurance customers. It’s a win-win for the insurance company and their customers.