2. BIOGRAPHY POINT THREE
Especialista en Data Platform con mas de 15 años de experiencia en
trabajos de (tuning – seguridad – migraciones – Desarrollo – BI – Alta
disponibilidad – Big data y Machine Learning
Maximiliano Accotto
BIOGRAPHY POINT ONE
Owner TriggerDB Consulting SRL | www.triggerdb.com
BIOGRAPHY POINT TWO
Microsoft MVP Data Platform desde el año 2005, miembro de SQLPass
Argentina y speaker para Microsoft en diferentes eventos desde el
año 2005
https://twitter.com/maxiaccotto
https://www.linkedin.com/in/maxiaccotto/
3. Acerca de TriggerDB Consulting
Nacimos en Argentina en el año 2005, desde entonces estamos ayudando a distintas organizaciones en el mundo en el
manejo y entrenamiento de las plataformas de datos de Microsoft (SQL Server, PowerBI, Azure, Big Data,etc) con la meta
de traspasarles a nuestros clientes los conocimientos adquiridos día a día.
Somos Microsoft Partner certificados en Data platform / Data Analytics y PowerBI
Líneas de Contacto www.triggerdb.com
Info@triggerdb.com
https://www.facebook.com/triggerdb/
4. ¿Que es machine Learning?
Es una rama de la inteligencia artificial cuyo objetivo es desarrollar
técnicas que permitan a las computadoras aprender.
6. ¿Porque ML en SQL Server?
Eliminar el movimiento de datos
Operación ML scripts y modelos
Performance y escalabilidad Enterprise
SQL Transformations
Relational data
Analytics library
8. SQL Server Machine Learning Services
• SQL Server 2016
• R support (3.2.2 version)
• Microsoft R Server
• SQL Server 2017
• Scoring native en TSQL usando PREDICT function (+Linux
support)
• EXTERNAL LIBRARY DDL para el manejo de paquetes R
• Ejecucion en batch para la entrada de datos
• Soporte para R (3.3.3 version)
• Soporte para Python (Anaconda 3.5.2)
9. Machine Learning Server
• Soporte Multi-plataforma
• Windows, Linux, Hadoop, SQL Server
• Microsoft R Server
• RevoScaleR, MicrosoftML, olapR, sqlrutils packages
• Uso de Web services para operar.
• Microsoft Machine Learning Server
• Soporte de R & Python
• revoscalepy, microsoftml python libraries
• rxExecBy
10. Comparación ScaleR Performance
US flight data for 20 years
Linear Regression on Arrival Delay
Run on 4 core laptop, 16GB RAM & 500GB SSD
12. Application exec sp_execute_external_script
@language = ‘Python’
, @script =
-- Python code --
The stored procedure
contains R or Python code
and executes in-database
Application Developer - Operacionalización de
modelos
Stored Proc call
Results
1
3
Execution
SQL Server
2
R/Python Runtime
Machine Learning
Services
13. Trabajo del DBA: Habilitar ML en SQL
Server
Enable External scripts
– Exec sp_configure ‘external
scripts enabled’, 1
– RCONFIGURE
SQL Server 2016
O superior
14. SP_execute_external_script
EXEC sp_execute_external_script
@language = N’R’,
@script = N’[Codigo]’,
@input_data_1 = N’[SQL input]’
[ , @input_data_1_name = N‘InputDataSet’ ]
[ , @output_data_1_name = N’OutputDataSet’ ]
[ , @params = N’parameter’ ]
WITH RESULT SETS (([SQL output]));
input_data_1_name and
output_data_1_name are optional
and default to InputDataSet and
OutputDataSet respectively
15. Operationalized R
EXEC sp_execute_external_script
@language = N’R’,
@script = N’[R code goes here]’,
@input_data_1 = N’[SQL input]’
[ , @input_data_1_name = N‘InputDataSet’ ]
[ , @output_data_1_name = N’OutputDataSet’ ]
[ , @params = N’parameter’ ]
WITH RESULT SETS (([SQL output]));
If output is a model or plot,
specify varbinary(max) in
WITH RESULT SETS
16. Tipos de salida
1. Dataset
• Standard resultset of rows and columns
• Data types will vary
2. Plot
• Static images
• Binary
3. Model
• Trained models such as linear regression, naïve bayes, etc.
• Binary
20. Realtime Predictions usando Scoring nativo
• PREDICT function
• No depende de R o Python runtime
• Habilitado on SQL Server tanto en Windows como Linux
• Uso
• Single or small number of rows scoring
• Highly concurrent scoring scenarios
• Predict during INSERT, UPDATE, MERGE statements
• Requirements
• Models built using RevoScaleR or revoscalepy
• rxLinMod, rxLogit, rxBTrees, rxDTree, rxDForest
• Serialized using rxSerializeModel (R) or rx_serialize_model (Python)
Microsoft R Server has no data size limits in relation to size of available RAM. When open source R operates on data sets that exceed RAM it will fail. In contrast Microsoft R Server scales linearly well beyond RAM limits and parallel algorithms are much faster.
The figure on the left shows the results of a benchmark test comparing the performance of OSR and a ScaleR algorithm. as can be seen on the fgure, When OSR operates on data sets that exceed RAM (about 300K observations in a dataframe), it will fail.
On the other hand it can be seen from the plot that ScaleR has no data size limit in relation to the size of the RAM. The ScaleR algorithm is seen to scales linearly well beyond the limits of the RAM (over 5M observations in a dataframe) and the parallel algorithms are much faster.
The table on the right highlights the massive speed boost Microsoft R Server provides when running a linear regression algorithm on 20 years of US flight data. You will also notice that data sets exceeding the 16GB of available RAM failed with memory errors when running the calculations using Open R.