El documento habla sobre Big Data y las herramientas de Azure para procesar y analizar grandes volúmenes de datos. Explica conceptos como Data Lake, HDInsight y Azure Data Lake Analytics. Describe cómo estas herramientas de Azure como HDInsight y Data Lake Analytics usan tecnologías de código abierto como Apache Hadoop, Spark y U-SQL para procesar y analizar grandes conjuntos de datos de forma escalable en la nube.
2. #GlobalAzure
Que es un Big Data?
"Big Data es como el sexo adolescente:
todos hablan acerca de ello, nadie sabe
realmente como hacerlo,
todos piensan que todos lo están haciendo,
por lo que todo el mundo dice que lo esta
haciendo“
Dan Ariely
7. #GlobalAzure
The 3 Azure Data Lake Services
HDInsight Analytics Store
Clusters as a service Big data queries as a
service
Hyper-scale Storage
optimized for analytics
10. #GlobalAzure
Map Reduce
• Estrategia divide y
vencerás:
• Map() – divider en problemas
más pequeños
• Reduce() – combinar los
resultados
………
Do work() Do work() Do work()
11. #GlobalAzure
• Infraestructura como un Servicio(IaaS)
Hadoop en una VM
Hortonworks
Cloudera
MapR
• Plataforma como un Servicio(PaaS)
Azure HDInsight(Cluster as a Service)
Azure Data Lake Store and Analytics
Big Data sobre Azure
13. #GlobalAzure
Que es un Data Lake?
“Un simple almacenamiento de toda la
data...desde la data cruda(que implica una
copia exacta del origen de datos)
a los datos transformados que son usados de
varias formas incluyendo reportes,
visualizaciones, analítica y maquinas de
aprendizaje.”
14. #GlobalAzure
Azure Data Lake
Integrando, plataforma de Big Data Storage + Analytics
Diseñado de las experiencias del mundo real
Office 365, Skype, Bing, etc.
Aprovechar tecnologías y habilidades existentes
Beneficios de un Servicio Local Azure
Elasticidad, aprovisionando dinámicamente los recursos que necesitamos
Capacidad de almacenamiento infinito
Enfocado en extracción significante de la data, no en la infraestructura
17. #GlobalAzure
Azure Data Lake Store
HDFS como servicio
Almacenamiento durable
Una variedad de escenarios
◦ Alta Capacidad
◦ Alta Frecuencia
◦ Alto Rendimiento
Data se almacena en su formato nativo
◦ Formatos de almacenamiento estructurado, semiestructurado y no
estructurado
18. #GlobalAzure
Azure HDInsight
Administrado, nube escalable de Hadoop como un Servicio
Complemento complete de las tecnologías de Apache.
Spark, Storm, HBase, etc.
Se centran en consultas y datos, no infraestructura
Pagas por solo lo que necesitas usar
Aprovechar las herramientas existentes
◦ Hive, Pig, Sqoop, R, etc.
19. #GlobalAzure
Azure Data Lake Analytics
Complemento al ecosistema HDInsight y Hadoop.
Lo escalas dinámicamente para coincidir con complejidad de
tamaño y consulta de datos
Construido en Apache YARN
Unidad de interacción es un trabajo de análisis.
U-SQL: Lenguaje de consulta arraigada entre SQL y C#
20. #GlobalAzure
U-SQL
Basado en SQL y C#
◦ Tipos y expresiones C#
◦ Tablas, vistas, funciones de Windows.
◦ Funciones definidas por el usuario/operadores/agregaciones en C.
Trabajo típico
1. Leer la data de archivos/tabla/ origenes federados
2. Transforma las filas en un pipeline.
3. Filas de salida a tablas o filas.
21. #GlobalAzure #GABogota
@orders =
EXTRACT
OrderId int,
Customer string,
Date DateTime,
Amount float
FROM "/input/orders.txt"
USING Extractors.Tsv();
OUTPUT @orders
TO "/output/orders_copy.txt"
USING Outputters.Tsv();
Apply Schema on read
From a file in an ADL Store
Easy delimited text handling
Write out
Read the input, write it directly to output (just a simple copy)
Rowset
22. #GlobalAzure #GABogota
@customers =
SELECT Customer.ToUpper() AS Customer
FROM @orders
WHERE Customer.Contains("Contoso");
C# Expression
Transforming Rowsets
C# Expression
Use WHERE for filtering
33. #GlobalAzure
Referencias:
MVA Iniciando con Big Data
https://mva.microsoft.com/es-es/training-courses/iniciando-con-big-data-16872?l=PumTmRs8C_6405192810
Big Data University
https://bigdatauniversity.com/
Microsoft Azure HDInsight Big Data Analyst - edx
https://www.edx.org/xseries/microsoft-azure-hdinsight-big-data-analyst
Processing Big Data with Hadoop in Azure - edx
https://www.edx.org/course/processing-big-data-hadoop-azure-microsoft-dat202-1x-0
Key point here is that data can be stored in raw source form and then transformed as needed to support various use cases. Data can exist in binary, tabular, document, delimited text, etc. form.
Not mentioned here is one of the key capabilities of ADL… the ability to federate data from external sources and query over it without explicit copying (which might be useful/necessary in situations involving sensitive or frequently changing data).
Key points:
ADL was not invented out of thin air… based on real-world experiences of real-world product teams at Microsoft
Based on open-source technologies to which you can immediately apply existing skills like Pig, Hive, etc… AND/OR you can choose to opt into newer MS-specific offerings like U-SQL that offer unique features and potentially smaller learning curve
ADL abstracts infrastructure so that your analytics storage and jobs scale with your needs… literally Big Data-as-a-service
ADL Store can be used on its own (without ADL Analytics)… its just “HDFS-in-the-cloud” at its heart.
Similarly ADL Analytics can be used without touching data in an ADL Store (to query storage blobs, or perhaps against federated SQL Databases).
Of course, they work very well together and will typically be used together.
Federate data from external sources - SQL Data Warehouse, SQL Database, IaaS-hosted SQL Server
Move data into Data Lake Store - Azure Data Factory for ETL, Azure Stream Analytics for streaming data
Power BI for query visualization
Azure Data Catalog for data publishing and discovery
Active Directory for user management and permissions
Beyond ADL Analytics, ADL Store can work as a source data repository for HDInsight; there are plans to further integrate it with other standalone open-source technologies in the Hadoop ecosystem like Spark, Storm, etc.
HDInsight is based on the popular Hortonworks Hadoop platform and provides a wide range of data storage and analysis capabilities, ranging from real-time stream processing to OLTP, predictive modeling, and interactive analytics.
Like ADL Store and ADL Analytics, HDInsight abstracts away infrastructure configuration and management and lets you focus on core data analysis tasks like ingest, transformation, query, and visualization. It scales up or down automatically as data size and query complexity requires.
HDInsight also supports both Windows and Linux cluster types, which maximizes opportunities to port existing code and skillsets.
http://hortonworks.com/apache/yarn/
“YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.”
From http://hadoop.apache.org/: “[YARN is] a framework for job scheduling and cluster resource management.”
The above also happens to be a good abstract description of ADL Analytics… it schedules and manages jobs and the cluster of machines used to execute those jobs. In the case of ADL Analytics a “job” is a U-SQL query issued against one or more configured data sources.
Good overview of U-SQL can be found here: http://usql.io/
“U-SQL is built on the learnings from Microsoft’s internal experience with SCOPE and existing languages such as T-SQL, ANSI SQL, and Hive. For example, we base our SQL and programming language integration and the execution and optimization framework for U-SQL on SCOPE, which currently runs hundred thousands of jobs each day internally. We also align the metadata system (databases, tables, etc.), the SQL syntax, and language semantics with T-SQL and ANSI SQL, the query languages most of our SQL Server customers are familiar with. And we use C# data types and the C# expression language so you can seamlessly write C# predicates and expressions inside SELECT statements and use C# to add your custom logic. Finally, we looked to Hive and other Big Data languages to identify patterns and data processing requirements and integrate them into our framework.”