SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
BIG DATA NEWS WITH
AZURE HD INSIGHT
MVP Nicolás
Nakasone
@nicolasnakasone
THE AZURE DATA LANDSCAPE
AZURE SDK
AZURE
DATA FACTORY
AZURE IMPORT
EXPORT SERVICE
AZURE CLI
COGNITIVE SERVICES
BOT SERVICE
AZURE SEARCH
AZURE
DATA CATALOG
AZURE EXPRESSROUTE AZURE NETWORK
SECURITY GROUPS
AZURE FUNCTIONS
VISUAL STUDIO
OPERATIONS
MANAGEMENT SUITE
AZURE
ACTIVE DIRECTORY
AZURE KEY
MANAGEMENT SERVICE
AZURE STORAGE
BLOBS
AZURE DATA LAKE
STORAGE
AZURE IOT HUB AZURE EVENT HUBS
KAFKA ON
AZURE HDINSIGHT
AZURE SQL DATA WAREHOUSE
AZURE SQL DB AZURE COSMOS DB
AZURE
ANALYSIS SERVICES POWER BI
AZURE
HDINSIGHT
AZURE
DATABRICKS
AZURE
STREAM ANALYTICS
AZURE ML ML SERVER AZURE
DATABRICKS
WHY HDINSIGHT?
Preserve existing investments in Hive, Pig, Storm & HBase
Preference towards managed offering
Multiple workloads (Realtime, Batch, ML etc.)
Similar on-prem and cloud codebase [Apache Open Source]
Row and Column level security
Low cost
• The most trusted and
compliant platform
AZURE HDINSIGHT
A SECURE AND MANAGED APACHE HADOOP AND SPARK PLATFORM FOR BUILDING DATA
LAKES IN THE CLOUD
Monitoring
& Security
Presto or Hive
LLAP?
Which storage
system?
How to Transfer
the Data
ADF/Airflow or Oozie?
Pig, Hive or
Spark
Spark Streaming or Storm
ETL Serving Layer
Storage
Orchestration
Event Processing
Pig
Designed for ETL ETL Data warehousing
Adoption High, increasing Low, decreasing Stable
Number of connectors Highest High High
Languages Python, R, Scala, Java, SQL Pig SQL
Performance High Medium Medium
STREAMING ENGINE TECHNOLOGY CHOICES
Spark Structured Streaming Storm
Adoption High, increasing Decreasing
Event processing guarantee Exactly once At least once
Throughput High Low
Processing Model Micro Batch Real-Time
Latency High Low
Event time support Yes Yes
Languages Python, R, Scala, Java,
SQL
Java
Capability Hive LLAP
Interactive Query Speed High High Medium
Scale High High Low
Caching Yes Yes Early Support
Result Caching Yes No No
Intelligent Cache Eviction Yes No No
Materialized Views Yes No No
Complex Fact to Fact Joins Yes Yes No
Transactions Yes No No
Query Concurrency High Low Low
Row , Column level security Yes [Apache Ranger+ AAD] Medium Medium
Rich end user Tools Yes Yes Yes
Language Support SQL, UDF SQL, Scala, Python SQL
Data Source Connector
Support
Storage Handlers Data Sources connectors
Hive Metadata
Spark Metadata
Hive Metadata
Azure HDInsight 3.6 with Hadoop 2.6 Azure HDInsight 4.0 with Hadoop 3.x
Hive Metastore migration tool: https://azure.microsoft.com/en-us/blog/hdinsight-metastore-migration-tool-open-source-
release-now-available/
ADF Airflow Oozie
Service management Azure PaaS IaaS VM HDInsight
Code JSON Python Java
GUI ADF V2 has great UX Good UX Below Average UX
Community Microsoft Growing (12,133 Stars) Declining (483 Stars)
On-demand clusters Yes No, but extensible No
Extensibility Custom action-only Full, graph + actions Custom action-only
Pipeline definition JSON/UX Python/ UX XML/JAVA/UX
Devops-first design Yes Yes Yes
Pipeline monitoring Yes Yes Yes
Scheduling Event, Time Event Event, Time
Data
movement
Storage
options
and
tradeoffs
Caching
Data Qty Network Bandwidth
45 Mbps (T3) 100 Mbps 1 Gbps
1 TB 2 days 1 day 2 hours
10 TB 22 days 10 days 1 day
35 TB 76 days 34 days 3 days
80 TB 173 days 78 days 8 days
100 TB 216 days 97 days 10 days
200 TB 1 year 194 days 19 days
500 TB 3 years 1 year 49 days
1 PB 6 years 3 years 97 days
2 PB 12 years 5 years 194 days
Network Transfer with TLS
• Over Internet
• Express Route
• Data Box online Transfer
Shipping data offline
• Data Box offline data transfer
USB 3.1 SSD disks
Order up to 5 in each pack
Ruggedized, self-contained appliances
100 TB
8 TB, up to 40 TB
1 PB
Use Azure Data Box to migrate data from an on-premises HDFS store to Azure
Storage
Type Latency ( Consistency of
latency)
Workloads Bandwidth Key Benefits
ADLS Gen 2 Hierarchical 10-50ms (Medium) HDInsight 3.6 &
4.0
Unconstrained Atomic Rename,
File Folder level
ACL’s
Standard BLOB Object Store 10-50ms (Medium) HDInsight 3.6 &
4.0
Unconstrained Mature
Premium BLOB Object Store ~5ms (High) HBase in Preview Unconstrained Fast
Premium
Managed Disks
Hierarchical ~5ms (High) Kafka, HBase in
preview
Based on disk Consistent latency
ADLS Gen 1 Hierarchical 10-100ms (Low) HDInsight 3.6( No
HBase)
High Atomic Rename,
File Folder level
ACL’s
REMOTE STORAGE: CACHING OPTIONS
Workload Caching Options Key benefits
Spark Spark IO Cache Up to ~8 to 10x perf improvements
HBase & Phoenix Bucket cache Up 5-10x perf gains on recently read or written data
Hive + LLAP LLAP Intelligent cache/Result Cache Up to ~4-100X gain on cached data
Azure Data Lake Storage
INSTANCE CORE RAM TEMP SSD
D1 v2 1 3.50 GiB 50 GiB
D2 v2 2 7.00 GiB 100 GiB
D3 v2 4 14.00 GiB 200 GiB
D4 v2 8 28.00 GiB 400 GiB
D5 v2 16 56.00 GiB 800 GiB
• Significant Spark performance speed up
with IO cache (up to 9X perf gains)
• Automatic cache resource management
• DRAM + Temp SSD makes large cache
pool
AZURE HDINSIGHT: ENTERPRISE GRADE SECURITY
DEFENSE IN DEPTH
PERIMETER
Isolate clusters within VNETs
Service Endpoint support for WASB, Azure DB, Cosmos DB
Restrict outbound traffic using NVAs*
AUTHENTICATION
Azure Active Directory
Kerberos with Active
Directory
AUTHORIZATION
Role-Based Access Control
Apache Ranger based Access
Control
DATA PROTECTION
Encryption on-the-wire with HTTPS enforced
Encryption at Rest using Azure Key Vault
Auditing of all data operations and configuration changes
AZURE HDINSIGHT: AUTHENTICATION & ACCESS CONTROL
Scenario Authorizing component
Yarn: Submit-App Apache Ranger: Yarn Plugin
Hive Operations: Select , Drop, index, Lock, Read, Write, Masking, Row
level filter on Hive Database, Table & Columns
Apache Ranger: Hive Plugin
Create/ Alter Table with storage location reference Apache Ranger + ADLS Gen 2 ACL’s
Spark SQL access with Hive Metastore Apache Ranger: Hive Plugin
HBase Access Policies Apache Ranger/ HBase plugin
Kafka Access Policies Apache ranger/ Kafka Plugin
Access Azure Data Lake Storage Gen2 using the Spark DataFrame API ADLS Gen 2 ACLs
Access Azure Data Lake Storage Gen2 using the RDD API ADLS Gen 2 ACLs
HDFS operations: Mkdir, ls, put, copyFromLocal, get, cat, mv, cp etc ADLS Gen 2 ACLs
Running Map Reduce jobs ADLS Gen 2 ACLs
Setup
Autoscale
Customize to your own scenario
Pay for ONLY what you need
Monitoring scaling history easily
Graceful Scale Down
DR OPTIONS BY WORKLOADS
Workload DR Option
Spark / Hive Manual, Partner solution
HBase HBase replication, Snapshot export, Import Export,
Copy Tables
Kafka Mirror Maker
https://github.com/anagha-microsoft/hdi-spark-dr
https://github.com/anagha-microsoft/hdi-kafka-dr
https://docs.microsoft.com/en-
us/azure/hdinsight/hbase/apache-hbase-backup-replication
HDINSIGHT MONITORING OPTIONS
Apache Ambari Azure Log Analytics Integration
HDInsight Cluster Metrics
DEMO TIME
Azure Hd insigth news

Más contenido relacionado

La actualidad más candente

How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
Databricks
 
Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
Gianmario Spacagna
 
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
Databricks
 

La actualidad más candente (20)

Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
 
DataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de KreukDataMinds 2022 Azure Purview Erwin de Kreuk
DataMinds 2022 Azure Purview Erwin de Kreuk
 
Designing big data analytics solutions on azure
Designing big data analytics solutions on azureDesigning big data analytics solutions on azure
Designing big data analytics solutions on azure
 
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
How Azure Databricks helped make IoT Analytics a Reality with Janath Manohara...
 
AI for Intelligent Cloud and Intelligent Edge: Discover, Deploy, and Manage w...
AI for Intelligent Cloud and Intelligent Edge:Discover, Deploy, and Manage w...AI for Intelligent Cloud and Intelligent Edge:Discover, Deploy, and Manage w...
AI for Intelligent Cloud and Intelligent Edge: Discover, Deploy, and Manage w...
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
 
A Tour of Azure SQL Databases (NOVA SQL UG 2020)
A Tour of Azure SQL Databases  (NOVA SQL UG 2020)A Tour of Azure SQL Databases  (NOVA SQL UG 2020)
A Tour of Azure SQL Databases (NOVA SQL UG 2020)
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
 
Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Azure SQL Database & Azure SQL Data Warehouse
Azure SQL Database & Azure SQL Data WarehouseAzure SQL Database & Azure SQL Data Warehouse
Azure SQL Database & Azure SQL Data Warehouse
 
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...
 
Spark on Azure HDInsight - spark meetup seattle
Spark on Azure HDInsight - spark meetup seattleSpark on Azure HDInsight - spark meetup seattle
Spark on Azure HDInsight - spark meetup seattle
 
NoSQL Migration Technical Pitch Deck
NoSQL Migration Technical Pitch DeckNoSQL Migration Technical Pitch Deck
NoSQL Migration Technical Pitch Deck
 
Azure Data services
Azure Data servicesAzure Data services
Azure Data services
 
HA/DR options with SQL Server in Azure and hybrid
HA/DR options with SQL Server in Azure and hybridHA/DR options with SQL Server in Azure and hybrid
HA/DR options with SQL Server in Azure and hybrid
 
Cortana Analytics Suite
Cortana Analytics SuiteCortana Analytics Suite
Cortana Analytics Suite
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 

Similar a Azure Hd insigth news

Similar a Azure Hd insigth news (20)

Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
 
Building Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and KafkaBuilding Big Data Applications using Spark, Hive, HBase and Kafka
Building Big Data Applications using Spark, Hive, HBase and Kafka
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for Architects
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
 
Azure Big data
Azure Big data Azure Big data
Azure Big data
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platform
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
 
Introduction and HDInsight best practices
Introduction and HDInsight best practicesIntroduction and HDInsight best practices
Introduction and HDInsight best practices
 
Modern dataintegration azuredatafactory_ssis
Modern dataintegration azuredatafactory_ssisModern dataintegration azuredatafactory_ssis
Modern dataintegration azuredatafactory_ssis
 
Scalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the CloudScalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the Cloud
 
SAP Workloads on the AWS Cloud - AWS Innovate Toronto
SAP Workloads on the AWS Cloud - AWS Innovate TorontoSAP Workloads on the AWS Cloud - AWS Innovate Toronto
SAP Workloads on the AWS Cloud - AWS Innovate Toronto
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
Securing your Big Data Environments in the Cloud
Securing your Big Data Environments in the CloudSecuring your Big Data Environments in the Cloud
Securing your Big Data Environments in the Cloud
 
Build 2017 - P4010 - A lap around Azure HDInsight and Cosmos DB Open Source A...
Build 2017 - P4010 - A lap around Azure HDInsight and Cosmos DB Open Source A...Build 2017 - P4010 - A lap around Azure HDInsight and Cosmos DB Open Source A...
Build 2017 - P4010 - A lap around Azure HDInsight and Cosmos DB Open Source A...
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 

Más de nnakasone

Más de nnakasone (20)

Azure Data Mesh
Azure Data MeshAzure Data Mesh
Azure Data Mesh
 
Big Data en el mundo del Machine Learning
Big Data en el mundo del Machine LearningBig Data en el mundo del Machine Learning
Big Data en el mundo del Machine Learning
 
Como empezar tu Carrera como Azure Data Engineer
Como empezar tu Carrera como Azure Data EngineerComo empezar tu Carrera como Azure Data Engineer
Como empezar tu Carrera como Azure Data Engineer
 
Aprende a diseñar tableros de control (dashboard) de Alto impacto
Aprende a diseñar tableros de control (dashboard) de Alto impactoAprende a diseñar tableros de control (dashboard) de Alto impacto
Aprende a diseñar tableros de control (dashboard) de Alto impacto
 
Azure Synapse LInk
Azure Synapse LInkAzure Synapse LInk
Azure Synapse LInk
 
Ingeniería de datos para el mejoramiento continuo de los negocios
Ingeniería de datos para el mejoramiento continuo de los negociosIngeniería de datos para el mejoramiento continuo de los negocios
Ingeniería de datos para el mejoramiento continuo de los negocios
 
Como Empezar en el mundo de la Inteligencia Artificial
Como Empezar en el mundo de la Inteligencia ArtificialComo Empezar en el mundo de la Inteligencia Artificial
Como Empezar en el mundo de la Inteligencia Artificial
 
Introducing to sql server 2022
Introducing to sql server 2022Introducing to sql server 2022
Introducing to sql server 2022
 
Synthetic data
Synthetic dataSynthetic data
Synthetic data
 
PowerBI Admin
PowerBI AdminPowerBI Admin
PowerBI Admin
 
Analitica Aplicada a la Salud
Analitica Aplicada a la SaludAnalitica Aplicada a la Salud
Analitica Aplicada a la Salud
 
GAI_Microsoft_Speech_NN.pdf
GAI_Microsoft_Speech_NN.pdfGAI_Microsoft_Speech_NN.pdf
GAI_Microsoft_Speech_NN.pdf
 
Implementando redes educativas empresariales
Implementando redes educativas empresarialesImplementando redes educativas empresariales
Implementando redes educativas empresariales
 
Construyendo soluciones ONNX con Microsoft AI Platform
Construyendo soluciones ONNX con Microsoft AI PlatformConstruyendo soluciones ONNX con Microsoft AI Platform
Construyendo soluciones ONNX con Microsoft AI Platform
 
Detectando Mentiras con AI
Detectando Mentiras con AIDetectando Mentiras con AI
Detectando Mentiras con AI
 
Genomic sql pass
Genomic sql passGenomic sql pass
Genomic sql pass
 
Azure Quantum Computing
Azure Quantum ComputingAzure Quantum Computing
Azure Quantum Computing
 
Inteligencia Artificial Success Case
Inteligencia Artificial Success CaseInteligencia Artificial Success Case
Inteligencia Artificial Success Case
 
Vision Artificial Casos de Exito
Vision Artificial Casos de ExitoVision Artificial Casos de Exito
Vision Artificial Casos de Exito
 
Bigdata
BigdataBigdata
Bigdata
 

Último

Último (20)

[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
 
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
The Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test AutomationThe Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test Automation
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
Auto Affiliate  AI Earns First Commission in 3 Hours..pdfAuto Affiliate  AI Earns First Commission in 3 Hours..pdf
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
 
Weeding your micro service landscape.pdf
Weeding your micro service landscape.pdfWeeding your micro service landscape.pdf
Weeding your micro service landscape.pdf
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024Automate your OpenSIPS config tests - OpenSIPS Summit 2024
Automate your OpenSIPS config tests - OpenSIPS Summit 2024
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements Engineering
 
Abortion Clinic In Springs ](+27832195400*)[ 🏥 Safe Abortion Pills in Springs...
Abortion Clinic In Springs ](+27832195400*)[ 🏥 Safe Abortion Pills in Springs...Abortion Clinic In Springs ](+27832195400*)[ 🏥 Safe Abortion Pills in Springs...
Abortion Clinic In Springs ](+27832195400*)[ 🏥 Safe Abortion Pills in Springs...
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
 
Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...
Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...
Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
 

Azure Hd insigth news

  • 1.
  • 2. BIG DATA NEWS WITH AZURE HD INSIGHT MVP Nicolás Nakasone @nicolasnakasone
  • 3. THE AZURE DATA LANDSCAPE AZURE SDK AZURE DATA FACTORY AZURE IMPORT EXPORT SERVICE AZURE CLI COGNITIVE SERVICES BOT SERVICE AZURE SEARCH AZURE DATA CATALOG AZURE EXPRESSROUTE AZURE NETWORK SECURITY GROUPS AZURE FUNCTIONS VISUAL STUDIO OPERATIONS MANAGEMENT SUITE AZURE ACTIVE DIRECTORY AZURE KEY MANAGEMENT SERVICE AZURE STORAGE BLOBS AZURE DATA LAKE STORAGE AZURE IOT HUB AZURE EVENT HUBS KAFKA ON AZURE HDINSIGHT AZURE SQL DATA WAREHOUSE AZURE SQL DB AZURE COSMOS DB AZURE ANALYSIS SERVICES POWER BI AZURE HDINSIGHT AZURE DATABRICKS AZURE STREAM ANALYTICS AZURE ML ML SERVER AZURE DATABRICKS
  • 4. WHY HDINSIGHT? Preserve existing investments in Hive, Pig, Storm & HBase Preference towards managed offering Multiple workloads (Realtime, Batch, ML etc.) Similar on-prem and cloud codebase [Apache Open Source] Row and Column level security Low cost
  • 5. • The most trusted and compliant platform AZURE HDINSIGHT A SECURE AND MANAGED APACHE HADOOP AND SPARK PLATFORM FOR BUILDING DATA LAKES IN THE CLOUD
  • 6. Monitoring & Security Presto or Hive LLAP? Which storage system? How to Transfer the Data ADF/Airflow or Oozie? Pig, Hive or Spark Spark Streaming or Storm ETL Serving Layer Storage Orchestration Event Processing
  • 7. Pig Designed for ETL ETL Data warehousing Adoption High, increasing Low, decreasing Stable Number of connectors Highest High High Languages Python, R, Scala, Java, SQL Pig SQL Performance High Medium Medium
  • 8. STREAMING ENGINE TECHNOLOGY CHOICES Spark Structured Streaming Storm Adoption High, increasing Decreasing Event processing guarantee Exactly once At least once Throughput High Low Processing Model Micro Batch Real-Time Latency High Low Event time support Yes Yes Languages Python, R, Scala, Java, SQL Java
  • 9. Capability Hive LLAP Interactive Query Speed High High Medium Scale High High Low Caching Yes Yes Early Support Result Caching Yes No No Intelligent Cache Eviction Yes No No Materialized Views Yes No No Complex Fact to Fact Joins Yes Yes No Transactions Yes No No Query Concurrency High Low Low Row , Column level security Yes [Apache Ranger+ AAD] Medium Medium Rich end user Tools Yes Yes Yes Language Support SQL, UDF SQL, Scala, Python SQL Data Source Connector Support Storage Handlers Data Sources connectors
  • 10. Hive Metadata Spark Metadata Hive Metadata Azure HDInsight 3.6 with Hadoop 2.6 Azure HDInsight 4.0 with Hadoop 3.x Hive Metastore migration tool: https://azure.microsoft.com/en-us/blog/hdinsight-metastore-migration-tool-open-source- release-now-available/
  • 11. ADF Airflow Oozie Service management Azure PaaS IaaS VM HDInsight Code JSON Python Java GUI ADF V2 has great UX Good UX Below Average UX Community Microsoft Growing (12,133 Stars) Declining (483 Stars) On-demand clusters Yes No, but extensible No Extensibility Custom action-only Full, graph + actions Custom action-only Pipeline definition JSON/UX Python/ UX XML/JAVA/UX Devops-first design Yes Yes Yes Pipeline monitoring Yes Yes Yes Scheduling Event, Time Event Event, Time
  • 13. Data Qty Network Bandwidth 45 Mbps (T3) 100 Mbps 1 Gbps 1 TB 2 days 1 day 2 hours 10 TB 22 days 10 days 1 day 35 TB 76 days 34 days 3 days 80 TB 173 days 78 days 8 days 100 TB 216 days 97 days 10 days 200 TB 1 year 194 days 19 days 500 TB 3 years 1 year 49 days 1 PB 6 years 3 years 97 days 2 PB 12 years 5 years 194 days
  • 14. Network Transfer with TLS • Over Internet • Express Route • Data Box online Transfer Shipping data offline • Data Box offline data transfer
  • 15. USB 3.1 SSD disks Order up to 5 in each pack Ruggedized, self-contained appliances 100 TB 8 TB, up to 40 TB 1 PB Use Azure Data Box to migrate data from an on-premises HDFS store to Azure Storage
  • 16. Type Latency ( Consistency of latency) Workloads Bandwidth Key Benefits ADLS Gen 2 Hierarchical 10-50ms (Medium) HDInsight 3.6 & 4.0 Unconstrained Atomic Rename, File Folder level ACL’s Standard BLOB Object Store 10-50ms (Medium) HDInsight 3.6 & 4.0 Unconstrained Mature Premium BLOB Object Store ~5ms (High) HBase in Preview Unconstrained Fast Premium Managed Disks Hierarchical ~5ms (High) Kafka, HBase in preview Based on disk Consistent latency ADLS Gen 1 Hierarchical 10-100ms (Low) HDInsight 3.6( No HBase) High Atomic Rename, File Folder level ACL’s
  • 17. REMOTE STORAGE: CACHING OPTIONS Workload Caching Options Key benefits Spark Spark IO Cache Up to ~8 to 10x perf improvements HBase & Phoenix Bucket cache Up 5-10x perf gains on recently read or written data Hive + LLAP LLAP Intelligent cache/Result Cache Up to ~4-100X gain on cached data
  • 18. Azure Data Lake Storage INSTANCE CORE RAM TEMP SSD D1 v2 1 3.50 GiB 50 GiB D2 v2 2 7.00 GiB 100 GiB D3 v2 4 14.00 GiB 200 GiB D4 v2 8 28.00 GiB 400 GiB D5 v2 16 56.00 GiB 800 GiB • Significant Spark performance speed up with IO cache (up to 9X perf gains) • Automatic cache resource management • DRAM + Temp SSD makes large cache pool
  • 19. AZURE HDINSIGHT: ENTERPRISE GRADE SECURITY DEFENSE IN DEPTH PERIMETER Isolate clusters within VNETs Service Endpoint support for WASB, Azure DB, Cosmos DB Restrict outbound traffic using NVAs* AUTHENTICATION Azure Active Directory Kerberos with Active Directory AUTHORIZATION Role-Based Access Control Apache Ranger based Access Control DATA PROTECTION Encryption on-the-wire with HTTPS enforced Encryption at Rest using Azure Key Vault Auditing of all data operations and configuration changes
  • 21. Scenario Authorizing component Yarn: Submit-App Apache Ranger: Yarn Plugin Hive Operations: Select , Drop, index, Lock, Read, Write, Masking, Row level filter on Hive Database, Table & Columns Apache Ranger: Hive Plugin Create/ Alter Table with storage location reference Apache Ranger + ADLS Gen 2 ACL’s Spark SQL access with Hive Metastore Apache Ranger: Hive Plugin HBase Access Policies Apache Ranger/ HBase plugin Kafka Access Policies Apache ranger/ Kafka Plugin Access Azure Data Lake Storage Gen2 using the Spark DataFrame API ADLS Gen 2 ACLs Access Azure Data Lake Storage Gen2 using the RDD API ADLS Gen 2 ACLs HDFS operations: Mkdir, ls, put, copyFromLocal, get, cat, mv, cp etc ADLS Gen 2 ACLs Running Map Reduce jobs ADLS Gen 2 ACLs
  • 22. Setup Autoscale Customize to your own scenario Pay for ONLY what you need Monitoring scaling history easily Graceful Scale Down
  • 23. DR OPTIONS BY WORKLOADS Workload DR Option Spark / Hive Manual, Partner solution HBase HBase replication, Snapshot export, Import Export, Copy Tables Kafka Mirror Maker https://github.com/anagha-microsoft/hdi-spark-dr https://github.com/anagha-microsoft/hdi-kafka-dr https://docs.microsoft.com/en- us/azure/hdinsight/hbase/apache-hbase-backup-replication
  • 24. HDINSIGHT MONITORING OPTIONS Apache Ambari Azure Log Analytics Integration HDInsight Cluster Metrics