Azure Data Platform Services
HDInsight Clusters in Azure
Data Storage: Apache Hive, Apache Hbase, Azure Data Catalog
Data Transformations: Apache Storm, Apache Spark, Azure Data Factory
Healthcare / Life Sciences Use Cases
Azure Data Platform Services
HDInsight Clusters in Azure
Data Storage: Apache Hive, Apache Hbase, Azure Data Catalog
Data Transformations: Apache Storm, Apache Spark, Azure Data Factory
Healthcare / Life Sciences Use Cases
1.
Azure Data Platform Services
Mostafa Elzoghbi
http://mostafa.rocks
Twitter: @MostafaElzoghbi
2.
Session takeaways and objectives
• Azure Data Platform Services
• HDInsight Clusters in Azure
• Data Storage: Apache Hive, Apache Hbase, Azure Data Catalog
• Data Transformations: Apache Storm, Apache Spark, Azure Data Factory
• Healthcare / Life Sciences Use Cases
4.
HDInsight is a cloud implementation on Microsoft Azure of the rapidly expanding Apache
Hadoop technology stack that is the go-to solution for big data analysis.
It includes implementations of Apache Spark, HBase, Storm, Pig, Hive, Sqoop, Oozie,
Ambari, and so on.
HDInsight also integrates with business intelligence (BI) tools such as Power BI, Excel, SQL
Server Analysis Services, and SQL Server Reporting Services.
HDInsight is available on Windows and Linux
HDInsight on Linux: A Hadoop cluster on Ubuntu
HDInsight on Windows: A Hadoop cluster on Win Server 2012 R2
What is HDInsight
5.
HDInsight provides cluster Types & custom configurations for:
• Hadoop (HDFS)
• HBase
• Storm
• Spark
• R Server
• Hive Interactive, Kafka (Preview)
HDInsight has powerful programming extensions for languages including C#, Java,
and .NET.
Business Value Proposition: Skip maintaining and purchasing hardware
HDInsight clusters on Azure
7.
Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after
Google BigTable.
HBase provides random access and strong consistency for large amounts of unstructured and
semistructured data in a schemaless database organized by column families
Data is stored in the rows of a table, and data within a row is grouped by column family.
The open-source code scales linearly to handle petabytes of data on thousands of nodes. It can
rely on data redundancy, batch processing, and other features that are provided by distributed
applications in the Hadoop ecosystem.
What is HBase
8.
Order No Customer
Name
Customer
Phone
Company Name Company
Address
12012015 Mostafa 101-232-2345 Microsoft Redmond, WA
Customer Company
Order No Customer
Name
Customer
Phone
Company Name Company
Address
12012015 Mostafa 101-232-2345 Microsoft Redmond, WA
9.
HBase Commands:
create Equivalent to Create table in T-SQL
get Equivalent to Select statements in T-SQL
put Equivalent to Update, Insert statement in T-SQL
scan Equivalent to Select (no where condition) in T-SQL
HBase shell is your query tool to execute in CRUD commands to a HBase cluster.
Data can also be managed using the HBase C# API, which provides a client library on top
of the HBase REST API.
An HBase database can also be queried by using Hive using SQLHive.
What is HBase
10.
Apache Hive is a data warehouse system for Hadoop, which enables data summarization,
querying, and analysis of data by using HiveQL (a query language similar to SQL).
Hive understands how to work with structured and semi-structured data, such as text files
where the fields are delimited by specific characters.
Hive also supports custom serializer/deserializers (SerDe) for complex or irregularly
structured data.
Hive can also be extended through user-defined functions (UDF).
A UDF allows you to implement functionality or logic that isn't easily modeled in HiveQL.
Support for multiple execution engine: WebHCat, HiveServer2 (faster) or Tez.
What is Hive
11.
Apache Storm is a distributed, fault-tolerant, open-source computation system that allows
you to process data in real-time with Hadoop.
Apache Storm on HDInsight allows you to create distributed, real-time analytics solutions in
the Azure environment by using Apache Hadoop.
Ability to write Storm components in C#, JAVA and Python.
Azure Scale up or Scale down without an impact for running Storm topologies.
Ease of provision and use in Azure portal & development templates in Visual Studio.
What is Apache Storm
12.
Apache Storm apps are submitted as Topologies.
A topology is a graph of computation that processes streams
Stream: An unbound collection of tuples. Streams are produced by spouts and bolts, and
they are consumed by bolts.
Tuple: A named list of dynamically typed values.
Spout: Consumes data from a data source and emits one or more streams.
Bolt: Consumes streams, performs processing on tuples, and may emit streams. Bolts are
also responsible for writing data to external storage, such as a queue, HDInsight, HBase, a
blob, or other data store.
Nimbus: JobTracker in Hadoop that distribute jobs, monitoring failures.
Apache Storm Components
13.
Apache Spark™ is a fast and general engine for large-scale data processing.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Write applications quickly in Java, Scala, Python, R.
Combine SQL, streaming, and complex analytics.
Spark's in-memory computation capabilities
make it a good choice for iterative algorithms in
ML and graph computations.
Support for R Server & Azure Data Lake.
What is Apache Spark
15.
Azure Data Lake includes all the capabilities required to make it easy for developers, data
scientists, and analysts to store data of any size, shape, and speed, and do all types of
processing and analytics across platforms and languages.
It removes the complexities of ingesting and storing all of your data while making it faster
to get up and running with batch, streaming, and interactive analytics.
Azure Data Lake works with existing IT investments for identity, management, and security
for simplified data management and governance.
Data Lake Analytics—a no-limits analytics job service to power intelligent action
Azure Data Lake (ADL)
17.
Azure Data Lake Analytics is a new service, built to make big data analytics easy.
Key capabilities:
Dynamic scaling: do analytics on terabytes or even exabytes of data
Develop faster, debug, and optimize smarter using familiar tools: U-SQL Jobs.
U-SQL: simple and familiar, powerful, and extensible: SQL language with the power of
expressive C#, process & analyze data with skills you already have.
Affordable and cost effective.
Works with all your Azure Data: Data Lake Analytics is optimized to work with Azure
Data Lake - providing the highest level of performance, throughput, and parallelization
for your big data workloads. Data Lake Analytics can also work with Azure Blob
storage and Azure SQL database.
Azure Data Lake Analytics
20.
Provider of enzymes for Food, Pharma Industry
Pharmaceuticals
Global provider of natural
ingredients such as cultures and
enzymes for the food,
pharmaceutical, nutritional, and
agricultural industries.
Part 1: What They Did | Gene Analysis
Challenge
Collect clinical trial data from electronic laboratory notebooks (ELNs)
Collect structured and unstructured data from sources like automated equipment (robots,
temperature sensors, and other devices)
Need to do analysis and look for patterns on this data
Solution
Chose Azure HDInsight, SQL Server on-premises
Examine patterns in chemical composition and physical properties of cultures and
enzymes e.g. examine the gene composition of our bacteria and how it actually
behaves in yogurt
Gene Analysis
21.
BK1
Provider of enzymes for Food, Pharma Industry
Part 2: How They Did It | Gene Analysis
How They Did It
Collect data in Azure Blobs
• Extract data from all the files (GBs in size)
• Transpose into JSON using .NET
HDInsight processes data for insights
• Hive is used to run queries
• Mainly Select statements (joins/unions)
• Maximum 20 lines of code
Use SQL Server for reporting
Use Hive ODBC connector
Gene Analysis
SQL Server
On-premises
Lab Information
Management
System
Automated equipment
(robots, temperature sensors, etc)
22.
Healthcare Application Provider
Healthcare
Leading provider of healthcare
software applications for clinical,
financial, pharmacy, etc.
Part 1: What They Did | Data Lake to deliver better analytics
Challenge
Multiple healthcare applications operating in product silos
Want to implement a data lake and provide distributed processing for better analytics and
predictions for their end customers:
• Population, risk, and Care management
• Clinical decision support using predictions
• Real time quality measures to assist providers reach their regulatory requirements
• Capacity management predictions
• Enrich clinical data with NLP on unstructured physicians notes to reduce over/under treatment,
readmissions, and faster claims processing
Solution
Chose Azure HDInsight, HBase, and custom .NET application
Begin by storing all medical records in Azure
Acquire, clean data and insert into data lake
Data Lake
Analytics
23.
BK1
Healthcare Application Provider
Part 2: How They Did It | Data Lake to deliver better analytics
How They Did It
Collect data in Azure Blobs
• Have medical records in JSON format
HDInsight processes data for insights
• MapReduce preprocesses data
• used to insert data into data lake
• Developed .NET application to interface with HBase
Data Lake Analytics
Azure Blobs
Azure HDInsight
Medical Records
in JSON
.NET Application
Insert into Blobs
24.
References
• HDInsight Documentation (R, Storm, Spark, Kafka, ADL,..etc)
https://azure.microsoft.com/en-us/services/hdinsight/
• Spark Programming Guide
http://spark.apache.org/docs/latest/programming-guide.html
• edx.org: Free Apache Spark courses
• Get started with Data Science VMs in Azure
https://blogs.technet.microsoft.com/machinelearning/tag/dsvm/
Los recortes son una forma práctica de recopilar diapositivas importantes para volver a ellas más tarde. Ahora puedes personalizar el nombre de un tablero de recortes para guardar tus recortes.
Crear un tablero de recortes
Compartir esta SlideShare
¿Odia los anuncios?
Consiga SlideShare sin anuncios
Acceda a millones de presentaciones, documentos, libros electrónicos, audiolibros, revistas y mucho más. Todos ellos sin anuncios.
Oferta especial para lectores de SlideShare
Solo para ti: Prueba exclusiva de 60 días con acceso a la mayor biblioteca digital del mundo.
La familia SlideShare crece. Disfruta de acceso a millones de libros electrónicos, audiolibros, revistas y mucho más de Scribd.
Parece que tiene un bloqueador de anuncios ejecutándose. Poniendo SlideShare en la lista blanca de su bloqueador de anuncios, está apoyando a nuestra comunidad de creadores de contenidos.
¿Odia los anuncios?
Hemos actualizado nuestra política de privacidad.
Hemos actualizado su política de privacidad para cumplir con las cambiantes normativas de privacidad internacionales y para ofrecerle información sobre las limitadas formas en las que utilizamos sus datos.
Puede leer los detalles a continuación. Al aceptar, usted acepta la política de privacidad actualizada.