SlideShare una empresa de Scribd logo
1 de 6
Descargar para leer sin conexión
Aucfanlab Datalake
- Big Data Management Platform -
1. Abstract
This report introduces the “Aucfan Datalake” a cloud-based big data management
platform developed at Aucfan. The Aucfan Datalake provides easy job orchestration
and management for data ingestion, storage and creation of datamarts. While it uses
cutting-edge big data technologies for these workloads, all technical details are
abstracted away from the user, who only has to know a simple high-level interface
for working with his data. The system already handles enormous daily workloads and
caters to a broad range of use cases at Aucfan. Individual scalability of its
components and the use of a cloud-based architecture make it adaptable for all
future workloads.
2. Introduction
The big data ecosystem has evolved at a fast pace with new technologies and
updates to existing technologies being released at an increasing rate in recent years.
The rise of Spark, the influx of new databases, the advent of fully managed cloud
services and similar developments have all contributed to creating a complex and
convoluted eco system that is hard to navigate for the user. The user has to learn
about different technologies like Hadoop and Spark, orchestrating his data
workloads and many other topics. Additionally, he has to manage all these resources
and take care of scaling them to his needs.
To make these technologies easily accessible to internal developers and spread the
use of big data technology in the company, the here introduced Datalake project was
born.
3. Design goals
Based on the above-mentioned difficulties the following design goals where
identified for the system:
- Enable the user to easily schedule data workloads
- Provide a scalable central repository for the company’s data
- Abstract all big data technology related details away
- Let the user work with high-level concepts like “datasets”, “import”, “export” etc.
4. High-level system overview
Given the goals stated above, a system design was conceived that encapsulates the
actual implementation from the user and provides an accessible API interface. From
a bird’s eye view the final system can be described as three parts: The data that the
user owns and wants to work with, the user itself, who uses the Datalake API to
execute data workflows and finally the Datalake, which provides the API and
implements	 the respective actions.
The provided API features entry points for importing data from the user’s data
sources, manage metadata, and export datasets to internal and external datamarts.
Additional features include scheduling workflows, categorizing data and partitioning.
5. API workflow for user
In this section the use of the API is explained from the user’s point of view. The user
works exclusively with the API and its high-level concepts and thus does not need to
know anything about the internally used big data technologies. The high-level
concepts represent among other things data sources, datasets, import and export
jobs, as well as a dataset-specific parser.
A detailed description of a typical workflow from importing the data to exporting it to
a specific datamart follows:
1. The user registers a datasource by defining the type of storage, location and
necessary authentication information of his data. This might for example be
an Azure blob storage account.
2. He then creates a dataset, which might include metadata like a description
or tags. Information like the size of the data and row count are later
automatically added, when the actual data is imported.
3. Next the user creates an Importer object, which contains a reference to the
above created datasource, to schedule the data import job. Additional
options for defining the temporal partitioning of the data may be added.
4. In the final step for the import workflow the user creates a Parser object
specific to his dataset. This parser contains information regarding the
format of the data, its schema and custom parsing instructions.
For example, for parsing CSV files the user would set CSV as the data
format, set CSV-related properties like column-seperator or escape
character and then define the schema of his data.
5. Once the data is parsed and stored as a dataset it can later be exported to
various kinds of datamarts. For this the user only needs to create an
Exporter object that references a datasource (for example a S3 bucket or
Azure SQL database) as the sink.
The user can monitor the execution of each of the above steps to check whether a
job is still running, succeeded or might have failed. In the last case he can access the
error logs of the respective job.
As the above actions are easily executable REST-API calls, shell tools and a
web-based GUI were created to allow users to interact with the system.
6. System architecture
The technical architecture of the system encompasses a Java application for the API
and system-logic as well as an ensemble of cloud services for orchestration and
execution of the data workflows. Here, the exclusive use of cloud services allows for
a cost effective and scalable solution. Microsoft Azure was chosen as the primary
cloud provider for the Datalake, as it provides an extensive portfolio of data services
and integrates well with other systems in the company.
6.1 REST-API
The Datalake project and other projects in the company are designed according
to a microservice architecture with a REST-API server as interface between the
services and users. The API server is implemented as a Spring-based java
application and deployed using Azure as a Platform as a service (PaaS) provider. For
the persistence layer of the application and the main repository for metadata a
NoSQL database was chosen. This allows for adaptability to changes and new
features as well as easy scaling.
6.2 Orchestration
The main job orchestration and monitoring is implemented using Azure’s Data
Factory service. It provides methods to manage and monitor various pipelines for
moving data or executing Hadoop / Spark jobs. Creation of pipelines and resources
in Azure Data Factory is done by REST-API calls from the Datalake’s java backend
server to the Azure Resource Management API.
6.3 Data Storage
All data is stored using Azure Blob Storage as cloud storage to allow scaling of
storage and to be able to scale storage and processing power for Hadoop / Spark
clusters separately. The systems main storage is divided into two logical locations:
1. The “Datalake”, which stores raw data that is imported from various sources as-is
2. The “DataPlatform”, which serves as a structured and optimized data store
6.4 Hive
Hive is used as the main big data warehouse and used to parse and store
datasets according to the format and schema information provided by the user.
Parsed datasets are persisted as external hive tables with the actual data being
stored as gizpped Avro-files. This allows for cost-effective storage and efficient
access to the data.
The schema of these Hive tables is defined by mapping the schema that the user
provided to the corresponding Hive datatypes. Here, the user-defined schema uses
common datatypes defined for the Datalake, which are than mapped to the
corresponding datatypes for each used data store (e.g. Hive, Azure SQL, etc.).
Additionally custom parsing instructions in the form of HiveQL statements can be
defined on a per-column basis.
Data that is parsed and stored like this allows for fast access and can directly be
used for the datamart export function.
6.5 Hadoop / Spark cluster
All Hive and Hadoop jobs are run on a fully-managed and scalable Azure
HDInsight cluster. For this a Linux-cluster running Spark is deployed. Hive queries
are executed vectorized and Tez is used for better performance. To further improve
performance data is partitioned according to its temporal characteristics.
6.6 Internal datamarts
The system additionally manages internal Azure SQL Databases and an Azure
SQL Data Warehouse instance. These can be used for datamarts created from
datasets.
6.7 Cloud services
An advantage of using cloud services exclusively is the scalability of all
components individually from each other. This allows appropriate scaling according
to performance needs of use cases. The used cluster is a scalable HDInsight cluster,
whose node number can be easily increased or shrunk. Additionally, the used data
storage adjusts to its use. The same is true for the Azure DWH which provides a near
infinite scale capability for SQL workloads. Overall this allows the system to react to
changing workloads in all components.
7. Summary
The Datalake is currently used for several use cases in the company. Most important
being the role of a central storage component for the companies EC market data
crawlers. The ingested data is exposed to BI tools using created datamarts.
Additionally, the Datalake functions as the central repository to gather the company’s
data and as the backend for the “Aucfan Dataexchange”, a web service that allows
to browse the contained data using a web browser.
The multitude of possible use cases and the diverse range of users that are easily
able to use the system to manage their data workloads furthermore ensures the
future use of the system in many additional use cases. Finally scalability of all
components individually provides for a cost-effective and efficient system to handle
these use cases with good performance.

Más contenido relacionado

La actualidad más candente

Designing and developing your database for application availability
Designing and developing your database for application availabilityDesigning and developing your database for application availability
Designing and developing your database for application availability
Charley Hanania
 
Introducing ms sql_server_updated
Introducing ms sql_server_updatedIntroducing ms sql_server_updated
Introducing ms sql_server_updated
leetinhf
 

La actualidad más candente (20)

Designing and developing your database for application availability
Designing and developing your database for application availabilityDesigning and developing your database for application availability
Designing and developing your database for application availability
 
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsA General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
 
Bt0066 database management system1
Bt0066 database management system1Bt0066 database management system1
Bt0066 database management system1
 
One Large Data Lake, Hold the Hype
One Large Data Lake, Hold the HypeOne Large Data Lake, Hold the Hype
One Large Data Lake, Hold the Hype
 
Log Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and SparkLog Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and Spark
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
Exploring Microsoft Azure Infrastructures
Exploring Microsoft Azure InfrastructuresExploring Microsoft Azure Infrastructures
Exploring Microsoft Azure Infrastructures
 
Sas hpa-va-bda-exadata-2389280
Sas hpa-va-bda-exadata-2389280Sas hpa-va-bda-exadata-2389280
Sas hpa-va-bda-exadata-2389280
 
data stage-material
data stage-materialdata stage-material
data stage-material
 
A Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning ArchitectureA Query Model for Ad Hoc Queries using a Scanning Architecture
A Query Model for Ad Hoc Queries using a Scanning Architecture
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
 
Monitor tableau server for reference
Monitor tableau server for referenceMonitor tableau server for reference
Monitor tableau server for reference
 
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Spark1
Spark1Spark1
Spark1
 
Introducing ms sql_server_updated
Introducing ms sql_server_updatedIntroducing ms sql_server_updated
Introducing ms sql_server_updated
 
Oracle developer interview questions(entry level)
Oracle developer interview questions(entry level)Oracle developer interview questions(entry level)
Oracle developer interview questions(entry level)
 
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSQUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
 
hive lab
hive labhive lab
hive lab
 
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
Big SQL 3.0: Datawarehouse-grade Performance on Hadoop - At last!
 

Destacado

Destacado (6)

メールマガジン分析 〜クリスマス商戦はいつから始まるのか〜
メールマガジン分析 〜クリスマス商戦はいつから始まるのか〜メールマガジン分析 〜クリスマス商戦はいつから始まるのか〜
メールマガジン分析 〜クリスマス商戦はいつから始まるのか〜
 
インターネットオークションにおけるチケット流通量調査
インターネットオークションにおけるチケット流通量調査インターネットオークションにおけるチケット流通量調査
インターネットオークションにおけるチケット流通量調査
 
出品情報のタイトル文字数と落札価格の相関性
出品情報のタイトル文字数と落札価格の相関性出品情報のタイトル文字数と落札価格の相関性
出品情報のタイトル文字数と落札価格の相関性
 
さくっとはじめるテキストマイニング(R言語)  スタートアップ編
さくっとはじめるテキストマイニング(R言語)  スタートアップ編さくっとはじめるテキストマイニング(R言語)  スタートアップ編
さくっとはじめるテキストマイニング(R言語)  スタートアップ編
 
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
 
最新業界事情から見るデータサイエンティストの「実像」
最新業界事情から見るデータサイエンティストの「実像」最新業界事情から見るデータサイエンティストの「実像」
最新業界事情から見るデータサイエンティストの「実像」
 

Similar a Aucfanlab Datalake - Big Data Management Platform -

A introduction to oracle data integrator
A introduction to oracle data integratorA introduction to oracle data integrator
A introduction to oracle data integrator
chkamal
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture
Rajesh Kumar
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
OllieShoresna
 

Similar a Aucfanlab Datalake - Big Data Management Platform - (20)

Data Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptxData Modernization_Harinath Susairaj.pptx
Data Modernization_Harinath Susairaj.pptx
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
 
Azure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdfAzure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdf
 
IT6701-Information management question bank
IT6701-Information management question bankIT6701-Information management question bank
IT6701-Information management question bank
 
A introduction to oracle data integrator
A introduction to oracle data integratorA introduction to oracle data integrator
A introduction to oracle data integrator
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture
 
Azure Data Engineering course in hyderabad.pptx
Azure Data Engineering course in hyderabad.pptxAzure Data Engineering course in hyderabad.pptx
Azure Data Engineering course in hyderabad.pptx
 
Azure Data Engineering Course in Hyderabad
Azure Data Engineering  Course in HyderabadAzure Data Engineering  Course in Hyderabad
Azure Data Engineering Course in Hyderabad
 
"Azure Data Engineering Course in Hyderabad "
"Azure Data Engineering Course in Hyderabad ""Azure Data Engineering Course in Hyderabad "
"Azure Data Engineering Course in Hyderabad "
 
adf.docx
adf.docxadf.docx
adf.docx
 
Azure Data Engineering.pptx
Azure Data Engineering.pptxAzure Data Engineering.pptx
Azure Data Engineering.pptx
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | Whitepaper
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
 
UNIT -IV.docx
UNIT -IV.docxUNIT -IV.docx
UNIT -IV.docx
 
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdfSchema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdf
 
Transform your data with Azure Data factory
Transform your data with Azure Data factoryTransform your data with Azure Data factory
Transform your data with Azure Data factory
 
Cloud Storage System like Dropbox
Cloud Storage System like DropboxCloud Storage System like Dropbox
Cloud Storage System like Dropbox
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE   UNDER AZURE ...
COMPARING THE PERFORMANCE OF ETL PIPELINE USING SPARK AND HIVE UNDER AZURE ...
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine Learning
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Aucfanlab Datalake - Big Data Management Platform -

  • 1. Aucfanlab Datalake - Big Data Management Platform - 1. Abstract This report introduces the “Aucfan Datalake” a cloud-based big data management platform developed at Aucfan. The Aucfan Datalake provides easy job orchestration and management for data ingestion, storage and creation of datamarts. While it uses cutting-edge big data technologies for these workloads, all technical details are abstracted away from the user, who only has to know a simple high-level interface for working with his data. The system already handles enormous daily workloads and caters to a broad range of use cases at Aucfan. Individual scalability of its components and the use of a cloud-based architecture make it adaptable for all future workloads. 2. Introduction The big data ecosystem has evolved at a fast pace with new technologies and updates to existing technologies being released at an increasing rate in recent years. The rise of Spark, the influx of new databases, the advent of fully managed cloud services and similar developments have all contributed to creating a complex and convoluted eco system that is hard to navigate for the user. The user has to learn about different technologies like Hadoop and Spark, orchestrating his data workloads and many other topics. Additionally, he has to manage all these resources and take care of scaling them to his needs. To make these technologies easily accessible to internal developers and spread the use of big data technology in the company, the here introduced Datalake project was born.
  • 2. 3. Design goals Based on the above-mentioned difficulties the following design goals where identified for the system: - Enable the user to easily schedule data workloads - Provide a scalable central repository for the company’s data - Abstract all big data technology related details away - Let the user work with high-level concepts like “datasets”, “import”, “export” etc. 4. High-level system overview Given the goals stated above, a system design was conceived that encapsulates the actual implementation from the user and provides an accessible API interface. From a bird’s eye view the final system can be described as three parts: The data that the user owns and wants to work with, the user itself, who uses the Datalake API to execute data workflows and finally the Datalake, which provides the API and implements the respective actions. The provided API features entry points for importing data from the user’s data sources, manage metadata, and export datasets to internal and external datamarts. Additional features include scheduling workflows, categorizing data and partitioning.
  • 3. 5. API workflow for user In this section the use of the API is explained from the user’s point of view. The user works exclusively with the API and its high-level concepts and thus does not need to know anything about the internally used big data technologies. The high-level concepts represent among other things data sources, datasets, import and export jobs, as well as a dataset-specific parser. A detailed description of a typical workflow from importing the data to exporting it to a specific datamart follows: 1. The user registers a datasource by defining the type of storage, location and necessary authentication information of his data. This might for example be an Azure blob storage account. 2. He then creates a dataset, which might include metadata like a description or tags. Information like the size of the data and row count are later automatically added, when the actual data is imported. 3. Next the user creates an Importer object, which contains a reference to the above created datasource, to schedule the data import job. Additional options for defining the temporal partitioning of the data may be added. 4. In the final step for the import workflow the user creates a Parser object specific to his dataset. This parser contains information regarding the format of the data, its schema and custom parsing instructions. For example, for parsing CSV files the user would set CSV as the data format, set CSV-related properties like column-seperator or escape character and then define the schema of his data. 5. Once the data is parsed and stored as a dataset it can later be exported to various kinds of datamarts. For this the user only needs to create an Exporter object that references a datasource (for example a S3 bucket or Azure SQL database) as the sink.
  • 4. The user can monitor the execution of each of the above steps to check whether a job is still running, succeeded or might have failed. In the last case he can access the error logs of the respective job. As the above actions are easily executable REST-API calls, shell tools and a web-based GUI were created to allow users to interact with the system. 6. System architecture The technical architecture of the system encompasses a Java application for the API and system-logic as well as an ensemble of cloud services for orchestration and execution of the data workflows. Here, the exclusive use of cloud services allows for a cost effective and scalable solution. Microsoft Azure was chosen as the primary cloud provider for the Datalake, as it provides an extensive portfolio of data services and integrates well with other systems in the company. 6.1 REST-API The Datalake project and other projects in the company are designed according to a microservice architecture with a REST-API server as interface between the services and users. The API server is implemented as a Spring-based java application and deployed using Azure as a Platform as a service (PaaS) provider. For the persistence layer of the application and the main repository for metadata a
  • 5. NoSQL database was chosen. This allows for adaptability to changes and new features as well as easy scaling. 6.2 Orchestration The main job orchestration and monitoring is implemented using Azure’s Data Factory service. It provides methods to manage and monitor various pipelines for moving data or executing Hadoop / Spark jobs. Creation of pipelines and resources in Azure Data Factory is done by REST-API calls from the Datalake’s java backend server to the Azure Resource Management API. 6.3 Data Storage All data is stored using Azure Blob Storage as cloud storage to allow scaling of storage and to be able to scale storage and processing power for Hadoop / Spark clusters separately. The systems main storage is divided into two logical locations: 1. The “Datalake”, which stores raw data that is imported from various sources as-is 2. The “DataPlatform”, which serves as a structured and optimized data store 6.4 Hive Hive is used as the main big data warehouse and used to parse and store datasets according to the format and schema information provided by the user. Parsed datasets are persisted as external hive tables with the actual data being stored as gizpped Avro-files. This allows for cost-effective storage and efficient access to the data. The schema of these Hive tables is defined by mapping the schema that the user provided to the corresponding Hive datatypes. Here, the user-defined schema uses common datatypes defined for the Datalake, which are than mapped to the corresponding datatypes for each used data store (e.g. Hive, Azure SQL, etc.). Additionally custom parsing instructions in the form of HiveQL statements can be defined on a per-column basis. Data that is parsed and stored like this allows for fast access and can directly be used for the datamart export function.
  • 6. 6.5 Hadoop / Spark cluster All Hive and Hadoop jobs are run on a fully-managed and scalable Azure HDInsight cluster. For this a Linux-cluster running Spark is deployed. Hive queries are executed vectorized and Tez is used for better performance. To further improve performance data is partitioned according to its temporal characteristics. 6.6 Internal datamarts The system additionally manages internal Azure SQL Databases and an Azure SQL Data Warehouse instance. These can be used for datamarts created from datasets. 6.7 Cloud services An advantage of using cloud services exclusively is the scalability of all components individually from each other. This allows appropriate scaling according to performance needs of use cases. The used cluster is a scalable HDInsight cluster, whose node number can be easily increased or shrunk. Additionally, the used data storage adjusts to its use. The same is true for the Azure DWH which provides a near infinite scale capability for SQL workloads. Overall this allows the system to react to changing workloads in all components. 7. Summary The Datalake is currently used for several use cases in the company. Most important being the role of a central storage component for the companies EC market data crawlers. The ingested data is exposed to BI tools using created datamarts. Additionally, the Datalake functions as the central repository to gather the company’s data and as the backend for the “Aucfan Dataexchange”, a web service that allows to browse the contained data using a web browser. The multitude of possible use cases and the diverse range of users that are easily able to use the system to manage their data workloads furthermore ensures the future use of the system in many additional use cases. Finally scalability of all components individually provides for a cost-effective and efficient system to handle these use cases with good performance.