Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Aucfanlab Datalake - Big Data Management Platform -
1. Aucfanlab Datalake
- Big Data Management Platform -
1. Abstract
This report introduces the “Aucfan Datalake” a cloud-based big data management
platform developed at Aucfan. The Aucfan Datalake provides easy job orchestration
and management for data ingestion, storage and creation of datamarts. While it uses
cutting-edge big data technologies for these workloads, all technical details are
abstracted away from the user, who only has to know a simple high-level interface
for working with his data. The system already handles enormous daily workloads and
caters to a broad range of use cases at Aucfan. Individual scalability of its
components and the use of a cloud-based architecture make it adaptable for all
future workloads.
2. Introduction
The big data ecosystem has evolved at a fast pace with new technologies and
updates to existing technologies being released at an increasing rate in recent years.
The rise of Spark, the influx of new databases, the advent of fully managed cloud
services and similar developments have all contributed to creating a complex and
convoluted eco system that is hard to navigate for the user. The user has to learn
about different technologies like Hadoop and Spark, orchestrating his data
workloads and many other topics. Additionally, he has to manage all these resources
and take care of scaling them to his needs.
To make these technologies easily accessible to internal developers and spread the
use of big data technology in the company, the here introduced Datalake project was
born.
2. 3. Design goals
Based on the above-mentioned difficulties the following design goals where
identified for the system:
- Enable the user to easily schedule data workloads
- Provide a scalable central repository for the company’s data
- Abstract all big data technology related details away
- Let the user work with high-level concepts like “datasets”, “import”, “export” etc.
4. High-level system overview
Given the goals stated above, a system design was conceived that encapsulates the
actual implementation from the user and provides an accessible API interface. From
a bird’s eye view the final system can be described as three parts: The data that the
user owns and wants to work with, the user itself, who uses the Datalake API to
execute data workflows and finally the Datalake, which provides the API and
implements the respective actions.
The provided API features entry points for importing data from the user’s data
sources, manage metadata, and export datasets to internal and external datamarts.
Additional features include scheduling workflows, categorizing data and partitioning.
3. 5. API workflow for user
In this section the use of the API is explained from the user’s point of view. The user
works exclusively with the API and its high-level concepts and thus does not need to
know anything about the internally used big data technologies. The high-level
concepts represent among other things data sources, datasets, import and export
jobs, as well as a dataset-specific parser.
A detailed description of a typical workflow from importing the data to exporting it to
a specific datamart follows:
1. The user registers a datasource by defining the type of storage, location and
necessary authentication information of his data. This might for example be
an Azure blob storage account.
2. He then creates a dataset, which might include metadata like a description
or tags. Information like the size of the data and row count are later
automatically added, when the actual data is imported.
3. Next the user creates an Importer object, which contains a reference to the
above created datasource, to schedule the data import job. Additional
options for defining the temporal partitioning of the data may be added.
4. In the final step for the import workflow the user creates a Parser object
specific to his dataset. This parser contains information regarding the
format of the data, its schema and custom parsing instructions.
For example, for parsing CSV files the user would set CSV as the data
format, set CSV-related properties like column-seperator or escape
character and then define the schema of his data.
5. Once the data is parsed and stored as a dataset it can later be exported to
various kinds of datamarts. For this the user only needs to create an
Exporter object that references a datasource (for example a S3 bucket or
Azure SQL database) as the sink.
4. The user can monitor the execution of each of the above steps to check whether a
job is still running, succeeded or might have failed. In the last case he can access the
error logs of the respective job.
As the above actions are easily executable REST-API calls, shell tools and a
web-based GUI were created to allow users to interact with the system.
6. System architecture
The technical architecture of the system encompasses a Java application for the API
and system-logic as well as an ensemble of cloud services for orchestration and
execution of the data workflows. Here, the exclusive use of cloud services allows for
a cost effective and scalable solution. Microsoft Azure was chosen as the primary
cloud provider for the Datalake, as it provides an extensive portfolio of data services
and integrates well with other systems in the company.
6.1 REST-API
The Datalake project and other projects in the company are designed according
to a microservice architecture with a REST-API server as interface between the
services and users. The API server is implemented as a Spring-based java
application and deployed using Azure as a Platform as a service (PaaS) provider. For
the persistence layer of the application and the main repository for metadata a
5. NoSQL database was chosen. This allows for adaptability to changes and new
features as well as easy scaling.
6.2 Orchestration
The main job orchestration and monitoring is implemented using Azure’s Data
Factory service. It provides methods to manage and monitor various pipelines for
moving data or executing Hadoop / Spark jobs. Creation of pipelines and resources
in Azure Data Factory is done by REST-API calls from the Datalake’s java backend
server to the Azure Resource Management API.
6.3 Data Storage
All data is stored using Azure Blob Storage as cloud storage to allow scaling of
storage and to be able to scale storage and processing power for Hadoop / Spark
clusters separately. The systems main storage is divided into two logical locations:
1. The “Datalake”, which stores raw data that is imported from various sources as-is
2. The “DataPlatform”, which serves as a structured and optimized data store
6.4 Hive
Hive is used as the main big data warehouse and used to parse and store
datasets according to the format and schema information provided by the user.
Parsed datasets are persisted as external hive tables with the actual data being
stored as gizpped Avro-files. This allows for cost-effective storage and efficient
access to the data.
The schema of these Hive tables is defined by mapping the schema that the user
provided to the corresponding Hive datatypes. Here, the user-defined schema uses
common datatypes defined for the Datalake, which are than mapped to the
corresponding datatypes for each used data store (e.g. Hive, Azure SQL, etc.).
Additionally custom parsing instructions in the form of HiveQL statements can be
defined on a per-column basis.
Data that is parsed and stored like this allows for fast access and can directly be
used for the datamart export function.
6. 6.5 Hadoop / Spark cluster
All Hive and Hadoop jobs are run on a fully-managed and scalable Azure
HDInsight cluster. For this a Linux-cluster running Spark is deployed. Hive queries
are executed vectorized and Tez is used for better performance. To further improve
performance data is partitioned according to its temporal characteristics.
6.6 Internal datamarts
The system additionally manages internal Azure SQL Databases and an Azure
SQL Data Warehouse instance. These can be used for datamarts created from
datasets.
6.7 Cloud services
An advantage of using cloud services exclusively is the scalability of all
components individually from each other. This allows appropriate scaling according
to performance needs of use cases. The used cluster is a scalable HDInsight cluster,
whose node number can be easily increased or shrunk. Additionally, the used data
storage adjusts to its use. The same is true for the Azure DWH which provides a near
infinite scale capability for SQL workloads. Overall this allows the system to react to
changing workloads in all components.
7. Summary
The Datalake is currently used for several use cases in the company. Most important
being the role of a central storage component for the companies EC market data
crawlers. The ingested data is exposed to BI tools using created datamarts.
Additionally, the Datalake functions as the central repository to gather the company’s
data and as the backend for the “Aucfan Dataexchange”, a web service that allows
to browse the contained data using a web browser.
The multitude of possible use cases and the diverse range of users that are easily
able to use the system to manage their data workloads furthermore ensures the
future use of the system in many additional use cases. Finally scalability of all
components individually provides for a cost-effective and efficient system to handle
these use cases with good performance.