Finance Data Lake objective is to create a centralized enterprise data repository for all Finance and Supply Chain data. It serves as the single source of truth. It enables a self-service discovery Analytics platform for business users to answer adhoc business questions and derive critical insights. The data lake is based on open source Hadoop big data platform and a very cost effective solution in breaking the ERP data silos and simplifying the data architecture in the enterprise.
POCs were conducted on in-house Hortonworks Hadoop data platform to validate the cluster performance for Production volumes. Based on business priorities, an initial roadmap was defined using 3 data sources including 2 SAP ERPs and Peoplesoft (OLTP systems). Development environment was established in AWS Cloud for agile delivery. The near real time data ingestion architecture for the data lake was defined using replication tools and custom SQOOP based micro-batching framework and data persisted in Apache Hive DB in ORC format. Data and user security is implemented using Apache Ranger and sensitive data stored at rest in encryption zones. Business data sets were developed in Hive scripts and scheduled using Oozie. Multiple reporting tools connectivity including SQL tools, Excel and Tableau were enabled for Self-service Analytics. Upon successful implementation of the initial phase, a full roadmap is established to extend the Finance data lake to over 25 data sources and enhance data ingestion to scale as well as enable OLAP tools on Hadoop.