Enterprises have been rapidly adopting data lakes as a complement or replacement of data warehouses. Many of the Data lake implementations are ignoring the inherent drawbacks and limitations of Data Lakes and ending up as data swamps with little or no benefit to the businesses. In this session we will go through some of challenges and the key aspects that need to be considered for successful Data lake implementations.
5. 5Page
What is a Data Lake
“ A data lake is an enterprise-wide system for storing and analyzing disparate sources of data in
their native formats”
“A data lake is a central location in which to store all your data, regardless of its source or format.”
“Is Data lake a replacement or complimentary to EDW ? ”
“Is Data lake just a storage layer ? ”
“ Just having a Hadoop environment is a data lake ? ”
6. 6Page
Data Lake Attributes
• Data Democratization
• Data Discovery
• Data Lineage
• Self-Service capabilities
• Metadata Management
9. 9Page
Data Governance
• Data Acquisition - what, when, where of data
• Data Organization – Structure, format
• Data Catalog – what data exists in the lake
• Capturing Metadata
• Data Lineage
• Data Quality
• Data Profile
• Provenance of data at file and record levels
• Business names, descriptions
• Data Provisioning
13. 13Page
Guidelines
• Expect structured , semi-structure, unstructured data
• store a metadata or tag for location of schema, unstructured
• Store a copy of raw input
• Raw first mile copy of the data so that we can recover our business or almost
• Replay the business if we need to
• Data Standardization – data clensing as a workflow after ingest
• Use a format that supports your data
• Automate metadata management
16. 16Page
Implementation Challenges
• Change Data Capture
• Mysql – binlog readers
• Oracle - tungsten
• Updating the deltas on to the data lake
• Reusable Data movement workflows
• One workflow for table ? (Generate Dynamic workflows based on metadata)
• Needs to be driven of metadata
• Schema changes on the Source end
• Streaming Data
• Partitioning Strategies on the Data Lake
• Configure them into metadata
17. 17Page
Tools /
Products
• Smart Catalogs
• Waterline Data Inventory
• Collibra Catalog
• Data Lake Management
• Zaloni Bedrock
• Informatica Intelligent Data Lake
• Data Governance and Metadata Management
• Cloudera Navigator
• Apache Atlas
• Collibra Data Governance
• Oracle BigData Catalog
18. 18Page
Data Lake Trends
• Data Lakes on Cloud
• IOT Data Lakes
• Logical Data Lakes
• Unified View of data that exists across data stores
• Data Discovery Portals