SlideShare una empresa de Scribd logo
1 de 21
Data Ingestion Engine
Design Overview for Big Data Meetup
Agenda
• Introduction
• Data Ingestion Engine
• Why do this? Isn’t data ingestion just ETL?
• Overview of Engine design
• Advantages of this approach
• Future enhancements
• Q/A & Discussion (remainder)
Who is this guy?
- Steven Frobish, Senior Consultant at Daugherty
- 17 years in Data and Analytics
- 3 years doing big data on mainframe before it was called “Big Data”
- 4 years Informatica support
- 4 years Data Integration Architect
- 6 years in the Hadoop space
- Ingestion, Cataloging, and Provisioning
- Operational Analytics
- Machine Learning
Motivation
• Analysts – Just give me the raw data!
• Auditors – Where is your sensitive information? (including free text)
• Records Retention – How long is your data going to stick around?
• Project Managers – I need 1800 tables moved in the next 2 weeks!
• DBA’s – Your schema is going to change (and you won’t know when)
Motivation - contd
• What did I learn:
• As “Big Iron” analyst
• Disk I/O is painfully slow, but ’parallelizable’
• Joins are expensive
• As Informatica Support Engineer:
• Roughly 30% of mappings were ‘pass-thru’
• Another 50% were similarly “doing the same thing over and over”
• ...but on slightly different data
• As Data Integration Architect:
• Most consumers can make sense of data on their own
• They just want everything, all the time
• Even highly modeled stores have common features that can be generalized
Motivation - contd
• What did I learn:
• As Data Lake Builder
• Lines of Business almost always need change data if available
• They are happy with schema on read
• but don’t break their stuff
• They will inevitably have their favorite format
• which they will discard immediately in favor of consistency
• Audits happen
• Security will always tell you that you are doing it wrong
• unless you can plug in any level of encryption they can think of…
• (then literally anything other than plaintext will do)
Why not ETL?
• Connecting to data is the easy part
• JDBC
• SFTP
• REST API’s
• Loading to Hadoop from edge node
• ETL tool’s main product is ‘T’ (no ‘T’ = no ETL)
• 1 table = 1 mapping
• Schema definition is part and parcel to the mapping
• Resource based Scheduling is difficult (impossible?)
Why not Spark?
• Spark is a parallel processing framework
• It is not free performance or instant throughput
• “Accessing the power of the cluster” is not needed here
• Problem of ingesting data is inherently not parallel
• Read a single source of data and write to a single target
• Any parallelism added in the middle is only adding complexity, not improving
throughput
What about NiFi?
• NiFi is cool
• Has many features Engine does not explore
• Back-pressure, fine-grained access, write-ahead logs
• An event driven architecture
• Typically requires more “plumbing” to set up
• Flows are still 1:1 to sources and actions are
not metadata driven
• Costly to setup many flows at once
What about SQOOP?
• Actually SQOOP is in the current Engine as an Access Module
• It has many features we want:
• Incremental extracts
• Captures schema information
• And several shortcomings:
• No resource based scheduling
• Doesn’t track inventory
• Field level masking is very difficult
The Data Ingestion Engine
• Is really just a driver program that:
• Queries a repository
• Decides to run jobs
• Time based
• ...but conscious of existing work
• Launches worker threads
Workers
• Find work to do from the queue
• Load up appropriate access module
• Track:
• Logging
• Runtime metadata
• Inventory metadata
• Occasionally:
• Transform sensitive data
• Capture profile stats
• Report errors
• Find more work to do from queue
Engine Design
Metadata Repository
Access Modules
• The code necessary to access a given source
• Oracle
• SFTP
• Kafka
• Each module is responsible for:
• Pulling data
• Acquiring schema information
• Recording basic statistics - row/record count, byte size, time taken
Transform Modules
• Use schema information to:
• Mask data
• Transform data to a given format
• Optionally capture profile stats
• Did someone mention sensitive data?
• HIPAA, PCI, PII, etc.
• Would some column tags help?
• Explicitly tag column name (e.g. ‘colA’ tagged as “is_sensitive”)
• Regex on column name (e.g. like ‘.*phone.*’, or ‘b(irth)?_?date’)
• Have some regexes for free text
• Regex on content (e.g. ‘d{3}-?d{2}-?d{4}’)
Metadata Driven Approach Benefits
• Generate many rules quickly
• Imagine an Access Module for which the output is more rules
metadata
• Scan Oracle Catalog for a given schema
• Generate rules/datasets for every table
• Let the access module do the heavy lifting
• You just need to tell it where to go and provide credentials
Metadata Driven Approach Benefits
• Capture schema changes immediately
• Use inventory data to trigger consumer processes
• Full lineage & Audit logs accessible in a single place
• Resource based scheduling and connection throttling
• Information Retention driven by metadata instead of scanning data
• … many more that become apparent only after you have built it
Some other stuff we are working on...
• Using KSQL to enrich data from Kafka
• Engine will start (if necessary) KSQL server
• With metadata generated KSQL***
• Provision data to an outside location***
• Launch Oozie workflow
*** Future feature
Questions?
Contact
steven.frobish@daugherty.com
Linked In

Más contenido relacionado

La actualidad más candente

Model Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaModel Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaDatabricks
 
Flink and NiFi, Two Stars in the Apache Big Data Constellation
Flink and NiFi, Two Stars in the Apache Big Data ConstellationFlink and NiFi, Two Stars in the Apache Big Data Constellation
Flink and NiFi, Two Stars in the Apache Big Data ConstellationMatthew Ring
 
Streaming with Oracle Data Integration
Streaming with Oracle Data IntegrationStreaming with Oracle Data Integration
Streaming with Oracle Data IntegrationMichael Rainey
 
the tooling of a modern and agile oracle dba
the tooling of a modern and agile oracle dbathe tooling of a modern and agile oracle dba
the tooling of a modern and agile oracle dbaBertrandDrouvot
 
Misusing MLflow To Help Deduplicate Data At Scale
Misusing MLflow To Help Deduplicate Data At ScaleMisusing MLflow To Help Deduplicate Data At Scale
Misusing MLflow To Help Deduplicate Data At ScaleDatabricks
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Databricks
 
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - OptumUltralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - OptumData Driven Innovation
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020Adam Doyle
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAPaolo Platter
 
DataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart AnalyticsDataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart AnalyticsDataStax Academy
 
Brokering Data: Accelerating Data Evaluation with Databricks White Label
Brokering Data: Accelerating Data Evaluation with Databricks White LabelBrokering Data: Accelerating Data Evaluation with Databricks White Label
Brokering Data: Accelerating Data Evaluation with Databricks White LabelDatabricks
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachDataWorks Summit
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidDataWorks Summit
 
Joe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFiJoe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFiMark Kerzner
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Distributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational ScalingDistributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational ScalingJustin Cunningham
 
Semantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowSemantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowDatabricks
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
 

La actualidad más candente (20)

Model Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaModel Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and Verta
 
Flink and NiFi, Two Stars in the Apache Big Data Constellation
Flink and NiFi, Two Stars in the Apache Big Data ConstellationFlink and NiFi, Two Stars in the Apache Big Data Constellation
Flink and NiFi, Two Stars in the Apache Big Data Constellation
 
Streaming with Oracle Data Integration
Streaming with Oracle Data IntegrationStreaming with Oracle Data Integration
Streaming with Oracle Data Integration
 
the tooling of a modern and agile oracle dba
the tooling of a modern and agile oracle dbathe tooling of a modern and agile oracle dba
the tooling of a modern and agile oracle dba
 
Misusing MLflow To Help Deduplicate Data At Scale
Misusing MLflow To Help Deduplicate Data At ScaleMisusing MLflow To Help Deduplicate Data At Scale
Misusing MLflow To Help Deduplicate Data At Scale
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
 
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - OptumUltralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - Optum
 
Data platform evolution
Data platform evolutionData platform evolution
Data platform evolution
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKA
 
DataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart AnalyticsDataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart Analytics
 
Brokering Data: Accelerating Data Evaluation with Databricks White Label
Brokering Data: Accelerating Data Evaluation with Databricks White LabelBrokering Data: Accelerating Data Evaluation with Databricks White Label
Brokering Data: Accelerating Data Evaluation with Databricks White Label
 
Tracking data lineage at Stitch Fix
Tracking data lineage at Stitch FixTracking data lineage at Stitch Fix
Tracking data lineage at Stitch Fix
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and Druid
 
Joe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFiJoe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFi
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Distributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational ScalingDistributed Data Quality - Technical Solutions for Organizational Scaling
Distributed Data Quality - Technical Solutions for Organizational Scaling
 
Semantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowSemantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflow
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 

Similar a Data Ingestion Engine

1. SQL Server forSharePoint geeksA gentle introductionThomas Vochten • Septem...
1. SQL Server forSharePoint geeksA gentle introductionThomas Vochten • Septem...1. SQL Server forSharePoint geeksA gentle introductionThomas Vochten • Septem...
1. SQL Server forSharePoint geeksA gentle introductionThomas Vochten • Septem...BIWUG
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analyticsIke Ellis
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Remy Rosenbaum
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedInkbajda
 
O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...
O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...
O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...NCCOMMS
 
Tips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineTips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineAndries den Haan
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptxIke Ellis
 
Tips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineTips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineAndries den Haan
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfRob Winters
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
ETL for the masses with Power Query and M
ETL for the masses with Power Query and METL for the masses with Power Query and M
ETL for the masses with Power Query and MRégis Baccaro
 
Introduction to SharePoint for SQLserver DBAs
Introduction to SharePoint for SQLserver DBAsIntroduction to SharePoint for SQLserver DBAs
Introduction to SharePoint for SQLserver DBAsSteve Knutson
 
Using ELK Explore Defect Data
Using ELK Explore Defect DataUsing ELK Explore Defect Data
Using ELK Explore Defect Dataatf117
 
UsingELKExploreDefectData
UsingELKExploreDefectDataUsingELKExploreDefectData
UsingELKExploreDefectDataYabin Xu
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQLTony Tam
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamRaymond Tay
 

Similar a Data Ingestion Engine (20)

1. SQL Server forSharePoint geeksA gentle introductionThomas Vochten • Septem...
1. SQL Server forSharePoint geeksA gentle introductionThomas Vochten • Septem...1. SQL Server forSharePoint geeksA gentle introductionThomas Vochten • Septem...
1. SQL Server forSharePoint geeksA gentle introductionThomas Vochten • Septem...
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
 
O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...
O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...
O365Con19 - Tips and Tricks for Complex Migrations to SharePoint Online - And...
 
Tips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineTips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint Online
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
Tips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint OnlineTips and tricks for complex migrations to SharePoint Online
Tips and tricks for complex migrations to SharePoint Online
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
 
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...Designing and Building Next Generation Data Pipelines at Scale with Structure...
Designing and Building Next Generation Data Pipelines at Scale with Structure...
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
ETL for the masses with Power Query and M
ETL for the masses with Power Query and METL for the masses with Power Query and M
ETL for the masses with Power Query and M
 
Introduction to SharePoint for SQLserver DBAs
Introduction to SharePoint for SQLserver DBAsIntroduction to SharePoint for SQLserver DBAs
Introduction to SharePoint for SQLserver DBAs
 
Using ELK Explore Defect Data
Using ELK Explore Defect DataUsing ELK Explore Defect Data
Using ELK Explore Defect Data
 
UsingELKExploreDefectData
UsingELKExploreDefectDataUsingELKExploreDefectData
UsingELKExploreDefectData
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Data Modeling for NoSQL
Data Modeling for NoSQLData Modeling for NoSQL
Data Modeling for NoSQL
 
Breaking data
Breaking dataBreaking data
Breaking data
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beam
 

Más de Adam Doyle

Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering RolesAdam Doyle
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowAdam Doyle
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop DevelopmentAdam Doyle
 
The new big data
The new big dataThe new big data
The new big dataAdam Doyle
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleAdam Doyle
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackAdam Doyle
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does dataAdam Doyle
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsAdam Doyle
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingAdam Doyle
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019Adam Doyle
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleAdam Doyle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user groupAdam Doyle
 
Cloudera - Docker on hadoop
Cloudera - Docker on hadoopCloudera - Docker on hadoop
Cloudera - Docker on hadoopAdam Doyle
 

Más de Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Cloudera - Docker on hadoop
Cloudera - Docker on hadoopCloudera - Docker on hadoop
Cloudera - Docker on hadoop
 

Último

Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftAanSulistiyo
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样ayvbos
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsMonica Sydney
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdfMatthew Sinclair
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdfMatthew Sinclair
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsMonica Sydney
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasDigicorns Technologies
 
PowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxPowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxgalaxypingy
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdfMatthew Sinclair
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdfMatthew Sinclair
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoilmeghakumariji156
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolinonuriaiuzzolino1
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查ydyuyu
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrHenryBriggs2
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirtrahman018755
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 

Último (20)

Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
PowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxPowerDirector Explination Process...pptx
PowerDirector Explination Process...pptx
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolino
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 

Data Ingestion Engine

  • 1. Data Ingestion Engine Design Overview for Big Data Meetup
  • 2. Agenda • Introduction • Data Ingestion Engine • Why do this? Isn’t data ingestion just ETL? • Overview of Engine design • Advantages of this approach • Future enhancements • Q/A & Discussion (remainder)
  • 3. Who is this guy? - Steven Frobish, Senior Consultant at Daugherty - 17 years in Data and Analytics - 3 years doing big data on mainframe before it was called “Big Data” - 4 years Informatica support - 4 years Data Integration Architect - 6 years in the Hadoop space - Ingestion, Cataloging, and Provisioning - Operational Analytics - Machine Learning
  • 4. Motivation • Analysts – Just give me the raw data! • Auditors – Where is your sensitive information? (including free text) • Records Retention – How long is your data going to stick around? • Project Managers – I need 1800 tables moved in the next 2 weeks! • DBA’s – Your schema is going to change (and you won’t know when)
  • 5. Motivation - contd • What did I learn: • As “Big Iron” analyst • Disk I/O is painfully slow, but ’parallelizable’ • Joins are expensive • As Informatica Support Engineer: • Roughly 30% of mappings were ‘pass-thru’ • Another 50% were similarly “doing the same thing over and over” • ...but on slightly different data • As Data Integration Architect: • Most consumers can make sense of data on their own • They just want everything, all the time • Even highly modeled stores have common features that can be generalized
  • 6. Motivation - contd • What did I learn: • As Data Lake Builder • Lines of Business almost always need change data if available • They are happy with schema on read • but don’t break their stuff • They will inevitably have their favorite format • which they will discard immediately in favor of consistency • Audits happen • Security will always tell you that you are doing it wrong • unless you can plug in any level of encryption they can think of… • (then literally anything other than plaintext will do)
  • 7. Why not ETL? • Connecting to data is the easy part • JDBC • SFTP • REST API’s • Loading to Hadoop from edge node • ETL tool’s main product is ‘T’ (no ‘T’ = no ETL) • 1 table = 1 mapping • Schema definition is part and parcel to the mapping • Resource based Scheduling is difficult (impossible?)
  • 8. Why not Spark? • Spark is a parallel processing framework • It is not free performance or instant throughput • “Accessing the power of the cluster” is not needed here • Problem of ingesting data is inherently not parallel • Read a single source of data and write to a single target • Any parallelism added in the middle is only adding complexity, not improving throughput
  • 9. What about NiFi? • NiFi is cool • Has many features Engine does not explore • Back-pressure, fine-grained access, write-ahead logs • An event driven architecture • Typically requires more “plumbing” to set up • Flows are still 1:1 to sources and actions are not metadata driven • Costly to setup many flows at once
  • 10. What about SQOOP? • Actually SQOOP is in the current Engine as an Access Module • It has many features we want: • Incremental extracts • Captures schema information • And several shortcomings: • No resource based scheduling • Doesn’t track inventory • Field level masking is very difficult
  • 11. The Data Ingestion Engine • Is really just a driver program that: • Queries a repository • Decides to run jobs • Time based • ...but conscious of existing work • Launches worker threads
  • 12. Workers • Find work to do from the queue • Load up appropriate access module • Track: • Logging • Runtime metadata • Inventory metadata • Occasionally: • Transform sensitive data • Capture profile stats • Report errors • Find more work to do from queue
  • 15. Access Modules • The code necessary to access a given source • Oracle • SFTP • Kafka • Each module is responsible for: • Pulling data • Acquiring schema information • Recording basic statistics - row/record count, byte size, time taken
  • 16. Transform Modules • Use schema information to: • Mask data • Transform data to a given format • Optionally capture profile stats • Did someone mention sensitive data? • HIPAA, PCI, PII, etc. • Would some column tags help? • Explicitly tag column name (e.g. ‘colA’ tagged as “is_sensitive”) • Regex on column name (e.g. like ‘.*phone.*’, or ‘b(irth)?_?date’) • Have some regexes for free text • Regex on content (e.g. ‘d{3}-?d{2}-?d{4}’)
  • 17. Metadata Driven Approach Benefits • Generate many rules quickly • Imagine an Access Module for which the output is more rules metadata • Scan Oracle Catalog for a given schema • Generate rules/datasets for every table • Let the access module do the heavy lifting • You just need to tell it where to go and provide credentials
  • 18. Metadata Driven Approach Benefits • Capture schema changes immediately • Use inventory data to trigger consumer processes • Full lineage & Audit logs accessible in a single place • Resource based scheduling and connection throttling • Information Retention driven by metadata instead of scanning data • … many more that become apparent only after you have built it
  • 19. Some other stuff we are working on... • Using KSQL to enrich data from Kafka • Engine will start (if necessary) KSQL server • With metadata generated KSQL*** • Provision data to an outside location*** • Launch Oozie workflow *** Future feature