Engineering Machine Learning Data Pipeline Series: Pulling in Data from Multiple Sources

Engineering Machine Learning Data Pipelines
Data from Multiple Sources
Paige Roberts
Integrate Product Marketing Manager

Common Machine Learning Applications
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer
2

“
”
For want of a nail, the kingdom was lost.
For want of a data cleansing and integration tool,
the whole AI superstructure can fall down.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018
Engineering Machine Learning Data Pipelines3

Data Scientist
Data Engineer to the Rescue
• Expert in statistical analysis, machine learning
techniques, finding answers to business questions
buried in datasets.
• Does NOT want to spend 50 – 90% of their time
tinkering with data, getting it into good shape to
train models – but frequently does, especially if
there’s no data engineer on their team.
• Job completed when machine learning model is
trained, tested, and proven it will accomplish the
goal. Not skilled at taking the model from a test
sandbox into production.
• Expert in data structures, data manipulation, and
constructing production data pipelines.
• WANTS to spend all of their time working with data.
• First gathers, cleans and standardizes data, helps
data scientist with feature engineering, provides top
notch data, ready to train models.
• After model is tested, builds robust high scale, data
pipelines to feed the models the data they need in
the correct format in production to provide ongoing
business value.
Data Engineer

Five Big Challenges of Engineering ML Data Pipelines
1. Scattered and Difficult to Access Datasets
Much of the necessary data is trapped in mainframes or streams in from POS, web clicks, etc. all in
incompatible formats, making it difficult to gather and prepare the data for model training.

Onboard Relational Data Quickly
• Db2, Oracle, Teradata, Netezza, S3, Redshift, …
• Onboard hundreds of tables into your cluster
• Onboard whole database schemas at once
• Create target tables automatically in Hive or
Impala
• Filter unwanted tables, rows, data types, or
columns with a mouse click
• Transform data in flight
DMX DataFunnel™

Onboard ALL Enterprise Data – Mainframe to Streaming
Data Sources
Access data from
streaming and batch
sources outside
cluster.

Data Sources
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.

Data Sources
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Data Lake
Data
Transform, join,
cleanse, enhance
data in cluster
with MapReduce,
EMR, or Spark.

10
Design Once, Deploy Anywhere
Intelligent Execution - Insulate your organization from underlying complexities of Hadoop.
Get excellent performance every time
without tuning, load balancing, etc.
No re-design, re-compile, no re-work ever
• Future-proof job designs for emerging
compute frameworks, e.g. Spark 2.x
• Move from dev to test to production
• Move from on-premise to Cloud
• Move from one Cloud to another
Use existing ETL skills
No parallel programming – Java, MapReduce, Spark …
No worries about:
• Mappers, Reducers
• Big side or small side of joins …
Design Once
in visual GUI
Deploy Anywhere!
On-Premise,
Cloud
Mapreduce, Spark,
Future Platforms
Windows, Unix,
Linux
Batch,
Streaming
Single Node,
Cluster

11
Same Solution – On Premise or In the Cloud
Big Data + Cloud + Syncsort = Powerful, Flexible, Cost Effective
• ETL engine on AWS Marketplace
• Available on EC2, EMR, Google Cloud
• S3 and Redshift connectivity
• Google GCS and Amazon S3 support
• First & only leading ETL engine on Docker Hub
• Partner with all major public Cloud providers

12
Bring ALL Enterprise Data Securely to the Data Lake
• Collect virtually any data from mainframe to Cloud, relational
to NoSQL
• Batch & streaming sources – Kafka, MapR Streams
• Access, re-format and load data directly into Hive & Impala.
No staging required!
• Pull hundreds of tables at once into your data hub, whole DB
schemas at the push of a button
• Load more data into Hadoop in less time
Build Your Enterprise Data Hub

Engineering Machine Learning Data Pipeline Series: Pulling in Data from Multiple Sources

Recomendados

Recomendados

Más contenido relacionado

Más de Precisely

Más de Precisely (20)

Último

Último (20)

Engineering Machine Learning Data Pipeline Series: Pulling in Data from Multiple Sources