Making the hurdle from designing a machine learning model to putting it into production, is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering a robust production data pipeline has its own set of tough problems to solve. Syncsort’s solutions help the data engineer every step of the way.
The first step is consolidating data from sources all over the enterprise. The data machine learning models come from a wide variety of physical locations, technical platforms and storage formats. The first challenge is requiring parallel onboarding capability and connectivity to sources from mainframe to streaming to Cloud and getting all that data onto the cluster. The next challenge is getting all the data transformed from its source storage format to the target, whether that system is Hive, Impala, HDFS, ORC, Parquet, KUDU or something else entirely. The final challenge is getting the data normalized, aggregated – or otherwise changed – and the features filtered down.
This is only the first part of creating robust production data pipelines, and if you’re not careful it can take weeks or even months of Sqoop scripts, shell scripting and Scala or Java code to complete the first step. Syncsort has helped data engineers solve this problem for years.
View this 15-minute webcast on-demand to get a deeper look at a better way to get high-performance data access and integration on your production cluster – without spending a bunch of time coding or tuning. These 15-minutes could save you weeks!
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Engineering Machine Learning Data Pipeline Series: Pulling in Data from Multiple Sources
1. Engineering Machine Learning Data Pipelines
Data from Multiple Sources
Paige Roberts
Integrate Product Marketing Manager
2. Common Machine Learning Applications
Engineering Machine Learning Data Pipelines
• Anti-money laundering
• Fraud detection
• Cybersecurity
• Targeted marketing
• Recommendation engine
• Next best action
• Customer churn prevention
• Know your customer
2
3. “
”
For want of a nail, the kingdom was lost.
For want of a data cleansing and integration tool,
the whole AI superstructure can fall down.
James Kobeilus
SiliconANGLE Wikibon
Lead Analyst for Data Science, Deep Learning, App Development
2018
Engineering Machine Learning Data Pipelines3
4. Data Scientist
Engineering Machine Learning Data Pipelines4
Data Engineer to the Rescue
• Expert in statistical analysis, machine learning
techniques, finding answers to business questions
buried in datasets.
• Does NOT want to spend 50 – 90% of their time
tinkering with data, getting it into good shape to
train models – but frequently does, especially if
there’s no data engineer on their team.
• Job completed when machine learning model is
trained, tested, and proven it will accomplish the
goal. Not skilled at taking the model from a test
sandbox into production.
• Expert in data structures, data manipulation, and
constructing production data pipelines.
• WANTS to spend all of their time working with data.
• First gathers, cleans and standardizes data, helps
data scientist with feature engineering, provides top
notch data, ready to train models.
• After model is tested, builds robust high scale, data
pipelines to feed the models the data they need in
the correct format in production to provide ongoing
business value.
Data Engineer
5. Engineering Machine Learning Data Pipelines5
Five Big Challenges of Engineering ML Data Pipelines
1. Scattered and Difficult to Access Datasets
Much of the necessary data is trapped in mainframes or streams in from POS, web clicks, etc. all in
incompatible formats, making it difficult to gather and prepare the data for model training.
6. Engineering Machine Learning Data Pipelines6
Onboard Relational Data Quickly
• Db2, Oracle, Teradata, Netezza, S3, Redshift, …
• Onboard hundreds of tables into your cluster
• Onboard whole database schemas at once
• Create target tables automatically in Hive or
Impala
• Filter unwanted tables, rows, data types, or
columns with a mouse click
• Transform data in flight
DMX DataFunnel™
7. Engineering Machine Learning Data Pipelines7
Onboard ALL Enterprise Data – Mainframe to Streaming
Data Sources
Access data from
streaming and batch
sources outside
cluster.
8. Engineering Machine Learning Data Pipelines8
Onboard ALL Enterprise Data – Mainframe to Streaming
Data Sources
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
9. Engineering Machine Learning Data Pipelines9
Onboard ALL Enterprise Data – Mainframe to Streaming
Data Sources
Onboard data, modify
on-the-fly to match
Hadoop storage model,
or store unchanged for
archive and compliance.
Access data from
streaming and batch
sources outside
cluster.
Data Lake
Data
Transform, join,
cleanse, enhance
data in cluster
with MapReduce,
EMR, or Spark.
10. 10
Design Once, Deploy Anywhere
Engineering Machine Learning Data Pipelines
Intelligent Execution - Insulate your organization from underlying complexities of Hadoop.
Get excellent performance every time
without tuning, load balancing, etc.
No re-design, re-compile, no re-work ever
• Future-proof job designs for emerging
compute frameworks, e.g. Spark 2.x
• Move from dev to test to production
• Move from on-premise to Cloud
• Move from one Cloud to another
Use existing ETL skills
No parallel programming – Java, MapReduce, Spark …
No worries about:
• Mappers, Reducers
• Big side or small side of joins …
Design Once
in visual GUI
Deploy Anywhere!
On-Premise,
Cloud
Mapreduce, Spark,
Future Platforms
Windows, Unix,
Linux
Batch,
Streaming
Single Node,
Cluster
11. 11
Same Solution – On Premise or In the Cloud
Engineering Machine Learning Data Pipelines
Big Data + Cloud + Syncsort = Powerful, Flexible, Cost Effective
• ETL engine on AWS Marketplace
• Available on EC2, EMR, Google Cloud
• S3 and Redshift connectivity
• Google GCS and Amazon S3 support
• First & only leading ETL engine on Docker Hub
• Partner with all major public Cloud providers
12. 12
Bring ALL Enterprise Data Securely to the Data Lake
Engineering Machine Learning Data Pipelines
• Collect virtually any data from mainframe to Cloud, relational
to NoSQL
• Batch & streaming sources – Kafka, MapR Streams
• Access, re-format and load data directly into Hive & Impala.
No staging required!
• Pull hundreds of tables at once into your data hub, whole DB
schemas at the push of a button
• Load more data into Hadoop in less time
Build Your Enterprise Data Hub