This document discusses feature stores and their role in modern machine learning infrastructure. It begins with an introduction and agenda. It then covers challenges with modern data platforms and emerging architectural shifts towards things like data meshes and feature stores. The remainder discusses what a feature store is, reference architectures, and recommendations for adopting feature stores including leveraging existing AWS services for storage, catalog, query, and more.
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Feature Store as a Data Foundation for Machine Learning
1. Feature Store
as a Data Foundation for ML
Presented by:
Stepan Pushkarev, CTO @ Provectus
Gandhi Raketla, Senior Solutions Architect @ AWS
2. 1. Introductions
2. Modern Data Lakes and Modern ML Infrastructure
3. Emerging Architectural Shifts
4. Feature Store: 200 LOD overview and reference architecture on
AWS
5. AWS Perspective on Feature Store
Agenda
4. Clients ranging from
fast-growing startups
to large enterprises
450 employees and
growing
Established in 2010
HQ in Palo Alto
Offices across the US,
Canada, and Europe
We are obsessed about leveraging cloud, data, and AI to reimagine the
way businesses operate, compete, and deliver customer value
AI-First Consultancy & Solutions
Provider
5. Innovative Tech Vendors
Seeking for niche expertise to
differentiate and win the market
Midsize to Large Enterprises
Seeking to accelerate innovation,
achieve operational excellence
Our Clients
8. Common Challenges:
Data Access and Discoverability
1. Data is scattered across multiple data sources
and technologies
2. Tedious process of managing AWS IAM roles,
Amazon S3 policies, API Gateways, Database
permissions
3. Gets even more complicated in AWS multi-
account setup
4. Metadata is not discoverable
5. As a result - all the investments into Data and
ML are killed by data access issues
9. 1. Lack of ownership and domain context —
A disconnect between data producers
and data consumers
2. Backlogged data team struggling to
keep pace with business demands
3. No Contracts between Data and ML
Engineering
4. As a result, fast end-to-end
experimentation is killed by complex
dependencies between teams
Common Challenges:
Monolithic Data Teams
https://martinfowler.com/articles/data-monolith-to-mesh.html
10. Common Challenges:
ML Experimentation Infrastructure
1. Inherited issues with Data Discovery and
Data Access
2. Reproducibility of datasets, ML pipelines,
ML Environments, and offline experiments
is still an issue
3. Production Experimentation frameworks
are fairly immature yet
4. As a result, the cost of an end-to-end
experiment from data to production ML
metric is 3-6 months
https://hbr.org/2020/03/building-a-culture-of-experimentation
11. Common Challenges:
Scaling ML Adoption in Production
1. Online serving. There is no unified and consistent
way to access features during model serving.
2. Impossible to reuse features between multiple
training pipelines and ML applications.
3. Monitoring and maintenance of ML Applications.
4. As a result, time and cost to scale from 1 to 100
models in production is growing exponentially.
What is your cost per
ML Model in Production?
13. Emerging Architectural Shifts
Data Lake -> Hudi/Delta Lakes
Hudi/Delta Lakes bring managed ingestion, ACID transactions
and point in time queries into traditional Data Lakes
Data Lake -> Data Mesh
Ownership of data domains, data pipelines, metadata, and API
is shifting from centralized teams to product teams
Data Lake -> Data Infrastructure as a platform
Unified reusable platform components and frameworks across
enterprise
Endpoint Protection -> Global Data Governance
Data Security and privacy measures are becoming centralized
as part of Data Platform
Metadata Store -> Global Data Catalog
User Experience around data discovery, lineage, and versioning
requires investments into metadata-rich Data Catalog
Feature Store
Scaling ML Experimentation and Operations requires a
separate data management layer for ML Features
ML Toolkit -> Complete ML Infrastructure
ML capabilities are democratized for ML Engineers and citizen
Data Scientists
14. ACID Data Lakes
● Managed Ingestion
● Dataset versioning for ML training
● Cheap “Deletes” (common GDPR use case)
● Audit log to any changes in datasets
● Brings ACID transactions in your data lake
● “Upserts” strategy on data ingestion
● Enables schemas to enforce data quality
Delta/Hudi Lakes
15. Global Data Governance
Accelerate privacy operations with data you already
have.
Automate business processes, data mapping, and PI
discovery and classification for privacy workflows.
Operationalize policies in a central location.
Govern privacy policies to ensure policies are effectively
managed across the enterprise. Define and document
workflows, traceability views, and business process
registers.
Scale compliance across multiple regulations.
Use a platform designed and built with privacy in mind
that is easily extensible to support new regulations.
AWS Config
AWS Lake Formation
16. Global Data Catalog
Meta-metadata store:
● Does this data exist? Where is it?
● What is the source of truth of this data?
● Do I have access?
● Who is the owner?
● Who are the users of this data?
● Are there existing assets I can reuse?
● Can I trust this data?
* There are no established leaders in open
source
17. The Core of MLOps and Reproducible
Experimentation Pipelines
Model Code
ML Pipeline Code
Infrastructure
as a Code
Versioned
Dataset
Production
Metrics & Alerts
Model Artifacts
Prediction
Service
ML Metrics
Automated Pipeline Execution
Pipeline Metadata
Alerts Reports
Feature Store
Orchestration: Idempotent Execution
Feedback Loop for Production Data
19. Feature Store Value Proposition
A data management layer for machine learning features.
1. Better ROI from feature engineering through reduction of
cost per model — Facilitates collaboration, sharing, and
reusing of features
2. Faster time to market for new models through increased
productivity of ML Engineers - Decoupled storage
implementation and features serving API
20. ● Personalization & Recommendation
Engines
● Dynamic Pricing Optimization
● Supply Chain Optimization
● Logistics and Transportation
Optimization
Feature Store: Canonical Use Cases
● Fraud Detection
● Predictive Maintenance
● Demand Forecasting
* All the use cases where ML models need a
stateful ever changing representation of the
system
21. ● Online Feature Store
Online applications look up for a feature
vector that is sent to an ML model for
predictions
● ML specific Metadata
Enables features discoverability and
reuse
Feature Store: Concepts
● ML Specific API and SDK
High level operations for fetching training
feature sets and online access
● Materialized Versioned Datasets
Maintains versions of featuresets used to
train ML models
Raw
Data Feature StoreFeature Engineering
Training
Serving
Discovery
23. Pros:
● Battle-tested with GoJek, Farfetch,
Postmates, and Zulily
● Integrated with Kubeflow
● Good community
Cons (to be addressed in the roadmap):
● GCP only
● Infrastructure-heavy
● Lacks composability
● No Data Versioning
* Now backed by Tecton
* https://blog.feast.dev/post/a-state-of-feast
Feast
Offline Store
(BigQuery)
Online
Serving
Historical
Serving
Feature
Registry
Online Store
(Redis)
Ingestion
Training
Discovery
Serving
Ingestion
API
Ingestion
24. Pros:
● Integrates with most Python libs for
ingestion and training
● Supports offline store with time travel
● AWS / GCP / Azure / On-Prem Ready
Cons:
● Hard to use out of HopsML
infrastructure
● Online store might not fit all latency
requirements
* Online serving is part of Enterprise version
Hopsworks
Feature
Registry
Offline Store
(Hudi/Hive)
Online
Serving
Historical
Serving
Spark
Online Store
(My SQL)
Training
Discovery
Serving
Pandas
Ingestion
API
27. 1. Start with designing consistent ACID Data Lake before investing
into Feature Store
2. Value from existing open source products does not justify
investments into integration and the dependencies they bring
3. Feature Store must not bring about new infrastructure and
data storage solutions. It has to be a lightweight API and SDK
integrated into your existing data infrastructure.
4. Data Catalog, Data Governance, and Data Quality components
are horizontal for the whole Data Infrastructure, including
Feature Store
5. There are no mature open source or cloud solutions for Global
Data Catalog and Data Quality monitoring.
Lessons Learned
28. Data Infrastructure with Feature Store
Raw
Data
Hot
Storage
Event
Data
Stream Processing
BI Tools
API
Batch Processing Cold
Storage
Workflow Automation
Training
Serving
Feature
Store API
Data
Catalog
Data
Quality
Data
Security
31. Recommendations for going forward with Feature Store:
1. Make sure your existing Data Infrastructure covers
90% of Feature Store requirements (Streaming
Ingestion, Consistency, Catalog, Versioning)
2. Build in-house a lightweight Feature Store API to your
existing storage solutions
3. Collaborate with community and cloud vendors to
maintain compatibility with standards and state of
the art ecosystem
4. Be ready to migrate to managed service or an open
source alternative as the market matures
Recommended Strategy
36. Performance
at scale
Consistent, single-digit
millisecond response times
at any scale; build
applications with virtually
unlimited throughput
Serverless architecture
No hardware provisioning,
software patching, or upgrades;
scales up or down
automatically; continuously
backs up your data
Global replication
You can build global
applications with fast access
to local data by easily
replicating tables across
multiple AWS Regions
Enterprise
security
Encrypts all data by
default and fully integrates
with AWS Identity and
Access Management for
robust security
Amazon DynamoDB
Fast and flexible key-value database service for any scale
37. Read scaling with replicas;
write and memory scaling with
sharding; nondisruptive scaling
Unlimited scale
AWS manages all hardware
and software setup,
configuration, and monitoring
Fully managed
In-memory data store
and cache for sub-millisecond
response times
Consistent high performance
Amazon ElastiCache
Managed, Redis, or Memcached-compatible in-memory data store
38. Performance
& scalability
5x throughput of standard
MySQL and 3x of standard
PostgreSQL; scale out up
to 15 read replicas
Availability
& durability
Fault-tolerant, self-healing
storage; 6 copies of data across 3
AZs; continuous backup to
Amazon S3
Highly
secure
Network
isolation,
encryption at
rest / in transit
Fully
managed
Managed by Amazon RDS:
On your part, no server provisioning,
software patching, setup,
configuration, or backups
Amazon Aurora
MySQL and PostgreSQL-compatible relational database built for the cloud
42. Amazon Athena
Pay per query
Pay only for queries run
Save 30–90% on per-query costs
through compression
Use S3 storage
ANSI SQL
JDBC/ODBC drivers
Multiple formats, compression
types, and complex joins and
data types
SQ
L
Serverless: zero infrastructure,
zero administration
Integrated with QuickSight
EasyQuery instantly
Zero setup cost
Point to S3 and start querying
Serverless, interactive query service
Analytics
43. Questions, details?
We would be happy to answer!
125 University Avenue
Suite 290, Palo Alto
California, 94301
provectus.com