Accelerating Data Science through Feature Platform, Transformers and GenAI

FeatureByte
Accelerating Data Science
through Feature Platform,
Transformers and GenAI
Sept 2023
Xavier Conort
with the help of Google Bard, ChatGPT and FeatureByte!
FeatureByte
Agenda
2
● Define feature engineering
● The features I like to extract from transactional data
● Why feature engineering can be hard
● Pain points solved by Feature Platforms
● The magic of Transformers
● How Generative AI helped FeatureByte build a context-aware
automated feature ideation
FeatureByte
A little about me
3
Actuary
Learnt to formulate
mathematically
insurance problems and
solved them with data.
Kaggle addict
Learnt to have fun with
data.
Chief Data Scientist,
DataRobot
Learnt to codify best practices.
Co-founder,
FeatureByte
Solving remaining pain points
and planning to have fun with
data again!
FeatureByte
Define feature
engineering
FeatureByte 5
FeatureByte
2 types of features
6
1. Transformed features:
a. combine columns, binning…
b. model specific: scaling, one-hot encoding, embedding, reducing, …
2. Features from raw data:
a. From regular time series
b. From transactional data
c. From slowly changing dimension tables
Well covered by books and blogs, well supported by
tools like sklearn and automated by Auto-ML
Pretty well covered by books and blogs and
automated by Auto-ML, TS package or deep learning
The most exciting but challenging feature
engineering task.
FeatureByte
The features I like to
extract from
transactional data
FeatureByte 8
across a variety of signal types and entities associated with the use case
1. Attribute: gets the attribute of the entity at a point-in-time
2. Frequency: counts the occurrence of events
3. Recency: measures the time since the latest event
4. Timing: relates to when the events happened
5. Latest event: attributes of the latest event
6. Stats: aggregates a numeric column's values
7. Diversity: measures the variability of data values
8. Stability: compares recent events to those of earlier periods
9. Similarity: compares an individual entity feature to a group
10. Most frequent: gets the most frequent value of a categorical column
11. Bucketing: aggregates a column's values across categories of a categorical column.
12. Attribute stats: collects stats for an attribute of the entity
13. Attribute change: measures the occurrence or magnitude of changes to slowly changing attributes
14. Location…
The MEANINGFUL ones!
FeatureByte 9
Dimension table with
static description on
products
Item table with details on
purchased products in
customer invoices
Event table
where each
row indicates
an invoice in a
Grocery shop
Slowly Changing Dimension table that contains
data on customers that change over time.
Simple example of transaction data: A Grocery Dataset
FeatureByte 10
Invoice x Similarity
Customer x Diversity
Customer x Attribute change
Customer x Similarity
FeatureByte 11
Customer x Timing
Features inspired by Owen Zhang, Kaggle
Grandmaster
FeatureByte 12
More than 100 examples in the tutorials of FeatureByte’s free and
source available package
FeatureByte
Why feature
engineering can be hard
FeatureByte
Feature Engineering with transactional data is Complex and Challenging
14
1. Transactional data can be
XXXL
2. Feature space can be
infinite
3. High risk of time leakage
and training-serving
inconsistencies: need to be
point-in-time accurate
4. Feature engineering
strategy varies with the
data semantics
5. Garbage-in Garbage-out
FeatureByte
3 skills come into play
15
DOMAIN
EXPERTISE
DATA
ENGINEERING
DATA
SCIENCE
DW
SAAS
SOURCES
MODEL
TRAINING
MODEL
PREDICTION
STREAMING
FeatureByte
… while 3 innovations are already radically simplifying the effort
16
DOMAIN
EXPERTISE
DATA
ENGINEERING
DATA
SCIENCE
Generative AI is
getting close to
emulate a
Domain Expert
Feature
Platforms are
simplifying data
engineering
Off the Shelf
Transformers are
revolutionizing NLP
FeatureByte
Pain points solved by
feature platforms
FeatureByte 18
Pain Point How a Feature Platform is addressing it
High latency in feature computation Pre-computes features and stores them in
an online store for low latency
Disappointing production accuracy caused
by inconsistency of feature values between
training and production
Ensures point-in-time correctness and
training-serving consistency
Code for efficient time travel operations in
SQL is complex
Declarative framework in Python to simplify
the creation of features
Waiting to experiment with our new feature
ideas can be frustrating
Automated backfilling to compute features at
any point in time in the past
Painful search for existing features Feature Catalog to share and reuse features
Uncertainties when our new features will be
finally deployed
Unified solution to ensure smooth transition
from experimentation to deployment
FeatureByte
The magic of
Transformers
FeatureByte 20
FeatureByte
What do I think about Transformers?
21
1. As a scientist: they are beautiful.
2. As a data scientist: they are awesome.
3. As a MLOps engineer: another transformation to operationalize.
4. As a product manager: they are opening up new horizons!
FeatureByte
Immediate benefits of Transformers
22
1. Radically simplify a complex task (NLP)
2. In many cases, don’t need any training for great outcomes! Can be used off
the shelf to transform raw text columns of transactional data:
a. Text embedding,
b. Name Entity Recognition,
c. Sentiment analysis,
d. Keyword extraction,
e. Extract specific information in a table structure…
3. I just need to aggregate the outputs in a meaningful way
FeatureByte
Do I still need Feature Platforms for Transformers?
23
Feature Platforms can improve Transformers operationalization
1. Point-in-Time correctness
2. Low latency thanks to the pre-computation of features that involve
transformers
3. Caching (via partial aggregations or other mechanisms) to reduce expensive
transformer calls
4. Transformer Library
FeatureByte
Can Deep Learning also replace traditional feature engineering?
24
Maybe in the future, but less magic expected…
1. I have not seen anything convincing yet apart from:
a. regular time series
b. sequences of event (successful results for recommendation system)
2. If solutions emerge, it won’t be pre-trained Off the Shelf Transformers like
for NLP and you will need a lot of data to train them to get meaningful
results.
3. And for many of us, it may not pass the model validation process:
a. not explainable enough
b. and maybe not robust enough if your data is not XXXL
FeatureByte 25
"Semantics is to artificial intelligence as physics is
to engineering."
John McCarthy, considered to be one of the most important figures in the
history of artificial intelligence
"Creativity is just connecting things."
Steve Jobs
So why do I think Transformers are revolutionary?
FeatureByte
How Generative AI can
help feature
engineering
FeatureByte
Areas where Generative AI is already doing great
27
Generative AI:
1. is already very familiar with data modeling concepts
2. can recognize the semantic of the data columns well beyond numeric and
string if meaningful names or good descriptions are provided
3. already possesses deep domain knowledge that can help
a. take some important decisions in the feature engineering process
b. assess how relevant a feature is to a use case
FeatureByte
How FeatureByte’s Copilot is currently leveraging GenAI
28
Description of Data
(tables, columns, entities)
Data Semantics &
Domain Insights
Feature Ideation
Engine
Feature code &
descriptions
Relevance Score
and Explanation
GenAI GenAI
Context-aware
Feature suggestions
Description of Use Case
(domain, context, target)
that are interpretable and
relevant! And editable.
FeatureByte 29
FeatureByte
Generative AI is already
familiar with data
modeling concepts
FeatureByte
Knows best practices
31
FeatureByte
Knows what is an entity which is a key concept in feature engineering
32
FeatureByte
Generative AI can
recognize data columns
semantics
FeatureByte
Make the difference between different types of numeric columns
34
FeatureByte
Generative AI can help
design your feature
engineering strategy
FeatureByte
Example: Can recommend strategy to filter data
36
FeatureByte
Generative AI can
assess the feature
relevance to a use case
FeatureByte
Plain English explanation
38
FeatureByte
… even for complex features
39
FeatureByte
… and is sharing its domain knowledge during the process
40
FeatureByte
Conclusion
FeatureByte
The 3 innovations are accelerating Data Science by
42
1. Reducing time-to-value thanks to:
a. quicker experimentation, and smooth transition to deployment
2. Increasing production accuracy thanks to:
a. training / serving consistency
b. transformers
c. context aware feature generation
3. Increasing Transparency thanks to:
a. Plain English descriptions of features
b. Relevance explanations to a use case
FeatureByte
and are opening up new horizons
43
New opportunities to:
● bridge the gap between features and non-technical stakeholders and
contribute to better Model Explainability and stronger regulation
● boost your creativity and domain knowledge
● build solutions and not only models!
● and more time to explore other advanced solutions like causal modelling +
A/B testing that may be better addressing your use cases
FeatureByte
Thanks!
If you are curious to know more about Featurebyte
White Paper on FeatureByte Copilot
Video of FeatureByte Enterprise
Github Repo of the free and source available engine: https://github.com/featurebyte/featurebyte
Documentation: https://docs.featurebyte.com/0.5/
1 de 44

Recomendados

Accelerating Data Science through Feature Platform, Transformers, and GenAI por
Accelerating Data Science through Feature Platform, Transformers, and GenAIAccelerating Data Science through Feature Platform, Transformers, and GenAI
Accelerating Data Science through Feature Platform, Transformers, and GenAIFeatureByte
189 vistas46 diapositivas
Streamlining Data Science Workflows with a Feature Catalog por
Streamlining Data Science Workflows with a Feature CatalogStreamlining Data Science Workflows with a Feature Catalog
Streamlining Data Science Workflows with a Feature CatalogGoDataDriven
122 vistas21 diapositivas
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz... por
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...HostedbyConfluent
314 vistas40 diapositivas
Scaling & Transforming Stitch Fix's Visibility into What Folks will love por
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveJune Andrews
76 vistas43 diapositivas
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence... por
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...AgileNetwork
12 vistas21 diapositivas
Simplify Feature Engineering in Your Data Warehouse por
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseFeatureByte
64 vistas28 diapositivas

Más contenido relacionado

Similar a Accelerating Data Science through Feature Platform, Transformers and GenAI

How we integrate Machine Learning Algorithms into our IT Platform at Outfittery por
How we integrate Machine Learning Algorithms into our IT Platform at OutfitteryHow we integrate Machine Learning Algorithms into our IT Platform at Outfittery
How we integrate Machine Learning Algorithms into our IT Platform at OutfitteryOUTFITTERY
1.6K vistas81 diapositivas
Feature store: Solving anti-patterns in ML-systems por
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsAndrzej Michałowski
2.5K vistas30 diapositivas
Ml ops and the feature store with hopsworks, DC Data Science Meetup por
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
497 vistas42 diapositivas
BSSML16 L10. Summary Day 2 Sessions por
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBigML, Inc
303 vistas29 diapositivas
Continuum Analytics and Python por
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
4.7K vistas94 diapositivas
Practical data science por
Practical data sciencePractical data science
Practical data scienceDing Li
142 vistas41 diapositivas

Similar a Accelerating Data Science through Feature Platform, Transformers and GenAI(20)

How we integrate Machine Learning Algorithms into our IT Platform at Outfittery por OUTFITTERY
How we integrate Machine Learning Algorithms into our IT Platform at OutfitteryHow we integrate Machine Learning Algorithms into our IT Platform at Outfittery
How we integrate Machine Learning Algorithms into our IT Platform at Outfittery
OUTFITTERY1.6K vistas
Feature store: Solving anti-patterns in ML-systems por Andrzej Michałowski
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
Andrzej Michałowski2.5K vistas
Ml ops and the feature store with hopsworks, DC Data Science Meetup por Jim Dowling
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling497 vistas
BSSML16 L10. Summary Day 2 Sessions por BigML, Inc
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
BigML, Inc303 vistas
Continuum Analytics and Python por Travis Oliphant
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
Travis Oliphant4.7K vistas
Practical data science por Ding Li
Practical data sciencePractical data science
Practical data science
Ding Li142 vistas
Serverless machine learning architectures at Helixa por Data Science Milan
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
Data Science Milan259 vistas
Applying linear regression and predictive analytics por MariaDB plc
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
MariaDB plc248 vistas
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure por Fei Chen
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Fei Chen3.9K vistas
ODSC West 2022 – Kitbashing in ML por Bryan Bischof
ODSC West 2022 – Kitbashing in MLODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in ML
Bryan Bischof74 vistas
Agile Data Rationalization for Operational Intelligence por Inside Analysis
Agile Data Rationalization for Operational IntelligenceAgile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational Intelligence
Inside Analysis1.9K vistas
Digital_IOT_(Microsoft_Solution).pdf por ssuserd23711
Digital_IOT_(Microsoft_Solution).pdfDigital_IOT_(Microsoft_Solution).pdf
Digital_IOT_(Microsoft_Solution).pdf
ssuserd237114 vistas
Balancing Infrastructure with Optimization and Problem Formulation por Alex D. Gaudio
Balancing Infrastructure with Optimization and Problem FormulationBalancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem Formulation
Alex D. Gaudio847 vistas
Managers guide to effective building of machine learning products por Gianmario Spacagna
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning products
Gianmario Spacagna683 vistas
The Data Science Process - Do we need it and how to apply? por Ivo Andreev
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
Ivo Andreev6K vistas
The Art Of Performance Tuning - with presenter notes! por Jonathan Ross
The Art Of Performance Tuning - with presenter notes!The Art Of Performance Tuning - with presenter notes!
The Art Of Performance Tuning - with presenter notes!
Jonathan Ross111 vistas
DIGITAL DESIGN THROUGH VERILOG por Aliyahh King
DIGITAL DESIGN THROUGH VERILOGDIGITAL DESIGN THROUGH VERILOG
DIGITAL DESIGN THROUGH VERILOG
Aliyahh King5 vistas
Data Engineer's Lunch #85: Designing a Modern Data Stack por Anant Corporation
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
Anant Corporation147 vistas
Doing Analytics Right - Building the Analytics Environment por Tasktop
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics Environment
Tasktop144 vistas

Último

HTTP headers that make your website go faster - devs.gent November 2023 por
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023Thijs Feryn
28 vistas151 diapositivas
NTGapps NTG LowCode Platform por
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform Mustafa Kuğu
141 vistas30 diapositivas
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... por
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...ShapeBlue
74 vistas18 diapositivas
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... por
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...Jasper Oosterveld
28 vistas49 diapositivas
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... por
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...ShapeBlue
48 vistas17 diapositivas
Network Source of Truth and Infrastructure as Code revisited por
Network Source of Truth and Infrastructure as Code revisitedNetwork Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisitedNetwork Automation Forum
42 vistas45 diapositivas

Último(20)

HTTP headers that make your website go faster - devs.gent November 2023 por Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn28 vistas
NTGapps NTG LowCode Platform por Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu141 vistas
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... por ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue74 vistas
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... por Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Jasper Oosterveld28 vistas
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... por ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue48 vistas
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT por ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue91 vistas
Business Analyst Series 2023 - Week 4 Session 7 por DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray1080 vistas
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online por ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue102 vistas
PharoJS - Zürich Smalltalk Group Meetup November 2023 por Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi141 vistas
Ransomware is Knocking your Door_Final.pdf por Security Bootcamp
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdf
Security Bootcamp76 vistas
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates por ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue119 vistas
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue por ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue131 vistas
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... por ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue63 vistas
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... por ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue82 vistas
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... por ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue88 vistas
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... por ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue77 vistas
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... por James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson133 vistas

Accelerating Data Science through Feature Platform, Transformers and GenAI

  • 1. FeatureByte Accelerating Data Science through Feature Platform, Transformers and GenAI Sept 2023 Xavier Conort with the help of Google Bard, ChatGPT and FeatureByte!
  • 2. FeatureByte Agenda 2 ● Define feature engineering ● The features I like to extract from transactional data ● Why feature engineering can be hard ● Pain points solved by Feature Platforms ● The magic of Transformers ● How Generative AI helped FeatureByte build a context-aware automated feature ideation
  • 3. FeatureByte A little about me 3 Actuary Learnt to formulate mathematically insurance problems and solved them with data. Kaggle addict Learnt to have fun with data. Chief Data Scientist, DataRobot Learnt to codify best practices. Co-founder, FeatureByte Solving remaining pain points and planning to have fun with data again!
  • 6. FeatureByte 2 types of features 6 1. Transformed features: a. combine columns, binning… b. model specific: scaling, one-hot encoding, embedding, reducing, … 2. Features from raw data: a. From regular time series b. From transactional data c. From slowly changing dimension tables Well covered by books and blogs, well supported by tools like sklearn and automated by Auto-ML Pretty well covered by books and blogs and automated by Auto-ML, TS package or deep learning The most exciting but challenging feature engineering task.
  • 7. FeatureByte The features I like to extract from transactional data
  • 8. FeatureByte 8 across a variety of signal types and entities associated with the use case 1. Attribute: gets the attribute of the entity at a point-in-time 2. Frequency: counts the occurrence of events 3. Recency: measures the time since the latest event 4. Timing: relates to when the events happened 5. Latest event: attributes of the latest event 6. Stats: aggregates a numeric column's values 7. Diversity: measures the variability of data values 8. Stability: compares recent events to those of earlier periods 9. Similarity: compares an individual entity feature to a group 10. Most frequent: gets the most frequent value of a categorical column 11. Bucketing: aggregates a column's values across categories of a categorical column. 12. Attribute stats: collects stats for an attribute of the entity 13. Attribute change: measures the occurrence or magnitude of changes to slowly changing attributes 14. Location… The MEANINGFUL ones!
  • 9. FeatureByte 9 Dimension table with static description on products Item table with details on purchased products in customer invoices Event table where each row indicates an invoice in a Grocery shop Slowly Changing Dimension table that contains data on customers that change over time. Simple example of transaction data: A Grocery Dataset
  • 10. FeatureByte 10 Invoice x Similarity Customer x Diversity Customer x Attribute change Customer x Similarity
  • 11. FeatureByte 11 Customer x Timing Features inspired by Owen Zhang, Kaggle Grandmaster
  • 12. FeatureByte 12 More than 100 examples in the tutorials of FeatureByte’s free and source available package
  • 14. FeatureByte Feature Engineering with transactional data is Complex and Challenging 14 1. Transactional data can be XXXL 2. Feature space can be infinite 3. High risk of time leakage and training-serving inconsistencies: need to be point-in-time accurate 4. Feature engineering strategy varies with the data semantics 5. Garbage-in Garbage-out
  • 15. FeatureByte 3 skills come into play 15 DOMAIN EXPERTISE DATA ENGINEERING DATA SCIENCE DW SAAS SOURCES MODEL TRAINING MODEL PREDICTION STREAMING
  • 16. FeatureByte … while 3 innovations are already radically simplifying the effort 16 DOMAIN EXPERTISE DATA ENGINEERING DATA SCIENCE Generative AI is getting close to emulate a Domain Expert Feature Platforms are simplifying data engineering Off the Shelf Transformers are revolutionizing NLP
  • 17. FeatureByte Pain points solved by feature platforms
  • 18. FeatureByte 18 Pain Point How a Feature Platform is addressing it High latency in feature computation Pre-computes features and stores them in an online store for low latency Disappointing production accuracy caused by inconsistency of feature values between training and production Ensures point-in-time correctness and training-serving consistency Code for efficient time travel operations in SQL is complex Declarative framework in Python to simplify the creation of features Waiting to experiment with our new feature ideas can be frustrating Automated backfilling to compute features at any point in time in the past Painful search for existing features Feature Catalog to share and reuse features Uncertainties when our new features will be finally deployed Unified solution to ensure smooth transition from experimentation to deployment
  • 21. FeatureByte What do I think about Transformers? 21 1. As a scientist: they are beautiful. 2. As a data scientist: they are awesome. 3. As a MLOps engineer: another transformation to operationalize. 4. As a product manager: they are opening up new horizons!
  • 22. FeatureByte Immediate benefits of Transformers 22 1. Radically simplify a complex task (NLP) 2. In many cases, don’t need any training for great outcomes! Can be used off the shelf to transform raw text columns of transactional data: a. Text embedding, b. Name Entity Recognition, c. Sentiment analysis, d. Keyword extraction, e. Extract specific information in a table structure… 3. I just need to aggregate the outputs in a meaningful way
  • 23. FeatureByte Do I still need Feature Platforms for Transformers? 23 Feature Platforms can improve Transformers operationalization 1. Point-in-Time correctness 2. Low latency thanks to the pre-computation of features that involve transformers 3. Caching (via partial aggregations or other mechanisms) to reduce expensive transformer calls 4. Transformer Library
  • 24. FeatureByte Can Deep Learning also replace traditional feature engineering? 24 Maybe in the future, but less magic expected… 1. I have not seen anything convincing yet apart from: a. regular time series b. sequences of event (successful results for recommendation system) 2. If solutions emerge, it won’t be pre-trained Off the Shelf Transformers like for NLP and you will need a lot of data to train them to get meaningful results. 3. And for many of us, it may not pass the model validation process: a. not explainable enough b. and maybe not robust enough if your data is not XXXL
  • 25. FeatureByte 25 "Semantics is to artificial intelligence as physics is to engineering." John McCarthy, considered to be one of the most important figures in the history of artificial intelligence "Creativity is just connecting things." Steve Jobs So why do I think Transformers are revolutionary?
  • 26. FeatureByte How Generative AI can help feature engineering
  • 27. FeatureByte Areas where Generative AI is already doing great 27 Generative AI: 1. is already very familiar with data modeling concepts 2. can recognize the semantic of the data columns well beyond numeric and string if meaningful names or good descriptions are provided 3. already possesses deep domain knowledge that can help a. take some important decisions in the feature engineering process b. assess how relevant a feature is to a use case
  • 28. FeatureByte How FeatureByte’s Copilot is currently leveraging GenAI 28 Description of Data (tables, columns, entities) Data Semantics & Domain Insights Feature Ideation Engine Feature code & descriptions Relevance Score and Explanation GenAI GenAI Context-aware Feature suggestions Description of Use Case (domain, context, target) that are interpretable and relevant! And editable.
  • 30. FeatureByte Generative AI is already familiar with data modeling concepts
  • 32. FeatureByte Knows what is an entity which is a key concept in feature engineering 32
  • 33. FeatureByte Generative AI can recognize data columns semantics
  • 34. FeatureByte Make the difference between different types of numeric columns 34
  • 35. FeatureByte Generative AI can help design your feature engineering strategy
  • 36. FeatureByte Example: Can recommend strategy to filter data 36
  • 37. FeatureByte Generative AI can assess the feature relevance to a use case
  • 39. FeatureByte … even for complex features 39
  • 40. FeatureByte … and is sharing its domain knowledge during the process 40
  • 42. FeatureByte The 3 innovations are accelerating Data Science by 42 1. Reducing time-to-value thanks to: a. quicker experimentation, and smooth transition to deployment 2. Increasing production accuracy thanks to: a. training / serving consistency b. transformers c. context aware feature generation 3. Increasing Transparency thanks to: a. Plain English descriptions of features b. Relevance explanations to a use case
  • 43. FeatureByte and are opening up new horizons 43 New opportunities to: ● bridge the gap between features and non-technical stakeholders and contribute to better Model Explainability and stronger regulation ● boost your creativity and domain knowledge ● build solutions and not only models! ● and more time to explore other advanced solutions like causal modelling + A/B testing that may be better addressing your use cases
  • 44. FeatureByte Thanks! If you are curious to know more about Featurebyte White Paper on FeatureByte Copilot Video of FeatureByte Enterprise Github Repo of the free and source available engine: https://github.com/featurebyte/featurebyte Documentation: https://docs.featurebyte.com/0.5/