SlideShare a Scribd company logo
1 of 14
PyBabe
Eat whatever data you wanna eat




                                  Dataikuℱ
Projet Goal was 


 Integrate Game Logs on a Large actor, social Gaming

       IsCool Entertainment (Euronext: ALWEK), 70 people,
       10M€ revenues.

 Around 30 GB raw logs per day for 7 games(web, mobile)

       That’s about 10TB per year.

       At the end some Hadoop’ing + Analytics SQ L, but
       in the middle lots of data integration

 Anykind of logs and Data

       Partial database extracts

       Apache/NGinx logs

       Tracking Logs (Web Analytics stuff. etc..)

       Application Logs

       REST APIs (Currency Exchange, Geo Data,
       Facebook APIs. )..)


                                                            Dataikuℱ
As a reminder
What most data scientists do ?


    LinkedIn&Twitter

       “Data Science”              Real Life
     “Recommendation”        80% of its time is spent
   “Clustering algorithms”    getting the data right
         “Big Data”
     “Machine Learning”          19% Analytics
  “Hidden Markov Model”
    “Predictive Analytics”    1% Twitter & LinkedIn
    “Logistic Regression”

                                           Dataikuℱ
Goal

 An project based on a ETL solution had
 previously failed

 Need for

     Agility

     To manage any data

     To be quick

 The answer is 
.

     PYTHON !!!



                                          Dataikuℱ
Step 1: Open your favorite
editor, write a .py ïŹle
 Script for data parsing, ïŹlling up the
 database, enrichment, cleanup,
 etc..

 Around 2000 line of code

 5 man days work

      !Good, but hard to maintain
      on the long run

      !Not fun

 I switched from emacs to
 SublimeText2 in the meantime, that
 was cool.


                                          Dataikuℱ
Step 2: Abstract and
Generalize. PyBabe
 Micro-ETL in Python
 Can read and write: FTP, HTTP, SQL, Filesystem, Amazon S3, E-Mail, ZIP,
 GZIP, MongoDB, Excel, etc..
 Basic ïŹle ïŹlters and transformations (ïŹlters, regular expressions, date parsing,
 geoip, transpose, sort, group, 
)
 Use yield and named tuples
 Open-source
     https://github.com/fdouetteau/PyBabe
 And the old project ?
     The old project became 200 linesof speciïŹc code

                                                                       Dataikuℱ
Sample pybabe
(1) Fetch a log ïŹle in s3 and put integrate in

babe = Babe()


## Fetch multiple CSV ïŹle from S3, har, cache en local
babe = babe.pull(url=“s3://myapp/mydir 2012-07-07_*.csv.gz”,cache=True)

## Recupùre l’IP dans le champs IP, trouve pas geoip le pays
babe = babe.geoip_country_code(ïŹeld=“ip”, country_code=“country”, ignore_error_True)


## RĂ©cupĂšre le user agent, et stocke le nom du navigateur
babe = babe.user_agent(ïŹeld=“user_agent”, browser=“browser”)


## Ne garde que les champs pertinents
babe = babe.ïŹlterFields(ïŹelds=[“user_id”, “date”, “country”, “user_agent”])


## Stocke le résultat dans une base de donnée
babe.push_sql(database=“mydb”, table=“mytable”, username=“
”);


                                                                                Dataikuℱ
Sample PyBabe script
    (2) Large ïŹle sort, join
babe = Babe()
## Fetch a large CSV ïŹle
babe = babe.pull(ïŹlename=“mybigïŹle.csv”)


## Perform a disk-based sort, batch 100k lines in memory
babe = babe.sortDiskBased(ïŹeld=“uid”, nsize=100000)


## Group By uid and sum revenu per user.
babe = babe.groupBy(ïŹeld=“uid”, reducer=lambda x, y: (x.uid, x.amount + y.amount))


## Join this stream on “uid” with the result of a CSV ïŹle
abe = babe.join(Babe().pull_sql(database=“mydb”, table=“user_info”, “uid”, “uid”)


## Store the result in an Excel ïŹle
babe.push (ïŹlename=“reports.xlxs”);



                                                                                    Dataikuℱ
Sample PyBabe script
   (3) Mail a report

babe = Babe()
## Pull the result of a SQL query
babe = babe.pull(database=“mydb”, name=”First Query”, query=“SELECT 
. “)


## Pull the result of a second SQL query
babe = babe.pull(database=“mydb”, name=”Second Query”, query=“SELECT 
.”)

## Send the overall stream (concatenated) as an email, with content attached in Excel, and
some sample data in the body
babe = babe.sendmail(subject=“Your Report”, recipients=fd@me.com, data_in_body=True,
data_in_body_row_limit=10, attach_formats=“xlsx”)




                                                                                Dataikuℱ
Some Design Choices
 Use collections.namedtuple
 Use generators
     Nice and easy programming style
          def ïŹlter(stream, f):
          	   for data in stream:
          	   	    if isinstance(data, StreamMeta):
          	   	    	     yield data
          	   	    elif f(data):
          	   	    	     yield data
 IO Streaming whenever possible
     An HTTP downloaded ïŹle begins to be processed as it starts downloading
 Use bulk-loaders (SQL) or external program when faster than the python
 implementation (e.g gzip)

                                                                          Dataikuℱ
PyBabe data model
                           def sample_pull():
                               header =
                                   StreamHeader(name=”visits”,
                                   partition={‘day’:‘2012-09-14’},
 A Babe works on a                 ïŹelds=[“name”, “day”])
 generator that contains
                               yield header
 a sequence of partition
                               yield header.makeRow(‘Florian’,‘2012-09-14’)
 A Partition is                yield header.makeRow(‘John’, ‘2012-09-14’)

 composed of a header          yield StreamFooter()
 (StreamHeader), rows,
                               yield header.replace(partition={‘day’:‘2012-09-15’})
 and a Footer
                               yield header.makeRow(‘Phil’, ‘2012-09-15’)

                               yield StreamFooter()
                                                                     Dataikuℱ
Some thoughts and
associated projects
 strptime and performance
     Parse a date with time.strptime or datetime.strptime
         30 microseconds vs. 3 microseconds for regexp matching !!!
     “Tarpys” a date parsing library, with date guessing
 Charset management (pyencoding_cleaner)
     Sniff ISO or UTF-8 charset over a fragment
     Optionally try to ïŹx bad encoding ( ĂƒÆ’Ă‚Âź, ĂƒÆ’Ă‚Â­, ĂƒÆ’Ă‚ÂŒ)
 python2.X csv module is ok but 

     No Unicode support
     Separator snifïŹng buggy on edge cases

                                                                      Dataikuℱ
Future

 Need to separate the github project into core and plugins
 Rewrite in C a CSV module ? 

 ConïŹgurable Error system. Should a error row fail the
 whole stream, fail the whole babe, send a warning, be or
 be skipped
 Pandas/NumPy integration
 An Homepage, Docs, etc...

                                                   Dataikuℱ
Ask questions ? ?
                  babe = Babe().pull(“questions.csv”)

                    babe = babe.ïŹlter(smart=True)

                      babe = babe.mapTo(oracle)
  Florian
 Douetteau             babe.push(“answers.csv”);
@fdouetteau
   CEO
  Dataiku
                  Dataiku : Our Goal
              -   Leverage and Provide the best of open
                  souce technologies to help people build
                     their own data science platform  Dataikuℱ

More Related Content

What's hot

Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
Cdiscount
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
Dataiku
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku
 

What's hot (20)

The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
 
Back to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from ScratchBack to Square One: Building a Data Science Team from Scratch
Back to Square One: Building a Data Science Team from Scratch
 
How to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
How to Build a Successful Data Team - Florian Douetteau @ PAPIs ConnectHow to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
How to Build a Successful Data Team - Florian Douetteau @ PAPIs Connect
 
Machine Learning Services Benchmark - InĂȘs Almeida @ PAPIs Connect
Machine Learning Services Benchmark - InĂȘs Almeida @ PAPIs ConnectMachine Learning Services Benchmark - InĂȘs Almeida @ PAPIs Connect
Machine Learning Services Benchmark - InĂȘs Almeida @ PAPIs Connect
 
Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!
 
Before Kaggle
Before KaggleBefore Kaggle
Before Kaggle
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
 
PASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureMLPASS Summit Data Storytelling with R Power BI and AzureML
PASS Summit Data Storytelling with R Power BI and AzureML
 

Similar to Eat whatever you can with PyBabe

Data herding
Data herdingData herding
Data herding
unbracketed
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
AIèˆ‡ć€§æ•žæ“šæ•žæ“šè™•ç† SparkćŻŠæˆ°(20171216)
AIèˆ‡ć€§æ•žæ“šæ•žæ“šè™•ç† SparkćŻŠæˆ°(20171216)AIèˆ‡ć€§æ•žæ“šæ•žæ“šè™•ç† SparkćŻŠæˆ°(20171216)
AIèˆ‡ć€§æ•žæ“šæ•žæ“šè™•ç† SparkćŻŠæˆ°(20171216)
Paul Chao
 

Similar to Eat whatever you can with PyBabe (20)

Hopsworks data engineering melbourne april 2020
Hopsworks   data engineering melbourne april 2020Hopsworks   data engineering melbourne april 2020
Hopsworks data engineering melbourne april 2020
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
Data herding
Data herdingData herding
Data herding
 
Data herding
Data herdingData herding
Data herding
 
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
KAPT Annotation processing & Code generation
KAPT Annotation processing & Code generationKAPT Annotation processing & Code generation
KAPT Annotation processing & Code generation
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
Everything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the WebEverything is Awesome - Cutting the Corners off the Web
Everything is Awesome - Cutting the Corners off the Web
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
AIèˆ‡ć€§æ•žæ“šæ•žæ“šè™•ç† SparkćŻŠæˆ°(20171216)
AIèˆ‡ć€§æ•žæ“šæ•žæ“šè™•ç† SparkćŻŠæˆ°(20171216)AIèˆ‡ć€§æ•žæ“šæ•žæ“šè™•ç† SparkćŻŠæˆ°(20171216)
AIèˆ‡ć€§æ•žæ“šæ•žæ“šè™•ç† SparkćŻŠæˆ°(20171216)
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Having Fun with Play
Having Fun with PlayHaving Fun with Play
Having Fun with Play
 
BP204 - Take a REST and put your data to work with APIs!
BP204 - Take a REST and put your data to work with APIs!BP204 - Take a REST and put your data to work with APIs!
BP204 - Take a REST and put your data to work with APIs!
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Jump Start into Apache¼ Sparkℱ and Databricks
Jump Start into Apache¼ Sparkℱ and DatabricksJump Start into Apache¼ Sparkℱ and Databricks
Jump Start into Apache¼ Sparkℱ and Databricks
 

More from Dataiku

Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from th
Dataiku
 

More from Dataiku (12)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from th
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Eat whatever you can with PyBabe

  • 1. PyBabe Eat whatever data you wanna eat Dataikuℱ
  • 2. Projet Goal was 
 Integrate Game Logs on a Large actor, social Gaming IsCool Entertainment (Euronext: ALWEK), 70 people, 10M€ revenues. Around 30 GB raw logs per day for 7 games(web, mobile) That’s about 10TB per year. At the end some Hadoop’ing + Analytics SQ L, but in the middle lots of data integration Anykind of logs and Data Partial database extracts Apache/NGinx logs Tracking Logs (Web Analytics stuff. etc..) Application Logs REST APIs (Currency Exchange, Geo Data, Facebook APIs. )..) Dataikuℱ
  • 3. As a reminder What most data scientists do ? LinkedIn&Twitter “Data Science” Real Life “Recommendation” 80% of its time is spent “Clustering algorithms” getting the data right “Big Data” “Machine Learning” 19% Analytics “Hidden Markov Model” “Predictive Analytics” 1% Twitter & LinkedIn “Logistic Regression” Dataikuℱ
  • 4. Goal An project based on a ETL solution had previously failed Need for Agility To manage any data To be quick The answer is 
. PYTHON !!! Dataikuℱ
  • 5. Step 1: Open your favorite editor, write a .py ïŹle Script for data parsing, ïŹlling up the database, enrichment, cleanup, etc.. Around 2000 line of code 5 man days work !Good, but hard to maintain on the long run !Not fun I switched from emacs to SublimeText2 in the meantime, that was cool. Dataikuℱ
  • 6. Step 2: Abstract and Generalize. PyBabe Micro-ETL in Python Can read and write: FTP, HTTP, SQL, Filesystem, Amazon S3, E-Mail, ZIP, GZIP, MongoDB, Excel, etc.. Basic ïŹle ïŹlters and transformations (ïŹlters, regular expressions, date parsing, geoip, transpose, sort, group, 
) Use yield and named tuples Open-source https://github.com/fdouetteau/PyBabe And the old project ? The old project became 200 linesof speciïŹc code Dataikuℱ
  • 7. Sample pybabe (1) Fetch a log ïŹle in s3 and put integrate in babe = Babe() ## Fetch multiple CSV ïŹle from S3, har, cache en local babe = babe.pull(url=“s3://myapp/mydir 2012-07-07_*.csv.gz”,cache=True) ## RecupĂšre l’IP dans le champs IP, trouve pas geoip le pays babe = babe.geoip_country_code(ïŹeld=“ip”, country_code=“country”, ignore_error_True) ## RĂ©cupĂšre le user agent, et stocke le nom du navigateur babe = babe.user_agent(ïŹeld=“user_agent”, browser=“browser”) ## Ne garde que les champs pertinents babe = babe.ïŹlterFields(ïŹelds=[“user_id”, “date”, “country”, “user_agent”]) ## Stocke le rĂ©sultat dans une base de donnĂ©e babe.push_sql(database=“mydb”, table=“mytable”, username=“
”); Dataikuℱ
  • 8. Sample PyBabe script (2) Large ïŹle sort, join babe = Babe() ## Fetch a large CSV ïŹle babe = babe.pull(ïŹlename=“mybigïŹle.csv”) ## Perform a disk-based sort, batch 100k lines in memory babe = babe.sortDiskBased(ïŹeld=“uid”, nsize=100000) ## Group By uid and sum revenu per user. babe = babe.groupBy(ïŹeld=“uid”, reducer=lambda x, y: (x.uid, x.amount + y.amount)) ## Join this stream on “uid” with the result of a CSV ïŹle abe = babe.join(Babe().pull_sql(database=“mydb”, table=“user_info”, “uid”, “uid”) ## Store the result in an Excel ïŹle babe.push (ïŹlename=“reports.xlxs”); Dataikuℱ
  • 9. Sample PyBabe script (3) Mail a report babe = Babe() ## Pull the result of a SQL query babe = babe.pull(database=“mydb”, name=”First Query”, query=“SELECT 
. “) ## Pull the result of a second SQL query babe = babe.pull(database=“mydb”, name=”Second Query”, query=“SELECT 
.”) ## Send the overall stream (concatenated) as an email, with content attached in Excel, and some sample data in the body babe = babe.sendmail(subject=“Your Report”, recipients=fd@me.com, data_in_body=True, data_in_body_row_limit=10, attach_formats=“xlsx”) Dataikuℱ
  • 10. Some Design Choices Use collections.namedtuple Use generators Nice and easy programming style def ïŹlter(stream, f): for data in stream: if isinstance(data, StreamMeta): yield data elif f(data): yield data IO Streaming whenever possible An HTTP downloaded ïŹle begins to be processed as it starts downloading Use bulk-loaders (SQL) or external program when faster than the python implementation (e.g gzip) Dataikuℱ
  • 11. PyBabe data model def sample_pull(): header = StreamHeader(name=”visits”, partition={‘day’:‘2012-09-14’}, A Babe works on a ïŹelds=[“name”, “day”]) generator that contains yield header a sequence of partition yield header.makeRow(‘Florian’,‘2012-09-14’) A Partition is yield header.makeRow(‘John’, ‘2012-09-14’) composed of a header yield StreamFooter() (StreamHeader), rows, yield header.replace(partition={‘day’:‘2012-09-15’}) and a Footer yield header.makeRow(‘Phil’, ‘2012-09-15’) yield StreamFooter() Dataikuℱ
  • 12. Some thoughts and associated projects strptime and performance Parse a date with time.strptime or datetime.strptime 30 microseconds vs. 3 microseconds for regexp matching !!! “Tarpys” a date parsing library, with date guessing Charset management (pyencoding_cleaner) Sniff ISO or UTF-8 charset over a fragment Optionally try to ïŹx bad encoding ( ĂƒÆ’Ă‚Âź, ĂƒÆ’Ă‚Â­, ĂƒÆ’Ă‚ÂŒ) python2.X csv module is ok but 
 No Unicode support Separator snifïŹng buggy on edge cases Dataikuℱ
  • 13. Future Need to separate the github project into core and plugins Rewrite in C a CSV module ? 
 ConïŹgurable Error system. Should a error row fail the whole stream, fail the whole babe, send a warning, be or be skipped Pandas/NumPy integration An Homepage, Docs, etc... Dataikuℱ
  • 14. Ask questions ? ? babe = Babe().pull(“questions.csv”) babe = babe.ïŹlter(smart=True) babe = babe.mapTo(oracle) Florian Douetteau babe.push(“answers.csv”); @fdouetteau CEO Dataiku Dataiku : Our Goal - Leverage and Provide the best of open souce technologies to help people build their own data science platform Dataikuℱ

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n