2. Projet Goal was âŠ
Integrate Game Logs on a Large actor, social Gaming
IsCool Entertainment (Euronext: ALWEK), 70 people,
10M⏠revenues.
Around 30 GB raw logs per day for 7 games(web, mobile)
Thatâs about 10TB per year.
At the end some Hadoopâing + Analytics SQ L, but
in the middle lots of data integration
Anykind of logs and Data
Partial database extracts
Apache/NGinx logs
Tracking Logs (Web Analytics stuff. etc..)
Application Logs
REST APIs (Currency Exchange, Geo Data,
Facebook APIs. )..)
Dataikuâą
3. As a reminder
What most data scientists do ?
LinkedIn&Twitter
âData Scienceâ Real Life
âRecommendationâ 80% of its time is spent
âClustering algorithmsâ getting the data right
âBig Dataâ
âMachine Learningâ 19% Analytics
âHidden Markov Modelâ
âPredictive Analyticsâ 1% Twitter & LinkedIn
âLogistic Regressionâ
Dataikuâą
4. Goal
An project based on a ETL solution had
previously failed
Need for
Agility
To manage any data
To be quick
The answer is âŠ.
PYTHON !!!
Dataikuâą
5. Step 1: Open your favorite
editor, write a .py ïŹle
Script for data parsing, ïŹlling up the
database, enrichment, cleanup,
etc..
Around 2000 line of code
5 man days work
!Good, but hard to maintain
on the long run
!Not fun
I switched from emacs to
SublimeText2 in the meantime, that
was cool.
Dataikuâą
6. Step 2: Abstract and
Generalize. PyBabe
Micro-ETL in Python
Can read and write: FTP, HTTP, SQL, Filesystem, Amazon S3, E-Mail, ZIP,
GZIP, MongoDB, Excel, etc..
Basic ïŹle ïŹlters and transformations (ïŹlters, regular expressions, date parsing,
geoip, transpose, sort, group, âŠ)
Use yield and named tuples
Open-source
https://github.com/fdouetteau/PyBabe
And the old project ?
The old project became 200 linesof speciïŹc code
Dataikuâą
8. Sample PyBabe script
(2) Large ïŹle sort, join
babe = Babe()
## Fetch a large CSV ïŹle
babe = babe.pull(ïŹlename=âmybigïŹle.csvâ)
## Perform a disk-based sort, batch 100k lines in memory
babe = babe.sortDiskBased(ïŹeld=âuidâ, nsize=100000)
## Group By uid and sum revenu per user.
babe = babe.groupBy(ïŹeld=âuidâ, reducer=lambda x, y: (x.uid, x.amount + y.amount))
## Join this stream on âuidâ with the result of a CSV ïŹle
abe = babe.join(Babe().pull_sql(database=âmydbâ, table=âuser_infoâ, âuidâ, âuidâ)
## Store the result in an Excel ïŹle
babe.push (ïŹlename=âreports.xlxsâ);
Dataikuâą
9. Sample PyBabe script
(3) Mail a report
babe = Babe()
## Pull the result of a SQL query
babe = babe.pull(database=âmydbâ, name=âFirst Queryâ, query=âSELECT âŠ. â)
## Pull the result of a second SQL query
babe = babe.pull(database=âmydbâ, name=âSecond Queryâ, query=âSELECT âŠ.â)
## Send the overall stream (concatenated) as an email, with content attached in Excel, and
some sample data in the body
babe = babe.sendmail(subject=âYour Reportâ, recipients=fd@me.com, data_in_body=True,
data_in_body_row_limit=10, attach_formats=âxlsxâ)
Dataikuâą
10. Some Design Choices
Use collections.namedtuple
Use generators
Nice and easy programming style
def ïŹlter(stream, f):
for data in stream:
if isinstance(data, StreamMeta):
yield data
elif f(data):
yield data
IO Streaming whenever possible
An HTTP downloaded ïŹle begins to be processed as it starts downloading
Use bulk-loaders (SQL) or external program when faster than the python
implementation (e.g gzip)
Dataikuâą
11. PyBabe data model
def sample_pull():
header =
StreamHeader(name=âvisitsâ,
partition={âdayâ:â2012-09-14â},
A Babe works on a ïŹelds=[ânameâ, âdayâ])
generator that contains
yield header
a sequence of partition
yield header.makeRow(âFlorianâ,â2012-09-14â)
A Partition is yield header.makeRow(âJohnâ, â2012-09-14â)
composed of a header yield StreamFooter()
(StreamHeader), rows,
yield header.replace(partition={âdayâ:â2012-09-15â})
and a Footer
yield header.makeRow(âPhilâ, â2012-09-15â)
yield StreamFooter()
Dataikuâą
12. Some thoughts and
associated projects
strptime and performance
Parse a date with time.strptime or datetime.strptime
30 microseconds vs. 3 microseconds for regexp matching !!!
âTarpysâ a date parsing library, with date guessing
Charset management (pyencoding_cleaner)
Sniff ISO or UTF-8 charset over a fragment
Optionally try to ïŹx bad encoding ( ĂÆĂÂź, ĂÆĂÂ, ĂÆĂÂŒ)
python2.X csv module is ok but âŠ
No Unicode support
Separator snifïŹng buggy on edge cases
Dataikuâą
13. Future
Need to separate the github project into core and plugins
Rewrite in C a CSV module ? âŠ
ConïŹgurable Error system. Should a error row fail the
whole stream, fail the whole babe, send a warning, be or
be skipped
Pandas/NumPy integration
An Homepage, Docs, etc...
Dataikuâą
14. Ask questions ? ?
babe = Babe().pull(âquestions.csvâ)
babe = babe.ïŹlter(smart=True)
babe = babe.mapTo(oracle)
Florian
Douetteau babe.push(âanswers.csvâ);
@fdouetteau
CEO
Dataiku
Dataiku : Our Goal
- Leverage and Provide the best of open
souce technologies to help people build
their own data science platform Dataikuâą