2. Romain Dorgueil
@rdorgueil
CTO/Hacker in Residence
Technical Co-founder
(Solo) Founder
Eng. Manager
Developer
L’Atelier BNP Paribas
WeAreTheShops
RDC Dist. Agency
Sensio/SensioLabs
AffiliationWizard
Felt too young in a Linux Cauldron
Dismantler of Atari computers
Basic literacy using a Minitel
Guitars & accordions
Off by one baby
Inception
8. Extract Transform Load
• Not new. Popular concept in the 1970s [1] [2]
• Everywhere. Commerce, websites, marketing, finance, …
• Update secondary data-stores from a master data-store.
• Insert initial data somewhere
• Bi-directional API calls
• Every day, compute time reports, edit invoices if threshold then send e-mail.
• …
[1] https://en.wikipedia.org/wiki/Extract,_transform,_load
[2] https://www.sas.com/en_us/insights/data-management/what-is-etl.html
14. Data Integration Tools
• Java + IDE based, for most of them
• Data transformations are blocks
• IO flow managed by connections
• Execution
15. In the Python world …
• Bubbles (https://github.com/stiivi/bubbles)
• PETL (https://github.com/alimanfoo/petl)
• and now… Bonobo (https://www.bonobo-project.org/)
You can also use amazing libraries including
Joblib, Dask, Pandas, Toolz,
but ETL is not their main focus.
16. Big Data Tools
• Can do anything. And probably more. Fast.
• Either needs an infrastructure, or cloud based.
23. I want…
• A data integration / ETL tool using code as configuration.
• Preferably Python code.
• Something that can be tested (I mean, by a machine).
• Something that can use inheritance.
• Fast install on laptop, thought to run on servers too.
37. Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
38. Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
39. Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
graph = bonobo.Graph(
extract,
transform,
load,
)
40. Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
graph = bonobo.Graph(
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)
41. Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
graph = bonobo.Graph(
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)
range(42)
42. Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
graph = bonobo.Graph(
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)
range(42)
bonobo.Filter(lambda n: n % 2)
43. Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
graph = bonobo.Graph(
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)
range(42)
bonobo.Filter(lambda n: n % 2)
print
44. Let’s rewrite main.py
import bonobo
def extract():
for i in range(42):
yield i
def transform(n):
if n % 2:
yield n
def load(n):
print(n)
graph = bonobo.Graph(
extract,
transform,
load,
)
if __name__ == '__main__':
bonobo.run(graph)
range(42)
bonobo.Filter(lambda n: n % 2)
print
graph = bonobo.Graph(
range(42),
bonobo.Filter(filter=lambda x: x % 2),
print,
)
if __name__ == '__main__':
bonobo.run(graph)