SlideShare una empresa de Scribd logo
1 de 50
Descargar para leer sin conexión
GoDataDriven
PROUDLY PART OF THE XEBIA GROUP
Real time data driven applications
Giovanni Lanzani	

Data Whisperer
Using Python + pandas as back end
Who am I?
Who am I?	

2008-2012: PhDTheoretical Physics	

2012-2013: KPMG	

2013-Now: GoDataDriven
Feedback
@gglanzani
Real-time, data driven app?
•No store and retrieve;	

•Store, {transform, enrich, analyse} and retrieve;	

•Real-time: retrieve is not a batch process;	

•App: something your mother could use:	

SELECT attendees !
FROM!pydataberlin2014 !
WHERE password = '1234';
Get insight about event impact
Get insight about event impact
Get insight about event impact
Get insight about event impact
Get insight about event impact
Is it Big Data?
Is it Big Data?
•Raw logs are in the order of 40TB;	

•We use Hadoop for storing, enriching and pre-
processing;	

•(10 nodes, 24TB per nodes)
Challenges
1. Privacy;	

2. Huge pile of data;	

3. Real-time retrieval;	

4. Some real-time analysis.
1. Privacy
1. Privacy
3. Real-time retrieval
•Harder than it looks;	

•Large data;	

•Retrieval is by giving date, center location + radius.
4. (Some) real-time analysis
Architecture
AngularJS app.py
helper.py
REST
Front-end Back-end
JSON
JS - 1
JS - 2
Flask
from flask import Flask!
app = Flask(__name__)!
!
@app.route("/hello")!
def hello():!
return "Hello World!"!
!
if __name__ == "__main__":!
app.run()
Flask
from flask import Flask!
app = Flask(__name__)!
!
@app.route("/hello")!
def hello():!
return "Hello World!"!
!
if __name__ == "__main__":!
app.run()
app.py example
@app.route('/api/<postcode>/<date>/<radius>', methods=['GET'])!
@app.route('/api/<postcode>/<date>', methods=['GET'])!
def datapoints(postcode, date, radius=1.0):!
...!
stats, timeline, points = helper.get_json(postcode, date, radius)!
return … # returns a JSON object for AngularJS
data example
date hour id_activity postcode hits delta sbi
2013-01-01 12 1234 1234AB 35 22 1
2013-01-08 12 1234 1234AB 45 35 1
2013-01-01 11 2345 5555ZB 2 1 2
2013-01-08 11 2345 5555ZB 55 2 2
helper.py example
def get_json(postcode, date, radius):!
    ...!
!
lat, lon = get_lat_lon(postcode)!
postcodes = get_postcodes(postcode, radius)!
!
data = get_data(postcodes, dates)!
!
stats = get_statistics(data, sbi)!
timeline = get_timeline(data, sbi)!
!
return stats, timeline, data.to_json(orient='records')
helper.py example
def get_json(postcode, date, radius):!
    ...!
!
lat, lon = get_lat_lon(postcode)!
postcodes = get_postcodes(postcode, radius)!
!
data = get_data(postcodes, dates)!
!
stats = get_statistics(data, sbi)!
timeline = get_timeline(data, sbi)!
!
return stats, timeline, data.to_json(orient='records')
helper.py example
def get_statistics(data, sbi):!
sbi_df = data[data.sbi == sbi] # filter by sbi!
!
hits = sbi_df.hits.sum() # sum the hits !
delta_hits = sbi_df.delta.sum() # sum the delta hits!
!
if delta_hits:!
percentage = (hits - delta_hits) / delta_hits!
else:!
percentage = 0!
!
return {"sbi": sbi, "total": hits, "percentage": percentage}
helper.py example
def get_statistics(data, sbi):!
sbi_df = data[data.sbi == sbi] # filter by sbi!
!
hits = sbi_df.hits.sum() # sum the hits !
delta_hits = sbi_df.delta.sum() # sum the delta hits!
!
if delta_hits:!
percentage = (hits - delta_hits) / delta_hits!
else:!
percentage = 0!
!
return {"sbi": sbi, "total": hits, "percentage": percentage}
helper.py example
def get_timeline(data, sbi):!
df_sbi = data.groupby([“date”, “hour", "sbi"]).!
aggregate(sum)!
return df_sbi
helper.py example
def get_timeline(data, sbi):!
df_sbi = data.groupby([“date”, “hour", "sbi"]).!
aggregate(sum)!
return df_sbi
helper.py example
def get_json(postcode, date, radius):!
    ...!
    !
lat, lon = get_lat_lon(postcode)!
postcodes = get_postcodes(postcode, radius)!
dates = date.split(';')!
!
data = get_data(postcodes, dates)!
!
stats = get_statistics(data)!
timeline = get_timeline(data, dates)!
!
return stats, timeline, data.to_json(orient='records')
Who has my data?
Who has my data?
•First iteration was a (pre)-POC, less data (3GB vs
500GB);	

•Time constraints;	

•Oeps:
Who has my data?
•First iteration was a (pre)-POC, less data (3GB vs
500GB);	

•Time constraints;	

•Oeps:
import pandas as pd!
...!
source_data = pd.read_csv("data.csv", …)!
...!
def get_data(postcodes, dates):!
result = filter_data(source_data, postcodes, dates)!
return result
Advantage of “everything is a df”
Pro:	

•Fast!!	

•Use what you know	

•NO DBA’s!	

•We all love CSV’s!	

!
!
!
Advantage of “everything is a df”
Pro:	

•Fast!!	

•Use what you know	

•NO DBA’s!	

•We all love CSV’s!	

!
!
!
Contra:	

•Doesn’t scale;	

•Huge startup time;	

•NO DBA’s!	

•We all hate CSV’s!
If you want to go down this path
•Set the dataframe index wisely;	

•Align the data to the index:	

!
•Beware of modifications of the original dataframe!
source_data.sort_index(inplace=True)
If you want to go down this path
If you want to go down this path
If you want to go down this path
If you want to go down this path
The reason pandas is faster is because I came
up with a better algorithm
If you don’t…
data = get_data(postcodes, dates)
AngularJS app.py
helper.py
REST
Front-end Back-end
JSON
If you don’t…
data = get_data(postcodes, dates)
database.py
Data
psycopg2
AngularJS app.py
helper.py
REST
Front-end Back-end
JSON
If you don’t…
data = get_data(db, postcodes, dates)
database.py
Data
psycopg2
AngularJS app.py
helper.py
REST
Front-end Back-end
JSON
Handling geo-data
def get_json(postcode, date, radius):!
"""!
    ...!
    """!
lat, lon = get_lat_lon(postcode)!
postcodes = get_postcodes(postcode, radius)!
dates = date.split(';')!
!
data = get_data(postcodes, dates)!
!
stats = get_statistics(data)!
timeline = get_timeline(data, dates)!
!
return stats, timeline, data.to_json(orient='records')
Issues?!
•With a radius of 10km, in Amsterdam, you get 10k
postcodes.You need to do this in your SQL:	

!
!
!
•Index on date and postcode, but single queries
running more than 20 minutes.
SELECT * FROM datapoints !
WHERE !
date IN date_array !
! ! AND !
! ! ! postcode IN postcode_array;
Postgres + Postgis (2.x)
PostGIS is a spatial database extender for PostgreSQL.
Supports geographic objects allowing location queries	

SELECT *!
FROM datapoints!
WHERE ST_DWithin(lon, lat, 1500)!
AND dates IN ('2013-02-30', '2013-02-31');!
-- every point within 1.5km !
-- from (lat, lon) on imaginary dates
Other db’s?
Steps to solve it
1. Align data on disk by date;	

2. Use the temporary table trick:	

!
!
!
!
3. Lose precision: 1234AB→1234	

4. (Compression)
CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY);!
INSERT INTO tmp (postcodes) VALUES postcode_array;!
!
SELECT * FROM tmp!
JOIN datapoints d!
ON d.postcode = tmp.postcodes!
WHERE!
d.dt IN dates_array;
GoDataDriven
We’re hiring / Questions? / Thank you!
@gglanzani	

giovannilanzani@godatadriven.com
Giovanni Lanzani	

Data Whisperer

Más contenido relacionado

Más de PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

Más de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Python and pandas as back end to real-time data driven applications by Giovanni Lanzani PyData Berlin 2014

  • 1. GoDataDriven PROUDLY PART OF THE XEBIA GROUP Real time data driven applications Giovanni Lanzani Data Whisperer Using Python + pandas as back end
  • 3. Who am I? 2008-2012: PhDTheoretical Physics 2012-2013: KPMG 2013-Now: GoDataDriven
  • 5. Real-time, data driven app? •No store and retrieve; •Store, {transform, enrich, analyse} and retrieve; •Real-time: retrieve is not a batch process; •App: something your mother could use: SELECT attendees ! FROM!pydataberlin2014 ! WHERE password = '1234';
  • 6. Get insight about event impact
  • 7. Get insight about event impact
  • 8. Get insight about event impact
  • 9. Get insight about event impact
  • 10. Get insight about event impact
  • 11. Is it Big Data?
  • 12. Is it Big Data? •Raw logs are in the order of 40TB; •We use Hadoop for storing, enriching and pre- processing; •(10 nodes, 24TB per nodes)
  • 13. Challenges 1. Privacy; 2. Huge pile of data; 3. Real-time retrieval; 4. Some real-time analysis.
  • 16. 3. Real-time retrieval •Harder than it looks; •Large data; •Retrieval is by giving date, center location + radius.
  • 21. Flask from flask import Flask! app = Flask(__name__)! ! @app.route("/hello")! def hello():! return "Hello World!"! ! if __name__ == "__main__":! app.run()
  • 22. Flask from flask import Flask! app = Flask(__name__)! ! @app.route("/hello")! def hello():! return "Hello World!"! ! if __name__ == "__main__":! app.run()
  • 23. app.py example @app.route('/api/<postcode>/<date>/<radius>', methods=['GET'])! @app.route('/api/<postcode>/<date>', methods=['GET'])! def datapoints(postcode, date, radius=1.0):! ...! stats, timeline, points = helper.get_json(postcode, date, radius)! return … # returns a JSON object for AngularJS
  • 24. data example date hour id_activity postcode hits delta sbi 2013-01-01 12 1234 1234AB 35 22 1 2013-01-08 12 1234 1234AB 45 35 1 2013-01-01 11 2345 5555ZB 2 1 2 2013-01-08 11 2345 5555ZB 55 2 2
  • 25. helper.py example def get_json(postcode, date, radius):!     ...! ! lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! ! data = get_data(postcodes, dates)! ! stats = get_statistics(data, sbi)! timeline = get_timeline(data, sbi)! ! return stats, timeline, data.to_json(orient='records')
  • 26. helper.py example def get_json(postcode, date, radius):!     ...! ! lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! ! data = get_data(postcodes, dates)! ! stats = get_statistics(data, sbi)! timeline = get_timeline(data, sbi)! ! return stats, timeline, data.to_json(orient='records')
  • 27. helper.py example def get_statistics(data, sbi):! sbi_df = data[data.sbi == sbi] # filter by sbi! ! hits = sbi_df.hits.sum() # sum the hits ! delta_hits = sbi_df.delta.sum() # sum the delta hits! ! if delta_hits:! percentage = (hits - delta_hits) / delta_hits! else:! percentage = 0! ! return {"sbi": sbi, "total": hits, "percentage": percentage}
  • 28. helper.py example def get_statistics(data, sbi):! sbi_df = data[data.sbi == sbi] # filter by sbi! ! hits = sbi_df.hits.sum() # sum the hits ! delta_hits = sbi_df.delta.sum() # sum the delta hits! ! if delta_hits:! percentage = (hits - delta_hits) / delta_hits! else:! percentage = 0! ! return {"sbi": sbi, "total": hits, "percentage": percentage}
  • 29. helper.py example def get_timeline(data, sbi):! df_sbi = data.groupby([“date”, “hour", "sbi"]).! aggregate(sum)! return df_sbi
  • 30. helper.py example def get_timeline(data, sbi):! df_sbi = data.groupby([“date”, “hour", "sbi"]).! aggregate(sum)! return df_sbi
  • 31. helper.py example def get_json(postcode, date, radius):!     ...!     ! lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! dates = date.split(';')! ! data = get_data(postcodes, dates)! ! stats = get_statistics(data)! timeline = get_timeline(data, dates)! ! return stats, timeline, data.to_json(orient='records')
  • 32. Who has my data?
  • 33. Who has my data? •First iteration was a (pre)-POC, less data (3GB vs 500GB); •Time constraints; •Oeps:
  • 34. Who has my data? •First iteration was a (pre)-POC, less data (3GB vs 500GB); •Time constraints; •Oeps: import pandas as pd! ...! source_data = pd.read_csv("data.csv", …)! ...! def get_data(postcodes, dates):! result = filter_data(source_data, postcodes, dates)! return result
  • 35. Advantage of “everything is a df” Pro: •Fast!! •Use what you know •NO DBA’s! •We all love CSV’s! ! ! !
  • 36. Advantage of “everything is a df” Pro: •Fast!! •Use what you know •NO DBA’s! •We all love CSV’s! ! ! ! Contra: •Doesn’t scale; •Huge startup time; •NO DBA’s! •We all hate CSV’s!
  • 37. If you want to go down this path •Set the dataframe index wisely; •Align the data to the index: ! •Beware of modifications of the original dataframe! source_data.sort_index(inplace=True)
  • 38. If you want to go down this path
  • 39. If you want to go down this path
  • 40. If you want to go down this path
  • 41. If you want to go down this path The reason pandas is faster is because I came up with a better algorithm
  • 42. If you don’t… data = get_data(postcodes, dates) AngularJS app.py helper.py REST Front-end Back-end JSON
  • 43. If you don’t… data = get_data(postcodes, dates) database.py Data psycopg2 AngularJS app.py helper.py REST Front-end Back-end JSON
  • 44. If you don’t… data = get_data(db, postcodes, dates) database.py Data psycopg2 AngularJS app.py helper.py REST Front-end Back-end JSON
  • 45. Handling geo-data def get_json(postcode, date, radius):! """!     ...!     """! lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! dates = date.split(';')! ! data = get_data(postcodes, dates)! ! stats = get_statistics(data)! timeline = get_timeline(data, dates)! ! return stats, timeline, data.to_json(orient='records')
  • 46. Issues?! •With a radius of 10km, in Amsterdam, you get 10k postcodes.You need to do this in your SQL: ! ! ! •Index on date and postcode, but single queries running more than 20 minutes. SELECT * FROM datapoints ! WHERE ! date IN date_array ! ! ! AND ! ! ! ! postcode IN postcode_array;
  • 47. Postgres + Postgis (2.x) PostGIS is a spatial database extender for PostgreSQL. Supports geographic objects allowing location queries SELECT *! FROM datapoints! WHERE ST_DWithin(lon, lat, 1500)! AND dates IN ('2013-02-30', '2013-02-31');! -- every point within 1.5km ! -- from (lat, lon) on imaginary dates
  • 49. Steps to solve it 1. Align data on disk by date; 2. Use the temporary table trick: ! ! ! ! 3. Lose precision: 1234AB→1234 4. (Compression) CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY);! INSERT INTO tmp (postcodes) VALUES postcode_array;! ! SELECT * FROM tmp! JOIN datapoints d! ON d.postcode = tmp.postcodes! WHERE! d.dt IN dates_array;
  • 50. GoDataDriven We’re hiring / Questions? / Thank you! @gglanzani giovannilanzani@godatadriven.com Giovanni Lanzani Data Whisperer