SlideShare una empresa de Scribd logo
1 de 74
Descargar para leer sin conexión
How to interactively visualise and
explore a billion objects (with vaex)
Maarten Breddels
&
Amina Helmi
PyData Paris 2016
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Copyright ESA/ATG medialab; background: ESO/S. Brunier
• Gaia satellite
• launched by ESA in december 2013
• determines positions, velocities and
astrophysical parameters of >109
stars of the Milky Way
• First catalogue in September 2016
• Final catalogue ~2020
Credit: NASA/Adler/U. Chicago/Wesleyan/JPL-Caltech
image:Devin Powell
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
Motivation
• We will soon have the Gaia catalogue
• > 10
9
objects/stars
• Can we visualise and explore this?
• We want to ‘see’ the data
• Data checks (reduction issues)
• Science: trends, relations, clustering
• You are the (biological) neutral
network
• Problem
• Scatter plots do not work well for 10
9
rows/objects (like Gaia)
• Work with densities/averages in 1,2 and
3d
• Interactive?
• Zoom, pan etc
• Explore
• selections/queries
• Subspace ranking
• Python
• rich set of libraries, becoming the default in
science
Situation
• TOPCAT comes close, not fast enough, works with
individual rows/particles, no exploratory tools,
written in Java.
• Your own IDL/Python code: a lot to consider to do it
optimal (multicore, efficient storage, efficient
algorithms, interactive becomes complex)
• DataShader?
• We want something to visualize 109 rows/objects in
~1 second
• Do we need to resort to Big Data solutions?
Big data?
• Big data
• definitions vary
• Practical question to ask
• Do you need Big Data solutions?
• Many computers / grid
• Invest in software
• Or
• Can we do it on a single computer ‘old style’
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Interactive? How?
• Interactive?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4): 12 cycles/
second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise sdd/hdd
speeds (30-100 seconds)
• proper storage and reading of data
• simple and fast algorithm for binning
•Aquarius A-2: pure dark matter N
body simulation of Milky Way like Halo
•6. *108 particles
• < 1 second
Interactive? How?
• Interactive?
• 10
9
* 2 * 8 bytes = 15 GiB (double is 8 bytes)
• Memory bandwidth: 10-20 GiB/s: ~1 second
• CPU: 3 Ghz (but multicore, say 4): 12 cycles/
second
• Few cycles per row/object, simple algorithm
• Histograms/Density grids
• Yes, but
• If it fits/cached in memory, otherwise sdd/hdd
speeds (30-100 seconds)
• proper storage and reading of data
• simple and fast algorithm for binning
•Aquarius A-2: pure dark matter N
body simulation of Milky Way like Halo
•6. *108 particles
• < 1 second
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Memory mapping:
• get direct access to OS memory cache, no copy, no
setup (apart from the kernel doing setting up the
pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about 0.5-1.0 second, at
10-20 GB/s
• Can be 2-3x slower (cpu cache helps a bit)
• Storage: native, column based (hdf5, fits)
• Normal (POSIX read) method:
• Allocate memory
• read from disk to memory
• Actually: from disk, to OS cache, to memory (if
unbuffered, otherwise another copy)
• Wastes memory (actually disk cache)
• 15 GB data, requires 30 GB is you want to use the
file system cache
cache
How to store and read the data
• Memory mapping:
• get direct access to OS memory cache, no copy, no
setup (apart from the kernel doing setting up the
pages)
• avoid memory copies, more cache available
• In previous example:
• copying 15 GB will take about 0.5-1.0 second, at
10-20 GB/s
• Can be 2-3x slower (cpu cache helps a bit)
Translate to something
visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map
to a color
• 3d: volume rendering
Translate to something
visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map
to a color
• 3d: volume rendering
Translate to something
visual
• 1d: histogram
• 2d:
• histogram
• use colormap to map
to a color
• 3d: volume rendering
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
How to get good
performance
• Pure python
• slow, GIL
• Numpy
• numpy.histogram slow
• C extension
• fast, no GIL, slower development
• less boilerplate code: cffi
• Numba @jit(nopython=True, nogil=True)
• Python code, c performance
• (nogil didn’t exist)
600 times faster
Great… 2d histograms…
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
Great… 2d histograms…
• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
Great… 2d histograms…
• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
2 1 4 .. 8 3v
+v, or v2
Great… 2d histograms…
• There is more
• Weighted histogram
• Don’t sums 1’s
• Sum values
1 32 4 .. 9 11x
4 7 41 .. 91 61
y
+1
2 1 4 .. 8 3v
+v, or v2
• Possibilities
• Total: flux, mass
• Mean: velocity, metallicity
• Dispersions: velocity…
• Basically statistics on a grid
Examples
Examples
Examples
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Exploration
• Selections
• expressions
• visual
• linked views
• Subspace ranking
Outline
• Motivation
• Technical
• Vaex
• Conclusions
Vaex: Visualization And EXploration
• A library
• python package
• ‘import vaex’
• reading of data
• multithreading
• binning (1,2,3, Nd)
• selections/queries
• server/client
• integrates with IPython notebook
• A GUI program
• Gives interactive navigation, zoom,
pan
• interactive selection (lasso,
rectangle)
• client
• undo/redo
Example data / vaex.example()
• Helmi and de Zeeuw 2000
• build up of a MW (stellar) halo from 33 satellites
• 3.3 * 106 particles
• Almost smooth in configuration space (x,y,z)
• Structure visible in E-Lz-L space
Exploration:Automated
• High dimensionality → many subspaces
• E-L, E-y, E-x, E-vx, E-vy, E-vz, E-z, E-Lz, L-y, L-
x, L-vx, L-vy, L-vz, L-z, L-Lz, y-x, y-vx, y-vy, y-
vz, y-z, y-Lz, x-vx, x-vy, x-vz, x-z, x-Lz, vx-vy,
vx-vz, vx-z, vx-Lz, vy-vz, vy-z, vy-Lz, vz-z, vz-
Lz, z-Lz
• Can we automate this / at least help?
• Ranking subspaces using Mutual Information
Mutual information
• Measure the information loss between
p(x,y) and p(x)p(y)
• information loss measured using the KL
divergence:
• In short:
• How much does breaking up correlation
(not just linear) change density
distribution?
Ranking by Mutual information
p(x,y) p(x)p(y) I(X;Y)
0.003
0.837
x,y
Lz,L
Ranking by Mutual information
p(x,y) p(x)p(y) I(X;Y)
0.003
0.837
x,y
Lz,L
Expressions/Virtual columns
• Not just visualize existing columns
• mathematical operations
• sqrt(x**2+y**2)
• log(r)
• Virtual columns
• r=sqrt(x**2+y**2)
• Suggest common ones
• equatorial to galactic
coordinates
How to get the data
• Gaia: ~1-2 TB
• download
• torrent
• Do you need to?
• server/client?
Client/Server
• 2d histogram example
• raw data:15 GB
• binned 256x256 grid
• 500KB, or 10-100kb compressed
• Proof of concept
• over http / websocket
• vaex webserver [file …]
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
Random subset / shuffled storage
…
dask?
wendelin?
Get vaex
• standalone binary (OS X , Linux) (just download and start)
• www.astro.rug.nl/~breddels/vaex (or google ‘vaex
visualisation’)
• www.github.com/maartenbreddels/vaex
• In Python
• Vanilla Python
• pip install —user —pre vaex
• Anaconda
• conda install -c maartenbreddels vaex
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
Summary
• vaex: visualisation and exploration
• of large datasets 10
6-9
objects
• fast: > 10
9
objects/second
• 1-2 and 3d visualisation using densities
• ~6d with vector field overlaid
• Explore using selections+linked views or
automated ranking of subspaces
• Can run on a single computer, zero setup
• Or as a server
• Main goal is Gaia catalogue, but tested/suitable for
• other catalogues
• simulations (N body)
How to interactively visualise and explore a billion objects (wit vaex)

Más contenido relacionado

Similar a How to interactively visualise and explore a billion objects (wit vaex)

Vaex pygrunn
Vaex pygrunnVaex pygrunn
Vaex pygrunn
Maarten Breddels
 

Similar a How to interactively visualise and explore a billion objects (wit vaex) (20)

Vaex pygrunn
Vaex pygrunnVaex pygrunn
Vaex pygrunn
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
 
Decima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero DawnDecima Engine: Visibility in Horizon Zero Dawn
Decima Engine: Visibility in Horizon Zero Dawn
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

How to interactively visualise and explore a billion objects (wit vaex)

  • 1. How to interactively visualise and explore a billion objects (with vaex) Maarten Breddels & Amina Helmi PyData Paris 2016
  • 3. Copyright ESA/ATG medialab; background: ESO/S. Brunier • Gaia satellite • launched by ESA in december 2013 • determines positions, velocities and astrophysical parameters of >109 stars of the Milky Way • First catalogue in September 2016 • Final catalogue ~2020
  • 6. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network
  • 7. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 8. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 9. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 10. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 11. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking
  • 12. Motivation • We will soon have the Gaia catalogue • > 10 9 objects/stars • Can we visualise and explore this? • We want to ‘see’ the data • Data checks (reduction issues) • Science: trends, relations, clustering • You are the (biological) neutral network • Problem • Scatter plots do not work well for 10 9 rows/objects (like Gaia) • Work with densities/averages in 1,2 and 3d • Interactive? • Zoom, pan etc • Explore • selections/queries • Subspace ranking • Python • rich set of libraries, becoming the default in science
  • 13. Situation • TOPCAT comes close, not fast enough, works with individual rows/particles, no exploratory tools, written in Java. • Your own IDL/Python code: a lot to consider to do it optimal (multicore, efficient storage, efficient algorithms, interactive becomes complex) • DataShader? • We want something to visualize 109 rows/objects in ~1 second • Do we need to resort to Big Data solutions?
  • 14. Big data? • Big data • definitions vary • Practical question to ask • Do you need Big Data solutions? • Many computers / grid • Invest in software • Or • Can we do it on a single computer ‘old style’
  • 16. Interactive? How? • Interactive? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-20 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4): 12 cycles/ second • Few cycles per row/object, simple algorithm • Histograms/Density grids • Yes, but • If it fits/cached in memory, otherwise sdd/hdd speeds (30-100 seconds) • proper storage and reading of data • simple and fast algorithm for binning •Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo •6. *108 particles • < 1 second
  • 17. Interactive? How? • Interactive? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-20 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4): 12 cycles/ second • Few cycles per row/object, simple algorithm • Histograms/Density grids • Yes, but • If it fits/cached in memory, otherwise sdd/hdd speeds (30-100 seconds) • proper storage and reading of data • simple and fast algorithm for binning •Aquarius A-2: pure dark matter N body simulation of Milky Way like Halo •6. *108 particles • < 1 second
  • 18. • Storage: native, column based (hdf5, fits) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the file system cache cache How to store and read the data
  • 19. • Storage: native, column based (hdf5, fits) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the file system cache cache How to store and read the data • Memory mapping: • get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages) • avoid memory copies, more cache available • In previous example: • copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s • Can be 2-3x slower (cpu cache helps a bit)
  • 20. • Storage: native, column based (hdf5, fits) • Normal (POSIX read) method: • Allocate memory • read from disk to memory • Actually: from disk, to OS cache, to memory (if unbuffered, otherwise another copy) • Wastes memory (actually disk cache) • 15 GB data, requires 30 GB is you want to use the file system cache cache How to store and read the data • Memory mapping: • get direct access to OS memory cache, no copy, no setup (apart from the kernel doing setting up the pages) • avoid memory copies, more cache available • In previous example: • copying 15 GB will take about 0.5-1.0 second, at 10-20 GB/s • Can be 2-3x slower (cpu cache helps a bit)
  • 21. Translate to something visual • 1d: histogram • 2d: • histogram • use colormap to map to a color • 3d: volume rendering
  • 22. Translate to something visual • 1d: histogram • 2d: • histogram • use colormap to map to a color • 3d: volume rendering
  • 23. Translate to something visual • 1d: histogram • 2d: • histogram • use colormap to map to a color • 3d: volume rendering
  • 24. How to get good performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist)
  • 25. How to get good performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist)
  • 26. How to get good performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist)
  • 27. How to get good performance • Pure python • slow, GIL • Numpy • numpy.histogram slow • C extension • fast, no GIL, slower development • less boilerplate code: cffi • Numba @jit(nopython=True, nogil=True) • Python code, c performance • (nogil didn’t exist) 600 times faster
  • 28.
  • 29.
  • 30. Great… 2d histograms… 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1
  • 31. Great… 2d histograms… • There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1
  • 32. Great… 2d histograms… • There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1 2 1 4 .. 8 3v +v, or v2
  • 33. Great… 2d histograms… • There is more • Weighted histogram • Don’t sums 1’s • Sum values 1 32 4 .. 9 11x 4 7 41 .. 91 61 y +1 2 1 4 .. 8 3v +v, or v2 • Possibilities • Total: flux, mass • Mean: velocity, metallicity • Dispersions: velocity… • Basically statistics on a grid
  • 37. Exploration • Selections • expressions • visual • linked views • Subspace ranking
  • 38. Exploration • Selections • expressions • visual • linked views • Subspace ranking
  • 39. Exploration • Selections • expressions • visual • linked views • Subspace ranking
  • 40. Exploration • Selections • expressions • visual • linked views • Subspace ranking
  • 42. Vaex: Visualization And EXploration • A library • python package • ‘import vaex’ • reading of data • multithreading • binning (1,2,3, Nd) • selections/queries • server/client • integrates with IPython notebook • A GUI program • Gives interactive navigation, zoom, pan • interactive selection (lasso, rectangle) • client • undo/redo
  • 43. Example data / vaex.example() • Helmi and de Zeeuw 2000 • build up of a MW (stellar) halo from 33 satellites • 3.3 * 106 particles • Almost smooth in configuration space (x,y,z) • Structure visible in E-Lz-L space
  • 44.
  • 45.
  • 46.
  • 47.
  • 48. Exploration:Automated • High dimensionality → many subspaces • E-L, E-y, E-x, E-vx, E-vy, E-vz, E-z, E-Lz, L-y, L- x, L-vx, L-vy, L-vz, L-z, L-Lz, y-x, y-vx, y-vy, y- vz, y-z, y-Lz, x-vx, x-vy, x-vz, x-z, x-Lz, vx-vy, vx-vz, vx-z, vx-Lz, vy-vz, vy-z, vy-Lz, vz-z, vz- Lz, z-Lz • Can we automate this / at least help? • Ranking subspaces using Mutual Information
  • 49. Mutual information • Measure the information loss between p(x,y) and p(x)p(y) • information loss measured using the KL divergence: • In short: • How much does breaking up correlation (not just linear) change density distribution?
  • 50. Ranking by Mutual information p(x,y) p(x)p(y) I(X;Y) 0.003 0.837 x,y Lz,L
  • 51. Ranking by Mutual information p(x,y) p(x)p(y) I(X;Y) 0.003 0.837 x,y Lz,L
  • 52. Expressions/Virtual columns • Not just visualize existing columns • mathematical operations • sqrt(x**2+y**2) • log(r) • Virtual columns • r=sqrt(x**2+y**2) • Suggest common ones • equatorial to galactic coordinates
  • 53.
  • 54.
  • 55.
  • 56.
  • 57. How to get the data • Gaia: ~1-2 TB • download • torrent • Do you need to? • server/client?
  • 58. Client/Server • 2d histogram example • raw data:15 GB • binned 256x256 grid • 500KB, or 10-100kb compressed • Proof of concept • over http / websocket • vaex webserver [file …]
  • 59.
  • 60.
  • 61. Random subset / shuffled storage
  • 62. Random subset / shuffled storage
  • 63. Random subset / shuffled storage
  • 64. Random subset / shuffled storage
  • 65. Random subset / shuffled storage
  • 67. Get vaex • standalone binary (OS X , Linux) (just download and start) • www.astro.rug.nl/~breddels/vaex (or google ‘vaex visualisation’) • www.github.com/maartenbreddels/vaex • In Python • Vanilla Python • pip install —user —pre vaex • Anaconda • conda install -c maartenbreddels vaex
  • 68. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 69. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 70. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 71. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 72. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)
  • 73. Summary • vaex: visualisation and exploration • of large datasets 10 6-9 objects • fast: > 10 9 objects/second • 1-2 and 3d visualisation using densities • ~6d with vector field overlaid • Explore using selections+linked views or automated ranking of subspaces • Can run on a single computer, zero setup • Or as a server • Main goal is Gaia catalogue, but tested/suitable for • other catalogues • simulations (N body)