SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
Large Data Analyze
with PyTables
Personal Profile:
●
Ali Hallaji
●
Parallel Processing and Large Data Analyze
●
Senior Python Developer at innfinision Cloud Solutions
●
Ali.Hallaji@innfinision.net
●
Innfinision.net
innfinision Cloud Solutions:
●
Providing Cloud, Virtualization and Data Center Solutions
●
Developing Software for Cloud Environments
●
Providing Services to Telecom, Education, Broadcasting & Health Fields
●
Supporting OpenStack Foundation as the First Iranian Company
●
First Supporter of IRAN OpenStack Community
Large Data Analyze with PyTables innfinision.net
●
Outline
●
What is PyTables ?
●
Numexpr & Numpy
●
Compressing Data
●
What is HDF5?
●
Querying your data in many different ways, fast
●
Design goals
Agenda:
Outline
innfinision.netLarge Data Analyze with PyTables
Outline
innfinision.netLarge Data Analyze with PyTables
The Starving CPU Problem
● Getting the Most Out of Computers
●
Caches and Data Locality
●
Techniques For Fighting Data
Starvation
High Performance Libraries
●
Why Should You Use Them?
●
In-Core High Performance
●
Out-of-Core High Performance
Libraries
Getting the Most Out of Computers
innfinision.netLarge Data Analyze with PyTables
Getting the Most Out of Computers
innfinision.netLarge Data Analyze with PyTables
Computers nowadays are very powerful:
● Extremely fast CPU’s (multicores)
●
Large amounts of RAM
●
Huge disk capacities
But they are facing a pervasive problem:
An ever-increasing mismatch between CPU, memory and disk speeds (the
so-called “Starving CPU problem”)
This introduces tremendous difficulties in getting the most out of
computers.
CPU vs Memory cycle Trend
innfinision.netLarge Data Analyze with PyTables
Cycle time is the time, usually measured in nanosecond s, between the start of
one random access memory ( RAM ) access to the time when the next access can
be started
History
●
In the 1970s and 1980s the memory subsystem was able to
deliver all the data that processors required in time.
●
In the good old days, the processor was the key bottleneck.
●
But in the 1990s things started to change...
CPU vs Memory cycle Trend
innfinision.netLarge Data Analyze with PyTables
dd
The CPU Starvation Problem
innfinision.netLarge Data Analyze with PyTables
Known facts (in 2010):
●
Memory latency is much higher (around 250x) than processors and it has been
an essential bottleneck for the past twenty years.
●
Memory throughput is improving at a better rate than memory latency, but it
is also much slower than processors (about 25x).
The result is that CPUs in our current computers are suffering from
a serious data starvation problem: they could consume (much!)
more data than the system can possibly deliver.
What Is the Industry Doing to Alleviate CPU Starvation?
innfinision.netLarge Data Analyze with PyTables
●
They are improving memory throughput: cheap to implement
(more data is transmitted on each clock cycle).
●
They are adding big caches in the CPU dies.
Why Is a Cache Useful?
innfinision.netLarge Data Analyze with PyTables
●
Caches are closer to the processor (normally in the same die),
so both the latency and throughput are improved.
●
However: the faster they run the smaller they must be.
●
They are effective mainly in a couple of scenarios:
●
Time locality: when the dataset is reused.
● Spatial locality: when the dataset is accessed sequentially.
Time Locality
innfinision.netLarge Data Analyze with PyTables
Parts of the dataset are reused
Spatial Locality
innfinision.netLarge Data Analyze with PyTables
Dataset is accessed sequentially
Why High Performance Libraries?
innfinision.netLarge Data Analyze with PyTables
●
High performance libraries are made by people that knows very
well the different optimization techniques.
●
You may be tempted to create original algorithms that can be
faster than these, but in general, it is very difficult to beat
them.
● In some cases, it may take some time to get used to them, but
the effort pays off in the long run.
Some In-Core High Performance Libraries
innfinision.netLarge Data Analyze with PyTables
●
ATLAS/MKL (Intel’s Math Kernel Library): Uses memory efficient algorithms as well
as SIMD and multi-core algorithms linear algebra operations.→
●
VML (Intel’s Vector Math Library): Uses SIMD and multi-core to compute basic math
functions (sin, cos, exp, log...) in vectors.
●
Numexpr: Performs potentially complex operations with NumPy arrays without the
overhead of temporaries. Can make use of multi-cores.
●
Blosc: A multi-threaded compressor that can transmit data from caches to memory,
and back, at speeds that can be larger than a OS memcpy().
What is PyTables ?
innfinision.netLarge Data Analyze with PyTables
PyTables
innfinision.netLarge Data Analyze with PyTables
PyTables is a package for managing hierarchical datasets and designed to efficiently
and easily cope with extremely large amounts of data. You can download PyTables
and use it for free. You can access documentation, some examples of use and
presentations in the HowToUse section.
PyTables is built on top of the HDF5 library, using the Python language and the
NumPy package. It features an object-oriented interface that, combined with C
extensions for the performance-critical parts of the code (generated using Cython),
makes it a fast, yet extremely easy to use tool for interactively browse, process and
search very large amounts of data. One important feature of PyTables is that it
optimizes memory and disk resources so that data takes much less space (specially if
on-flight compression is used) than other solutions such as relational or object
oriented databases.
Numexpr & Numpy
innfinision.netLarge Data Analyze with PyTables
Numexpr: Dealing with Complex Expressions
innfinision.netLarge Data Analyze with PyTables
●
Wears a specialized virtual machine for evaluating expressions.
●
It accelerates computations by using blocking and by avoiding
temporaries.
●
Multi-threaded: can use several cores automatically.
●
It has support for Intel’s VML (Vector Math Library), so you
can accelerate the evaluation of transcendental (sin, cos,
atanh, sqrt. . . ) functions too.
NumPy: A Powerful Data Container for Python
innfinision.netLarge Data Analyze with PyTables
NumPy provides a very powerful, object oriented, multi dimensional data container:
●
array[index]: retrieves a portion of a data container
●
(array1**3 / array2) - sin(array3): evaluates potentially complex
expressions
● numpy.dot(array1, array2): access to optimized BLAS (*GEMM) functions
NumPy And Temporaries
innfinision.netLarge Data Analyze with PyTables
Computing “a*b+c” with numpy. Temporaries goes to memory
Numexpr Avoids (Big) Temporaries
innfinision.netLarge Data Analyze with PyTables
Computing “a*b+c” with numexpr. Temporaries in memory are avoided.
Numexpr Performance (Using Multiple Threads)
innfinision.netLarge Data Analyze with PyTables
Time to evaluate polynomial : ((.25*x + .75)*x – 1.5)*x -2
Compression
innfinision.netLarge Data Analyze with PyTables
Why Compression
innfinision.netLarge Data Analyze with PyTables
●
Lets you store more data using the same space
●
Uses more CPU, but CPU time is cheap compared with disk access
●
Different compressors for different uses:
Bzip2, zlib, lzo, Blosc
Why Compression
innfinision.netLarge Data Analyze with PyTables
3X more data
Why Compression
innfinision.netLarge Data Analyze with PyTables
Less data needs to be transmitted to the CPU
Transmission + decompression faster than direct transfer?
Blosc: Compressing faster than Memory Speed
innfinision.netLarge Data Analyze with PyTables
What is HDF5 ?
innfinision.netLarge Data Analyze with PyTables
HDF5
innfinision.netLarge Data Analyze with PyTables
HDF5 is a data model, library, and file format for storing and managing data. It
supports an unlimited variety of datatypes, and is designed for flexible and
efficient I/O and for high volume and complex data. HDF5 is portable and is
extensible, allowing applications to evolve in their use of HDF5. The HDF5
Technology suite includes tools and applications for managing, manipulating,
viewing, and analyzing data in the HDF5 format.
The HDF5 technology suite includes:
innfinision.netLarge Data Analyze with PyTables
●
A versatile data model that can represent very complex data objects and a wide
variety of metadata.
●
A completely portable file format with no limit on the number or size of data
objects in the collection.
●
A software library that runs on a range of computational platforms, from laptops to
massively parallel systems, and implements a high-level API with C, C++,
Fortran 90, and Java interfaces.
●
A rich set of integrated performance features that allow for access time and
storage space optimizations.
●
Tools and applications for managing, manipulating, viewing, and analyzing the data
in the collection.
Data structures
innfinision.netLarge Data Analyze with PyTables
High level of flexibility for structuring your data:
●
Datatypes: scalars (numerical & strings), records, enumerated, time...
●
Table support multidimensional cells and nested records
●
Multidimensional arrays
●
Variable lengths arrays
Attributes:
innfinision.netLarge Data Analyze with PyTables
Metadata about data
● Dataset hierarchy
innfinision.netLarge Data Analyze with PyTables
Querying your data in many different
ways, fast
innfinision.netLarge Data Analyze with PyTables
PyTables Query
innfinision.netLarge Data Analyze with PyTables
One characteristic that sets PyTables apart from similar tools is its capability to
perform extremely fast queries on your tables in order to facilitate as much as
possible your main goal: get important information *out* of your datasets.
PyTables achieves so via a very flexible and efficient query iterator, named
Table.where(). This, in combination with OPSI, the powerful indexing engine that
comes with PyTables, and the efficiency of underlying tools like NumPy, HDF5,
Numexpr and Blosc, makes of PyTables one of the fastest and more powerful query
engines available.
Different query modes
innfinision.netLarge Data Analyze with PyTables
Regular query:
●
[ r[‘c1’] for r in table
if r[‘c2’] > 2.1 and r[‘c3’] == True)) ]
In-kernel query:
●
[ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ]
Indexed query:
●
table.cols.c2.createIndex()
●
table.cols.c3.createIndex()
●
[ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ]
innfinision.netLarge Data Analyze with PyTables
This presentation has been collected from
several other presentations(PyTables presentation).
For more presentation refer to this
link (http://pytables.org/moin/HowToUse#Presentations).
Ali Hallaji
Ali.Hallaji@innfinision.net
innfinision.net
Thank you

Más contenido relacionado

La actualidad más candente

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopEvert Lammerts
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMatthias Feys
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopEvert Lammerts
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkDatabricks
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonWes McKinney
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleTuri, Inc.
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1Kenta Oono
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMartin Zapletal
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Omid Vahdaty
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsTravis Oliphant
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!DataWorks Summit
 
Research Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataResearch Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataRicard de la Vega
 

La actualidad más candente (18)

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
 
eScience Cluster Arch. Overview
eScience Cluster Arch. OvervieweScience Cluster Arch. Overview
eScience Cluster Arch. Overview
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with Hadoop
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in Python
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at Scale
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache SparkMachine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
 
Research Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataResearch Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories Metadata
 

Destacado

Marineland catalunya aqualeon decalogo verano 12 (2)
Marineland catalunya aqualeon decalogo verano 12 (2)Marineland catalunya aqualeon decalogo verano 12 (2)
Marineland catalunya aqualeon decalogo verano 12 (2)evercom
 
materias de lina
materias de linamaterias de lina
materias de linalimaherher
 
Кредит в один клик!
Кредит в один клик!Кредит в один клик!
Кредит в один клик!KreditMarket
 
Microbial bioprocessing
Microbial bioprocessingMicrobial bioprocessing
Microbial bioprocessingShivangi Gupta
 

Destacado (10)

Un Buen Perro
Un Buen PerroUn Buen Perro
Un Buen Perro
 
Marineland catalunya aqualeon decalogo verano 12 (2)
Marineland catalunya aqualeon decalogo verano 12 (2)Marineland catalunya aqualeon decalogo verano 12 (2)
Marineland catalunya aqualeon decalogo verano 12 (2)
 
TT LinkedIn
TT LinkedInTT LinkedIn
TT LinkedIn
 
MSM catalogo
MSM catalogoMSM catalogo
MSM catalogo
 
materias de lina
materias de linamaterias de lina
materias de lina
 
Chalk Dust Final PDF
Chalk Dust Final PDFChalk Dust Final PDF
Chalk Dust Final PDF
 
Кредит в один клик!
Кредит в один клик!Кредит в один клик!
Кредит в один клик!
 
Microbial bioprocessing
Microbial bioprocessingMicrobial bioprocessing
Microbial bioprocessing
 
Civilizacion egipcia historia
Civilizacion egipcia   historiaCivilizacion egipcia   historia
Civilizacion egipcia historia
 
Haber-Bosch Process
Haber-Bosch ProcessHaber-Bosch Process
Haber-Bosch Process
 

Similar a PyTables

Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataHitoshi Sato
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsShawn Zhu
 
It's the memory, stupid! CodeJam 2014
It's the memory, stupid!  CodeJam 2014It's the memory, stupid!  CodeJam 2014
It's the memory, stupid! CodeJam 2014Francesc Alted
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Javamalduarte
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user groupAdam Doyle
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...confluent
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
 
Big data berlin
Big data berlinBig data berlin
Big data berlinkammeyer
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 

Similar a PyTables (20)

PyTables
PyTablesPyTables
PyTables
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data Scientists
 
It's the memory, stupid! CodeJam 2014
It's the memory, stupid!  CodeJam 2014It's the memory, stupid!  CodeJam 2014
It's the memory, stupid! CodeJam 2014
 
Wolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat DresdenWolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat Dresden
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 

Más de Ali Hallaji

Más de Ali Hallaji (8)

Next Generation FIDS for airports
Next Generation FIDS for airportsNext Generation FIDS for airports
Next Generation FIDS for airports
 
Next Generation FIDS for airports
Next Generation FIDS for airportsNext Generation FIDS for airports
Next Generation FIDS for airports
 
Sharding
ShardingSharding
Sharding
 
MongoDB
MongoDBMongoDB
MongoDB
 
M101
M101M101
M101
 
M102
M102M102
M102
 
M202
M202M202
M202
 
Py tables
Py tablesPy tables
Py tables
 

PyTables

  • 2. Personal Profile: ● Ali Hallaji ● Parallel Processing and Large Data Analyze ● Senior Python Developer at innfinision Cloud Solutions ● Ali.Hallaji@innfinision.net ● Innfinision.net
  • 3. innfinision Cloud Solutions: ● Providing Cloud, Virtualization and Data Center Solutions ● Developing Software for Cloud Environments ● Providing Services to Telecom, Education, Broadcasting & Health Fields ● Supporting OpenStack Foundation as the First Iranian Company ● First Supporter of IRAN OpenStack Community
  • 4. Large Data Analyze with PyTables innfinision.net ● Outline ● What is PyTables ? ● Numexpr & Numpy ● Compressing Data ● What is HDF5? ● Querying your data in many different ways, fast ● Design goals Agenda:
  • 6. Outline innfinision.netLarge Data Analyze with PyTables The Starving CPU Problem ● Getting the Most Out of Computers ● Caches and Data Locality ● Techniques For Fighting Data Starvation High Performance Libraries ● Why Should You Use Them? ● In-Core High Performance ● Out-of-Core High Performance Libraries
  • 7. Getting the Most Out of Computers innfinision.netLarge Data Analyze with PyTables
  • 8. Getting the Most Out of Computers innfinision.netLarge Data Analyze with PyTables Computers nowadays are very powerful: ● Extremely fast CPU’s (multicores) ● Large amounts of RAM ● Huge disk capacities But they are facing a pervasive problem: An ever-increasing mismatch between CPU, memory and disk speeds (the so-called “Starving CPU problem”) This introduces tremendous difficulties in getting the most out of computers.
  • 9. CPU vs Memory cycle Trend innfinision.netLarge Data Analyze with PyTables Cycle time is the time, usually measured in nanosecond s, between the start of one random access memory ( RAM ) access to the time when the next access can be started History ● In the 1970s and 1980s the memory subsystem was able to deliver all the data that processors required in time. ● In the good old days, the processor was the key bottleneck. ● But in the 1990s things started to change...
  • 10. CPU vs Memory cycle Trend innfinision.netLarge Data Analyze with PyTables dd
  • 11. The CPU Starvation Problem innfinision.netLarge Data Analyze with PyTables Known facts (in 2010): ● Memory latency is much higher (around 250x) than processors and it has been an essential bottleneck for the past twenty years. ● Memory throughput is improving at a better rate than memory latency, but it is also much slower than processors (about 25x). The result is that CPUs in our current computers are suffering from a serious data starvation problem: they could consume (much!) more data than the system can possibly deliver.
  • 12. What Is the Industry Doing to Alleviate CPU Starvation? innfinision.netLarge Data Analyze with PyTables ● They are improving memory throughput: cheap to implement (more data is transmitted on each clock cycle). ● They are adding big caches in the CPU dies.
  • 13. Why Is a Cache Useful? innfinision.netLarge Data Analyze with PyTables ● Caches are closer to the processor (normally in the same die), so both the latency and throughput are improved. ● However: the faster they run the smaller they must be. ● They are effective mainly in a couple of scenarios: ● Time locality: when the dataset is reused. ● Spatial locality: when the dataset is accessed sequentially.
  • 14. Time Locality innfinision.netLarge Data Analyze with PyTables Parts of the dataset are reused
  • 15. Spatial Locality innfinision.netLarge Data Analyze with PyTables Dataset is accessed sequentially
  • 16. Why High Performance Libraries? innfinision.netLarge Data Analyze with PyTables ● High performance libraries are made by people that knows very well the different optimization techniques. ● You may be tempted to create original algorithms that can be faster than these, but in general, it is very difficult to beat them. ● In some cases, it may take some time to get used to them, but the effort pays off in the long run.
  • 17. Some In-Core High Performance Libraries innfinision.netLarge Data Analyze with PyTables ● ATLAS/MKL (Intel’s Math Kernel Library): Uses memory efficient algorithms as well as SIMD and multi-core algorithms linear algebra operations.→ ● VML (Intel’s Vector Math Library): Uses SIMD and multi-core to compute basic math functions (sin, cos, exp, log...) in vectors. ● Numexpr: Performs potentially complex operations with NumPy arrays without the overhead of temporaries. Can make use of multi-cores. ● Blosc: A multi-threaded compressor that can transmit data from caches to memory, and back, at speeds that can be larger than a OS memcpy().
  • 18. What is PyTables ? innfinision.netLarge Data Analyze with PyTables
  • 19. PyTables innfinision.netLarge Data Analyze with PyTables PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can download PyTables and use it for free. You can access documentation, some examples of use and presentations in the HowToUse section. PyTables is built on top of the HDF5 library, using the Python language and the NumPy package. It features an object-oriented interface that, combined with C extensions for the performance-critical parts of the code (generated using Cython), makes it a fast, yet extremely easy to use tool for interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space (specially if on-flight compression is used) than other solutions such as relational or object oriented databases.
  • 20. Numexpr & Numpy innfinision.netLarge Data Analyze with PyTables
  • 21. Numexpr: Dealing with Complex Expressions innfinision.netLarge Data Analyze with PyTables ● Wears a specialized virtual machine for evaluating expressions. ● It accelerates computations by using blocking and by avoiding temporaries. ● Multi-threaded: can use several cores automatically. ● It has support for Intel’s VML (Vector Math Library), so you can accelerate the evaluation of transcendental (sin, cos, atanh, sqrt. . . ) functions too.
  • 22. NumPy: A Powerful Data Container for Python innfinision.netLarge Data Analyze with PyTables NumPy provides a very powerful, object oriented, multi dimensional data container: ● array[index]: retrieves a portion of a data container ● (array1**3 / array2) - sin(array3): evaluates potentially complex expressions ● numpy.dot(array1, array2): access to optimized BLAS (*GEMM) functions
  • 23. NumPy And Temporaries innfinision.netLarge Data Analyze with PyTables Computing “a*b+c” with numpy. Temporaries goes to memory
  • 24. Numexpr Avoids (Big) Temporaries innfinision.netLarge Data Analyze with PyTables Computing “a*b+c” with numexpr. Temporaries in memory are avoided.
  • 25. Numexpr Performance (Using Multiple Threads) innfinision.netLarge Data Analyze with PyTables Time to evaluate polynomial : ((.25*x + .75)*x – 1.5)*x -2
  • 27. Why Compression innfinision.netLarge Data Analyze with PyTables ● Lets you store more data using the same space ● Uses more CPU, but CPU time is cheap compared with disk access ● Different compressors for different uses: Bzip2, zlib, lzo, Blosc
  • 28. Why Compression innfinision.netLarge Data Analyze with PyTables 3X more data
  • 29. Why Compression innfinision.netLarge Data Analyze with PyTables Less data needs to be transmitted to the CPU Transmission + decompression faster than direct transfer?
  • 30. Blosc: Compressing faster than Memory Speed innfinision.netLarge Data Analyze with PyTables
  • 31. What is HDF5 ? innfinision.netLarge Data Analyze with PyTables
  • 32. HDF5 innfinision.netLarge Data Analyze with PyTables HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.
  • 33. The HDF5 technology suite includes: innfinision.netLarge Data Analyze with PyTables ● A versatile data model that can represent very complex data objects and a wide variety of metadata. ● A completely portable file format with no limit on the number or size of data objects in the collection. ● A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces. ● A rich set of integrated performance features that allow for access time and storage space optimizations. ● Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection.
  • 34. Data structures innfinision.netLarge Data Analyze with PyTables High level of flexibility for structuring your data: ● Datatypes: scalars (numerical & strings), records, enumerated, time... ● Table support multidimensional cells and nested records ● Multidimensional arrays ● Variable lengths arrays
  • 35. Attributes: innfinision.netLarge Data Analyze with PyTables Metadata about data
  • 36. ● Dataset hierarchy innfinision.netLarge Data Analyze with PyTables
  • 37. Querying your data in many different ways, fast innfinision.netLarge Data Analyze with PyTables
  • 38. PyTables Query innfinision.netLarge Data Analyze with PyTables One characteristic that sets PyTables apart from similar tools is its capability to perform extremely fast queries on your tables in order to facilitate as much as possible your main goal: get important information *out* of your datasets. PyTables achieves so via a very flexible and efficient query iterator, named Table.where(). This, in combination with OPSI, the powerful indexing engine that comes with PyTables, and the efficiency of underlying tools like NumPy, HDF5, Numexpr and Blosc, makes of PyTables one of the fastest and more powerful query engines available.
  • 39. Different query modes innfinision.netLarge Data Analyze with PyTables Regular query: ● [ r[‘c1’] for r in table if r[‘c2’] > 2.1 and r[‘c3’] == True)) ] In-kernel query: ● [ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ] Indexed query: ● table.cols.c2.createIndex() ● table.cols.c3.createIndex() ● [ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ]
  • 40. innfinision.netLarge Data Analyze with PyTables This presentation has been collected from several other presentations(PyTables presentation). For more presentation refer to this link (http://pytables.org/moin/HowToUse#Presentations).