SlideShare una empresa de Scribd logo
1 de 41
Wrapping C++ Arrow - Why
and How?
2 Sep 2019
Yoni Davidson
TG-17
About me!
Generalist working in TG-17, stealth mode startup.
Currently I am working in Kotlin,Javascript and Python.
Before:
Sears Israel - Mobile/Backend team.
Eyesight mobile - Mobile,Platform and IOT teams.
Alvarion - Wimax and Wifi teams.
Motivation
Data is getting bigger (Hadoop, S3) - Parquet for efficient storage.
Data scientists need a way to work without running out of memory.
Big data Infra is based on JVM, other languages would like to work on the data
and serialization is expensive - Python is the best example.
Moving data around is expensive (Serialization and Deserialization) - IO between
Services / GPU->CPU
Building all this for each framework and each language (Java + Python) is a lot of
work and blocks innovation.
What is Apache Arrow?
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-
independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern
hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.
Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
https://arrow.apache.org/
Moving data in memory between languages and
between services
What is Apache Arrow?
Performance Advantage of Columnar In-Memory
Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern
CPUs (Multi core) and GPUs.
What is Apache Arrow?
Advantages of a Common Data Layer
Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern
CPUs and GPUs.
● Each system has its own internal
memory format
● 70-80% computation wasted on
serialization and deserialization
● Similar functionality implemented
in multiple projects
● All systems utilize the same
memory format
● No overhead for cross-system
communication
● Projects can share functionality
(eg, Parquet-to-Arrow reader)
Who is leading the work on Apache Arrow?
How fast is it?
● https://github.com/apache/spark/pull/22954 - Enables Arrow optimization from R DataFrame to Spark DataFrame
How fast is it? https://databricks.com/blog/2017/10/30/introducing-
vectorized-udfs-for-pyspark.html
Where is Apache Arrow going?
Using arrow to allow TF to natively work with local and remote datasets
https://medium.com/tensorflow/tensorflow-with-apache-arrow-datasets-cdbcfe80a59f
pandas2 will be based on Apache Arrow - native work with Pandas on other platforms:
Language bindings status:
Apache Arrow Python bindings
Based on the CPP project.
Built with Cython.
Allows integration with the massive Python ecosystem - Pandas.
What is Language Binding?
In the context of software libraries, bindings are wrapper libraries that bridge two programming languages, so that a library written for one
language can be used in another language (Wiki).
Where do we find language bindings ?
What do we want to do with our Go implementation?
Sharing table with Python in same memory space. 0 serialization
Pros:
1. It’s a very closed problem - read the spec, write tests and implement.
2. It gives you all the advantages of Arrow (up to the implementation date).
3. Go allows us to improve the implementation by providing better tools for
concurrent work (easier than C++).
First approach - implement spec in pure Go
First approach - implement spec in pure Go
Cons:
1. Every improvement that the main branch has needs to be implemented in Go,
especially if it’s not an “API” change, you’ll need to understand the C++ code
and then write it in Go.
2. 1 Makes it harder to maintain the project.
3. In case the Go version adds improvements it will be harder to export them
back to C++ project (and python who is binded to it) since the core project it
not the native one.
First approach - implement spec in pure Go
Python project also enjoys C++ improvements.
carrow - Go bindings to Apache Arrow via C++-API
https://github.com/353solutions/carrow
carrow - Go bindings to Apache Arrow via C++-API
Pros:
1. This project enjoys all the CPP main branch improvements.
2. Any add that we create using the Go project we can export back to
Python/CPP project (Did an experiment of reading pandas from our Go
project).
carrow - Go bindings to Apache Arrow via C++-API
Cons:
1. It's much harder to build ( compared to a pure native Go implementation).
Challenge 1 - Go and CPP - don’t link
CPP compilers do symbols mangling (for supporting CPP features ),
CGo doesn’t support it and a C wrapper is needed.
Challenge 1 - Go and CPP - don’t link - example
void *table_new(void *sp, void *cp) {
auto schema = (Schema *)sp;
auto columns = (std::vector<std::shared_ptr<arrow::Column>> *)cp;
auto table = arrow::Table::Make(schema->ptr, *columns);
if (table == nullptr) {
return nullptr;
}
auto wrapper = new Table;
wrapper->table = table;
return wrapper;
}
Challenge 1 - Go and CPP - don’t link - example
#ifndef _CARROW_H_
#define _CARROW_H_
#ifdef __cplusplus
extern "C" {
#endif
void *table_new(void *sp, void *cp);
#ifdef __cplusplus
}
#endif // extern "C"
#endif // #ifdef _CARROW_H_
Challenge 2 - Building a CPP/Go project
CPP libs and headers are required, this means that the dev env’ is more complex
than a Go project.
Solution is a Dockerfile that has Native CPP + Python bindings for E2E tests.
Challenge 2 - Building a CPP/Go project - Dockerfile
FROM ubuntu:18.04
# Tools
RUN apt-get update && apt-get install -y 
gdb 
git 
make 
vim 
wget 
&& rm -rf /var/lib/apt/lists/*
# Go installation
RUN cd /tmp && 
wget https://dl.google.com/go/go1.12.9.linux-amd64.tar.gz && 
tar -C /usr/local -xzf go1.12.9.linux-amd64.tar.gz && 
rm go1.12.9.linux-amd64.tar.gz
ENV PATH="/usr/local/go/bin:${PATH}"
Challenge 2 - Building a CPP/Go project - Dockerfile
# Python bindings
RUN cd /tmp && 
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && 
bash Miniconda3-latest-Linux-x86_64.sh -b -p /miniconda && 
rm Miniconda3-latest-Linux-x86_64.sh
ENV PATH="/miniconda/bin:${PATH}"
RUN conda install -y 
Cython 
conda-forge::compilers 
conda-forge::pyarrow=0.14 
ipython 
numpy 
pkg-config
ENV LD_LIBRARY_PATH=/miniconda/lib
WORKDIR /src/carrow
Challenge 3 - Wrapper for each type
Since this is a wrapper lib, there is a need to do a lot of “copy pasta” code to wrap
each type.
Solution was to use go template and generate some of the code.
Challenge 3 - Wrapper for each type - example
func main() {
arrowTypes := []string{"Bool", "Float64", "Integer64", "String", "Timestamp"}
.
.
.
// Supported data types
var(
{{- range $val := .ArrowTypes}}
{{$val}}Type = DType(C.{{$val | ToUpper }}_DTYPE)
{{- end}}
)
Challenge 4 - Logger
Do we send all our errors up the stream to the Go package for logging ?
We can also create a Go logger and throw it down to the CPP code for logging.
Challenge 5 - Error handling
Where are errors handled ?
Where is the best place to log and handle them?
For now - every call returns this result_t
typedef struct {
const char *err;
void *ptr;
int64_t i;
} result_t;
Challenge 666 - Memory management
2 memory managers.
1. Go runtime - Automatic memory management.
2. CPP runtime - Apache arrow uses std::shared_ptr extensively:
std::shared_ptr is a smart pointer that retains shared ownership of an object through a pointer. Several shared_ptr objects may own the same object. The
object is destroyed and its memory deallocated when either of the following happens:
■ the last remaining shared_ptr owning the object is destroyed;
■ the last remaining shared_ptr owning the object is assigned another pointer via operator= or reset().
Challenge 666 - Memory management - solution
Wrap std::shared_ptr with a struct - so we know who owns the memory.
struct Table {
std::shared_ptr<arrow::Table> table;
};
Challenge 666 - Memory management - solution
Use finalizer to free memory.
// NewSchema creates a new schema
func NewSchema(fields []*Field) (*Schema, error) {
fieldsList, err := NewFieldList()
if err != nil {
return nil, fmt.Errorf("can't create schema,failed creating fields list")
}
.
.
.
schema := &Schema{ptr}
runtime.SetFinalizer(schema, func(s *Schema) {
C.schema_free(s.ptr)
})
return schema, nil
}
Challenge 7 - cgo is FFI
FFI - Foreign function interface
https://github.com/dyu/ffi-overhead Results (500M calls)
c:
1182
1182
cpp:
1182
1183
Go: X 32
37975
Challenge 7 - cgo is FFI
Try and reduce unneeded cgo calls:
Using Builder pattern for appending data in array.
func TestAppendInt64(t *testing.T) {
bld := NewInteger64ArrayBuilder()
const size = 20913
for i := int64(0); i < size; i++ {
err := bld.Append(i)
require.NoErrorf(err, "append %d", i)
}
arr, err := bld.Finish()
}
Our benchmarks show that this implementation is 7 times faster than calling cgo function for each data append.
Challenge 8 - Making package Go getable
This lib is linked to a specific Arrow version in a specific OS (Linux AMD64 for
example).
Do we precompile for each OS?
Add to Readme what packages need to be installed alongside?
carrow status
Adding more features (More data types).
Building good use-cases, Where and how should we use this?
Adding our project to main Apache Arrow Repo.
Questions?
Thank you
https://github.com/yonidavidson
https://www.linkedin.com/in/yoni-davidson-35b53222/
https://twitter.com/yonidavidson

Más contenido relacionado

La actualidad más candente

Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer George Markomanolis
 
IRIS-HEP Retreat: Boost-Histogram Roadmap
IRIS-HEP Retreat: Boost-Histogram RoadmapIRIS-HEP Retreat: Boost-Histogram Roadmap
IRIS-HEP Retreat: Boost-Histogram RoadmapHenry Schreiner
 
Take advantage of C++ from Python
Take advantage of C++ from PythonTake advantage of C++ from Python
Take advantage of C++ from PythonYung-Yu Chen
 
Reversing the dropbox client on windows
Reversing the dropbox client on windowsReversing the dropbox client on windows
Reversing the dropbox client on windowsextremecoders
 
Notes about moving from python to c++ py contw 2020
Notes about moving from python to c++ py contw 2020Notes about moving from python to c++ py contw 2020
Notes about moving from python to c++ py contw 2020Yung-Yu Chen
 
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingCHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingHenry Schreiner
 
Trivadis TechEvent 2016 Go - The Cloud Programming Language by Andija Sisko
Trivadis TechEvent 2016 Go - The Cloud Programming Language by Andija SiskoTrivadis TechEvent 2016 Go - The Cloud Programming Language by Andija Sisko
Trivadis TechEvent 2016 Go - The Cloud Programming Language by Andija SiskoTrivadis
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData
 
IRIS-HEP: Boost-histogram and Hist
IRIS-HEP: Boost-histogram and HistIRIS-HEP: Boost-histogram and Hist
IRIS-HEP: Boost-histogram and HistHenry Schreiner
 
[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy
[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy
[GSoC 2017] gopy: Updating gopy to support Python3 and PyPyDong-hee Na
 
ROOT 2018: iminuit and MINUIT2 Standalone
ROOT 2018: iminuit and MINUIT2 StandaloneROOT 2018: iminuit and MINUIT2 Standalone
ROOT 2018: iminuit and MINUIT2 StandaloneHenry Schreiner
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prodYunong Xiao
 
2019 IRIS-HEP AS workshop: Boost-histogram and hist
2019 IRIS-HEP AS workshop: Boost-histogram and hist2019 IRIS-HEP AS workshop: Boost-histogram and hist
2019 IRIS-HEP AS workshop: Boost-histogram and histHenry Schreiner
 
Tensorflow in Docker
Tensorflow in DockerTensorflow in Docker
Tensorflow in DockerEric Ahn
 
Why is Python slow? Python Nordeste 2013
Why is Python slow? Python Nordeste 2013Why is Python slow? Python Nordeste 2013
Why is Python slow? Python Nordeste 2013Daker Fernandes
 
그렇게 커미터가 된다: Python을 통해 오픈소스 생태계 가르치기
그렇게 커미터가 된다: Python을 통해 오픈소스 생태계 가르치기그렇게 커미터가 된다: Python을 통해 오픈소스 생태계 가르치기
그렇게 커미터가 된다: Python을 통해 오픈소스 생태계 가르치기Jeongkyu Shin
 

La actualidad más candente (20)

Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
IRIS-HEP Retreat: Boost-Histogram Roadmap
IRIS-HEP Retreat: Boost-Histogram RoadmapIRIS-HEP Retreat: Boost-Histogram Roadmap
IRIS-HEP Retreat: Boost-Histogram Roadmap
 
Take advantage of C++ from Python
Take advantage of C++ from PythonTake advantage of C++ from Python
Take advantage of C++ from Python
 
Reversing the dropbox client on windows
Reversing the dropbox client on windowsReversing the dropbox client on windows
Reversing the dropbox client on windows
 
PyHEP 2019: Python 3.8
PyHEP 2019: Python 3.8PyHEP 2019: Python 3.8
PyHEP 2019: Python 3.8
 
Notes about moving from python to c++ py contw 2020
Notes about moving from python to c++ py contw 2020Notes about moving from python to c++ py contw 2020
Notes about moving from python to c++ py contw 2020
 
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingCHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
 
Trivadis TechEvent 2016 Go - The Cloud Programming Language by Andija Sisko
Trivadis TechEvent 2016 Go - The Cloud Programming Language by Andija SiskoTrivadis TechEvent 2016 Go - The Cloud Programming Language by Andija Sisko
Trivadis TechEvent 2016 Go - The Cloud Programming Language by Andija Sisko
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
 
IRIS-HEP: Boost-histogram and Hist
IRIS-HEP: Boost-histogram and HistIRIS-HEP: Boost-histogram and Hist
IRIS-HEP: Boost-histogram and Hist
 
[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy
[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy
[GSoC 2017] gopy: Updating gopy to support Python3 and PyPy
 
ROOT 2018: iminuit and MINUIT2 Standalone
ROOT 2018: iminuit and MINUIT2 StandaloneROOT 2018: iminuit and MINUIT2 Standalone
ROOT 2018: iminuit and MINUIT2 Standalone
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prod
 
Streams for the Web
Streams for the WebStreams for the Web
Streams for the Web
 
2019 IRIS-HEP AS workshop: Boost-histogram and hist
2019 IRIS-HEP AS workshop: Boost-histogram and hist2019 IRIS-HEP AS workshop: Boost-histogram and hist
2019 IRIS-HEP AS workshop: Boost-histogram and hist
 
Tensorflow in Docker
Tensorflow in DockerTensorflow in Docker
Tensorflow in Docker
 
Why is Python slow? Python Nordeste 2013
Why is Python slow? Python Nordeste 2013Why is Python slow? Python Nordeste 2013
Why is Python slow? Python Nordeste 2013
 
그렇게 커미터가 된다: Python을 통해 오픈소스 생태계 가르치기
그렇게 커미터가 된다: Python을 통해 오픈소스 생태계 가르치기그렇게 커미터가 된다: Python을 통해 오픈소스 생태계 가르치기
그렇게 커미터가 된다: Python을 통해 오픈소스 생태계 가르치기
 
Lua vs python
Lua vs pythonLua vs python
Lua vs python
 

Similar a carrow - Go bindings to Apache Arrow via C++-API

Pypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequelPypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequelMark Rees
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchDirk Petersen
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdbRoman Podoliaka
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using SwiftDiego Freniche Brito
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...confluent
 
Software Quality Assurance Tooling - Wintersession 2024
Software Quality Assurance Tooling - Wintersession 2024Software Quality Assurance Tooling - Wintersession 2024
Software Quality Assurance Tooling - Wintersession 2024Henry Schreiner
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance PythonIan Ozsvald
 
Package a PyApp as a Flatpak Package: An HTTP Server for Example @ PyCon APAC...
Package a PyApp as a Flatpak Package: An HTTP Server for Example @ PyCon APAC...Package a PyApp as a Flatpak Package: An HTTP Server for Example @ PyCon APAC...
Package a PyApp as a Flatpak Package: An HTTP Server for Example @ PyCon APAC...Jian-Hong Pan
 
Dependent things dependency management for apple sw - slideshare
Dependent things   dependency management for apple sw - slideshareDependent things   dependency management for apple sw - slideshare
Dependent things dependency management for apple sw - slideshareCavelle Benjamin
 
Easy deployment & management of cloud apps
Easy deployment & management of cloud appsEasy deployment & management of cloud apps
Easy deployment & management of cloud appsDavid Cunningham
 
Scaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesScaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesRobert Lemke
 
Introduction to Flutter - truly crossplatform, amazingly fast
Introduction to Flutter - truly crossplatform, amazingly fastIntroduction to Flutter - truly crossplatform, amazingly fast
Introduction to Flutter - truly crossplatform, amazingly fastBartosz Kosarzycki
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat
PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat
PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat Pôle Systematic Paris-Region
 
Cloud Foundry V2 | Intermediate Deep Dive
Cloud Foundry V2 | Intermediate Deep DiveCloud Foundry V2 | Intermediate Deep Dive
Cloud Foundry V2 | Intermediate Deep DiveKazuto Kusama
 
Build your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resourcesBuild your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resourcesMartin Czygan
 

Similar a carrow - Go bindings to Apache Arrow via C++-API (20)

Pypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequelPypy is-it-ready-for-production-the-sequel
Pypy is-it-ready-for-production-the-sequel
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdb
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
 
Software Quality Assurance Tooling - Wintersession 2024
Software Quality Assurance Tooling - Wintersession 2024Software Quality Assurance Tooling - Wintersession 2024
Software Quality Assurance Tooling - Wintersession 2024
 
Introduction to Google Colaboratory.pdf
Introduction to Google Colaboratory.pdfIntroduction to Google Colaboratory.pdf
Introduction to Google Colaboratory.pdf
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
 
Package a PyApp as a Flatpak Package: An HTTP Server for Example @ PyCon APAC...
Package a PyApp as a Flatpak Package: An HTTP Server for Example @ PyCon APAC...Package a PyApp as a Flatpak Package: An HTTP Server for Example @ PyCon APAC...
Package a PyApp as a Flatpak Package: An HTTP Server for Example @ PyCon APAC...
 
Dependent things dependency management for apple sw - slideshare
Dependent things   dependency management for apple sw - slideshareDependent things   dependency management for apple sw - slideshare
Dependent things dependency management for apple sw - slideshare
 
Easy deployment & management of cloud apps
Easy deployment & management of cloud appsEasy deployment & management of cloud apps
Easy deployment & management of cloud apps
 
Scaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesScaleable PHP Applications in Kubernetes
Scaleable PHP Applications in Kubernetes
 
Introduction to Flutter - truly crossplatform, amazingly fast
Introduction to Flutter - truly crossplatform, amazingly fastIntroduction to Flutter - truly crossplatform, amazingly fast
Introduction to Flutter - truly crossplatform, amazingly fast
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
App container rkt
App container rktApp container rkt
App container rkt
 
PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat
PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat
PyParis 2017 / Writing a C Python extension in 2017, Jean-Baptiste Aviat
 
Cloud Foundry V2 | Intermediate Deep Dive
Cloud Foundry V2 | Intermediate Deep DiveCloud Foundry V2 | Intermediate Deep Dive
Cloud Foundry V2 | Intermediate Deep Dive
 
Build your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resourcesBuild your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resources
 

Último

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 

Último (20)

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 

carrow - Go bindings to Apache Arrow via C++-API

  • 1. Wrapping C++ Arrow - Why and How? 2 Sep 2019 Yoni Davidson TG-17
  • 2. About me! Generalist working in TG-17, stealth mode startup. Currently I am working in Kotlin,Javascript and Python. Before: Sears Israel - Mobile/Backend team. Eyesight mobile - Mobile,Platform and IOT teams. Alvarion - Wimax and Wifi teams.
  • 3. Motivation Data is getting bigger (Hadoop, S3) - Parquet for efficient storage. Data scientists need a way to work without running out of memory. Big data Infra is based on JVM, other languages would like to work on the data and serialization is expensive - Python is the best example. Moving data around is expensive (Serialization and Deserialization) - IO between Services / GPU->CPU Building all this for each framework and each language (Java + Python) is a lot of work and blocks innovation.
  • 4. What is Apache Arrow? Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language- independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust. https://arrow.apache.org/
  • 5. Moving data in memory between languages and between services
  • 6. What is Apache Arrow? Performance Advantage of Columnar In-Memory Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern CPUs (Multi core) and GPUs.
  • 7. What is Apache Arrow? Advantages of a Common Data Layer Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern CPUs and GPUs. ● Each system has its own internal memory format ● 70-80% computation wasted on serialization and deserialization ● Similar functionality implemented in multiple projects ● All systems utilize the same memory format ● No overhead for cross-system communication ● Projects can share functionality (eg, Parquet-to-Arrow reader)
  • 8. Who is leading the work on Apache Arrow?
  • 9. How fast is it? ● https://github.com/apache/spark/pull/22954 - Enables Arrow optimization from R DataFrame to Spark DataFrame
  • 10. How fast is it? https://databricks.com/blog/2017/10/30/introducing- vectorized-udfs-for-pyspark.html
  • 11. Where is Apache Arrow going? Using arrow to allow TF to natively work with local and remote datasets https://medium.com/tensorflow/tensorflow-with-apache-arrow-datasets-cdbcfe80a59f pandas2 will be based on Apache Arrow - native work with Pandas on other platforms:
  • 13. Apache Arrow Python bindings Based on the CPP project. Built with Cython. Allows integration with the massive Python ecosystem - Pandas.
  • 14. What is Language Binding? In the context of software libraries, bindings are wrapper libraries that bridge two programming languages, so that a library written for one language can be used in another language (Wiki).
  • 15. Where do we find language bindings ?
  • 16. What do we want to do with our Go implementation? Sharing table with Python in same memory space. 0 serialization
  • 17. Pros: 1. It’s a very closed problem - read the spec, write tests and implement. 2. It gives you all the advantages of Arrow (up to the implementation date). 3. Go allows us to improve the implementation by providing better tools for concurrent work (easier than C++). First approach - implement spec in pure Go
  • 18. First approach - implement spec in pure Go Cons: 1. Every improvement that the main branch has needs to be implemented in Go, especially if it’s not an “API” change, you’ll need to understand the C++ code and then write it in Go. 2. 1 Makes it harder to maintain the project. 3. In case the Go version adds improvements it will be harder to export them back to C++ project (and python who is binded to it) since the core project it not the native one.
  • 19. First approach - implement spec in pure Go Python project also enjoys C++ improvements.
  • 20. carrow - Go bindings to Apache Arrow via C++-API https://github.com/353solutions/carrow
  • 21. carrow - Go bindings to Apache Arrow via C++-API Pros: 1. This project enjoys all the CPP main branch improvements. 2. Any add that we create using the Go project we can export back to Python/CPP project (Did an experiment of reading pandas from our Go project).
  • 22. carrow - Go bindings to Apache Arrow via C++-API Cons: 1. It's much harder to build ( compared to a pure native Go implementation).
  • 23. Challenge 1 - Go and CPP - don’t link CPP compilers do symbols mangling (for supporting CPP features ), CGo doesn’t support it and a C wrapper is needed.
  • 24. Challenge 1 - Go and CPP - don’t link - example void *table_new(void *sp, void *cp) { auto schema = (Schema *)sp; auto columns = (std::vector<std::shared_ptr<arrow::Column>> *)cp; auto table = arrow::Table::Make(schema->ptr, *columns); if (table == nullptr) { return nullptr; } auto wrapper = new Table; wrapper->table = table; return wrapper; }
  • 25. Challenge 1 - Go and CPP - don’t link - example #ifndef _CARROW_H_ #define _CARROW_H_ #ifdef __cplusplus extern "C" { #endif void *table_new(void *sp, void *cp); #ifdef __cplusplus } #endif // extern "C" #endif // #ifdef _CARROW_H_
  • 26. Challenge 2 - Building a CPP/Go project CPP libs and headers are required, this means that the dev env’ is more complex than a Go project. Solution is a Dockerfile that has Native CPP + Python bindings for E2E tests.
  • 27. Challenge 2 - Building a CPP/Go project - Dockerfile FROM ubuntu:18.04 # Tools RUN apt-get update && apt-get install -y gdb git make vim wget && rm -rf /var/lib/apt/lists/* # Go installation RUN cd /tmp && wget https://dl.google.com/go/go1.12.9.linux-amd64.tar.gz && tar -C /usr/local -xzf go1.12.9.linux-amd64.tar.gz && rm go1.12.9.linux-amd64.tar.gz ENV PATH="/usr/local/go/bin:${PATH}"
  • 28. Challenge 2 - Building a CPP/Go project - Dockerfile # Python bindings RUN cd /tmp && wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh -b -p /miniconda && rm Miniconda3-latest-Linux-x86_64.sh ENV PATH="/miniconda/bin:${PATH}" RUN conda install -y Cython conda-forge::compilers conda-forge::pyarrow=0.14 ipython numpy pkg-config ENV LD_LIBRARY_PATH=/miniconda/lib WORKDIR /src/carrow
  • 29. Challenge 3 - Wrapper for each type Since this is a wrapper lib, there is a need to do a lot of “copy pasta” code to wrap each type. Solution was to use go template and generate some of the code.
  • 30. Challenge 3 - Wrapper for each type - example func main() { arrowTypes := []string{"Bool", "Float64", "Integer64", "String", "Timestamp"} . . . // Supported data types var( {{- range $val := .ArrowTypes}} {{$val}}Type = DType(C.{{$val | ToUpper }}_DTYPE) {{- end}} )
  • 31. Challenge 4 - Logger Do we send all our errors up the stream to the Go package for logging ? We can also create a Go logger and throw it down to the CPP code for logging.
  • 32. Challenge 5 - Error handling Where are errors handled ? Where is the best place to log and handle them? For now - every call returns this result_t typedef struct { const char *err; void *ptr; int64_t i; } result_t;
  • 33. Challenge 666 - Memory management 2 memory managers. 1. Go runtime - Automatic memory management. 2. CPP runtime - Apache arrow uses std::shared_ptr extensively: std::shared_ptr is a smart pointer that retains shared ownership of an object through a pointer. Several shared_ptr objects may own the same object. The object is destroyed and its memory deallocated when either of the following happens: ■ the last remaining shared_ptr owning the object is destroyed; ■ the last remaining shared_ptr owning the object is assigned another pointer via operator= or reset().
  • 34. Challenge 666 - Memory management - solution Wrap std::shared_ptr with a struct - so we know who owns the memory. struct Table { std::shared_ptr<arrow::Table> table; };
  • 35. Challenge 666 - Memory management - solution Use finalizer to free memory. // NewSchema creates a new schema func NewSchema(fields []*Field) (*Schema, error) { fieldsList, err := NewFieldList() if err != nil { return nil, fmt.Errorf("can't create schema,failed creating fields list") } . . . schema := &Schema{ptr} runtime.SetFinalizer(schema, func(s *Schema) { C.schema_free(s.ptr) }) return schema, nil }
  • 36. Challenge 7 - cgo is FFI FFI - Foreign function interface https://github.com/dyu/ffi-overhead Results (500M calls) c: 1182 1182 cpp: 1182 1183 Go: X 32 37975
  • 37. Challenge 7 - cgo is FFI Try and reduce unneeded cgo calls: Using Builder pattern for appending data in array. func TestAppendInt64(t *testing.T) { bld := NewInteger64ArrayBuilder() const size = 20913 for i := int64(0); i < size; i++ { err := bld.Append(i) require.NoErrorf(err, "append %d", i) } arr, err := bld.Finish() } Our benchmarks show that this implementation is 7 times faster than calling cgo function for each data append.
  • 38. Challenge 8 - Making package Go getable This lib is linked to a specific Arrow version in a specific OS (Linux AMD64 for example). Do we precompile for each OS? Add to Readme what packages need to be installed alongside?
  • 39. carrow status Adding more features (More data types). Building good use-cases, Where and how should we use this? Adding our project to main Apache Arrow Repo.

Notas del editor

  1. This blog post introduces the Pandas UDFs (a.k.a. Vectorized UDFs) feature in the upcoming Apache Spark 2.3 release that substantially improves the performance and usability of user-defined functions (UDFs) in Python. Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. To enable data scientists to leverage the value of big data, Spark added a Python API in version 0.7, with support for user-defined functions. These user-defined functions operate one-row-at-a-time, and thus suffer from high serialization and invocation overhead. As a result, many data pipelines define UDFs in Java and Scala and then invoke them from Python. Pandas UDFs built on top of Apache Arrow bring you the best of both worlds—the ability to define low-overhead, high-performance UDFs entirely in Python. In Spark 2.3, there will be two types of Pandas UDFs: scalar and grouped map. Next, we illustrate their usage using four example programs: Plus One, Cumulative Probability, Subtract Mean, Ordinary Least Squares Linear Regression.