carrow - Go bindings to Apache Arrow via C++-API

Wrapping C++ Arrow - Why
and How?
2 Sep 2019
Yoni Davidson
TG-17

About me!
Generalist working in TG-17, stealth mode startup.
Currently I am working in Kotlin,Javascript and Python.
Before:
Sears Israel - Mobile/Backend team.
Eyesight mobile - Mobile,Platform and IOT teams.
Alvarion - Wimax and Wifi teams.

Motivation
Data is getting bigger (Hadoop, S3) - Parquet for efficient storage.
Data scientists need a way to work without running out of memory.
Big data Infra is based on JVM, other languages would like to work on the data
and serialization is expensive - Python is the best example.
Moving data around is expensive (Serialization and Deserialization) - IO between
Services / GPU->CPU
Building all this for each framework and each language (Java + Python) is a lot of
work and blocks innovation.

What is Apache Arrow?
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-
independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern
hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.
Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
https://arrow.apache.org/

Moving data in memory between languages and
between services

Performance Advantage of Columnar In-Memory
Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern
CPUs (Multi core) and GPUs.

Advantages of a Common Data Layer
Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern
CPUs and GPUs.
● Each system has its own internal
memory format
● 70-80% computation wasted on
serialization and deserialization
● Similar functionality implemented
in multiple projects
● All systems utilize the same
memory format
● No overhead for cross-system
communication
● Projects can share functionality
(eg, Parquet-to-Arrow reader)

Who is leading the work on Apache Arrow?

How fast is it?
● https://github.com/apache/spark/pull/22954 - Enables Arrow optimization from R DataFrame to Spark DataFrame

How fast is it? https://databricks.com/blog/2017/10/30/introducing-
vectorized-udfs-for-pyspark.html

Where is Apache Arrow going?
Using arrow to allow TF to natively work with local and remote datasets
https://medium.com/tensorflow/tensorflow-with-apache-arrow-datasets-cdbcfe80a59f
pandas2 will be based on Apache Arrow - native work with Pandas on other platforms:

Apache Arrow Python bindings
Based on the CPP project.
Built with Cython.
Allows integration with the massive Python ecosystem - Pandas.

What is Language Binding?
In the context of software libraries, bindings are wrapper libraries that bridge two programming languages, so that a library written for one
language can be used in another language (Wiki).

Where do we find language bindings ?

What do we want to do with our Go implementation?
Sharing table with Python in same memory space. 0 serialization

Pros:
1. It’s a very closed problem - read the spec, write tests and implement.
2. It gives you all the advantages of Arrow (up to the implementation date).
3. Go allows us to improve the implementation by providing better tools for
concurrent work (easier than C++).
First approach - implement spec in pure Go

Cons:
1. Every improvement that the main branch has needs to be implemented in Go,
especially if it’s not an “API” change, you’ll need to understand the C++ code
and then write it in Go.
2. 1 Makes it harder to maintain the project.
3. In case the Go version adds improvements it will be harder to export them
back to C++ project (and python who is binded to it) since the core project it
not the native one.

Python project also enjoys C++ improvements.

carrow - Go bindings to Apache Arrow via C++-API
https://github.com/353solutions/carrow

Pros:
1. This project enjoys all the CPP main branch improvements.
2. Any add that we create using the Go project we can export back to
Python/CPP project (Did an experiment of reading pandas from our Go
project).

Cons:
1. It's much harder to build ( compared to a pure native Go implementation).

Challenge 1 - Go and CPP - don’t link
CPP compilers do symbols mangling (for supporting CPP features ),
CGo doesn’t support it and a C wrapper is needed.

Challenge 1 - Go and CPP - don’t link - example
void *table_new(void *sp, void *cp) {
auto schema = (Schema *)sp;
auto columns = (std::vector<std::shared_ptr<arrow::Column>> *)cp;
auto table = arrow::Table::Make(schema->ptr, *columns);
if (table == nullptr) {
return nullptr;
}
auto wrapper = new Table;
wrapper->table = table;
return wrapper;
}

Challenge 1 - Go and CPP - don’t link - example
#ifndef _CARROW_H_
#define _CARROW_H_
#ifdef __cplusplus
extern "C" {
#endif
void *table_new(void *sp, void *cp);
#ifdef __cplusplus
}
#endif // extern "C"
#endif // #ifdef _CARROW_H_

Challenge 2 - Building a CPP/Go project
CPP libs and headers are required, this means that the dev env’ is more complex
than a Go project.
Solution is a Dockerfile that has Native CPP + Python bindings for E2E tests.

Challenge 2 - Building a CPP/Go project - Dockerfile
FROM ubuntu:18.04
# Tools
RUN apt-get update && apt-get install -y
gdb
git
make
vim
wget
&& rm -rf /var/lib/apt/lists/*
# Go installation
RUN cd /tmp &&
wget https://dl.google.com/go/go1.12.9.linux-amd64.tar.gz &&
tar -C /usr/local -xzf go1.12.9.linux-amd64.tar.gz &&
rm go1.12.9.linux-amd64.tar.gz
ENV PATH="/usr/local/go/bin:${PATH}"

Challenge 2 - Building a CPP/Go project - Dockerfile
# Python bindings
RUN cd /tmp &&
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh &&
bash Miniconda3-latest-Linux-x86_64.sh -b -p /miniconda &&
rm Miniconda3-latest-Linux-x86_64.sh
ENV PATH="/miniconda/bin:${PATH}"
RUN conda install -y
Cython
conda-forge::compilers
conda-forge::pyarrow=0.14
ipython
numpy
pkg-config
ENV LD_LIBRARY_PATH=/miniconda/lib
WORKDIR /src/carrow

Challenge 3 - Wrapper for each type
Since this is a wrapper lib, there is a need to do a lot of “copy pasta” code to wrap
each type.
Solution was to use go template and generate some of the code.

Challenge 3 - Wrapper for each type - example
func main() {
arrowTypes := []string{"Bool", "Float64", "Integer64", "String", "Timestamp"}
.
.
.
// Supported data types
var(
{{- range $val := .ArrowTypes}}
{{$val}}Type = DType(C.{{$val | ToUpper }}_DTYPE)
{{- end}}
)

Challenge 4 - Logger
Do we send all our errors up the stream to the Go package for logging ?
We can also create a Go logger and throw it down to the CPP code for logging.

Challenge 5 - Error handling
Where are errors handled ?
Where is the best place to log and handle them?
For now - every call returns this result_t
typedef struct {
const char *err;
void *ptr;
int64_t i;
} result_t;

Challenge 666 - Memory management
2 memory managers.
1. Go runtime - Automatic memory management.
2. CPP runtime - Apache arrow uses std::shared_ptr extensively:
std::shared_ptr is a smart pointer that retains shared ownership of an object through a pointer. Several shared_ptr objects may own the same object. The
object is destroyed and its memory deallocated when either of the following happens:
■ the last remaining shared_ptr owning the object is destroyed;
■ the last remaining shared_ptr owning the object is assigned another pointer via operator= or reset().

Challenge 666 - Memory management - solution
Wrap std::shared_ptr with a struct - so we know who owns the memory.
struct Table {
std::shared_ptr<arrow::Table> table;
};

Challenge 666 - Memory management - solution
Use finalizer to free memory.
// NewSchema creates a new schema
func NewSchema(fields []*Field) (*Schema, error) {
fieldsList, err := NewFieldList()
if err != nil {
return nil, fmt.Errorf("can't create schema,failed creating fields list")
}
.
.
.
schema := &Schema{ptr}
runtime.SetFinalizer(schema, func(s *Schema) {
C.schema_free(s.ptr)
})
return schema, nil
}

Challenge 7 - cgo is FFI
FFI - Foreign function interface
https://github.com/dyu/ffi-overhead Results (500M calls)
c:
1182
1182
cpp:
1182
1183
Go: X 32
37975

Challenge 7 - cgo is FFI
Try and reduce unneeded cgo calls:
Using Builder pattern for appending data in array.
func TestAppendInt64(t *testing.T) {
bld := NewInteger64ArrayBuilder()
const size = 20913
for i := int64(0); i < size; i++ {
err := bld.Append(i)
require.NoErrorf(err, "append %d", i)
}
arr, err := bld.Finish()
}
Our benchmarks show that this implementation is 7 times faster than calling cgo function for each data append.

Challenge 8 - Making package Go getable
This lib is linked to a specific Arrow version in a specific OS (Linux AMD64 for
example).
Do we precompile for each OS?
Add to Readme what packages need to be installed alongside?

carrow status
Adding more features (More data types).
Building good use-cases, Where and how should we use this?
Adding our project to main Apache Arrow Repo.

Thank you
https://github.com/yonidavidson
https://www.linkedin.com/in/yoni-davidson-35b53222/
https://twitter.com/yonidavidson

carrow - Go bindings to Apache Arrow via C++-API

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a carrow - Go bindings to Apache Arrow via C++-API

Similar a carrow - Go bindings to Apache Arrow via C++-API (20)

Último

Último (20)

carrow - Go bindings to Apache Arrow via C++-API

Notas del editor