2. About me!
Generalist working in TG-17, stealth mode startup.
Currently I am working in Kotlin,Javascript and Python.
Before:
Sears Israel - Mobile/Backend team.
Eyesight mobile - Mobile,Platform and IOT teams.
Alvarion - Wimax and Wifi teams.
3. Motivation
Data is getting bigger (Hadoop, S3) - Parquet for efficient storage.
Data scientists need a way to work without running out of memory.
Big data Infra is based on JVM, other languages would like to work on the data
and serialization is expensive - Python is the best example.
Moving data around is expensive (Serialization and Deserialization) - IO between
Services / GPU->CPU
Building all this for each framework and each language (Java + Python) is a lot of
work and blocks innovation.
4. What is Apache Arrow?
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-
independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern
hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.
Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
https://arrow.apache.org/
5. Moving data in memory between languages and
between services
6. What is Apache Arrow?
Performance Advantage of Columnar In-Memory
Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern
CPUs (Multi core) and GPUs.
7. What is Apache Arrow?
Advantages of a Common Data Layer
Columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern
CPUs and GPUs.
● Each system has its own internal
memory format
● 70-80% computation wasted on
serialization and deserialization
● Similar functionality implemented
in multiple projects
● All systems utilize the same
memory format
● No overhead for cross-system
communication
● Projects can share functionality
(eg, Parquet-to-Arrow reader)
9. How fast is it?
● https://github.com/apache/spark/pull/22954 - Enables Arrow optimization from R DataFrame to Spark DataFrame
10. How fast is it? https://databricks.com/blog/2017/10/30/introducing-
vectorized-udfs-for-pyspark.html
11. Where is Apache Arrow going?
Using arrow to allow TF to natively work with local and remote datasets
https://medium.com/tensorflow/tensorflow-with-apache-arrow-datasets-cdbcfe80a59f
pandas2 will be based on Apache Arrow - native work with Pandas on other platforms:
13. Apache Arrow Python bindings
Based on the CPP project.
Built with Cython.
Allows integration with the massive Python ecosystem - Pandas.
14. What is Language Binding?
In the context of software libraries, bindings are wrapper libraries that bridge two programming languages, so that a library written for one
language can be used in another language (Wiki).
16. What do we want to do with our Go implementation?
Sharing table with Python in same memory space. 0 serialization
17. Pros:
1. It’s a very closed problem - read the spec, write tests and implement.
2. It gives you all the advantages of Arrow (up to the implementation date).
3. Go allows us to improve the implementation by providing better tools for
concurrent work (easier than C++).
First approach - implement spec in pure Go
18. First approach - implement spec in pure Go
Cons:
1. Every improvement that the main branch has needs to be implemented in Go,
especially if it’s not an “API” change, you’ll need to understand the C++ code
and then write it in Go.
2. 1 Makes it harder to maintain the project.
3. In case the Go version adds improvements it will be harder to export them
back to C++ project (and python who is binded to it) since the core project it
not the native one.
19. First approach - implement spec in pure Go
Python project also enjoys C++ improvements.
20. carrow - Go bindings to Apache Arrow via C++-API
https://github.com/353solutions/carrow
21. carrow - Go bindings to Apache Arrow via C++-API
Pros:
1. This project enjoys all the CPP main branch improvements.
2. Any add that we create using the Go project we can export back to
Python/CPP project (Did an experiment of reading pandas from our Go
project).
22. carrow - Go bindings to Apache Arrow via C++-API
Cons:
1. It's much harder to build ( compared to a pure native Go implementation).
23. Challenge 1 - Go and CPP - don’t link
CPP compilers do symbols mangling (for supporting CPP features ),
CGo doesn’t support it and a C wrapper is needed.
24. Challenge 1 - Go and CPP - don’t link - example
void *table_new(void *sp, void *cp) {
auto schema = (Schema *)sp;
auto columns = (std::vector<std::shared_ptr<arrow::Column>> *)cp;
auto table = arrow::Table::Make(schema->ptr, *columns);
if (table == nullptr) {
return nullptr;
}
auto wrapper = new Table;
wrapper->table = table;
return wrapper;
}
25. Challenge 1 - Go and CPP - don’t link - example
#ifndef _CARROW_H_
#define _CARROW_H_
#ifdef __cplusplus
extern "C" {
#endif
void *table_new(void *sp, void *cp);
#ifdef __cplusplus
}
#endif // extern "C"
#endif // #ifdef _CARROW_H_
26. Challenge 2 - Building a CPP/Go project
CPP libs and headers are required, this means that the dev env’ is more complex
than a Go project.
Solution is a Dockerfile that has Native CPP + Python bindings for E2E tests.
27. Challenge 2 - Building a CPP/Go project - Dockerfile
FROM ubuntu:18.04
# Tools
RUN apt-get update && apt-get install -y
gdb
git
make
vim
wget
&& rm -rf /var/lib/apt/lists/*
# Go installation
RUN cd /tmp &&
wget https://dl.google.com/go/go1.12.9.linux-amd64.tar.gz &&
tar -C /usr/local -xzf go1.12.9.linux-amd64.tar.gz &&
rm go1.12.9.linux-amd64.tar.gz
ENV PATH="/usr/local/go/bin:${PATH}"
28. Challenge 2 - Building a CPP/Go project - Dockerfile
# Python bindings
RUN cd /tmp &&
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh &&
bash Miniconda3-latest-Linux-x86_64.sh -b -p /miniconda &&
rm Miniconda3-latest-Linux-x86_64.sh
ENV PATH="/miniconda/bin:${PATH}"
RUN conda install -y
Cython
conda-forge::compilers
conda-forge::pyarrow=0.14
ipython
numpy
pkg-config
ENV LD_LIBRARY_PATH=/miniconda/lib
WORKDIR /src/carrow
29. Challenge 3 - Wrapper for each type
Since this is a wrapper lib, there is a need to do a lot of “copy pasta” code to wrap
each type.
Solution was to use go template and generate some of the code.
30. Challenge 3 - Wrapper for each type - example
func main() {
arrowTypes := []string{"Bool", "Float64", "Integer64", "String", "Timestamp"}
.
.
.
// Supported data types
var(
{{- range $val := .ArrowTypes}}
{{$val}}Type = DType(C.{{$val | ToUpper }}_DTYPE)
{{- end}}
)
31. Challenge 4 - Logger
Do we send all our errors up the stream to the Go package for logging ?
We can also create a Go logger and throw it down to the CPP code for logging.
32. Challenge 5 - Error handling
Where are errors handled ?
Where is the best place to log and handle them?
For now - every call returns this result_t
typedef struct {
const char *err;
void *ptr;
int64_t i;
} result_t;
33. Challenge 666 - Memory management
2 memory managers.
1. Go runtime - Automatic memory management.
2. CPP runtime - Apache arrow uses std::shared_ptr extensively:
std::shared_ptr is a smart pointer that retains shared ownership of an object through a pointer. Several shared_ptr objects may own the same object. The
object is destroyed and its memory deallocated when either of the following happens:
■ the last remaining shared_ptr owning the object is destroyed;
■ the last remaining shared_ptr owning the object is assigned another pointer via operator= or reset().
34. Challenge 666 - Memory management - solution
Wrap std::shared_ptr with a struct - so we know who owns the memory.
struct Table {
std::shared_ptr<arrow::Table> table;
};
36. Challenge 7 - cgo is FFI
FFI - Foreign function interface
https://github.com/dyu/ffi-overhead Results (500M calls)
c:
1182
1182
cpp:
1182
1183
Go: X 32
37975
37. Challenge 7 - cgo is FFI
Try and reduce unneeded cgo calls:
Using Builder pattern for appending data in array.
func TestAppendInt64(t *testing.T) {
bld := NewInteger64ArrayBuilder()
const size = 20913
for i := int64(0); i < size; i++ {
err := bld.Append(i)
require.NoErrorf(err, "append %d", i)
}
arr, err := bld.Finish()
}
Our benchmarks show that this implementation is 7 times faster than calling cgo function for each data append.
38. Challenge 8 - Making package Go getable
This lib is linked to a specific Arrow version in a specific OS (Linux AMD64 for
example).
Do we precompile for each OS?
Add to Readme what packages need to be installed alongside?
39. carrow status
Adding more features (More data types).
Building good use-cases, Where and how should we use this?
Adding our project to main Apache Arrow Repo.
This blog post introduces the Pandas UDFs (a.k.a. Vectorized UDFs) feature in the upcoming Apache Spark 2.3 release that substantially improves the performance and usability of user-defined functions (UDFs) in Python.
Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. To enable data scientists to leverage the value of big data, Spark added a Python API in version 0.7, with support for user-defined functions. These user-defined functions operate one-row-at-a-time, and thus suffer from high serialization and invocation overhead. As a result, many data pipelines define UDFs in Java and Scala and then invoke them from Python.
Pandas UDFs built on top of Apache Arrow bring you the best of both worlds—the ability to define low-overhead, high-performance UDFs entirely in Python.
In Spark 2.3, there will be two types of Pandas UDFs: scalar and grouped map. Next, we illustrate their usage using four example programs: Plus One, Cumulative Probability, Subtract Mean, Ordinary Least Squares Linear Regression.