To effectively support deep learning at LinkedIn, we need to first address the data processing issues. Most of the datasets used by our ML algorithms (e.g., LinkedIn large scale personalization engine Photon-ML) are in Avro format. Each record in an Avro dataset is essentially a sparse vector, and can be easily consumed by most of the modern classifiers. However, the format cannot be directly used by TensorFlow -- the leading deep learning package. The main blocker is that the sparse vector is not in the same format as Tensor.
Many companies have vast amount of ML data in similar sparse vector format, and Tensor format is still relatively new to many companies. Avro2TF bridges this gap by providing scalable Spark based transformation and extension mechanism to efficiently convert the data into TF records that can be readily consumed by TensorFlow. With this technology, engineers can improve their productivity by focusing on model building rather than data processing.
In this talk, we will go over the data processing issues common to many machine learning pipelines, and how we solve the problems, then deep dive into the open sourced tool, Avro2TF. How it works, its tech architecture and usage.
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Avro2 tf: a data processing engine for tensorflow
1. Avro2TF: A Data Processing
Engine for TensorFlow
Xuhong Zhang, Wensheng Sun, Chenya Zhang, Yiming Ma
AI Computing Foundation Team
2. Tensor Data Preparation: Avro2TF
- A data preprocessing component under TensorFlowIn.
- Read raw user input data with any format supported by Spark.
Generate tensor metadata (e.g.shape, cardinality, dtype).
Generate Avro- or TFRecord-based training data.
Make your training data ready to be consumed by TF!
2
7. Tensor Data Preparation: Avro2TF
Deep dive into the user config.
- Example of a tensorizeIn config.
7
columnConfig for NTV only
Array of word tokens
If need Id conversion
8. Tensor Data Preparation: Avro2TF
Deep dive into the user config.
Shape.
- []: scalar;
- [-1] : 1D array of any length;
- [6]: 1D array of length 6;
- [2, 3]: a matrix with 2 rows and 3 columns.
(Where is “shape” used?)
Notice:
We don’t do any reshaping here. The shape is just the shape of your raw data.
8
9. Tensor Data Preparation: Avro2TF
Deep dive into the user config.
- For categorical/sparse features, we require them to be represented in NTV
(name-term-value) format.
9
- We also support the following primitive types:
int, long, float, double, String, bytes (for multimedia data such as image, audio, and video), boolean,
enum, Array[NTV].
10. Tensor Data Preparation: Avro2TF
Deep dive into the tensorize process.
Feature Mapping Table Generation.
- Requirement:
String-based indices Numerical Id-based indices
- Usage:
1) Mapping table is limited (only based on training data). [Starting from 0]
2) If not available maps to same Id with UNK (unknown) token. [= Cardinality]
3) User can provide their own mapping table.
4) Different training can use the same mapping table.
10
Talk later!
11. Tensor Data Preparation: Avro2TF
Deep dive into the tensorize process.
Feature Mapping Table Generation.
- Feature: NTV format. - Feature: A list of String.
Each row = name + term Each row = single word
11
12. Tensor Data Preparation: Avro2TF
Deep dive into the tensor metadata.
Cardinality Computation:
- sparseVector: the number of unique name + term across all records .
- String: the number of unique Strings across all records.
- long or int: the maximum long/int value.
12
13. Tensor Data Preparation: Avro2TF
Deep dive into the output tensor.
- For categorical/sparse features, we provide a special data type: sparseVector.
Indices: All the ids will be put into an array.
Values: All the values in the NTV tuples will be put into an array.
- We also support the following output tensor data type:
int, long, float, double, String, boolean, bytes, sparseVector.
13