[WHUG] Wielki brat patrzy - czyli jak zbieramy dane o użytkownikach allegro

Wielki Brat patrzy
Czyli jak zbieramy dane o użytkownikach Allegro

O nas
Marcin Kuthan
mkuthan.github.io
Maciej Arciuch
zbyt leniwy, żeby mieć konto na githubie

Opowiemy o
● czym jest clickstream (co?)
● potrzebach biznesowych i
technicznych (dlaczego?)
● ogólnej architekturze
systemu, technicznych
aspektach (jak?)

Clickstream w Allegro
● Czym jest clickstream?
● Zbierane z frontu, web i mobile
● Ponad 400 mln zdarzeń dziennie
● Podstawa do wielu decyzji biznesowych
● Kilka zespołów Big Data

Jak być powinno
● Dane dostępne od razu - małe opóźnienia
● Dobrze opisane, łatwo dostępne dla innych
● Efektywny format danych
● Stabilnie
● Skalowalnie

Spróbujmy jeszcze raz
● Potrzeba nr 1: szybciej!
● Kolejka + przetw. strumieniowe: po 2s
● Stabilnie i skalowalnie
● Nowe zastosowania:
○ dział wykrywania “wałków”
○ rekomendacje, wyszukiwarka

Potrzeba nr 2: miejsce. Rozwiązanie: format Avro
● dojrzałe rozwiązanie
● zajmuje (całkiem) mało miejsca
● schematy: struktura + dokumentacja
● do przetw. wsadowego i strumieniowego
● kompatybilność

● Dane nieskompresowane: Avro zajmuje 45% JSON-a
● Avro - format binarny: niektóre alg. kompresji się nie
nadają (vide Snappy, 9 razy mniejszy stopień
kompresji Avro niż JSON-a)
● Realny wybór: GZip vs LZ4
○ wybraliśmy LZ4 - mniejsze zużycie CPU kosztem
20% gorszej kompresji
● Możliwość zmiany “w locie” (Kafka)

Problem nr 3: bałagan. Rozwiązanie: centralne
repozytorium schematów.
● single source of truth
● każdy element czyta z repo najnowszy schemat
● kontrola kompatybilności przy commicie
● wiemy z czym porównać
● propagacja do metastore’a, plików, itd.

Repozytorium schematów
● “schema review” - praca nad
schematem przez pull-requesty
● merge, wdrożenie na DEV
● promocja na TEST i PROD
● nie ważne co wdrożymy pierwsze:
kod czy schemat

Repozytorium schematów
● Dwie konkurencyjne implementacje
○ https://github.com/schema-repo/schema-repo
○ https://github.com/confluentinc/schema-registry
● Korzystamy ze schema-repo
○ była pierwsza, wyszła z AVRO-1124
○ trzyma schematy w ZK, a nie Kafce

Pageviews
Mobile
Events
Errors
Source
Clickstream Ingestion System
Buffer
Kafka
...
Clients
Kafka

Pageviews
Mobile
Events
Errors
Clickstream Ingestion System (cont)
...
Clients
Kafka camus
camus2hive.sh

Requirements
● Scalability
● At least once delivery
● Fault tolerance
● Back pressure

At least once - end2end
“Offset Store”“Streaming Engine”
fetch_data(topic, partitions, offset_begins, offset_ends)
process_data
commit_offsets(topic, partitions, offsets)
publish_results(topic, results)
Kafka Out Kafka In
get_offsets(topic)

Exactly once
“Checkpoint Store”“Streaming Engine”
fetch_data(topic, partitions, offset_begins, offset_ends)
process_data
store_checkpoint(metadata, results)
publish_results(results)
Kafka Out Kafka In
load_checkpoint()
transactional
non-transactional
exactly once

Exactly once - end2end
Apache Kafka
⇓
Streaming Engine
⇓
Apache Kafka

Fault tolerance
● yarn-cluster
● spark.yarn.maxAppAttempts
● spark.yarn.max.executor.failures
● spark.task.maxFailures
● spark.streaming.kafka.maxRetries

Fault tolerance
● min.insync.replicas = 2
● acks = all
● topics rep factor >= 3
● kafka clusters >= 4 nodes
● retries > 0
● retry.backoff.ms > 0

Back pressure
input rate == processing rate
Source Sink

Back pressure
Source Sink
input rate > processing rate

Back pressure
Source Sink
steady state again

Back pressure
Source Sink
initial state
spark.streaming.kafka.maxRatePerPartition*
* effectively requires single topic

Back pressure
● Pull from source
● Process
● Push to sink (sync)
buffer.memory=not too much
block.on.buffer.full=true

Spark Streaming & Apache Kafka integration
● Receiver based approach / high
level Kafka consumer
● Direct streams approach / low level
Kafka consumer

Receiver based approach
Spark Executor
Receiver
1
HDFS (WAL)
Spark Driver
Streaming
Context
Offset Store
Source
pull data

Spark Executor
Receiver
1
HDFS (WAL)
2
Spark Driver
Streaming
Context
Offset Store
Source
store Write Ahead Log

Spark Executor
Receiver
1
HDFS (WAL)
2
Spark Driver
Streaming
Context
3
4
Offset Store
Source
send blocks’ ids and commit offset

Spark Executor6
HDFS (WAL)
Spark Driver
Streaming
Context
5
Offset Store
Sink
process data and publish results

Driver checkpointing
Spark Driver
Streaming
Context
Blocks’ ids
HDFS
(checkpoint)

HDFS
(checkpoint)
Failed
Spark Driver
Restarted
Spark Driver
Streaming
Context
Blocks’ ids

StreamingContext.getOrCreate(
checkpointDir,
functionToCreateContext
)

Problems
1. Receiver occupies 1 core / executor
2. Data duplication
3. Additional latency
4. HDFS load
5. Complex back pressure
6. Controversial checkpointing

Other gotchas
● High Level Consumer rebalancing
● Spark partition != Kafka partition
val kafkaDStreams = (1 to readParallelism).map {
KafkaUtils.createStream(...)
}
val unionDStream = ssc.union(kafkaDStreams)

Direct stream approach
1
2
Offset Store
Spark Driver
Streaming
Context
Spark Executor
Source
fetch offset & distribute work
Sink

1
2 3
Offset Store
Spark Driver
Streaming
Context
Spark Executor
Source
fetch, process & publish data
Sink
4

1
2 3
5
6
Offset Store
Spark Driver
Streaming
Context
Spark Executor
Source
wait for completion & commit offset
Sink
4
5’

Direct stream - good parts
● Low level Kafka consumer
● Straightforward fault tolerance for
at least once
● Built-in natural back pressure
● No WAL
● Kafka partition == Spark partition

Direct stream - bad parts
● Built-in at least once
auto.offset.reset=smallest
● Offset Store based at least once
DIY

Direct stream - bad parts
● Lack of kafka connections pool
● Less mature/mainstream than
receiver based approach

Key takeaways
● Avro schemas and a central schema repo
- a way to reduce confusion
● Spark Streaming & Apache Kafka - almost
perfect couple
● Use Direct Streams

Thank you!
http://github.com/allegro
http://allegro.tech

[WHUG] Wielki brat patrzy - czyli jak zbieramy dane o użytkownikach allegro

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a [WHUG] Wielki brat patrzy - czyli jak zbieramy dane o użytkownikach allegro

Similar a [WHUG] Wielki brat patrzy - czyli jak zbieramy dane o użytkownikach allegro (20)

Más de allegro.tech

Más de allegro.tech (6)

[WHUG] Wielki brat patrzy - czyli jak zbieramy dane o użytkownikach allegro