The functional paradigm is not only applicable to programming. There is even more reason for using functional patterns at an architectural level. MapReduce is the most famous example of such a pattern. In this talk, we will go through a few other architectural patterns, and their corresponding stateful anti-patterns.
2. Who’s talking?
Swedish Institute of Comp. Sc. (test tools)
Sun Microsystems (very large machines)
Google (Hangouts, productivity)
Recorded Future (NLP startup)
Cinnober Financial Tech. (trading systems)
Spotify (data processing & modelling)
Schibsted (data processing & modelling)
2
3. Why functional?
Verbs
... has made ... expanding ...
... flourishes ... merged ... has been unable to escape lingering .. built ...
... are ... placed ... say ... are ... to explode ...
.. are considering ... to reopen … to recall ...
3
4. Or object-oriented?
Nouns, pronouns
... bankruptcy ... government bailout ... automaker Chrysler ... comeback ... sales ... Jeep sport utility vehicles.
... Chrysler ... part ... Fiat Chrysler Automobiles, it ... concerns ... the safety ... Jeeps ...
... Jeeps ... gas tanks ... regulators ... safety advocates ... rear-end crash.
... regulators ... an investigation ... those Jeeps ... Fiat Chrysler’s agreement ... models.
4
5. Functional benefits? My version.
Matches a few problems
Data processing
Matches a few computer properties
Consistency through immutability
Deterministic - replay for resilience
5
6. Local vs distributed properties
Local
Hardware provides
strong consistency
Faults -> death
6
Distributed
Eventual consistency
Faults must be
survived
10. Anti-pattern - isolated batch jobs
Get data (more on that later)
Cron an ETL batch job (function)
Output solidifies. Mostly.
Steps in isolation - often different teams
What to do on ETL code changes?
10
Sales with
demographics
Views with
demographics
11. Pattern: data pipeline
End-to-end sequences/DAG of jobs
Not only exist, but treated end-to-end
Input is raw, original data
Separate raw data from generated
11
Users
Page
views
Sales with
demographics
Conversio
n analytics
Conversion
analytics
Views with
demographics
12. Lambda architecture, part 1
Save all collected data without preprocessing
But timestamp on generation, register,
arrival
Rerun everything downstream on code change
Human fault tolerance
In conflict with privacy management?
12
13. Pipeline workflow orchestration
Ideally: Good old make + cluster + IDE + xUnit
Test end-to-end
Rebuild on upstream changes (but not all)
State of practice: Luigi, Pinball, Azkaban
Don’t take you all the way :-(
13
14. Lambda architecture, part 2
Parallel batch and real-time pipelines
Batch more accurate, overrides
Real-time for window of recent data
14
15. Obtaining data
Log things. Conceptually stable, but collection
is challenging at scale.
Have legacy code and master data in
databases? Let us have a look.
15
16. Database dimensioned for online traffic
Hadoop = herd of elephants
Load spike
Height = #mapper nodes
Area = #users
Anti-pattern: direct dump
16
API
17. Direct dumps in the trenches
Company successful - #users increasing
More Sqoop mappers - higher DB load
Daily dump jobs went to 25h
Devops firewalled off Hadoop to recover
17
18. Anti-pattern: dump through API
SOA/microservice culture
DB protected by throttling
API not used to elephants
Query area is still large
Herd of elephants through gate - 1-2 weeks
18
API
20. All dumps are non-deterministic
HDFS down? Dump later.
State is gone - dump not accurate
Slave replication down?
Dump not accurate
20
21. Anti-pattern: deterministic mirror
Replay commit log until full day/hour
Discovered through archaeology :-)
Not scalable, point of failure
Hourly dump took 45 minutes, increasing...
2121
22. (Anti-)pattern: better dumping
Netflix Aegisthus
Snapshot Cassandra (fast, atomic,
reliable)
Transfer SSTables to HDFS
Replicate compaction in MapReduce
Other DBs? Depends on atomic snapshot.
22
23. All dumps are anti-patterns?
Typical use: Join activity events with user info
Event time != dump time
Aggregation discards information
Which users enabled X, tried, and disabled?
23
24. Pattern: Event source
All facts are events. Immutable, timestamped
Event stream is source of truth
No explicit “current state”
The functional data architecture?
24
25. Event source incarnated: unified log
Pour events into pub/sub bus, with long history.
Kafka de-facto standard.
Tap from bus to HDFS/S3 in time buckets.
Camus/Secor
Stream processing pipelines to dest topics
Replay on code changes
25
26. Unified log, practical considerations
Long history necessary
Must have time to fix stream process bugs
Use 3+ months and use stream as temp
DB
Unified log also useful for meta and control
Tweak Kafka for low latency
26
27. Event source + views
View = snapshot of aggregated state @ time
For ETL, choice of hourly/daily aggregates or
exact views
27
Logs
View View
28. Event source + database
Business logic may demand “current state”
Event stream is truth, keep DB in sync
28
29. Event source, synced database
A. Service interface generates
events and DB transactions
B. Generate stream from DB
commit log.
Postgres, MySQL -> Kafka
C.Build DB with stream
processing
29
APIAPIAPI
30. Deployment & orchestration
System = many machines
Desired system state = code + config
Actual state = Orchestrator(current, desired)
30
32. Stateful orchestration in the trench
Desired = { case roleA: install(x,y)
case roleB: install(z) }
Current = x installed on roleB. Old x. Zombie
woke up when B load decreased.
Puppet+apt = No simple way to remove
undesired state
32
33. Pattern: artifacts from source
Orchestrator = Docker|Packer {
delete current
return Image(desired)
}
No state leak from existing state. Sort of.
33
34. Deterministic, predictable?
Image building leaky on purpose
E.g. “apt-get update && apt-get install”
Imports external state
Ephemeral databases preserve state
Ability to rebuild from unified log is
valuable
34
35. Jay Kreps, Confluent: Unified log
Martin Kleppman: Unified log, Bottled Water
Nathan Marz: Lambda
Sander Mak @ Jfokus: Event sourcing
Datomic
Questions?
More?
35