Pravega is a storage substrate that we have designed and implemented from ground up to accommodate the requirements of recent data analytics applications. The architecture of Pravega is such that applications use it to store stream data permanently while being able to process this data with low latency. Storing stream data permanently is highly desirable because it enables applications to process such data in both near real-time or months later in the same way and through the same interface. Writes to Pravega can be transactional so that the application can make data visible for reading atomically. This feature is important for guaranteeing exactly-once semantics when writing data into Pravega. As the volume of writes on a stream can grow and shrink over time, Pravega enables the capacity of a stream to adapt by allowing streams to scale automatically. The auto-scaling feature enables the number of segments of a stream to increase and decrease according to load, where a segment is the unit of parallelism of a Pravega stream. Pravega by itself does not provide any capability to process data, however, and it requires an engine with advanced data analytics features such as Apache Flink to process its streams. To connect Pravega and Flink, we have been working with the community to implement a source and a sink. The source uses Pravega checkpoints to save the state of incoming streams and guarantees exactly-once semantics, while not requiring a sticky assignment of source workers to segments. Pravega checkpoints integrate and align well with Flink checkpoints, making this a unique approach compared to existing sources. As scaling is a key feature of Pravega, we also envision scaling signals out of Pravega to the source workers indicating that a job needs to scale up or down, which is a feature currently under development and [...]
10. A typical architecture
….. 01110110 01100001 01101100 01000110
… 01001010 01101111 01101001 01110110
Source
Messaging
substrate
• Ingests and buffers data
• Decouples source from the
engine processing the data
….. 01110110 01100001 01101100
… 01001010 01101111 01101001
Stream data
processor
Messaging
substrate
One or more stages
Stream data
processor
11. A typical architecture
….. 01110110 01100001 01101100 01000110
… 01001010 01101111 01101001 01110110
Source
Messaging
substrate
….. 01110110 01100001 01101100
… 01001010 01101111 01101001
Messaging
substrate
Limitations:
• Data stored temporarily
• Not able to store an unbounded
amount of stream data
Multiple stages
Stream data
processor
Stream data
processor
• Ingests and buffers data
• Decouples source from the
engine processing the data
12. A typical architecture
….. 01110110 01100001 01101100 01000110
… 01001010 01101111 01101001 01110110
Source
Messaging
substrate
….. 01110110 01100001 01101100
Messaging
substrate
Multiple stages
Limitations:
• Data stored temporarily
• Not able to store an unbounded
amount of stream data
• Separate bulk store for historical
reads
Some bulk data store
Stream data
processor
Stream data
processor
• Ingests and buffers data
• Decouples source from the
engine processing the data
13. Target of Pravega is a stream store able to:
• Store stream data permanently
• Preserve order
• Accommodate unbounded streams
27. 01000110
Scaling a stream
….. 01110110 01100001 01101100 01000110
Stream has one
segment
1
….. 01110110 01100001 01101100
• Seal current
segment
• Create new ones
2
01000110
01000110
• Say input load has
increased
• Need more parallelism
37. Events
• Internally
• Pravega is all about bytes
• Current API focused on
events
• Some encapsulation of
application bytes
• Serializer interface
public interface Serializer<T> {
/**
* Serializes the given event.
*
* @param value The event to be serialized.
* @return The serialized form of the event.
* NOTE: buffers returned should not exceed {@link #MAX_EVENT_SIZE}.
*/
ByteBuffer serialize(T value);
/**
* Deserializes the given ByteBuffer into an event.
*
* @param serializedValue A event that has been previously serialized.
* @return The event object.
*/
T deserialize(ByteBuffer serializedValue);
}
38. EventStreamWriter API
String scope = “myScope”;
WriterConfig config = new WriterConfig();
String streamName = “myStream”;
ClientFactory factory = ClientFactory.withScope(scope, new URI(“//demo.pravega.io:3333”));
EventStreamWriter<String> writer = factory.createEventWriter(streamName, serializer, config);
while(!worldEnd) {
/* E.g., getNewRecord() reads the next line in a file */
String record = getNewRecord();
String key = extractKey(record);
String event = extractEvent(record);
writer.writeEvent(key, event);
}
39. Using the EventStreamReader API
String scope = “myScope”;
String myReaderId = “myId”;
String myReadGroup = “myGroup”;
ReaderConfig config = new ReaderConfig(props);
String streamName = “myStream”;
ClientFactory factory = ClientFactory.withScope(scope, new URI(“//127.0.0.1:3333”);
EventStreamReader<String> reader = factory.createEventReader(myReaderId,
myReadGroup,
serializer,
config);
while(!worldEnd) {
EventRead<String> event = reader.readNextEvent(long timeout);
// Application consumes event and it is supposed to persist the Position
// object to guarantee exactly once.
consumeEvent(event.getEvent(), event.getPosition());
}
reader.close();
43. Guarantees on the write path
• Order
• Writer appends in application order
• Duplicates
• Writer IDs
• Maps to last appended data on the segment store
• Writer does not persist ID to tolerate a crash
• Txns
• Atomicity at the stream level
• If anything goes wrong with the writes, either abort or let it time out
44. Avoiding duplicates – Reconnect
Segment
store
Writer ID = 101 10: 1
Ack
Segment
store
Writer ID = 102 10: 2
Segment
store
Writer ID = 103 10: 2
Setup 10
10: 2
Segment
store
Writer ID = 104 10: 2
Ack
45. Avoiding duplicates – Transactional writes
Writer ID = 101
Ack
Controller
beginTxn
Segment
store
Writer ID = 102 10: 1
Ack
Segment
store
Writer ID = 103 10: 1
4
Controller
Eventually times out and
aborts txn
46. Avoiding duplicates – Transactional writes
Writer ID = 265
Ack
Controller
beginTxn
Segment
store
Writer ID = 266 26: 1
Ack
48. Reader groups
• Group of event readers
• Read events from a set of streams
• Load distributed across readers of the group
• Segments
• A given reader reads from a set of segments
• Coordination of segment assignment done via a state synchronizer
• State synchronizer
• General facility for synchronizing state across processes
• Uses a revisioned Pravega segment
• Advanced topic: not explained further in this talk
53. Resetting to a checkpoint
1
Reset to
checkpoint C1
Pravega
Segment 1
Reader
Reader
Segment 2
2
Reinitialize
readers
Pravega
Segment 1
Reader
Reader
Segment 2
3
Pravega
Segment 1
Reader
Reader
Segment 2
Assignment can be different from
last time
55. Take-away messages
• Pravega is all about
• Unbounded stream data
• Permanently stored
• Elasticity for streams
• Scaling producers and consumers independently
• Under active development
• Looking at first use cases
56. Ongoing work
• Performance tuning
• Scaling support
• Event-time support
• Geo-distribution
• Security
• … and much more
http://pravega.io
E-mail: fpj@apache.org
Twitter: @fpjunqueira
Join the community!
61. Reading via ReaderGroup
61
seg A
seg B
seg C
seg D
seg E
seg F
seg G
Task Manager
Task Manager
Task Manager
Task Manager
Pravega
Controller &
Reader Group
• Readers do not choose their own segments
• ReaderGroup automatically assigns and re-balances segments
• Leaving the ReaderGroup in charge is key to automatic scaling
Reader
Reader
62. Reading via ReaderGroup
62
• At points in time, some segments may be not assigned to any reader
• Example: New segments, re-balancing segments, …
seg A
seg B
seg C
seg D
seg F
seg G
Task Manager
Task Manager
Task Manager
Task Manager
Pravega
Reader
Reader
seg E
being re-assigned
Controller &
Reader Group
63. Flink + Pravega Checkpoints
63
Task Manager
Task Manager
Task Manager
Task Manager
Job Manager
Pravega
Checkpoint-
Coordinator
seg A
seg B
seg C
seg D
seg E
seg F
Checkpoint store
(HDFS, S3, NFS, …)
seg G
being assigned
read positions
64. Flink + Pravega Checkpoints
64
Task Manager
Task Manager
Task Manager
Task Manager
Job Manager
Pravega
Checkpoint-
Coordinator
seg A
seg B
seg C
seg D
seg E
seg F
Checkpoint store
(HDFS, S3, NFS, …)
seg G
being assigned
(1) Trigger Checkpoint
in Pravega
read positions
checkpoint
positions
(2a) Segment
Checkpoint Data
(2b) Inject Barrier
(4) Store Check-
point Metadata
(3) Operator
Snapshots
66. The FlinkPravegaWriter
• Regular Flink SinkFunction
• No partitioner, but a "routing key"
• Remember: No partitions in Pravega
• Just dynamically created segments
• Same key always goes to the same segment
• Order of elements guaranteed per key!
66
Flink Application Pravega Nodes
seg 2
seg 1
seg 3
seg 4
67. Exactly-once via Transactions
• Similar to a distributed 2-phase commit
• Coordinated by asynchronous checkpoints, no voting delays
• Basic algorithm:
• Between checkpoints: Produce into transaction
• On operator snapshot: Flush local transaction (vote-to-commit)
• On checkpoint complete: Commit transactions
• On recovery: check and commit any pending transactions
67
73. Looking ahead…
73
Automatic Scaling
Flink follows Pravega's scaling
(at least the first stage)
High Availability through Pravega
Use synchronizers
instead of ZooKeeper
(leader election,
distributed atomic counters, …)