Safety is one of the most crucial concerns of Uber’s ride-sharing platform. To advance the timeliness in response to the safety issues in daily business, Uber runs a Flink pipeline that joins together multiple high volume (>10 TB/day) streaming sources of sensor data along with trip information in order to extract a number of contextual features. It deploys a deep learning model on top of TensorFlow in Flink to detect potential car crashes and identify general trip and driving anomalies. This knowledge is then passed down to the business operational teams to proactively reach out to riders and drivers to confirm their user experiences and provide prompt safety assistance if needed.
5. ■ A First Model
■ Choosing and Integrating Flink
■ 1st Iteration: A Modular, Light Topology
■ 2nd Iteration: On a Reusable Sensor Platform
■ 3rd Iteration: On-Trip Detection
Agenda
6. ■ A First Model
■ Choosing and Integrating Flink
■ 1st Iteration: A Modular, Light Topology
■ 2nd Iteration: On a Reusable Sensor Platform
■ 3rd Iteration: On-Trip Detection
Agenda
8. ● Distance from dropoff to destination
● Rider/Driver cancellation vs normal trip
completion
● Overall length of trip (time, distance)
● Location context (highway, movie
theatre, airport, etc.)
Trip Context Features
10. Building a Model
● Uber has very high-accuracy labels
● Extremely unbalanced dataset
● Model trained using Apache Spark
● Can host model for streaming using
platform called Michelangelo
The Apache Spark logo is either a registered trademark or a trademark of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of this mark.
11. ■ A First Model
■ Choosing and Integrating Flink
■ 1st Iteration: A Modular, Light Topology
■ 2nd Iteration: On a Reusable Sensor Platform
■ 3rd Iteration: On-Trip Detection
Agenda
12. Why Flink?
● Uber migrated from Samza to Flink
● Rich API: keyBy, join, window, etc.
● Supports batch processing
● Exactly once guarantee
The Apache Spark and Apache Flink logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
13. Uber Infrastructure: Schema’d Kafka
● Many Kafka topics at Uber have enforced schemas
● Centralized schema registry that stores Avro schemas
● Wrote a custom, config-driven SourceFunction/SinkFunction that
loads into and out of generated Java classes
topic-name
feature-name
DataStream<T> inputStream = env.addSource(
(SourceFunction<T>) getInputs().get(topicName),
new AvroTypeInfo<>(tClass)
);
14. Uber Infrastructure: M3 Metrics
● In-house, open source metrics system, M3, roughly compatible with
Prometheus
● Implemented custom MetricReporter, lightly adapted from Flink’s
PrometheusReporter
● A Prometheus scraper then ingests into Uber metrics system
● Utilize M3 with our internal alerting and monitoring
M3 Query Language:
15. ■ A First Model
■ Choosing and Integrating Flink
■ 1st Iteration: A Modular, Light Topology
■ 2nd Iteration: On a Reusable Sensor Platform
■ 3rd Iteration: On-Trip Detection
Agenda
16. Sensor Data at Uber
5 minutes 3 minutes 6 minutes
GPS: Points sent up one at a time;
0.5 Hz, latitude/longitude/speed, ~3 TB/day
Accelerometer: ~5-minute batched payloads;
25 Hz, 3 dimensions, ~10 TB/day
Hive
Cassandra
● Uber operates tens of millions of trips daily
● Sensor data is MBs per trip
17. Joining TBs of Sensor Streams
● Managing state is difficult; state is sensitive to failures
● Trade-offs between state size and data coverage
● Focus on reducing stream joins
?
GPS
Accel
Points sent up one or two at a time;
0.5 Hz, latitude/longitude/speed, ~3 TB/day
~5-minute payloads;
25 Hz, 3 dimensions, ~10 TB/day
19. Condensing Prior to Trip Joins/Aggregations
Detect
Spikes
Aggregate
Spikes by Trip
Accelerometer
Payloads
Detect
Stops
Trip GPS
3 TB
10 TB 60 GB
1 GB
Join Stops and
Spikes by Trip
20. A Modular Post-Trip Crash Detection Topology
Location Service
Fetch GPS
Route
Detect
Spikes
Aggregate
Spikes by Trip
Detect
Stops
Trip End Event
Kafka Topic
Fetch Trip
Context
Stops and
Spikes
Scored by
Model
Michelangelo:
Machine Learning
Platform
Trip Service
To RideCheck
Service
Accelerometer
Payloads
Why so many jobs?
● Resource isolation
● "Paper trail"/debuggability
● Reuse intermediate features
● Facilitates cross-team
collaboration
21. The Power of Flink: Joining by Trip ID
● Use SessionWindow
● We first ensure that both streams are
deduplicated by trip ID
● Configured “gap” roughly acts as an expiry time
● See power of windows in Flink:
○ Triggered the moment that both sides have
arrived, immediately freeing state
22. ■ A First Model
■ Choosing and Integrating Flink
■ 1st Iteration: A Modular, Light Topology
■ 2nd Iteration: On a Reusable Sensor Platform
■ 3rd Iteration: On-Trip Detection
Agenda
23. Platformizing: When Use Cases Diverge
● Example of stop detection between RideCheck's Crash vs Trip Anomaly
● Different products have different criteria for data latency, data quality,
precision, and definition for contextual features
● It's a case where a feature is engineered differently for different applications
24. Platformizing: Demand for Sensor Data
● Efforts joining large data streams to give data context is not unique
● Fraud detection, ETA calculation
● Examples of Aggregations: per trip, rider/driver match, geolocation (segments
of streets, region), time
25. Adding Sensor Embeddings to the Model
● Use deep learning to learn
features from raw sensor
data
○ GPS
○ Accelerometer
○ Gyroscope
● Produce 100-dimension
embedding
● Add output as features for
existing model
● TensorFlow sub-model runs
within Flink
26. Sensor Trip Aggregation
● A few things make this more feasible:
○ More demand for clean, convenient sensor
data from other teams within Uber
○ Reliable GPS now included in batched sensor
payloads
● Time for a Sensor Platform that does the
aggregation once for everybody
● Unlocks:
○ Full-trip raw data analysis
○ Easy use of trip context data
○ Data quality guarantees
?
27. Consolidated
Crash Detection
Consolidation
Trip
Aggregation
Extract Stops,
Spikes, Embeddings
Accel/Gyro/GPS
Payloads
Trip Event
Kafka
Topics Fetch Trip
Context
Scored by
Model
Michelangelo:
Hosted ML ModelsTrip Service
To
RideCheck
Per-Trip
Sensor
Data
Trip
Events
Why move to a single job now?
● Platform has simplified things
● Much more stable now; less
need to isolate
● Rapid iteration has slowed;
less need for debugging
28. ■ A First Model
■ Choosing and Integrating Flink
■ 1st Iteration: A Modular, Light Topology
■ 2nd Iteration: On a Reusable Sensor Platform
■ 3rd Iteration: On-Trip Detection
Agenda
29. On-Trip Crash Detection: A Hybrid Solution
● Hardest part is forgoing some valuable trip context
● Model performance is inevitably lower due to
○ Giving up Post-Trip features
○ Consider only a sliding window of data
● Meant to be run in tandem with Post-Trip pipeline
30. On-Trip Crash Detection
Trip
Aggregation 1-Minute
Payloads
On-Trip Crash Detection
To
RideCheck
Trip
Events
Retain at most 5 minutes of data
Still Emit Original Per-
Trip Sensor Data
31. The Future
● Drive down the delay further
○ There would be enormous value
in being able to respond in
seconds
● On-device heuristics/model
○ Trigger early upload of batched
sensor data
○ Backend still does heavy lifting