This document discusses extending metrics capabilities in Apache Flink for real-time business intelligence (BI) on streaming data pipelines at Walmart Labs. It proposes adding metric calculators and decorators to Flink operators to compute BI metrics as data flows through pipelines. Metrics can be handled in-process for small dimensions or via side outputs for large dimensions, requiring downstream aggregation. The right metrics database and aggregation techniques should be chosen based on dimensionality.
2. 2
CHI150406 Category Prioritization ...
Smart Pricing @ Walmart Labs
• Algorithmic: competitive analysis, economic modeling, cost/profit margin/sales/inventory data
• Rich ingestion: Walmart Stores, Walmart.com, Merchants, Competition, 3rd party data
• Large catalog size: ~ 5M 1st party catalog; ~100M 3rd party/marketplace catalog
• Multi-strategy: competitive leadership/revenue management/liquidation/trending/bidding
• Real-time: essential inputs ingestion (competition/trends/availability)
• Real-time: any 1P item price is refreshed within 5 minutes on all important input triggers
• Quasi real-time: push-rate-controlled 3P catalog price valuation
• Throughput control: probabilistic filtering/caching/micro-batching/streaming API/backpressure
• Regular batch jobs for quasi-static data: sales forecast/demand elasticity/marketplace statistics
Item data
•Merchant app
•Catalog
•3P sources
Competition data
• Matching
• Crawling
• 3P sources
Walmart data
• Walmart Stores
• Walmart.com
• SCM/WMS
Smart pricing
data
• Forecasts
• Elasticity
• Profit margin
Pricing algo run
• ML
• Optimization
• Rule-based
Price push
• Portal push
• Cache refresh
• Merchant push
Tech Stack: Flink, Kafka, Cassandra, Redis, gRPC, Hive, ElasticSearch, Druid, PySpark, Scikit, Azkaban
3. 3
CHI150406 Category Prioritization ...
Monitoring & BI
• Health-check: metrics/Graphite/Prometheus/Grafana/New Relic
• Reporting: Hive/Tableau
• Auditing: Hive/Presto
• BI: Druid
Going beyond Druid: real-time BI
• Multi-dimensional KPI monitoring & alerting: meet/beat score, stop-loss category P&L
• Categorized Top-K counters: merchant ‘hot-item’ search ranking, trending items
• Anomaly detection: stateful outlier input detectors, categorized tail statistics
• Price strategy selection: item-level descriptive/prescriptive time-window snapshot
If all the data is already flowing through Flink pipelines – why not
enrich them to compute real-time BI metrics as well?
• Pros: existing Flink pipelines scalability; data locality; incremental code decoration with metric
aspects; powerful Flink function APIs
• Cons: metrics dimensionality curse/abuse; Flink performance side-effects;
4. 4
CHI150406 Category Prioritization ...
Quick tour of Flink-metrics
• Dropwizard-like metrics: counters/gauges/histograms/meters
• Metric key scoping: system-level(TM/task/operator) + custom user-level (Metric Group feature)
• Flink Runtime Context: allows to register operator-level metric groups via Flink RichFunction API
• Flink Metric Reporter: is notified of metric registry changes and micro-batches the metrics push
• Built-in metrics and reporters: scheduled Dropwizard reporter; Kafka metrics; IO/latency metrics
• Flink web-GUI: basic metric navigation & charting (no aggregates)
ü Good fit for health-checking
ü Simple and extensible
x How to define metrics logic?
x How to decorate/attach metrics to existing Flink operators?
x How to make adding more/new metrics a quick/simple exercise?
x How to handle dynamic/BI-like metric dimensions?
x How to handle metrics aggregation?
Million-dollar question: KSQL vs Flink for real-time BI?
5. 5
CHI150406 Category Prioritization ...
Extending Flink metrics:
• Define metric logic: add a hierarchy of stateful Flink Metric Calculators; offer built-in calculator bundles
• Attach to existing Flink operators: add a hierarchy of rich Flink Metered Operator Provider decorators;
overload calls to StreamingExecutionEnvironment.addOperator() to implicitly decorate pipelines with metrics;
• Rapid metric development: offer metric calculator base classes, extensible with FP lambda arguments
• Ease of new metric set-up: programmatic and/or annotation and/or configuration
• Dynamic metric key dimension extraction: FP lambda to extract key; state per key; register per key
MetricCalcul
ator<>
StatefulMetric
Calculator<>
ScalarMetric
Calculator<>
VectorMetri
cCalculator
<>
MeteredFunction<T> extends RichFunction,
ResultTypeQueryable<T>, Supplier<MetricCalculatorsRegistry>
public interface MetricCalculator<T> extends
Consumer<MetricsInvokationContext<T,?>>,
Supplier<Map<List<String>, Metric>>, Serializable {
void bootstrap(RuntimeContext ctx);
}
abstract class MeteredOperatorProvider<OPER extends
Function> implements Supplier<OPER> {
protected OPER innerFunction;
protected MetricCalculatorsRegistry metricsCalculators;
protected List<String> userMetricGroupScope;
MeteredOp
eratorProvid
er<>
MeteredOper
ator1Provide
r<>
SourceOper
atorProvider
<>
SinkOperatorP
rovider<>
MeteredOper
ator2Provider
<>
MapOperato
rProvider<>
7. 7
CHI150406 Category Prioritization ...
BI metrics: in-process
In-process: fits small BI dimension cardinality & domain cases
• just a regular operator metric extending VectorMetricsCalculator, with custom key extractor FP lambda
• metric keys will be extracted on the fly from data and new Metrics will be registered & reported dynamically
• keyed metric state will be kept per each metric key
• keyed metric state will be updated in-process within each operator invocation (only for the extracted key)
• each task operator will keep a partition of metric keys that it has observed
• no aggregation in Flink: BI metrics are kept & reported without aggregation as regular Metric Groups
• abuse of dimensionality: may bloat the TM memory and/or operator state in Flink and/or Flink metrics reporter
public static MetricsCalculator<ItemPushRequest> priceChangePackage(){
return new ResettingVectorCounterCalculator<ItemPushRequest>().withSimpleEvaluator(
(data, s) -> Arrays.asList("PriceChangeDescTotalCounter", data.getItemPricePushData().getPriceChangeDesc())
, (k, s) -> !k.get().isEmpty()? 1L: null);
}
8. 8
CHI150406 Category Prioritization ...
BI metrics: side output
Side output: fits large BI dimension cardinality & domain cases
• just a regular operator metric extending VectorMetricsCalculator, with custom key extractor FP lambda
• metric keys will be extracted on the fly from data
• no new Metrics will be registered & reported dynamically
• Flink ProcessFunction Operator Provider will be used
• keyed metric state will be updated as async side effect of each operator invocation (only for the extracted key)
• aggregation in Flink: BI metric side-stream must be explicitly handled in Flink, with aggregation & metric push
done by another downstream Flink stage (scheduled independently, typically involves a data shuffle)
• abuse of dimensionality: either use a proper aggregation or a firehose sink in the downstream Flink metrics stage
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/side_output.html
9. 9
CHI150406 Category Prioritization ...
Make a proper choice of the metrics DB/aggregation tech
A final bit of advice:
Pay attention to the limits of metric dashboarding tools