The document outlines how to implement BigPetStore, a blueprint for Flink users, in under 500 lines of code. It describes generating sample data, performing ETL with both the DataSet and Table APIs, training a matrix factorization model with FlinkML for recommendations, and serving recommendations with the DataStream API. The goal is to demonstrate end-to-end workflows in Flink that go beyond WordCount by mixing APIs for data generation, cleaning, machine learning, and streaming predictions.
2. Outline
• BigPetStore model
• Data generator with the DataSet API
• ETL with the DataSet & Table API
• Matrix factorization with FlinkML
• Recommendation with the DataStream API
• Summary
3. BigPetStore
Blueprints for Big Data
applications
Consists of:
• Data Generators
• Examples using tools in Big Data
ecosystem to process data
• Build system and tests for
integrating tools and multiple JVM
languages
Part of the Apache BigTop project
4. BigPetStore model
• Customers visiting pet stores generating
transactions, location based
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in
Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference
on , vol., no., pp.49-55, 3-5 Dec. 2014
5. Data generation
val env = ExecutionEnvironment.getExecutionEnvironment
val (stores, products, customers) = getData()
val startTime = getCurrentMillis()
val transactions = env.fromCollection(customers)
.flatMap(new TransactionGenerator(products))
.withBroadcastSet(stores, ”stores”)
.map{t => t.setDateTime(t.getDateTime + startTime); t}
transactions.writeAsText(output)
• Use RJ Nowling’s Java generator classes
• Write transactions to JSON
6. ETL with the DataSet API
val env = ExecutionEnvironment.getExecutionEnvironment
val transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val productsWithIndex = transactions.flatMap(_.getProducts)
.distinct
.zipWithUniqueId
val customerAndProductPairs = transactions
.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId, p)))
.join(productsWithIndex).where(_._2).equalTo(_._2)
.map(pair => (pair._1._1, pair._2._1))
.distinct
customerAndProductPairs.writeAsCsv(output)
• Read the dirty JSON
• Output (customer, product) pairs for the
recommender
7. ETL with the Table API
val env = ExecutionEnvironment.getExecutionEnvironment
val transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val table = transactions.map(toCaseClass(_)).toTable
val storeTransactionCount = table.groupBy('storeId)
.select('storeId, 'storeName, 'storeId.count as 'count)
val bestStores = table.groupBy('storeId)
.select('storeId.max as 'max)
.join(storeTransactionCount)
.where(”count = max”)
.select('storeId, 'storeName, 'storeId.count as 'count)
.toDataSet[StoreCount]
• Read the dirty JSON
• SQL style queries
8. A little Recommeder theory
Item
factors
User side
information
User-Item matrixUser factors
Item side
information
U
I
P
Q
R
• R is potentially huge, approximate it with P∗Q
• Prediction is TopK(user’s row ∗ Q)
9. Matrix factorization with FlinkML
val env = ExecutionEnvironment.getExecutionEnvironment
val input = env.readCsvFile[(Int,Int)](inputFile)
.map(pair => (pair._1, pair._2, 1.0))
val model = ALS()
.setNumfactors(numFactors)
.setIterations(iterations)
.setLambda(lambda)
model.fit(input)
val (p, q) = model.factorsOption.get
p.writeAsText(pOut)
q.writeAsText(qOut)
• Read the (customer, product) pairs
• Write P and Q to file
10. Recommendation with the DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
env.socketTextStream(”localhost”, 9999)
.map(new GetUserVector())
.broadcast()
.map(new PartialTopK())
.keyBy(0)
.flatMap(new GlobalTopK())
.print();
• Get the user’s row for a userID
• Compute the distributed TopK of the
user’s row ∗ Q
11. Summary
• Go beyond WordCount with BigPetStore
• Feel free to mix the DataSet, DataStream, FlinkML,
Table APIs in your Flink workflows
• Data generation, cleaning, ETL, Machine learning,
streaming prediction on top of one engine with
under 500 lines of code
• Java and Scala APIs work well together
• A Flink pet project is always fun. No pun intended.
12. Big thanks to
• The BigPetStore folks:
Suneel Marthi
Ronald J. Nowling
Jay Vyas
• Squirrels helping with the code:
Gyula Fóra
Gábor Gevay
Gábor Hermann
Fabian Hueske
Aljoscha Krettek
• And to the whole Flink community
13. Check out the code
https://github.com/mbalassi/bigpetstore-flink
Márton Balassi
mbalassi@apache.org / @MartonBalassi
Hungarian Academy of Sciences