Warp 10 - Time Series Analysis on top of Hadoop - HUG France - Paris Spark Meetup 2017-05-16

Tour of the Open Source Warp 10 Time Series Platform including its integration into the Hadoop Ecosystem (Spark, Pig, Flink, Storm).

  1. 1. Spark Meetup - 2017-05-16, Paris Mathias @Herberts - CTO, Cityzen Data Warp 10 - Simplifying analysis of time series data on top of
  2. 2. `whoami` Former Senior SRE on Big Table at Google Former head of Big Data at Crédit Mutuel Arkéa Pioneer in the use of Hadoop & HBase in production since 2009 Co-Founder and CTO of Cityzen Data, maker of Warp 10 @herberts
  3. 3. Time Series
  4. 4. Time Series are everywhere
  5. 5. IoT & time series data management and analysis
  6. 6. Versatile Data Model
  7. 7. Geo Time Series®
  8. 8. Geo Time Series®
  9. 9. Digital Twin Paradigm
  10. 10. Multiple Versions
  11. 11. Embeddable for Edge Analytics
  12. 12. Standalone Version HA Datalog in-memory
  13. 13. Distributed Version
  14. 14. Secure Solution
  15. 15. Security Encryption and authentication/authorization mechanisms sandboxed environment for analytics
  16. 16. Rich Analytics
  17. 17. Analytics A stack based language dedicated to time series analytics
  18. 18. Advanced stack based language ■ Result is a JSON array of the various stack levels ■ Support for variables and context saving ■ Code serialization ■ Loops, conditionals, macros - Data Flow model ■ Secure code execution, resource limits
  19. 19. 5 high level frameworks ■ BUCKETIZE - transform a series so it has regularly spaced ticks ■ MAP - apply a function on a sliding window ■ REDUCE - tick by tick computation on multiple series, producing a single one ■ FILTER - select series based on various criteria ■ APPLY - tick by tick application of an n-ary function
  21. 21. Compact expressiveness <% ‘Display write requests count for each region’ DOC SAVE 'context' STORE 'cell' STORE 'PT60m' DURATION 'duration' STORE '@TOKEN_READ@' 'TOKEN' STORE NOW 'now' STORE [ $TOKEN 'writeRequestCount' { 'cell' $cell 'Context' 'regionserver' } $now $duration ] FETCH // Remove resets false RESETS // Align ticks [ SWAP bucketizer.last $now 60 STU * 0 ] BUCKETIZE // Sum by hname [ SWAP [ 'hname' ] reducer.sum ] REDUCE FILLNEXT FILLPREVIOUS // Compute rates [ SWAP mapper.rate 1 0 0 ] MAP $context RESTORE %>
  22. 22. Extensibility
  23. 23. WarpScript Server Side Macros <% <’ This macro does such and such… @param xxx @param yyy ‘> DOC // Store the current context so we can create symbols freely SAVE ‘_context’ STORE // Insert your code here // Restore original context $_context RESTORE %> ‘macro’ STORE // Unit tests // Leave the macro on the stack $macro // Use via @path/to/macro in your scripts
  24. 24. WarpScript Extensions Import io.warp10.script.sdk.WarpScriptExtension; import io.warp10.script.NamedWarpScriptFunction; import io.warp10.script.WarpScriptException; import io.warp10.script.WarpScriptStack; import io.warp10.script.WarpScriptStackFunction; public class MyExtension extends WarpScriptExtension { private static Map<String,Object> functions = new HashMap<String,Object>(); private static class MyStackFunction extends NamedWarpScriptFunction implements WarpScriptStackFunction { @Override public Object apply(WarpScriptStack stack) throws WarpScriptException { …. return stack; } } static { functions.put("XXX", new MyStackFunction(“XXX”)); } @Override public Map<String, Object> getFunctions() { return functions; } }
  25. 25. CALLing external programs #!/usr/bin/env python -u import cPickle, sys, urllib, base64 # Output the maximum number of instances of this 'callable' to spawn print 10 # Loop, reading stdin, doing our stuff and outputing to stdout while True: try: line = sys.stdin.readline() line = line.strip() line = urllib.unquote(line.decode('utf-8')) # Remove Base64 encoding str = base64.b64decode(line) args = cPickle.loads(str) # Do out stuff output = …. # Output result (URL encoded UTF-8). print urllib.quote(output.encode('utf-8')) except Exception as err: print ' ' + urllib.quote(repr(err).encode('utf-8')) ... ->PICKLE ‘UTF-8’ BYTES-> ->B64 ‘path/to/file’ CALL B64-> PICKLE-> ....
  26. 26. Visualization
  27. 27. Quantum IDE
  28. 28. Quantum IDE
  29. 29. QuantumViz Web Component <!doctype html> <html> <head> <meta name="viewport" content="width=device-width, minimum-scale=1.0, initial-scale=1.0, user-scalable=yes"> <script src=""></script> <link rel="import" href=""> <link rel="import" href=""> <body> <warp10-quantumviz width="500" height="400" show-axis="true" tooltip="true" line-width="2" reload="0" host="" > NEWGTS 1 720 <% DUP 'i' STORE 10000000 * NaN NaN NaN $i TORADIANS COS ADDVALUE %> FOR [ SWAP ] 'gts' STORE [ { 'color' '#00d4ff' 'key' 'Sine' } ] 'params' STORE { 'interpolate' 'linear' } 'globalParams' STORE { 'gts' $gts 'params' $params 'globalParams' $globalParams } </warp10-quantumviz> </body> </html>
  30. 30. Grafana Integration
  31. 31. Timelion Integration
  32. 32. rocessing Integration 800 'width' STORE 800 'height' STORE 400.0 'maxspeed' STORE 40000.0 'maxalt' STORE 3.0 2.0 2.0 @orbit/heatmap/kernel/triangular 'kernel' STORE @orbit/heatmap/palette/classic 'palette' STORE 'TOKEN''token' STORE $width $height '2D' PGraphics 'MULTIPLY' PblendMode 'CENTER' PimageMode [ $token '~(ALT|CAS)' {} NOW -2000000 ] FETCH DUP 0 GET LASTTICK 'now' STORE [ SWAP bucketizer.last $now STU 0 ] BUCKETIZE // Create heatmap <% 7 GET LIST-> DROP 'CAS' STORE 'ALT' STORE <% $CAS ISNULL NOT $ALT ISNULL NOT && %> <% $kernel $CAS $maxspeed / $width * $ALT $maxalt / 1.0 SWAP - $height * Pimage %> IFT 0 NaN NaN NaN NULL %> MACROREDUCER 'GRAPHER' STORE [ SWAP [] $GRAPHER ] REDUCE DROP // Colorize Ppixels <% DROP Palpha $palette SWAP GET %> LMAP PupdatePixels Pencode Pdecode $width $height '2D' PGraphics // Do the grid PnoFill 0 0 $width 1 - $height 1 - Prect 2.0 PstrokeWeight 200.0 Pcolor Pstroke 250.0 $maxspeed / $width * DUP 0 SWAP $height Pline 0 10000 $maxalt / 1.0 SWAP - $height * DUP $width SWAP Pline SWAP 0 0 Pimage Pencode
  33. 33. QuantumImg Web Component <!doctype html> <html> <head> <link rel="stylesheet" href="//"> <link rel="stylesheet" href="//"> <script src="//"></script> <script src="//"></script> <script src=""></script> <link rel="import" href=""> <link rel="import" href=""> <body> <warp10-img width="300" height="300" reload="0" host=""> 200 'width' CSTORE 200 'height' CSTORE $width $height '2D' Pgraphics Ppixels <% DROP DROP RAND 0xFFFFFFFF * TOLONG %> LMAP PupdatePixels Pencode </warp10-img> </body> </html>
  34. 34. Ok, what about Hadoop?
  35. 35. Dealing with time series data in Hadoop is difficult!
  36. 36. Most, if not all approaches do it wrong!
  37. 37. Either too narrow in focus... think econometric time series
  38. 38. ...or providing too little value... because moving average is simply a beginning
  39. 39. ...or limited to a specific tool think xxxRDD
  40. 40. Warp 10 brings the power of to
  41. 41. Warp10InputFormat ■ Read data stored in Warp 10 at millions of datapoints per second ■ Standard Hadoop InputFormat ■ Compatible with any tool relying on such an InputFormat ■ Compact representation of time series, lower memory footprint
  42. 42. Integration with ■ Enable the use of WarpScript code in the Spark DAG ■ Provide both WarpScriptFunction and WarpScriptFlatMapFunction ■ Manipulate RDD/DataSet/DataFrame elements on the WarpScript stack ■ Extend WarpScript to support custom types if needed ■ Load time series data from any source (Parquet, SQL, …)
  43. 43. DataFrame df =; RDD<Row> rdd = df.rdd(); JavaRDD<Row> jrdd = rdd.toJavaRDD(); JavaRDD<Row> out = jrdd.mapPartitions(new WarpScriptFlatMapFunction<Iterator<Row>,Row>("@ext-macro.mc2")); JavaPairRDD<Row, Iterable<Row>> grouped = out.groupBy(new WarpScriptFunction<Row, Row>("[ 0 1 ] SUBLIST ->SPARKROW")); JavaRDD<Row> merged = WarpScriptFunction<Tuple2<Row,Iterable<Row>>, Row>("LIST-> DROP 0 GET [] SWAP <% SPARK-> 2 GET UNWRAP +! %> FOREACH MERGE WRAPRAW + 2 GET 1 ->LIST ->SPARKROW")); List<StructField> fields = new ArrayList<StructField>(); fields.add(DataTypes.createStructField("wrapper", DataTypes.BinaryType, false)); StructType st = new StructType(fields.toArray(new StructField[0])); DataFrame df2 = sqlc.createDataFrame(merged, st); df2.write().parquet("/path/to/output/parquetfile"); Integration with
  44. 44. Integration with ■ Enable the use of WarpScript code in Pig scripts ■ Provide a WarpScriptRun UDF ■ Manipulate Pig types (tuples, bags, …) on the WarpScript stack ■ Represent time series in a very compact form to speed up processing ■ Load time series data from any source
  45. 45. REGISTER warp10-pig-0.0.10-rc2.jar; SET warp.timeunits 'us'; DEFINE WarpScriptRun io.warp10.pig.WarpScriptRun(); GTS = LOAD '$input' USING PigStorage() AS (gts: chararray); -- Retain only the 'frequency' GTS and chunk them by 5 minutes FREQCHUNKS = FOREACH GTS GENERATE FLATTEN( WarpScriptRun('DUP UNWRAPEMPTY NAME "frequency" == <% UNWRAP 0 5 m 0 0 "chunkid" false CHUNK WRAP %> <% [] %> IFTE ->V ', gts)); -- Flatten the bag CHUNKS = FOREACH FREQCHUNKS GENERATE FLATTEN($0); -- Generate station id, chunk id, gts BYSTATIONCHUNK = FOREACH CHUNKS GENERATE FLATTEN( WarpScriptRun('DUP UNWRAP LABELS DUP "chunkid" GET SWAP "stationid" GET ', $0)) AS (stationid: chararray, chunkid: chararray, gts: chararray); -- Group by station id, chunk id STATIONCHUNKGROUP = GROUP BYSTATIONCHUNK BY (stationid, chunkid) PARALLEL 20; -- Merge the GTS to reconstruct the chunk and emit station id, chunk id, gts FULLCHUNKS = FOREACH STATIONCHUNKGROUP GENERATE FLATTEN( WarpScriptRun('V-> <% DROP 2 GET UNWRAP %> LMAP MERGE DUP LABELS SWAP WRAP SWAP DUP "chunkid" GET SWAP "stationid" GET ', BYSTATIONCHUNK)) AS (stationid: chararray, chunkid: chararray, gts: chararray); STORE FULLCHUNKS INTO ‘$output’ USING PigStorage(‘t’); Integration with
  46. 46. { 'type' 'spout' 'id' 'spout-0' 'output' { 'stream-0' [ 'field-2' 'field-1' ] } 'parallelism' 1 'every' 500 'debug' true 'macro' 0 'counter' STORE <% $counter 1 + 'counter' STORE 'NOW' 'https://host:port/api/v0/exec' REXEC 'now' STORE { 'stream-0' [ [ 'now' $now ] ] } %> } { 'type' 'bolt' 'id' 'bolt-0' 'parallelism' 2 'debug' true 'input' { 'spout-0' { 'stream-0' 'shuffle' } } 'output' { 'stream-1' [ 'outfield' ] } ‘macro' <% SNAPSHOT [ SWAP ] 'value' STORE $value 0 GET _storm.LOG { 'stream-1' [ $value ] } %> } Integration with stream processing engines
  47. 47. And also... ■ Integration with Flink ■ Integration with Zeppelin via a WarpScript interpreter ■ Warp 10 sink to push data to Warp 10 once it has been processed ■ Coherent approach in ad-hoc, batch, and streaming modes ■ Reduce amount of code needed to be written, focus on business problems
  48. 48. Open Source Distribution
  49. 49. Thank you! curl -O -L tar zxpf warp10-1.2.7.tar.gz export JAVA_HOME=/path/to/java/home; cd warp10-1.2.7; ./bin/warp10-standalone.init start 3 steps to get you started with Warp 10 A set of resources to learn, ask and share @warp10io!forum/warp10-users
  50. 50. contact @ cityzendata . com