Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Spark zeppelin-cassandra at synchrotron

922 visualizaciones

Publicado el

Spark/Zeppelin/Cassandra for particle accelerator metrics storage and aggregation

Publicado en: Tecnología
  • Sé el primero en comentar

Spark zeppelin-cassandra at synchrotron

  1. 1. Spark/Cassandra/Zeppelin for particle accelerator metrics storage and aggregation DuyHai DOAN Apache Cassandra Evangelist
  2. 2. @doanduyhai Who Am I ? Duy Hai DOAN Apache Cassandra Evangelist •  talks, meetups, confs … •  open-source projects (Achilles, Apache Zeppelin ...) •  OSS Cassandra point of contact ☞ duy_hai.doan@datastax.com ☞ @doanduyhai 2
  3. 3. The HDB++ project •  What is Synchrotron •  HDB++ project presentation •  Why Spark, Cassandra and Zeppelin ?
  4. 4. @doanduyhai What is Synchrotron ? 4 •  particle accelerator (electrons) •  electron beams used for crystallography analysis of: •  material •  molecular biology •  …
  5. 5. @doanduyhai What is Synchrotron ? 5
  6. 6. @doanduyhai6
  7. 7. @doanduyhai The HDB++ project 7 •  Sub-project of TANGO, software toolkit to •  connect •  control/monitor •  integrate sensor devices •  HDB++ = new TANGO event-driven archiving system •  historically used MySQL •  now stores data into Cassandra
  8. 8. @doanduyhai The HDB++ project 8
  9. 9. @doanduyhai The HDB++ project 9 As of Sept - 2015
  10. 10. @doanduyhai The HDB++ GUI 10
  11. 11. @doanduyhai The HDB++ GUI 11
  12. 12. @doanduyhai The HDB++ hardware specs 12
  13. 13. 13 Q & A ! "
  14. 14. The HDB++ Cassandra data model
  15. 15. @doanduyhai Metrics table 15 CREATE TABLE hdb.att_scalar_devshort_ro ( att_conf_id timeuuid, period text, data_time timestamp, data_time_us int, error_desc text, insert_time timestamp, insert_time_us int, quality int, recv_time timestamp,recv_time_us int, value_r int, PRIMARY KEY((att_conf_id,period),data_time, data_time_us))
  16. 16. @doanduyhai Statistics table 16 CREATE TABLE hdb.stat_scalar_devshort_ro ( att_conf_id text, type_period text, //HOUR, DAY, MONTH, YEAR period text, //yyyy-MM-dd:HH, yyyy-MM-dd, yyyy-MM, yyyy count_distinct_error bigint, count_error bigint, count_point bigint, value_r_max int, value_r_min int, value_r_mean double, value_r_sd double, PRIMARY KEY ((att_conf_id, type_period), period) );
  17. 17. @doanduyhai Statistics table 17 INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period, value_r_mean) VALUES(xxxx, 'DAY', '2016-06-28', 123.456); INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period, value_r_mean) VALUES(xxxx, 'HOUR', '2016-06-28:01', 123.456); INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period, value_r_mean) VALUES(xxxx, 'MONTH', '2016-06', 123.456); // Request by period of time SELECT * FROM hdbtest.stat_scalar_devshort_ro WHERE att_conf_id = xxx AND type_period='DAY' AND period > '2016-06-20' AND period < '2016-06-28';
  18. 18. 18 Q & A ! "
  19. 19. The Spark jobs
  20. 20. @doanduyhai Source code 20 val devShortRoTable = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map("table" -> "att_scalar_devshort_ro", "keyspace" -> "hdbtest")) .load() devShortRoTable.registerTempTable("att_scalar_devshort_ro")
  21. 21. @doanduyhai Source code 21 val devShortRo = sqlContext.sql(s""" SELECT "DAY" AS type_period, att_conf_id, period, count(att_conf_id) AS count_point, count(error_desc) AS count_error, count(DISTINCT error_desc) AS count_distinct_error, min(value_r) AS value_r_min, max(value_r) AS value_r_max, avg(value_r) AS value_r_mean, stddev(value_r) AS value_r_sd FROM att_scalar_devshort_ro WHERE period="${day}" GROUP BY att_conf_id, period""")
  22. 22. @doanduyhai Source code 22 devShortRo.write .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "stat_scalar_devshort_ro", "keyspace" -> "hdbtest")) .mode(SaveMode.Append) .save()
  23. 23. Demo Zeppelin 23
  24. 24. @doanduyhai Zeppelin vizualisation (export as Iframe) 24
  25. 25. 25 Q & A ! "
  26. 26. Spark/Cassandra/Zeppelin tricks and traps •  Zeppelin/Spark/Cassandra •  Spark/Cassandra
  27. 27. @doanduyhai Zeppelin/Spark/Cassandra 27 •  Legend 💣 = trap 💡 = trick
  28. 28. @doanduyhai Zeppelin/Spark/Cassandra 28 •  Zeppelin build mode •  standard •  with Spark-Cassandra connector (maven profile -Pcassandra-spark-1.x) •  Spark run mode •  local •  with a stand-alone Spark co-located with Cassandra
  29. 29. @doanduyhai Zeppelin/Spark/Cassandra 29 •  Zeppelin build mode standard, Spark run mode local •  needs to add Spark-Cassandra connector as dependency to the Spark interpreter 💡
  30. 30. @doanduyhai Zeppelin/Spark/Cassandra 30 •  Zeppelin build mode standard, Spark run mode local •  on Spark interpreter init, all declared dependencies will be fetched from declared repositories (default = Maven central + local Maven repo) •  beware of corporate FIREWALL !!!!!!!!! •  Where are the downloaded dependencies (jars) stored ? 💣 💡
  31. 31. @doanduyhai Zeppelin/Spark/Cassandra 31 •  Zeppelin build mode standard, Spark run mode cluster •  Zeppelin uses spark-submit command •  Spark interpreter run by bin/interpreter.sh 💡
  32. 32. @doanduyhai Zeppelin/Spark/Cassandra 32 •  Zeppelin build mode standard, Spark run mode cluster •  run at least in local mode ONCE so that Zeppelin can dowload dependencies into local repo !!!! (zeppelin.interpreter.localRepo)💣
  33. 33. @doanduyhai Zeppelin/Spark/Cassandra 33 •  Zeppelin build mode with connector, Spark run mode local or cluster •  run smoothly because all Spark-Cassandra connector dependencies are merged into the interpreter/spark/dep/zeppelin-spark-dependencies-x.y.z.jar fat jar during the build process 💡
  34. 34. @doanduyhai Zeppelin/Spark/Cassandra 34 •  OSS Spark •  needs to add Spark-Cassandra connector dependencies •  in conf/spark-env.sh ... ... Caused by: java.lang.NoClassDefFoundError: com/ datastax/driver/core/ConsistencyLevel
  35. 35. @doanduyhai Zeppelin/Spark/Cassandra 35 •  OSS Spark •  needs to provide all transitive dependencies for the Spark-Cassandra connector !!! •  in conf/spark-env.sh •  or use spark-submit --package groupId:artifactId:version option 💣
  36. 36. @doanduyhai Zeppelin/Spark/Cassandra 36 •  DSE Spark •  run smoothly because the Spark-Cassandra connector dependencies are already embedded into the package ($DSE_HOME/resources/spark/lib)
  37. 37. @doanduyhai Spark/Cassandra 37 •  Spark deploy mode (spark-submit --deploy-mode ) •  client •  cluster •  Zeppelin deploys by default using client mode 💡
  38. 38. @doanduyhai Spark/Cassandra 38 •  Spark client deploy mode •  default •  needs to ship all driver program dependencies to the workers (network intensive) •  suitable for REPL (Spark Shell, Zeppelin) •  suitable for one-shot job/testing
  39. 39. @doanduyhai Spark/Cassandra 39 •  Spark cluster deploy mode •  driver program runs on a worker node •  all driver program dependencies should be reachable by any worker •  usually dependencies are stored in HDFS, can be stored on local FS on all workers •  suitable for recurrent jobs •  need a consistent build & deploy process for your jobs
  40. 40. @doanduyhai Spark/Cassandra 40 •  The job fails when using spark-submit •  but succeeded with Zeppelin … •  error: value stddev not found val devShortRo = sqlContext.sql(s""" SELECT "DAY" AS type_period, att_conf_id, period, count(att_conf_id) AS count_point, count(error_desc) AS count_error, count(DISTINCT error_desc) AS count_distinct_error, min(value_r) AS value_r_min, max(value_r) AS value_r_max, avg(value_r) AS value_r_mean, stddev(value_r) AS value_r_sd FROM att_scalar_devshort_ro WHERE period="${day}" GROUP BY att_conf_id, period""")
  41. 41. @doanduyhai Spark/Cassandra 41 •  Indeed Zeppelin use Hive context by default … •  Fix 💣
  42. 42. 42 Q & A ! "
  43. 43. @doanduyhai Cassandra Summit 2016 
 September 7-9 San Jose, CA Get 15% Off with Code: 
 DoanDuy15
 Cassandrasummit.org
  44. 44. 44 @doanduyhai duy_hai.doan@datastax.com https://academy.datastax.com/ Thank You

×