SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Apache	
  	
  	
  	
  	
  
Chris	
  Fregly	
  
Solu3on	
  Architect	
  
	
  
Use	
  Cases	
  
•  Relevant	
  
– Opera3ons	
  that	
  affect	
  all	
  data	
  in	
  data	
  set	
  
– Batch	
  Analy3cs	
  
•  Less-­‐relevant	
  
– Fine-­‐grained	
  updates	
  to	
  shared	
  state	
  
– Web	
  apps	
  
Performance	
  
•  20x	
  Hadoop	
  for	
  itera3ve	
  ML	
  and	
  graph	
  apps	
  
– Avoids	
  IO	
  and	
  deserializa3on	
  by	
  storing	
  data	
  in	
  
memory	
  as	
  direct	
  Java	
  objects	
  (versus	
  binary	
  data)	
  
•  40x	
  Hadoop	
  for	
  analy3cs	
  
•  Query	
  1TB	
  dataset	
  interac3vely	
  with	
  5-­‐7s	
  
latency	
  
•  25-­‐100	
  m1.xlarge	
  EC2	
  instances	
  
– 4	
  cores,	
  15GB	
  RAM,	
  HDFS	
  w/	
  256MB	
  blocks	
  
•  10	
  itera3ons	
  on	
  100GB	
  datasets	
  
Benchmarks:	
  Scan	
  
Benchmarks:	
  Aggrega3on	
  
Benchmarks:	
  Join	
  
Benchmarks:	
  Script	
  
Interac3ve	
  
•  Scala	
  REPL	
  
•  Python	
  interpreter	
  
RDD	
  
•  Resilient	
  Distributed	
  Dataset	
  
–  Immutable	
  
–  Flexible	
  storage	
  strategies	
  
•  In-­‐memory,	
  spill-­‐to-­‐disk	
  priority	
  
–  Par33oned	
  by	
  key	
  
•  Op3mizes	
  joins	
  with	
  data	
  locality	
  
–  Parallel	
  
–  Fault-­‐tolerant	
  
–  Supports	
  coarse-­‐grained	
  transforma3ons	
  
•  Versus	
  Fine-­‐grained	
  updates	
  to	
  shared	
  state	
  
–  Enables	
  in-­‐memory	
  data	
  reuse	
  between	
  itera3ons	
  
•  Common	
  in	
  machine	
  learning	
  and	
  graph	
  algorithms	
  
•  No	
  disk	
  IO	
  serializa3on	
  needed	
  between	
  itera3ons	
  
–  Rich	
  set	
  of	
  opera3ons	
  including	
  map,	
  flatMap,	
  filter,	
  join,	
  
groupByKey,	
  reduceByKey,	
  etc.	
  
Lineage	
  
•  Logs	
  coarse-­‐grained	
  RDD	
  transforma3ons	
  
•  Enables	
  fault	
  tolerance	
  
•  Reconstruct	
  by	
  replaying	
  log	
  of	
  parent/derived	
  
RDDs	
  and	
  their	
  transforma3ons	
  
–  Recomputed	
  in	
  parallel	
  on	
  separate	
  nodes	
  
•  ~10KB	
  versus	
  2x	
  replica3on	
  of	
  complete	
  dataset	
  
•  Large	
  lineages	
  may	
  benefit	
  from	
  checkpoin3ng*	
  
*more	
  later	
  
Dependencies	
  
•  Two	
  types	
  of	
  parent-­‐child	
  RDD	
  dependencies	
  
•  Narrow	
  
–  Each	
  par33on	
  of	
  the	
  parent	
  RDD	
  is	
  used	
  by	
  at	
  most	
  one	
  
par33on	
  of	
  the	
  child	
  RDD	
  
–  Allows	
  pipelined	
  execu3on	
  on	
  a	
  single	
  node	
  
–  Node	
  failure	
  s3ll	
  allows	
  par33on-­‐level	
  recovery	
  
(parallelizable)	
  
–  ie.	
  map(),	
  union(),	
  sample()	
  
•  Wide	
  
•  Each	
  par33on	
  of	
  the	
  parent	
  RDD	
  is	
  used	
  by	
  many	
  par33ons	
  of	
  the	
  
child	
  RDD	
  
•  Requires	
  par33on	
  of	
  each	
  parent	
  RDD	
  to	
  be	
  shuffled	
  across	
  nodes	
  
•  Node	
  failure	
  may	
  lead	
  to	
  complete	
  re-­‐execu3on	
  if	
  loss	
  affects	
  
par33ons	
  of	
  all	
  parent	
  RDDs	
  (bad)	
  
•  ie.	
  join()*	
  
*unless	
  parents	
  are	
  hash-­‐par33oned	
  
RDD	
  Dependencies	
  
Opera3ons	
  
•  Two	
  Types	
  of	
  RDD	
  Opera3ons	
  
–  Transforma3ons	
  
•  Define	
  RDDs	
  
•  Lazy	
  ini3aliza3on	
  
–  Ac3ons	
  
•  Compute	
  
•  Store	
  
•  Some	
  opera3ons	
  are	
  only	
  available	
  for	
  key-­‐value	
  pairs	
  
–  join(),	
  groupByKey(),	
  reduceByKey()	
  
•  map()	
  
–  One-­‐to-­‐one	
  transforma3on	
  
•  flatMap()	
  
–  Zero-­‐to-­‐many	
  transforma3on	
  
–  Most	
  similar	
  to	
  map()	
  in	
  MapReduce	
  
•  Par33oning	
  	
  
–  sort(),	
  groupByKey(),	
  reduceByKey()	
  
–  Results	
  in	
  hash	
  or	
  range-­‐par33oned	
  RDD	
  
RDD	
  Opera3ons	
  
Transforma3ons	
  
•  RDD	
  -­‐>	
  RDD	
  
	
  
lines=spark.textFile(“hdfs://…”)	
  
errors	
  =	
  lines.filter(_.startsWith(“ERROR”))	
  
errors.persist()	
  
Ac3ons	
  
•  RDD’s	
  are	
  lazily	
  materialized	
  when	
  ac#on	
  is	
  invoked	
  
•  count()	
  
–  Counts	
  the	
  elements	
  
•  collect()	
  
–  Returns	
  the	
  elements	
  
•  save()	
  
–  Persists	
  to	
  storage	
  
•  persist()	
  
–  Reuse	
  in	
  future	
  itera3ons	
  
–  Can	
  spill	
  to	
  disk	
  if	
  memory	
  is	
  low	
  
Par33oning	
  
•  Useful	
  for	
  grouping	
  related	
  data	
  onto	
  same	
  
node	
  
•  Improves	
  join()	
  performance	
  
•  Par33ons	
  
– One	
  par33on	
  for	
  each	
  block	
  
•  PreferredLoca3ons	
  
– block-­‐to-­‐node	
  mapping	
  
•  Iterator	
  
– Reads	
  the	
  blocks	
  
Job	
  Scheduling	
  
	
  
•  Driver	
  	
  
–  defines	
  RDDs	
  by	
  performing	
  transforma#ons	
  
•  Worker	
  
–  Long-­‐lived	
  processes	
  
–  Perform	
  ac#ons	
  on	
  RDD	
  par33ons	
  
–  Data	
  locality	
  is	
  op3mized	
  
•  Driver	
  passes	
  Closures	
  (Func3on	
  Literals)	
  to	
  Worker	
  nodes	
  
for	
  execu3on	
  
Job	
  Scheduling	
  
•  Ac3ons	
  cause	
  the	
  scheduler	
  to	
  do	
  the	
  following:	
  
–  Examine	
  the	
  RDD’s	
  lineage	
  graph	
  
–  Build	
  a	
  DAG	
  of	
  stages	
  to	
  execute	
  
–  Computes	
  missing	
  par33ons	
  at	
  each	
  stage	
  un3l	
  the	
  target	
  RDD	
  is	
  
reached	
  
•  Stages	
  
–  Op3mized	
  to	
  pipeline	
  transforma3ons	
  of	
  narrow	
  dependencies	
  
–  Shuffle	
  opera3ons	
  for	
  wide	
  dependencies	
  
•  Materializes	
  intermediate	
  records	
  on	
  the	
  node	
  holding	
  the	
  par33on	
  of	
  the	
  
parent	
  RDD	
  
–  Short-­‐circuits	
  for	
  already-­‐computed	
  par33ons	
  of	
  parent	
  RDD	
  
•  Delay	
  scheduling	
  (TODO:	
  	
  Understand	
  this	
  beper)	
  
•  Data	
  locality	
  (in-­‐memory	
  or	
  on-­‐disk)	
  is	
  op3mized	
  
•  Node	
  failures	
  	
  
–  Resubmits	
  tasks	
  to	
  compute	
  the	
  missing	
  par33ons	
  in	
  parallel	
  
•  Scheduler	
  failures	
  are	
  currently	
  not	
  handled	
  
Memory	
  
•  LRU	
  evic3on	
  policy	
  for	
  RDDs	
  
•  Cycles	
  are	
  prevented	
  	
  
•  Persistence	
  priority	
  can	
  be	
  set	
  for	
  each	
  RDD	
  
•  Spill	
  to	
  disk	
  	
  
–  Worst	
  case	
  performance	
  is	
  similar	
  to	
  exis3ng	
  
MapReduce	
  performance	
  due	
  to	
  IO	
  
•  Each	
  Spark	
  instance	
  has	
  its	
  own	
  memory	
  space*	
  
*Future	
  improvement	
  may	
  include	
  unified	
  memory	
  
manager	
  to	
  share	
  RDDs	
  across	
  Spark	
  instances	
  
Checkpoin3ng	
  
•  Snapshot	
  of	
  lineage	
  graph	
  
•  Asynchronously	
  replicate	
  RDDs	
  to	
  other	
  nodes	
  
–  Immutability	
  allows	
  this	
  to	
  happen	
  in	
  the	
  background	
  without	
  a	
  stop-­‐
the-­‐world	
  event	
  
•  Speeds	
  up	
  recovery	
  of	
  lost	
  RDD	
  par33ons	
  	
  
–  Useful	
  for	
  large	
  lineage	
  graphs	
  containing	
  wide	
  dependencies	
  which	
  
could	
  require	
  full	
  re-­‐computa3on	
  
•  Lineage	
  graphs	
  containing	
  narrow	
  dependencies	
  are	
  less-­‐likely	
  to	
  
require	
  full	
  re-­‐computa3on	
  
•  Lineage	
  is	
  forgopen	
  aqer	
  checkpoint	
  occurs	
  to	
  avoid	
  unbounded	
  
lineage	
  graphs	
  
•  User	
  can	
  determine	
  checkpoint	
  strategy*	
  
*Future	
  improvements	
  may	
  include	
  an	
  automa3c	
  checkpoint	
  policy	
  
based	
  on	
  data	
  set	
  characteris3cs	
  (size,	
  dependency	
  types,	
  and	
  ini3al	
  
3me	
  to	
  compute)	
  
Run3me	
  
•  Runs	
  standalone,	
  YARN,	
  Hadoop	
  v1.x	
  
•  Mesos	
  
–  Cluster-­‐resource	
  isola3on	
  across	
  distributed	
  
applica3ons	
  
–  Fine-­‐grained	
  sharing	
  
Data	
  Storage	
  
•  HDFS,	
  HBase,	
  S3,	
  local	
  files	
  	
  
Development	
  Tools	
  
•  (TODO:	
  Verify	
  all	
  of	
  this)	
  
•  Scala	
  2.10	
  
•  Java	
  1.6+	
  	
  
•  Python	
  2.7	
  +	
  NumP	
  
•  Spark	
  Debugger	
  
– Lineage	
  graph	
  inspector/replayer	
  
Shark	
  
•  Hive	
  on	
  Spark	
  
•  Run	
  unmodified	
  Hive	
  queries	
  on	
  exis3ng	
  data	
  
warehouse	
  
•  Supports	
  UDF’s,	
  Metastore,	
  etc	
  
•  Call	
  MLLib	
  directly	
  on	
  Hive	
  tables	
  
•  Use	
  the	
  same	
  Hive	
  queries	
  for	
  real3me,	
  near-­‐
real3me,	
  and	
  batch!	
  
MLlib	
  
•  Machine	
  learning	
  algorithms	
  
– Naïve-­‐Bayes	
  Classifier	
  
– K-­‐means	
  clustering	
  
– Linear	
  and	
  logis3c	
  regression	
  
– Alterna3ng	
  least	
  squares	
  collabora3ve	
  filtering	
  
•  Explicit	
  ra3ngs	
  
•  Implicit	
  feedback	
  
– Stochas3c	
  gradient	
  descent	
  
Real-­‐world	
  Uses	
  
Dynamic	
  stream-­‐switching	
  based	
  
on	
  real-­‐3me	
  network	
  condi3ons	
  
Predict	
  traffic-­‐conges3on	
  using	
  
expecta3on	
  maximiza3on	
  (EM)	
  
machine	
  learning	
  algorithm	
  
Monarch	
  Project:	
  	
  Twiper	
  spam	
  
classifica3on	
  using	
  logis3c	
  
regression	
  machine	
  learning	
  
algorithm	
  
GraphX	
  
•  Data-­‐parallel	
  and	
  graph-­‐parallel	
  system	
  
•  Ability	
  to	
  join	
  graph	
  data	
  and	
  table	
  data	
  
•  “Think	
  like	
  a	
  vertex”	
  
•  Vertex-­‐locality	
  is	
  op3mized	
  across	
  the	
  cluster	
  
Streaming	
  
•  Sources	
  
–  Kaya,	
  Flume,	
  Twiper	
  Firehose,	
  ZeroMQ,	
  Kinesis	
  
•  Custom	
  connectors	
  
•  Data	
  must	
  be	
  stored	
  reliably	
  in	
  case	
  recompute	
  is	
  needed	
  upon	
  node	
  
failure	
  
•  Discre3zed	
  streams	
  of	
  RDDs	
  (D-­‐Streams)	
  
•  Series	
  of	
  	
  determinis#c	
  batch	
  computa3ons	
  over	
  small	
  3me	
  intervals	
  
(versus	
  record-­‐at-­‐a-­‐3me	
  like	
  Storm	
  and	
  S4)	
  
•  Determinism	
  allows	
  parallel	
  re-­‐computa#on	
  on	
  failure	
  and	
  specula#ve	
  
execu#on	
  on	
  stragglers	
  
•  Specula3ve	
  execu3on	
  happens	
  if	
  a	
  task	
  runs	
  more	
  than	
  1.4x	
  longer	
  than	
  
the	
  median	
  task	
  in	
  its	
  job	
  stage	
  
•  0.5-­‐2s	
  latencies	
  
–  Good	
  enough,	
  but	
  not	
  meant	
  for	
  high-­‐frequency	
  trading	
  
–  Allows	
  beper	
  fault-­‐tolerance	
  and	
  efficiency	
  
•  Clear,	
  exactly-­‐once	
  consistency	
  seman3cs	
  
–  Each	
  record	
  passed	
  through	
  the	
  whole	
  determinis3c	
  process	
  
•  Distributed	
  state	
  is	
  far	
  easier	
  to	
  reason	
  about	
  
•  No	
  long-­‐lived,	
  stateful	
  opera3ons	
  or	
  cross-­‐stream	
  ordering	
  
Streaming	
  
•  Seemlessly	
  interoperate	
  with	
  batch	
  and	
  interac3ve	
  processing	
  –	
  all	
  based	
  
on	
  RDDs	
  
–  Real-­‐3me,	
  ad-­‐hoc	
  queries	
  against	
  live	
  system	
  
–  Helps	
  avoid	
  over-­‐fi{ng	
  in	
  certain	
  ML	
  algorithms	
  by	
  combining	
  real-­‐3me	
  
(sparse)	
  data	
  with	
  historical	
  data	
  
•  Timing	
  Issues	
  
–  Supports	
  “slack	
  3me”	
  to	
  allow	
  late-­‐arriving	
  records	
  to	
  be	
  part	
  of	
  the	
  correct	
  
batch	
  
•  Adds	
  fixed	
  latency,	
  of	
  course	
  
–  Applica3on-­‐level	
  incremental	
  reduce()	
  to	
  add	
  the	
  new	
  records	
  to	
  a	
  previous	
  
batch	
  
•  Sliding	
  Window	
  
–  Incremental	
  Aggrega3on	
  
–  Incremental	
  reduceByWindow	
  to	
  aggregate	
  within	
  a	
  given	
  window	
  
–  If	
  aggrega3on	
  func3on	
  is	
  inver3ble,	
  you	
  can	
  subtract	
  values	
  to	
  avoid	
  duplicate	
  
calcula3ons	
  
•  State	
  Tracking	
  
–  track()	
  operator	
  
•  Transforms	
  streams	
  of	
  (key,	
  event)	
  into	
  streams	
  of	
  (key,	
  state)	
  
•  ini3alize,	
  update,	
  3meout	
  func3ons	
  
Streaming	
  
•  Master	
  
–  D-­‐Stream	
  Lineage	
  Graph	
  
–  Task	
  Scheduler	
  
–  Block	
  Tracker	
  
•  Workers	
  
–  Receives	
  input	
  RDDs	
  
–  Clock-­‐sync’d	
  with	
  NTP	
  	
  
–  Executes	
  tasks	
  
•  No3fies	
  master	
  of	
  new	
  input	
  RDDs	
  on	
  regular	
  intervals	
  	
  
•  Receives	
  tasks	
  from	
  master	
  
•  Computes	
  new	
  RDD	
  par33ons	
  
–  Manages	
  in-­‐memory	
  Block	
  Store	
  
•  Stores	
  par33ons	
  of	
  immutable	
  input	
  RDDs	
  and	
  computed	
  RDDs	
  
•  Each	
  block	
  is	
  given	
  a	
  unique	
  ID	
  
•  LRU	
  Cache	
  
Streaming	
  
•  Pipelines	
  operators	
  can	
  be	
  grouped	
  into	
  a	
  single	
  task	
  
–  map().map()	
  
•  Task	
  placement	
  on	
  workers	
  
–  Data-­‐locality	
  aware	
  
•  Choose	
  a	
  worker	
  that	
  contains	
  the	
  data	
  block	
  
–  Par33on-­‐aware	
  	
  
•  Same	
  keys	
  on	
  same	
  node	
  to	
  avoid	
  shuffling	
  
•  Block	
  placement	
  based	
  on	
  worker	
  load	
  
•  Master	
  not	
  fault-­‐tolerant	
  (yet)*	
  
*Future	
  work	
  to	
  persist	
  the	
  D-­‐Streams	
  graph	
  to	
  allow	
  a	
  
standby	
  to	
  take	
  over	
  
Streaming	
  Performance	
  
•  100	
  m1.xlarge	
  EC2	
  instances	
  
– 4	
  cores,	
  15GB	
  RAM	
  
•  100	
  byte	
  input	
  records	
  
•  1	
  second	
  latency	
  target	
  
– 500	
  ms	
  input	
  intervals	
  
•  6GB/s	
  throughput!	
  
•  Linear	
  scalability	
  
Other	
  Streaming	
  Solu3ons	
  
•  Replica3on	
  
–  Requires	
  synchroniza3on	
  protocol	
  (Flux,	
  DPC)	
  to	
  preserve	
  order	
  
•  Upstream	
  backup	
  
–  Upon	
  failure,	
  parent	
  replays	
  messages	
  since	
  last	
  checkpoint	
  to	
  the	
  
standby	
  
–  Storm	
  
•  At	
  least	
  once	
  delivery	
  
•  Requires	
  applica3on	
  code	
  to	
  recover	
  state	
  
–  Storm	
  +	
  Triden	
  
•  Keeps	
  state	
  in	
  replicated	
  DB	
  
•  Commits	
  updates	
  in	
  batches	
  
•  Requires	
  updates	
  to	
  be	
  replicated	
  across	
  the	
  network	
  within	
  a	
  transac3on	
  
•  Costly,	
  but	
  recovers	
  quickly	
  
•  Doesn’t	
  handle	
  stragglers/slow	
  nodes	
  
–  Will	
  slow	
  down	
  the	
  whole	
  system	
  given	
  the	
  required	
  synchroniza3on	
  
Streaming	
  Comparison	
  
Streaming	
  Future	
  Work	
  
•  Dynamically	
  adjust	
  level	
  of	
  parallelism	
  at	
  each	
  stage	
  
•  Dynamically	
  adjust	
  interval	
  size	
  based	
  on	
  load	
  
–  Lower	
  interval	
  size	
  at	
  low	
  load	
  for	
  lower	
  latencies	
  
•  Automa3c	
  checkpoin3ng	
  based	
  on	
  data	
  set	
  
characteris3cs	
  (size,	
  dependency	
  types,	
  and	
  ini3al	
  3me	
  
to	
  compute)	
  
•  Limit	
  ad-­‐hoc	
  query	
  resources	
  to	
  avoid	
  slowing	
  the	
  
overall	
  streaming	
  system	
  
•  Return	
  par3al	
  results	
  in	
  case	
  of	
  failure	
  
–  Launch	
  a	
  child	
  task	
  before	
  its	
  parents	
  are	
  done	
  
–  Provide	
  lineage	
  data	
  to	
  know	
  which	
  parents	
  are	
  missing	
  
Kinesis	
  Connector	
  Demo!	
  
References	
  
•  hpp://www.cs.berkeley.edu/~matei/papers/2012/
nsdi_spark.pdf	
  
•  hpp://www.eecs.berkeley.edu/Pubs/TechRpts/2012/
EECS-­‐2012-­‐259.pdf	
  
•  hpp://spark.apache.org/	
  
	
  
Available	
  Training	
  
•  Spark	
  Streaming	
  
–  AWS	
  Kinesis-­‐Spark	
  Streaming	
  integra3on	
  
•  Shark	
  
–  Hive	
  queries	
  with	
  Spark/HDFS	
  
•  MLlib	
  
–  Anomaly	
  detec3on	
  with	
  Spark	
  
•  GraphX	
  
–  Distributed	
  Graph	
  Processing	
  with	
  Spark	
  
•  Installing	
  Spark	
  on	
  Mesos	
  
•  Installing	
  Spark	
  on	
  YARN	
  
Upcoming	
  Talks	
  
•  AWS	
  Meetup	
  (SF,	
  Oakland,	
  and	
  Berkeley)	
  	
  
– hpp://www.meetup.com/AWS-­‐Berkeley	
  
•  Spark	
  Meetup	
  (SF)	
  
– hpp://www.meetup.com/spark-­‐users/	
  
•  Advanced	
  AWS	
  Meetup	
  (SF	
  and	
  Peninsula)	
  
– hpp://www.meetup.com/AdvancedAWS/	
  
•  Ne}lix	
  Internal	
  CloudU	
  (Los	
  Gatos)	
  
•  Scala	
  Meetup	
  (SF)	
  
– hpp://www.meetup.com/SF-­‐Scala/	
  

Más contenido relacionado

La actualidad más candente

Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMeeraj Kunnumpurath
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Databricks
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics Databricks
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesVinay Shukla
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenDatabricks
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Sparkfelixcss
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh
 
Data science on big data. Pragmatic approach
Data science on big data. Pragmatic approachData science on big data. Pragmatic approach
Data science on big data. Pragmatic approachPavel Mezentsev
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
MLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersMLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersDataWorks Summit
 

La actualidad más candente (20)

Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 
Scalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache SparkScalable Data Science in Python and R on Apache Spark
Scalable Data Science in Python and R on Apache Spark
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
Data science on big data. Pragmatic approach
Data science on big data. Pragmatic approachData science on big data. Pragmatic approach
Data science on big data. Pragmatic approach
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
MLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersMLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API Servers
 

Destacado

Petabridge: The New .NET Enterprise Stack
Petabridge: The New .NET Enterprise StackPetabridge: The New .NET Enterprise Stack
Petabridge: The New .NET Enterprise StackDataStax Academy
 
DataStax: The Whys of NoSQL
DataStax: The Whys of NoSQLDataStax: The Whys of NoSQL
DataStax: The Whys of NoSQLDataStax Academy
 
DataStax: Steps to successfully implementing NoSQL in the enterprise
DataStax: Steps to successfully implementing NoSQL in the enterpriseDataStax: Steps to successfully implementing NoSQL in the enterprise
DataStax: Steps to successfully implementing NoSQL in the enterpriseDataStax Academy
 
DataStax: Setting Your Database Management on Autopilot with OpsCenter
DataStax: Setting Your Database Management on Autopilot with OpsCenterDataStax: Setting Your Database Management on Autopilot with OpsCenter
DataStax: Setting Your Database Management on Autopilot with OpsCenterDataStax Academy
 
DataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart AnalyticsDataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart AnalyticsDataStax Academy
 
DataStax: Ramping up Cassandra QA
DataStax: Ramping up Cassandra QADataStax: Ramping up Cassandra QA
DataStax: Ramping up Cassandra QADataStax Academy
 
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra MigrationInfosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra MigrationDataStax Academy
 
Reltio: Powering Enterprise Data-driven Applications with Cassandra
Reltio: Powering Enterprise Data-driven Applications with CassandraReltio: Powering Enterprise Data-driven Applications with Cassandra
Reltio: Powering Enterprise Data-driven Applications with CassandraDataStax Academy
 
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...DataStax Academy
 
DataStax: What's New in Apache TinkerPop - the Graph Computing Framework
DataStax: What's New in Apache TinkerPop - the Graph Computing FrameworkDataStax: What's New in Apache TinkerPop - the Graph Computing Framework
DataStax: What's New in Apache TinkerPop - the Graph Computing FrameworkDataStax Academy
 
Target: Escaping Disco-Era Data Modeling
Target: Escaping Disco-Era Data ModelingTarget: Escaping Disco-Era Data Modeling
Target: Escaping Disco-Era Data ModelingDataStax Academy
 
DataStax: Datastax Enterprise - The Multi-Model Platform
DataStax: Datastax Enterprise - The Multi-Model PlatformDataStax: Datastax Enterprise - The Multi-Model Platform
DataStax: Datastax Enterprise - The Multi-Model PlatformDataStax Academy
 
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraCisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraDataStax Academy
 
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...DataStax Academy
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkDataStax Academy
 

Destacado (15)

Petabridge: The New .NET Enterprise Stack
Petabridge: The New .NET Enterprise StackPetabridge: The New .NET Enterprise Stack
Petabridge: The New .NET Enterprise Stack
 
DataStax: The Whys of NoSQL
DataStax: The Whys of NoSQLDataStax: The Whys of NoSQL
DataStax: The Whys of NoSQL
 
DataStax: Steps to successfully implementing NoSQL in the enterprise
DataStax: Steps to successfully implementing NoSQL in the enterpriseDataStax: Steps to successfully implementing NoSQL in the enterprise
DataStax: Steps to successfully implementing NoSQL in the enterprise
 
DataStax: Setting Your Database Management on Autopilot with OpsCenter
DataStax: Setting Your Database Management on Autopilot with OpsCenterDataStax: Setting Your Database Management on Autopilot with OpsCenter
DataStax: Setting Your Database Management on Autopilot with OpsCenter
 
DataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart AnalyticsDataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart Analytics
 
DataStax: Ramping up Cassandra QA
DataStax: Ramping up Cassandra QADataStax: Ramping up Cassandra QA
DataStax: Ramping up Cassandra QA
 
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra MigrationInfosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
 
Reltio: Powering Enterprise Data-driven Applications with Cassandra
Reltio: Powering Enterprise Data-driven Applications with CassandraReltio: Powering Enterprise Data-driven Applications with Cassandra
Reltio: Powering Enterprise Data-driven Applications with Cassandra
 
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
 
DataStax: What's New in Apache TinkerPop - the Graph Computing Framework
DataStax: What's New in Apache TinkerPop - the Graph Computing FrameworkDataStax: What's New in Apache TinkerPop - the Graph Computing Framework
DataStax: What's New in Apache TinkerPop - the Graph Computing Framework
 
Target: Escaping Disco-Era Data Modeling
Target: Escaping Disco-Era Data ModelingTarget: Escaping Disco-Era Data Modeling
Target: Escaping Disco-Era Data Modeling
 
DataStax: Datastax Enterprise - The Multi-Model Platform
DataStax: Datastax Enterprise - The Multi-Model PlatformDataStax: Datastax Enterprise - The Multi-Model Platform
DataStax: Datastax Enterprise - The Multi-Model Platform
 
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraCisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
 
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...
 
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and SparkTupleJump: Breakthrough OLAP performance on Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
 

Similar a IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learning with Spark and Cassandra

CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allenjaxconf
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in JavaRuben Badaró
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331Fengchang Xie
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014Avinash Ramineni
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014clairvoyantllc
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...Lviv Startup Club
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
CASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSCASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSVipul Thakur
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Availability and scalability in mongo
Availability and scalability in mongoAvailability and scalability in mongo
Availability and scalability in mongoMd. Khairul Anam
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 

Similar a IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learning with Spark and Cassandra (20)

Cpu Caches
Cpu CachesCpu Caches
Cpu Caches
 
CPU Caches - Jamie Allen
CPU Caches - Jamie AllenCPU Caches - Jamie Allen
CPU Caches - Jamie Allen
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Hadoop
HadoopHadoop
Hadoop
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
Vitalii Bondarenko - “Azure real-time analytics and kappa architecture with K...
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
CASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMSCASSANDRA - Next to RDBMS
CASSANDRA - Next to RDBMS
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Availability and scalability in mongo
Availability and scalability in mongoAvailability and scalability in mongo
Availability and scalability in mongo
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 

Más de DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

Más de DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 

Último (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learning with Spark and Cassandra

  • 1. Apache           Chris  Fregly   Solu3on  Architect    
  • 2. Use  Cases   •  Relevant   – Opera3ons  that  affect  all  data  in  data  set   – Batch  Analy3cs   •  Less-­‐relevant   – Fine-­‐grained  updates  to  shared  state   – Web  apps  
  • 3. Performance   •  20x  Hadoop  for  itera3ve  ML  and  graph  apps   – Avoids  IO  and  deserializa3on  by  storing  data  in   memory  as  direct  Java  objects  (versus  binary  data)   •  40x  Hadoop  for  analy3cs   •  Query  1TB  dataset  interac3vely  with  5-­‐7s   latency   •  25-­‐100  m1.xlarge  EC2  instances   – 4  cores,  15GB  RAM,  HDFS  w/  256MB  blocks   •  10  itera3ons  on  100GB  datasets  
  • 8. Interac3ve   •  Scala  REPL   •  Python  interpreter  
  • 9. RDD   •  Resilient  Distributed  Dataset   –  Immutable   –  Flexible  storage  strategies   •  In-­‐memory,  spill-­‐to-­‐disk  priority   –  Par33oned  by  key   •  Op3mizes  joins  with  data  locality   –  Parallel   –  Fault-­‐tolerant   –  Supports  coarse-­‐grained  transforma3ons   •  Versus  Fine-­‐grained  updates  to  shared  state   –  Enables  in-­‐memory  data  reuse  between  itera3ons   •  Common  in  machine  learning  and  graph  algorithms   •  No  disk  IO  serializa3on  needed  between  itera3ons   –  Rich  set  of  opera3ons  including  map,  flatMap,  filter,  join,   groupByKey,  reduceByKey,  etc.  
  • 10. Lineage   •  Logs  coarse-­‐grained  RDD  transforma3ons   •  Enables  fault  tolerance   •  Reconstruct  by  replaying  log  of  parent/derived   RDDs  and  their  transforma3ons   –  Recomputed  in  parallel  on  separate  nodes   •  ~10KB  versus  2x  replica3on  of  complete  dataset   •  Large  lineages  may  benefit  from  checkpoin3ng*   *more  later  
  • 11. Dependencies   •  Two  types  of  parent-­‐child  RDD  dependencies   •  Narrow   –  Each  par33on  of  the  parent  RDD  is  used  by  at  most  one   par33on  of  the  child  RDD   –  Allows  pipelined  execu3on  on  a  single  node   –  Node  failure  s3ll  allows  par33on-­‐level  recovery   (parallelizable)   –  ie.  map(),  union(),  sample()   •  Wide   •  Each  par33on  of  the  parent  RDD  is  used  by  many  par33ons  of  the   child  RDD   •  Requires  par33on  of  each  parent  RDD  to  be  shuffled  across  nodes   •  Node  failure  may  lead  to  complete  re-­‐execu3on  if  loss  affects   par33ons  of  all  parent  RDDs  (bad)   •  ie.  join()*   *unless  parents  are  hash-­‐par33oned  
  • 13. Opera3ons   •  Two  Types  of  RDD  Opera3ons   –  Transforma3ons   •  Define  RDDs   •  Lazy  ini3aliza3on   –  Ac3ons   •  Compute   •  Store   •  Some  opera3ons  are  only  available  for  key-­‐value  pairs   –  join(),  groupByKey(),  reduceByKey()   •  map()   –  One-­‐to-­‐one  transforma3on   •  flatMap()   –  Zero-­‐to-­‐many  transforma3on   –  Most  similar  to  map()  in  MapReduce   •  Par33oning     –  sort(),  groupByKey(),  reduceByKey()   –  Results  in  hash  or  range-­‐par33oned  RDD  
  • 15. Transforma3ons   •  RDD  -­‐>  RDD     lines=spark.textFile(“hdfs://…”)   errors  =  lines.filter(_.startsWith(“ERROR”))   errors.persist()  
  • 16. Ac3ons   •  RDD’s  are  lazily  materialized  when  ac#on  is  invoked   •  count()   –  Counts  the  elements   •  collect()   –  Returns  the  elements   •  save()   –  Persists  to  storage   •  persist()   –  Reuse  in  future  itera3ons   –  Can  spill  to  disk  if  memory  is  low  
  • 17. Par33oning   •  Useful  for  grouping  related  data  onto  same   node   •  Improves  join()  performance   •  Par33ons   – One  par33on  for  each  block   •  PreferredLoca3ons   – block-­‐to-­‐node  mapping   •  Iterator   – Reads  the  blocks  
  • 18. Job  Scheduling     •  Driver     –  defines  RDDs  by  performing  transforma#ons   •  Worker   –  Long-­‐lived  processes   –  Perform  ac#ons  on  RDD  par33ons   –  Data  locality  is  op3mized   •  Driver  passes  Closures  (Func3on  Literals)  to  Worker  nodes   for  execu3on  
  • 19. Job  Scheduling   •  Ac3ons  cause  the  scheduler  to  do  the  following:   –  Examine  the  RDD’s  lineage  graph   –  Build  a  DAG  of  stages  to  execute   –  Computes  missing  par33ons  at  each  stage  un3l  the  target  RDD  is   reached   •  Stages   –  Op3mized  to  pipeline  transforma3ons  of  narrow  dependencies   –  Shuffle  opera3ons  for  wide  dependencies   •  Materializes  intermediate  records  on  the  node  holding  the  par33on  of  the   parent  RDD   –  Short-­‐circuits  for  already-­‐computed  par33ons  of  parent  RDD   •  Delay  scheduling  (TODO:    Understand  this  beper)   •  Data  locality  (in-­‐memory  or  on-­‐disk)  is  op3mized   •  Node  failures     –  Resubmits  tasks  to  compute  the  missing  par33ons  in  parallel   •  Scheduler  failures  are  currently  not  handled  
  • 20. Memory   •  LRU  evic3on  policy  for  RDDs   •  Cycles  are  prevented     •  Persistence  priority  can  be  set  for  each  RDD   •  Spill  to  disk     –  Worst  case  performance  is  similar  to  exis3ng   MapReduce  performance  due  to  IO   •  Each  Spark  instance  has  its  own  memory  space*   *Future  improvement  may  include  unified  memory   manager  to  share  RDDs  across  Spark  instances  
  • 21. Checkpoin3ng   •  Snapshot  of  lineage  graph   •  Asynchronously  replicate  RDDs  to  other  nodes   –  Immutability  allows  this  to  happen  in  the  background  without  a  stop-­‐ the-­‐world  event   •  Speeds  up  recovery  of  lost  RDD  par33ons     –  Useful  for  large  lineage  graphs  containing  wide  dependencies  which   could  require  full  re-­‐computa3on   •  Lineage  graphs  containing  narrow  dependencies  are  less-­‐likely  to   require  full  re-­‐computa3on   •  Lineage  is  forgopen  aqer  checkpoint  occurs  to  avoid  unbounded   lineage  graphs   •  User  can  determine  checkpoint  strategy*   *Future  improvements  may  include  an  automa3c  checkpoint  policy   based  on  data  set  characteris3cs  (size,  dependency  types,  and  ini3al   3me  to  compute)  
  • 22. Run3me   •  Runs  standalone,  YARN,  Hadoop  v1.x   •  Mesos   –  Cluster-­‐resource  isola3on  across  distributed   applica3ons   –  Fine-­‐grained  sharing  
  • 23. Data  Storage   •  HDFS,  HBase,  S3,  local  files    
  • 24. Development  Tools   •  (TODO:  Verify  all  of  this)   •  Scala  2.10   •  Java  1.6+     •  Python  2.7  +  NumP   •  Spark  Debugger   – Lineage  graph  inspector/replayer  
  • 25. Shark   •  Hive  on  Spark   •  Run  unmodified  Hive  queries  on  exis3ng  data   warehouse   •  Supports  UDF’s,  Metastore,  etc   •  Call  MLLib  directly  on  Hive  tables   •  Use  the  same  Hive  queries  for  real3me,  near-­‐ real3me,  and  batch!  
  • 26. MLlib   •  Machine  learning  algorithms   – Naïve-­‐Bayes  Classifier   – K-­‐means  clustering   – Linear  and  logis3c  regression   – Alterna3ng  least  squares  collabora3ve  filtering   •  Explicit  ra3ngs   •  Implicit  feedback   – Stochas3c  gradient  descent  
  • 27. Real-­‐world  Uses   Dynamic  stream-­‐switching  based   on  real-­‐3me  network  condi3ons   Predict  traffic-­‐conges3on  using   expecta3on  maximiza3on  (EM)   machine  learning  algorithm   Monarch  Project:    Twiper  spam   classifica3on  using  logis3c   regression  machine  learning   algorithm  
  • 28. GraphX   •  Data-­‐parallel  and  graph-­‐parallel  system   •  Ability  to  join  graph  data  and  table  data   •  “Think  like  a  vertex”   •  Vertex-­‐locality  is  op3mized  across  the  cluster  
  • 29. Streaming   •  Sources   –  Kaya,  Flume,  Twiper  Firehose,  ZeroMQ,  Kinesis   •  Custom  connectors   •  Data  must  be  stored  reliably  in  case  recompute  is  needed  upon  node   failure   •  Discre3zed  streams  of  RDDs  (D-­‐Streams)   •  Series  of    determinis#c  batch  computa3ons  over  small  3me  intervals   (versus  record-­‐at-­‐a-­‐3me  like  Storm  and  S4)   •  Determinism  allows  parallel  re-­‐computa#on  on  failure  and  specula#ve   execu#on  on  stragglers   •  Specula3ve  execu3on  happens  if  a  task  runs  more  than  1.4x  longer  than   the  median  task  in  its  job  stage   •  0.5-­‐2s  latencies   –  Good  enough,  but  not  meant  for  high-­‐frequency  trading   –  Allows  beper  fault-­‐tolerance  and  efficiency   •  Clear,  exactly-­‐once  consistency  seman3cs   –  Each  record  passed  through  the  whole  determinis3c  process   •  Distributed  state  is  far  easier  to  reason  about   •  No  long-­‐lived,  stateful  opera3ons  or  cross-­‐stream  ordering  
  • 30. Streaming   •  Seemlessly  interoperate  with  batch  and  interac3ve  processing  –  all  based   on  RDDs   –  Real-­‐3me,  ad-­‐hoc  queries  against  live  system   –  Helps  avoid  over-­‐fi{ng  in  certain  ML  algorithms  by  combining  real-­‐3me   (sparse)  data  with  historical  data   •  Timing  Issues   –  Supports  “slack  3me”  to  allow  late-­‐arriving  records  to  be  part  of  the  correct   batch   •  Adds  fixed  latency,  of  course   –  Applica3on-­‐level  incremental  reduce()  to  add  the  new  records  to  a  previous   batch   •  Sliding  Window   –  Incremental  Aggrega3on   –  Incremental  reduceByWindow  to  aggregate  within  a  given  window   –  If  aggrega3on  func3on  is  inver3ble,  you  can  subtract  values  to  avoid  duplicate   calcula3ons   •  State  Tracking   –  track()  operator   •  Transforms  streams  of  (key,  event)  into  streams  of  (key,  state)   •  ini3alize,  update,  3meout  func3ons  
  • 31. Streaming   •  Master   –  D-­‐Stream  Lineage  Graph   –  Task  Scheduler   –  Block  Tracker   •  Workers   –  Receives  input  RDDs   –  Clock-­‐sync’d  with  NTP     –  Executes  tasks   •  No3fies  master  of  new  input  RDDs  on  regular  intervals     •  Receives  tasks  from  master   •  Computes  new  RDD  par33ons   –  Manages  in-­‐memory  Block  Store   •  Stores  par33ons  of  immutable  input  RDDs  and  computed  RDDs   •  Each  block  is  given  a  unique  ID   •  LRU  Cache  
  • 32. Streaming   •  Pipelines  operators  can  be  grouped  into  a  single  task   –  map().map()   •  Task  placement  on  workers   –  Data-­‐locality  aware   •  Choose  a  worker  that  contains  the  data  block   –  Par33on-­‐aware     •  Same  keys  on  same  node  to  avoid  shuffling   •  Block  placement  based  on  worker  load   •  Master  not  fault-­‐tolerant  (yet)*   *Future  work  to  persist  the  D-­‐Streams  graph  to  allow  a   standby  to  take  over  
  • 33. Streaming  Performance   •  100  m1.xlarge  EC2  instances   – 4  cores,  15GB  RAM   •  100  byte  input  records   •  1  second  latency  target   – 500  ms  input  intervals   •  6GB/s  throughput!   •  Linear  scalability  
  • 34. Other  Streaming  Solu3ons   •  Replica3on   –  Requires  synchroniza3on  protocol  (Flux,  DPC)  to  preserve  order   •  Upstream  backup   –  Upon  failure,  parent  replays  messages  since  last  checkpoint  to  the   standby   –  Storm   •  At  least  once  delivery   •  Requires  applica3on  code  to  recover  state   –  Storm  +  Triden   •  Keeps  state  in  replicated  DB   •  Commits  updates  in  batches   •  Requires  updates  to  be  replicated  across  the  network  within  a  transac3on   •  Costly,  but  recovers  quickly   •  Doesn’t  handle  stragglers/slow  nodes   –  Will  slow  down  the  whole  system  given  the  required  synchroniza3on  
  • 36. Streaming  Future  Work   •  Dynamically  adjust  level  of  parallelism  at  each  stage   •  Dynamically  adjust  interval  size  based  on  load   –  Lower  interval  size  at  low  load  for  lower  latencies   •  Automa3c  checkpoin3ng  based  on  data  set   characteris3cs  (size,  dependency  types,  and  ini3al  3me   to  compute)   •  Limit  ad-­‐hoc  query  resources  to  avoid  slowing  the   overall  streaming  system   •  Return  par3al  results  in  case  of  failure   –  Launch  a  child  task  before  its  parents  are  done   –  Provide  lineage  data  to  know  which  parents  are  missing  
  • 38. References   •  hpp://www.cs.berkeley.edu/~matei/papers/2012/ nsdi_spark.pdf   •  hpp://www.eecs.berkeley.edu/Pubs/TechRpts/2012/ EECS-­‐2012-­‐259.pdf   •  hpp://spark.apache.org/    
  • 39. Available  Training   •  Spark  Streaming   –  AWS  Kinesis-­‐Spark  Streaming  integra3on   •  Shark   –  Hive  queries  with  Spark/HDFS   •  MLlib   –  Anomaly  detec3on  with  Spark   •  GraphX   –  Distributed  Graph  Processing  with  Spark   •  Installing  Spark  on  Mesos   •  Installing  Spark  on  YARN  
  • 40. Upcoming  Talks   •  AWS  Meetup  (SF,  Oakland,  and  Berkeley)     – hpp://www.meetup.com/AWS-­‐Berkeley   •  Spark  Meetup  (SF)   – hpp://www.meetup.com/spark-­‐users/   •  Advanced  AWS  Meetup  (SF  and  Peninsula)   – hpp://www.meetup.com/AdvancedAWS/   •  Ne}lix  Internal  CloudU  (Los  Gatos)   •  Scala  Meetup  (SF)   – hpp://www.meetup.com/SF-­‐Scala/