SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
+	
  
Artmosphere	
  	
  
Discover	
  the	
  beauty	
  not	
  yet	
  seen
Keira	
  Zhou	
  
October,	
  2015
+	
  
Demo	
  
2
n  www.artmosphere.nyc	
  
n  http://54.215.136.187:5000	
  
n  Video:	
  https://youtu.be/skzZ7sosC8c	
  
+	
  
Data	
  Pipeline
Web	
  API:	
  
•  JSON	
  files	
  
	
  
Self-­‐Engineered	
  
user	
  activity	
  log
Data	
  Source
3
+	
  
Cluster	
  Setup
4
8GB	
  Memory	
  
50GB	
  Storage	
  
8GB	
  Memory	
  
1TB	
  Storage	
  
8GB	
  Memory	
  
1TB	
  Storage	
  
8GB	
  Memory	
  
1TB	
  Storage	
  
+	
  
Dataset
n Artsy.net:	
  
n  About	
  26K	
  artworks	
  
n  About	
  45K	
  artists	
  
n  JSON	
  
5
+	
  
Dataset
n Artsy.net:	
  
n  About	
  26K	
  artworks	
  
n  About	
  45K	
  artists	
  
n  JSON	
  
n Generated	
  User	
  Log:	
  
n  Simulated	
  user	
  “collect”	
  activities	
  
n  Multiplied	
  real	
  user	
  location	
  
6
+	
  
Search	
  Artwork	
  by	
  Title
7
n Raw	
  JSON	
  of	
  45K	
  artwork	
  
n Spark	
  -­‐	
  Elasticsearch	
  
+	
  
Real-­‐Time	
  Streaming
8
n Track	
  the	
  trend	
  of	
  each	
  art	
  
n Spark	
  Streaming	
  -­‐	
  Cassandra	
  
+	
  
Batch	
  Processing
9
n Artists	
  Location	
  
n Similar	
  Artworks	
  
n Spark	
  –	
  Cassandra	
  
+	
  
Batch	
  Processing	
  Artists	
  Location
10
n  Spark:	
  processed	
  507GB	
  of	
  data
+	
  
Challenges
n Find	
  the	
  right	
  database	
  for	
  your	
  problem	
  	
  
n  Elasticsearch	
  for	
  search	
  
n  Cassandra	
  for	
  time	
  series	
  
n Computers	
  are	
  multilingual	
  –Python,	
  Scala,	
  Java…	
  
n But	
  challenges	
  make	
  life	
  interesting	
  
11
+	
  
About	
  Me
12
n  MS	
  &	
  BS	
  in	
  Systems	
  Engineering,	
  
University	
  of	
  Virginia	
  
n  Machine	
  Learning	
  
n  Natural	
  Language	
  Processing	
  
n  High	
  Performance	
  Computing	
  
+	
  
About	
  Me
13
n  MS	
  &	
  BS	
  in	
  Systems	
  Engineering,	
  
University	
  of	
  Virginia	
  
n  Machine	
  Learning	
  
n  Natural	
  Language	
  Processing	
  
n  High	
  Performance	
  Computing	
  
n  Enjoy	
  
n  Sketching	
  
n  Adventures	
  
n  Laughing	
  
n  Life
+	
  
Backup	
  Slides
14
+	
  
Batch	
  Processing	
  Art	
  Similarity
15
n Artworks	
  are	
  manually	
  tagged	
  
n Input	
  format:	
  
n  [art1]:	
  [tag1][tag2][tag3]…	
  
n Compute	
  common	
  tags	
  
between	
  two	
  artworks	
  
n  Spark	
  –	
  Cassandra	
  
n  Could	
  also	
  use	
  Collaborative	
  
Filtering	
  (MLlib	
  in	
  Spark)	
  
+	
  
Benchmark	
  Reads/Writes
16
“Benchmarking	
  Top	
  NoSQL	
  Databases”	
  End	
  Point:	
  	
  
http://www.datastax.com/wp-­‐content/themes/datastax-­‐2014-­‐08/files/NoSQL_Benchmarks_EndPoint.pdf	
  
Most	
  Writes	
   Writes/Reads	
  Balanced	
  
n Operations	
  /	
  sec	
  
n Cassandra	
  |	
  Couchbase	
  |	
  Hbase	
  |	
  MongoDB	
  
+	
  
Cassandra	
  Time	
  Series
17
time_stamp_1 time_stamp_2 time_stamp_3 …
art_id_1 3 1 2 …
art_id_2 5 3 1 …
art_id_3 1 4 2 …
… … … … …
Primary	
  Key	
  
(Partition	
  Key)
Primary	
  Key	
  (Clustering	
  Key):	
  	
  
with	
  Clustering	
  Order	
  By	
  (Desc)
n  Compound	
  Primary	
  Key	
  (art_id,	
  time_stamp)	
  
n  art_id:	
  Partition	
  key,	
  responsible	
  for	
  data	
  distribution	
  across	
  nodes	
  
n  time_stamp:	
  Clustering	
  key,	
  responsible	
  for	
  data	
  sorting	
  within	
  the	
  
partition	
  
+	
  
Transactions	
  in	
  Cassandra
18
Node	
  1 Node	
  2
Write	
  “Life	
  is	
  good”	
  
Consistency	
  =	
  all
Write	
  “Life	
  is	
  good”	
   Write	
  “Life	
  is	
  good”	
  
Node	
  1	
  
Life	
  is	
  good
Node	
  2	
  
Life	
  is	
  …	
  <Job	
  failed>
Report	
  FAIL	
  -­‐>	
  Rollback	
  Report	
  SUCCESS	
  
Final	
  report:	
  FAIL
Node	
  1	
  
Life	
  is	
  good
Node	
  2	
  
Datastax:	
  http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_atomicity_c.html	
  
n  Write	
  atomicity	
  is	
  at	
  the	
  partition-­‐level
+	
  
Batch	
  Processing	
  Artists	
  Location
19
n  Spark:	
  processed	
  507GB	
  of	
  data
File	
  Size	
  in	
  GB 4.7 14.5 28.5 43 101 202.5 318.5 507
Time	
  in	
  min 1.5 5.3 6.5 9.5 21.25 42.87 67.1 110
+	
  
Spark	
  vs.	
  Hadoop
Spark Hadoop	
  MapReduce
Fault	
  Tolerance
via	
  RDD	
  
(they	
  rebuild	
  lost	
  data	
  on	
  failure	
  using	
  
lineage:	
  each	
  RDD	
  remembers	
  how	
  it	
  
was	
  built	
  from	
  other	
  datasets	
  to	
  
rebuild	
  itself)
via	
  Replication
Cache	
  Data	
  into	
  
Memory
Yes No
Support	
  in-­‐memory	
  data	
  sharing	
  across	
  
directed	
  acyclic	
  graphs	
  (RDD);	
  
well-­‐suited	
  to	
  machine	
  learning	
  
algorithms
Each	
  job	
  reads	
  data	
  from	
  
stable	
  storage	
  (e.g.	
  file	
  
systems)
Write	
  
Intermediate	
  
Files	
  into	
  Hard	
  
Disk
Yes Yes
	
  while	
  Shuffle while	
  Map	
  reduce
20
+	
  
Spark	
  Streaming	
  vs.	
  Storm
Spark	
  Streaming Storm
Processing	
  Model Micro	
  batches One	
  record	
  at	
  a	
  time
Latency Few	
  seconds Sub-­‐second	
  
Fault	
  tolerance:	
  
Every	
  record	
  
processed
Only	
  once	
  (track	
  
processing	
  at	
  the	
  
batch	
  level)
At	
  least	
  once	
  (may	
  have	
  
duplicates	
  when	
  recovering	
  
from	
  a	
  fault)	
  
Implemented	
  In Scala Clojure
21
+	
  
Cassandra	
  vs.	
  PostgreSQL
Cassandra PostgreSQL
Database	
  Model NoSQL DBMS
Scale
Horizontally	
  
(More	
  data	
  =	
  More	
  servers)
Vertically	
  	
  
(More	
  data	
  =	
  Bigger	
  server)
Distributed Distributed Not	
  distributed	
  	
  
Normalization
Better	
  denormalized	
  tables:	
  
Increase	
  writes	
  but	
  simplify	
  
reads
Better	
  normalized	
  tables	
  
(store	
  additional	
  redundant	
  
information	
  on	
  disk	
  to	
  optimize	
  
query	
  response.	
  	
  
But	
  still,	
  not	
  distributed)
Consistency Developer’s	
  job Software	
  handles	
  it
22
+	
  
Cassandra	
  vs.	
  HBase
Cassandra HBase
Google	
  Bigtable Adopt	
  Google	
  Bigtable
Distributed Yes
Internode	
  
communications
Integrated	
  Gossip	
  protocol Rely	
  on	
  Zookeeper
Availability
Multiple	
  seed	
  nodes	
  	
  
(concentration	
  points	
  for	
  intercluster	
  
communication)
Standby	
  master	
  node
Consistency
Richer	
  consistency	
  support:	
  
You	
  can	
  configure	
  how	
  many	
  replica	
  
nodes	
  must	
  successfully	
  complete	
  the	
  
operation	
  before	
  it	
  is	
  acknowledged	
  	
  
(You	
  can	
  require	
  all	
  replica	
  nodes)
Strong	
  row-­‐level	
  
consistency
Query SQL	
  like	
  query
hbase>	
  create	
  ‘t1’,	
  
{NAME	
  =>	
  ‘f1’},	
  {NAME	
  
=>	
  ‘f2’},	
  {NAME	
  =>	
  ‘f3’}
23
+	
  
Elasticsearch	
  vs.	
  Solr
Elasticseach Solr
Lucene	
  Index Both	
  use	
  Lucene	
  index
Distributed Yes
Yes	
  but	
  depend	
  on	
  
Zookeeper
Nested	
  Object
Support	
  complex	
  nested	
  
object
Can	
  be	
  implemented	
  as	
  a	
  flat	
  
object	
  but	
  hard	
  to	
  update
Change	
  #	
  of	
  Shards
No	
  
(hash(doc_id)	
  %	
  
#_of_primary_shards)
Yes
Automatic	
  shard	
  
rebalancing	
  (after	
  
adding	
  new	
  nodes)
Yes No
Faceting
Support	
  richer	
  faceting	
  
(exclude	
  terms	
  using	
  Regex)
Support	
  faceting
24

Más contenido relacionado

La actualidad más candente

Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series DatabasePramit Choudhary
 
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander ZaitsevClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander ZaitsevAltinity Ltd
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoTaro L. Saito
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Zhenxiao Luo
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkVasia Kalavri
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...Altinity Ltd
 
ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов
ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья СвиридовManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов
ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья СвиридовGeeksLab Odessa
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introductionRick Chang
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaValery Tkachenko
 
Monitoring pg with_graphite_grafana
Monitoring pg with_graphite_grafanaMonitoring pg with_graphite_grafana
Monitoring pg with_graphite_grafanaJan Wieck
 
InfluxDb and Grafana fighting with data
InfluxDb and Grafana fighting with dataInfluxDb and Grafana fighting with data
InfluxDb and Grafana fighting with dataIvan Vaskevych
 
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)spil-engineering
 
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...Altinity Ltd
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Rob Emanuele
 
Collecting metrics with Graphite and StatsD
Collecting metrics with Graphite and StatsDCollecting metrics with Graphite and StatsD
Collecting metrics with Graphite and StatsDitnig
 

La actualidad más candente (20)

Highly Available Graphite
Highly Available GraphiteHighly Available Graphite
Highly Available Graphite
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander ZaitsevClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
ClickHouse Analytical DBMS: Introduction and Case Studies, by Alexander Zaitsev
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkGelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
 
ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов
ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья СвиридовManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов
ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introduction
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek Vavrusa
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 
Mario on spark
Mario on sparkMario on spark
Mario on spark
 
Monitoring pg with_graphite_grafana
Monitoring pg with_graphite_grafanaMonitoring pg with_graphite_grafana
Monitoring pg with_graphite_grafana
 
InfluxDb and Grafana fighting with data
InfluxDb and Grafana fighting with dataInfluxDb and Grafana fighting with data
InfluxDb and Grafana fighting with data
 
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
 
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
 
Collecting metrics with Graphite and StatsD
Collecting metrics with Graphite and StatsDCollecting metrics with Graphite and StatsD
Collecting metrics with Graphite and StatsD
 

Similar a Artmosphere Demo

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Migrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraMigrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraDemi Ben-Ari
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache CassandraStu Hood
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_sparkGeetanjali G
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataRoger Xia
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingChen-en Lu
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型wang xing
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Turi, Inc.
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 

Similar a Artmosphere Demo (20)

Scala+data
Scala+dataScala+data
Scala+data
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Migrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraMigrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to Cassandra
 
Einführung in MongoDB
Einführung in MongoDBEinführung in MongoDB
Einführung in MongoDB
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_spark
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Spark 计算模型
Spark 计算模型Spark 计算模型
Spark 计算模型
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Llnl talk
Llnl talkLlnl talk
Llnl talk
 

Último

Greater Noida Call Girls : ☎ 8527673949, Low rate Call Girls
Greater Noida Call Girls : ☎ 8527673949, Low rate Call GirlsGreater Noida Call Girls : ☎ 8527673949, Low rate Call Girls
Greater Noida Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 
Triangle Vinyl Record Store, Clermont Florida
Triangle Vinyl Record Store, Clermont FloridaTriangle Vinyl Record Store, Clermont Florida
Triangle Vinyl Record Store, Clermont FloridaGabrielaMiletti
 
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | DelhiFULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | DelhiMalviyaNagarCallGirl
 
Olivia Cox. intertextual references.pptx
Olivia Cox. intertextual references.pptxOlivia Cox. intertextual references.pptx
Olivia Cox. intertextual references.pptxLauraFagan6
 
Dxb Call Girl +971509430017 Indian Call Girl in Dxb By Dubai Call Girl
Dxb Call Girl +971509430017 Indian Call Girl in Dxb By Dubai Call GirlDxb Call Girl +971509430017 Indian Call Girl in Dxb By Dubai Call Girl
Dxb Call Girl +971509430017 Indian Call Girl in Dxb By Dubai Call GirlYinisingh
 
Pragati Maidan Call Girls : ☎ 8527673949, Low rate Call Girls
Pragati Maidan Call Girls : ☎ 8527673949, Low rate Call GirlsPragati Maidan Call Girls : ☎ 8527673949, Low rate Call Girls
Pragati Maidan Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 
Aiims Call Girls : ☎ 8527673949, Low rate Call Girls
Aiims Call Girls : ☎ 8527673949, Low rate Call GirlsAiims Call Girls : ☎ 8527673949, Low rate Call Girls
Aiims Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 
Villa De Cubero Trading Post, Curio Shop, Villa de Cubero NM
Villa De Cubero Trading Post, Curio Shop, Villa de Cubero NMVilla De Cubero Trading Post, Curio Shop, Villa de Cubero NM
Villa De Cubero Trading Post, Curio Shop, Villa de Cubero NMroute66connected
 
8377087607, Door Step Call Girls In Gaur City (NOIDA) 24/7 Available
8377087607, Door Step Call Girls In Gaur City (NOIDA) 24/7 Available8377087607, Door Step Call Girls In Gaur City (NOIDA) 24/7 Available
8377087607, Door Step Call Girls In Gaur City (NOIDA) 24/7 Availabledollysharma2066
 
Roadrunner Lodge, Motel/Residence, Tucumcari NM
Roadrunner Lodge, Motel/Residence, Tucumcari NMRoadrunner Lodge, Motel/Residence, Tucumcari NM
Roadrunner Lodge, Motel/Residence, Tucumcari NMroute66connected
 
Iffco Chowk Call Girls : ☎ 8527673949, Low rate Call Girls
Iffco Chowk Call Girls : ☎ 8527673949, Low rate Call GirlsIffco Chowk Call Girls : ☎ 8527673949, Low rate Call Girls
Iffco Chowk Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | DelhiFULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | DelhiMalviyaNagarCallGirl
 
SHIVNA SAHITYIKI APRIL JUNE 2024 Magazine
SHIVNA SAHITYIKI APRIL JUNE 2024 MagazineSHIVNA SAHITYIKI APRIL JUNE 2024 Magazine
SHIVNA SAHITYIKI APRIL JUNE 2024 MagazineShivna Prakashan
 
FULL ENJOY - 9953040155 Call Girls in Noida | Delhi
FULL ENJOY - 9953040155 Call Girls in Noida | DelhiFULL ENJOY - 9953040155 Call Girls in Noida | Delhi
FULL ENJOY - 9953040155 Call Girls in Noida | DelhiMalviyaNagarCallGirl
 
Karol Bagh Call Girls : ☎ 8527673949, Low rate Call Girls
Karol Bagh Call Girls : ☎ 8527673949, Low rate Call GirlsKarol Bagh Call Girls : ☎ 8527673949, Low rate Call Girls
Karol Bagh Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call GirlsJagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 
Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857
Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857
Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857delhimodel235
 
Indian High Profile Call Girls In Sector 18 Noida 8375860717 Escorts Service
Indian High Profile Call Girls In Sector 18 Noida 8375860717 Escorts ServiceIndian High Profile Call Girls In Sector 18 Noida 8375860717 Escorts Service
Indian High Profile Call Girls In Sector 18 Noida 8375860717 Escorts Servicedoor45step
 
Govindpuri Call Girls : ☎ 8527673949, Low rate Call Girls
Govindpuri Call Girls : ☎ 8527673949, Low rate Call GirlsGovindpuri Call Girls : ☎ 8527673949, Low rate Call Girls
Govindpuri Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 
Call Girls in Islamabad | 03070433345 | Call Girl Service
Call Girls in Islamabad | 03070433345 | Call Girl ServiceCall Girls in Islamabad | 03070433345 | Call Girl Service
Call Girls in Islamabad | 03070433345 | Call Girl ServiceAyesha Khan
 

Último (20)

Greater Noida Call Girls : ☎ 8527673949, Low rate Call Girls
Greater Noida Call Girls : ☎ 8527673949, Low rate Call GirlsGreater Noida Call Girls : ☎ 8527673949, Low rate Call Girls
Greater Noida Call Girls : ☎ 8527673949, Low rate Call Girls
 
Triangle Vinyl Record Store, Clermont Florida
Triangle Vinyl Record Store, Clermont FloridaTriangle Vinyl Record Store, Clermont Florida
Triangle Vinyl Record Store, Clermont Florida
 
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | DelhiFULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
 
Olivia Cox. intertextual references.pptx
Olivia Cox. intertextual references.pptxOlivia Cox. intertextual references.pptx
Olivia Cox. intertextual references.pptx
 
Dxb Call Girl +971509430017 Indian Call Girl in Dxb By Dubai Call Girl
Dxb Call Girl +971509430017 Indian Call Girl in Dxb By Dubai Call GirlDxb Call Girl +971509430017 Indian Call Girl in Dxb By Dubai Call Girl
Dxb Call Girl +971509430017 Indian Call Girl in Dxb By Dubai Call Girl
 
Pragati Maidan Call Girls : ☎ 8527673949, Low rate Call Girls
Pragati Maidan Call Girls : ☎ 8527673949, Low rate Call GirlsPragati Maidan Call Girls : ☎ 8527673949, Low rate Call Girls
Pragati Maidan Call Girls : ☎ 8527673949, Low rate Call Girls
 
Aiims Call Girls : ☎ 8527673949, Low rate Call Girls
Aiims Call Girls : ☎ 8527673949, Low rate Call GirlsAiims Call Girls : ☎ 8527673949, Low rate Call Girls
Aiims Call Girls : ☎ 8527673949, Low rate Call Girls
 
Villa De Cubero Trading Post, Curio Shop, Villa de Cubero NM
Villa De Cubero Trading Post, Curio Shop, Villa de Cubero NMVilla De Cubero Trading Post, Curio Shop, Villa de Cubero NM
Villa De Cubero Trading Post, Curio Shop, Villa de Cubero NM
 
8377087607, Door Step Call Girls In Gaur City (NOIDA) 24/7 Available
8377087607, Door Step Call Girls In Gaur City (NOIDA) 24/7 Available8377087607, Door Step Call Girls In Gaur City (NOIDA) 24/7 Available
8377087607, Door Step Call Girls In Gaur City (NOIDA) 24/7 Available
 
Roadrunner Lodge, Motel/Residence, Tucumcari NM
Roadrunner Lodge, Motel/Residence, Tucumcari NMRoadrunner Lodge, Motel/Residence, Tucumcari NM
Roadrunner Lodge, Motel/Residence, Tucumcari NM
 
Iffco Chowk Call Girls : ☎ 8527673949, Low rate Call Girls
Iffco Chowk Call Girls : ☎ 8527673949, Low rate Call GirlsIffco Chowk Call Girls : ☎ 8527673949, Low rate Call Girls
Iffco Chowk Call Girls : ☎ 8527673949, Low rate Call Girls
 
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | DelhiFULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
 
SHIVNA SAHITYIKI APRIL JUNE 2024 Magazine
SHIVNA SAHITYIKI APRIL JUNE 2024 MagazineSHIVNA SAHITYIKI APRIL JUNE 2024 Magazine
SHIVNA SAHITYIKI APRIL JUNE 2024 Magazine
 
FULL ENJOY - 9953040155 Call Girls in Noida | Delhi
FULL ENJOY - 9953040155 Call Girls in Noida | DelhiFULL ENJOY - 9953040155 Call Girls in Noida | Delhi
FULL ENJOY - 9953040155 Call Girls in Noida | Delhi
 
Karol Bagh Call Girls : ☎ 8527673949, Low rate Call Girls
Karol Bagh Call Girls : ☎ 8527673949, Low rate Call GirlsKarol Bagh Call Girls : ☎ 8527673949, Low rate Call Girls
Karol Bagh Call Girls : ☎ 8527673949, Low rate Call Girls
 
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call GirlsJagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
 
Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857
Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857
Low Rate Call Girls in Laxmi Nagar Delhi Call 9990771857
 
Indian High Profile Call Girls In Sector 18 Noida 8375860717 Escorts Service
Indian High Profile Call Girls In Sector 18 Noida 8375860717 Escorts ServiceIndian High Profile Call Girls In Sector 18 Noida 8375860717 Escorts Service
Indian High Profile Call Girls In Sector 18 Noida 8375860717 Escorts Service
 
Govindpuri Call Girls : ☎ 8527673949, Low rate Call Girls
Govindpuri Call Girls : ☎ 8527673949, Low rate Call GirlsGovindpuri Call Girls : ☎ 8527673949, Low rate Call Girls
Govindpuri Call Girls : ☎ 8527673949, Low rate Call Girls
 
Call Girls in Islamabad | 03070433345 | Call Girl Service
Call Girls in Islamabad | 03070433345 | Call Girl ServiceCall Girls in Islamabad | 03070433345 | Call Girl Service
Call Girls in Islamabad | 03070433345 | Call Girl Service
 

Artmosphere Demo

  • 1. +   Artmosphere     Discover  the  beauty  not  yet  seen Keira  Zhou   October,  2015
  • 2. +   Demo   2 n  www.artmosphere.nyc   n  http://54.215.136.187:5000   n  Video:  https://youtu.be/skzZ7sosC8c  
  • 3. +   Data  Pipeline Web  API:   •  JSON  files     Self-­‐Engineered   user  activity  log Data  Source 3
  • 4. +   Cluster  Setup 4 8GB  Memory   50GB  Storage   8GB  Memory   1TB  Storage   8GB  Memory   1TB  Storage   8GB  Memory   1TB  Storage  
  • 5. +   Dataset n Artsy.net:   n  About  26K  artworks   n  About  45K  artists   n  JSON   5
  • 6. +   Dataset n Artsy.net:   n  About  26K  artworks   n  About  45K  artists   n  JSON   n Generated  User  Log:   n  Simulated  user  “collect”  activities   n  Multiplied  real  user  location   6
  • 7. +   Search  Artwork  by  Title 7 n Raw  JSON  of  45K  artwork   n Spark  -­‐  Elasticsearch  
  • 8. +   Real-­‐Time  Streaming 8 n Track  the  trend  of  each  art   n Spark  Streaming  -­‐  Cassandra  
  • 9. +   Batch  Processing 9 n Artists  Location   n Similar  Artworks   n Spark  –  Cassandra  
  • 10. +   Batch  Processing  Artists  Location 10 n  Spark:  processed  507GB  of  data
  • 11. +   Challenges n Find  the  right  database  for  your  problem     n  Elasticsearch  for  search   n  Cassandra  for  time  series   n Computers  are  multilingual  –Python,  Scala,  Java…   n But  challenges  make  life  interesting   11
  • 12. +   About  Me 12 n  MS  &  BS  in  Systems  Engineering,   University  of  Virginia   n  Machine  Learning   n  Natural  Language  Processing   n  High  Performance  Computing  
  • 13. +   About  Me 13 n  MS  &  BS  in  Systems  Engineering,   University  of  Virginia   n  Machine  Learning   n  Natural  Language  Processing   n  High  Performance  Computing   n  Enjoy   n  Sketching   n  Adventures   n  Laughing   n  Life
  • 15. +   Batch  Processing  Art  Similarity 15 n Artworks  are  manually  tagged   n Input  format:   n  [art1]:  [tag1][tag2][tag3]…   n Compute  common  tags   between  two  artworks   n  Spark  –  Cassandra   n  Could  also  use  Collaborative   Filtering  (MLlib  in  Spark)  
  • 16. +   Benchmark  Reads/Writes 16 “Benchmarking  Top  NoSQL  Databases”  End  Point:     http://www.datastax.com/wp-­‐content/themes/datastax-­‐2014-­‐08/files/NoSQL_Benchmarks_EndPoint.pdf   Most  Writes   Writes/Reads  Balanced   n Operations  /  sec   n Cassandra  |  Couchbase  |  Hbase  |  MongoDB  
  • 17. +   Cassandra  Time  Series 17 time_stamp_1 time_stamp_2 time_stamp_3 … art_id_1 3 1 2 … art_id_2 5 3 1 … art_id_3 1 4 2 … … … … … … Primary  Key   (Partition  Key) Primary  Key  (Clustering  Key):     with  Clustering  Order  By  (Desc) n  Compound  Primary  Key  (art_id,  time_stamp)   n  art_id:  Partition  key,  responsible  for  data  distribution  across  nodes   n  time_stamp:  Clustering  key,  responsible  for  data  sorting  within  the   partition  
  • 18. +   Transactions  in  Cassandra 18 Node  1 Node  2 Write  “Life  is  good”   Consistency  =  all Write  “Life  is  good”   Write  “Life  is  good”   Node  1   Life  is  good Node  2   Life  is  …  <Job  failed> Report  FAIL  -­‐>  Rollback  Report  SUCCESS   Final  report:  FAIL Node  1   Life  is  good Node  2   Datastax:  http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_atomicity_c.html   n  Write  atomicity  is  at  the  partition-­‐level
  • 19. +   Batch  Processing  Artists  Location 19 n  Spark:  processed  507GB  of  data File  Size  in  GB 4.7 14.5 28.5 43 101 202.5 318.5 507 Time  in  min 1.5 5.3 6.5 9.5 21.25 42.87 67.1 110
  • 20. +   Spark  vs.  Hadoop Spark Hadoop  MapReduce Fault  Tolerance via  RDD   (they  rebuild  lost  data  on  failure  using   lineage:  each  RDD  remembers  how  it   was  built  from  other  datasets  to   rebuild  itself) via  Replication Cache  Data  into   Memory Yes No Support  in-­‐memory  data  sharing  across   directed  acyclic  graphs  (RDD);   well-­‐suited  to  machine  learning   algorithms Each  job  reads  data  from   stable  storage  (e.g.  file   systems) Write   Intermediate   Files  into  Hard   Disk Yes Yes  while  Shuffle while  Map  reduce 20
  • 21. +   Spark  Streaming  vs.  Storm Spark  Streaming Storm Processing  Model Micro  batches One  record  at  a  time Latency Few  seconds Sub-­‐second   Fault  tolerance:   Every  record   processed Only  once  (track   processing  at  the   batch  level) At  least  once  (may  have   duplicates  when  recovering   from  a  fault)   Implemented  In Scala Clojure 21
  • 22. +   Cassandra  vs.  PostgreSQL Cassandra PostgreSQL Database  Model NoSQL DBMS Scale Horizontally   (More  data  =  More  servers) Vertically     (More  data  =  Bigger  server) Distributed Distributed Not  distributed     Normalization Better  denormalized  tables:   Increase  writes  but  simplify   reads Better  normalized  tables   (store  additional  redundant   information  on  disk  to  optimize   query  response.     But  still,  not  distributed) Consistency Developer’s  job Software  handles  it 22
  • 23. +   Cassandra  vs.  HBase Cassandra HBase Google  Bigtable Adopt  Google  Bigtable Distributed Yes Internode   communications Integrated  Gossip  protocol Rely  on  Zookeeper Availability Multiple  seed  nodes     (concentration  points  for  intercluster   communication) Standby  master  node Consistency Richer  consistency  support:   You  can  configure  how  many  replica   nodes  must  successfully  complete  the   operation  before  it  is  acknowledged     (You  can  require  all  replica  nodes) Strong  row-­‐level   consistency Query SQL  like  query hbase>  create  ‘t1’,   {NAME  =>  ‘f1’},  {NAME   =>  ‘f2’},  {NAME  =>  ‘f3’} 23
  • 24. +   Elasticsearch  vs.  Solr Elasticseach Solr Lucene  Index Both  use  Lucene  index Distributed Yes Yes  but  depend  on   Zookeeper Nested  Object Support  complex  nested   object Can  be  implemented  as  a  flat   object  but  hard  to  update Change  #  of  Shards No   (hash(doc_id)  %   #_of_primary_shards) Yes Automatic  shard   rebalancing  (after   adding  new  nodes) Yes No Faceting Support  richer  faceting   (exclude  terms  using  Regex) Support  faceting 24

Notas del editor

  1. Artsy: https://developers.artsy.net/docs
  2. Artsy: https://developers.artsy.net/docs
  3. Artsy: https://developers.artsy.net/docs
  4. Search by title
  5. Update as a plot of Spark performance
  6. Update as a plot of Spark performance