SlideShare una empresa de Scribd logo
1 de 15
Descargar para leer sin conexión
Demonstration
Outline
● Some comments on what we're trying to
  show
  ○ high level cluster configuration
  ○ an example application that might use this config
    ■ based on a Gowalla data set
● Launch cluster nodes on EC2
● Launch/configure Cassandra on cluster
● Demonstrate use of Cassandra
  ○ cassandra-cli, pycassa scripts to interact with db
● Demonstrate use of Hadoop
● Demonstrate use of Pig on the real data
Cluster configuration
● Four EC2 nodes
  ○ m1.medium instances
    ■ realistically a bit small for real world
● 3 nodes part of Cassandra
  ○ data can be input dynamically into db via Thrift API
● All nodes run Hadoop Tasktracker
● MapReduce runs close to (Cassandra) data
● JobTracker on separate node
Cluster config


        Job Tracker                           Cassandra

                                             Task Tracker




                               Cassandra                     Cassandra

                              Task Tracker                  Task Tracker



All nodes m1.small for demo
Let's get the cluster up...
       ...over to Lamine!
Let's get Cassandra
      running...
  ...and show the basic cli...
Application data
● Used Gowalla data in this test application
● Gowalla provide anonymized data for
  test/research purposes:
  ○ Graph of UID connections
  ○ List of checkins - UID, LocID
● Size of data set:
  ○ 400MB checkins
    ■ 6.4m checkins
  ○ ~200k users
● Also generated simpler variant of this data
  for demonstration
  ○ more real user information
  ○ more real location information
Application data - User Graph




 Simple graph structure -
 unidirectional graph with
 UIDs as nodes
Application Data - Checkin info
How this data can be used
● Application interested in:
   ○   my checkins
   ○   list my friends
   ○   checkins at given location
   ○   my friends checkins
● Analytics:
   ○ top ten most active users - most checkins
   ○ aggregate checkins per week
   ○ aggregate checkins per week per city
Cassandra data models
● The following data models were used:
  ○ User
  ○ Location
  ○ Checkin
  ○ FriendRels
    ■ graph of friend relationships
  ○ UserCheckins
    ■ checkins by user
  ○ LocationCheckins
    ■ checkins by location
  ○ FriendCheckins
    ■ checkins by friends
Cassandra data models
● Use of valueless columns
  ○ FriendRels, UserCheckins, LocationCheckins,
    FriendCheckins are just sets of valueless columns
● FriendRel:
  ○ row_key: {friendid1: '', friendid2: '', friendid3: '', ...}
    ■ row_key is a uid
● UserCheckins:
  ○ row_key: {checkinid1: '', checkinid2: '', ...}
    ■ row_key is uid
● LocationCheckins use LocID as row key
● FriendCheckins use my UID to get my
  friend's checkins
Let's import the data into
       Cassandra...
You deserve a coffee...
Using Hadoop and Pig
 ...and we can do some analytics...

Más contenido relacionado

La actualidad más candente

Data Step Hash Object vs SQL Join
Data Step Hash Object vs SQL JoinData Step Hash Object vs SQL Join
Data Step Hash Object vs SQL Join
Geoff Ness
 

La actualidad más candente (20)

Elasticsearch avoiding hotspots
Elasticsearch  avoiding hotspotsElasticsearch  avoiding hotspots
Elasticsearch avoiding hotspots
 
Object multifunctional indexing with an open API
Object multifunctional indexing with an open API Object multifunctional indexing with an open API
Object multifunctional indexing with an open API
 
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
 
Streaming data to s3 using akka streams
Streaming data to s3 using akka streamsStreaming data to s3 using akka streams
Streaming data to s3 using akka streams
 
NBITSearch. Features.
NBITSearch. Features.NBITSearch. Features.
NBITSearch. Features.
 
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
 
DB reading group may 16, 2018
DB reading group may 16, 2018DB reading group may 16, 2018
DB reading group may 16, 2018
 
Locality Sensitive Hashing By Spark
Locality Sensitive Hashing By SparkLocality Sensitive Hashing By Spark
Locality Sensitive Hashing By Spark
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
 
Data Step Hash Object vs SQL Join
Data Step Hash Object vs SQL JoinData Step Hash Object vs SQL Join
Data Step Hash Object vs SQL Join
 
Pain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication worksPain points with M3, some things to address them and how replication works
Pain points with M3, some things to address them and how replication works
 
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-OnApache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
MongoDB Workshop Universidad de Huelva
MongoDB Workshop Universidad de HuelvaMongoDB Workshop Universidad de Huelva
MongoDB Workshop Universidad de Huelva
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
 
Amazon Web Services lection 4
Amazon Web Services lection 4  Amazon Web Services lection 4
Amazon Web Services lection 4
 
R user group 2011 09
R user group 2011 09R user group 2011 09
R user group 2011 09
 
Data Lessons Learned at Scale
Data Lessons Learned at ScaleData Lessons Learned at Scale
Data Lessons Learned at Scale
 

Destacado (7)

No sql course introduction
No sql course   introductionNo sql course   introduction
No sql course introduction
 
Rocco pres-v1
Rocco pres-v1Rocco pres-v1
Rocco pres-v1
 
Rss announcements
Rss announcementsRss announcements
Rss announcements
 
Rss talk
Rss talkRss talk
Rss talk
 
Overview of no sql
Overview of no sqlOverview of no sql
Overview of no sql
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 

Similar a Demonstration

Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 

Similar a Demonstration (20)

Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Native Container Monitoring
Native Container MonitoringNative Container Monitoring
Native Container Monitoring
 
Druid
DruidDruid
Druid
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 
Spark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest CórdobaSpark & Cassandra - DevFest Córdoba
Spark & Cassandra - DevFest Córdoba
 
Presentation
PresentationPresentation
Presentation
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdf
 
MongoDB FabLab León
MongoDB FabLab LeónMongoDB FabLab León
MongoDB FabLab León
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
 
Avoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfAvoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdf
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
Peer sim (p2p network)
Peer sim (p2p network)Peer sim (p2p network)
Peer sim (p2p network)
 
Mongo db improve the performance of your application codemotion2016
Mongo db improve the performance of your application codemotion2016Mongo db improve the performance of your application codemotion2016
Mongo db improve the performance of your application codemotion2016
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Demonstration

  • 2. Outline ● Some comments on what we're trying to show ○ high level cluster configuration ○ an example application that might use this config ■ based on a Gowalla data set ● Launch cluster nodes on EC2 ● Launch/configure Cassandra on cluster ● Demonstrate use of Cassandra ○ cassandra-cli, pycassa scripts to interact with db ● Demonstrate use of Hadoop ● Demonstrate use of Pig on the real data
  • 3. Cluster configuration ● Four EC2 nodes ○ m1.medium instances ■ realistically a bit small for real world ● 3 nodes part of Cassandra ○ data can be input dynamically into db via Thrift API ● All nodes run Hadoop Tasktracker ● MapReduce runs close to (Cassandra) data ● JobTracker on separate node
  • 4. Cluster config Job Tracker Cassandra Task Tracker Cassandra Cassandra Task Tracker Task Tracker All nodes m1.small for demo
  • 5. Let's get the cluster up... ...over to Lamine!
  • 6. Let's get Cassandra running... ...and show the basic cli...
  • 7. Application data ● Used Gowalla data in this test application ● Gowalla provide anonymized data for test/research purposes: ○ Graph of UID connections ○ List of checkins - UID, LocID ● Size of data set: ○ 400MB checkins ■ 6.4m checkins ○ ~200k users ● Also generated simpler variant of this data for demonstration ○ more real user information ○ more real location information
  • 8. Application data - User Graph Simple graph structure - unidirectional graph with UIDs as nodes
  • 9. Application Data - Checkin info
  • 10. How this data can be used ● Application interested in: ○ my checkins ○ list my friends ○ checkins at given location ○ my friends checkins ● Analytics: ○ top ten most active users - most checkins ○ aggregate checkins per week ○ aggregate checkins per week per city
  • 11. Cassandra data models ● The following data models were used: ○ User ○ Location ○ Checkin ○ FriendRels ■ graph of friend relationships ○ UserCheckins ■ checkins by user ○ LocationCheckins ■ checkins by location ○ FriendCheckins ■ checkins by friends
  • 12. Cassandra data models ● Use of valueless columns ○ FriendRels, UserCheckins, LocationCheckins, FriendCheckins are just sets of valueless columns ● FriendRel: ○ row_key: {friendid1: '', friendid2: '', friendid3: '', ...} ■ row_key is a uid ● UserCheckins: ○ row_key: {checkinid1: '', checkinid2: '', ...} ■ row_key is uid ● LocationCheckins use LocID as row key ● FriendCheckins use my UID to get my friend's checkins
  • 13. Let's import the data into Cassandra...
  • 14. You deserve a coffee...
  • 15. Using Hadoop and Pig ...and we can do some analytics...