SlideShare a Scribd company logo
1 of 24
Data + Algorithms = Knowledge




Facebook Analytics


                  With Elastic Map/Reduce
                      – a Hands-on Workshop

                                            November 12, 2012
                                        J Singh, DataThinks.org




                             1
Take-away Messages

• Map Reduce is simple, Hadoop is one implementation of MR…
   – …made even simpler by services like Elastic Map Reduce


• But Map Reduce requires a different style of programming…
   – …and a different set of techniques for debugging


• Facebook data can get big very quickly…
   – …and storage and bandwidth costs can dominate your solution


• Analytics is an iterative (agile) process…
   – …each iteration requires evaluating results, and tuning the algorithms,
     possibly the acquisition of more data

                       © J Singh, 2012                                  2
                                2
Signing Up for AWS

The steps required to obtain an AWS account
   Create an AWS account (http://aws.amazon.com).
    –   http://www.slideshare.net/AmazonWebServices/video-how-to-sign-up-for-
        amazon-web-services-8700872
    –   Requires a valid credit card and a phone based identification.
   Sign in to the AWS Management Console
    – http://aws.amazon.com/console




                          © J Singh, 2012                                   3
                                   3
Elastic Map Reduce Resources

• Summary of the offering

• Elastic MapReduce Training

• Getting Started Guide

• Developers Guide




                     © J Singh, 2012   4
                              4
MapReduce Conceptual Underpinnings

• Based on Functional Programming model
   – From Lisp
       • (map square '(1 2 3 4))   (1 4 9 16)
       • (reduce plus '(1 4 9 16))   30
   – From APL
       • +/ N    N  1 2 3 4


• Easy to distribute (based on each element of the vector)

• New for Map/Reduce: Nice failure/retry semantics
   – Hundreds and thousands of low-end servers are running at
     the same time
                     © J Singh, 2012                            5
                              5
MapReduce Flow




            © J Singh, 2012   6
                     6
Elastic Map Reduce – Summary

• Hadoop installed and maintained by Amazon
   – We can focus on programming
   – Offers a few options on map and reduce programs

• Streaming
   – Map and Reduce programs
     connect through stdin and
     stdout
   – Allows Map and Reduce to be
     written in any language
• Hive, Pig
   – Translates to Map/Reduce JARs
   – Can cascade M/R pipelines
• Custom JAR – for special cases

                      © J Singh, 2012                  7
                               7
Elastic Map Reduce – Architecture

• Starting with data in S3

• EMR Service initiates the job
• Hadoop Master coordinates
  operation
• Slave nodes are initiated and
  data loaded into them
• Extra nodes can be invoked if
  needed

• Results are copied back into S3
   – Nodes are destroyed

                      © J Singh, 2012   8
                               8
Elastic Map Reduce – Word Count

• Use the AWS Management Console >> Elastic MapReduce
  – Define Job Flow
      • Hadoop Version 1.0.3
      • Run your own application
          – Steaming
  – Specify Parameters
      • For input files,
        elasticmapreduce/samples/wordcount/input
      • For output files, you need to define your own S3 bucket
          – In a separate browser tab, AWS Management Console >> S3
          – Bucket names can include lowercase letters, numbers, period, dash
      • Mapper code can be seen at http://goo.gl/EbCme
          – Copy this code to one of your buckets
          – Specify path <your-bucket>/wordSplitter.py
                           © J Singh, 2012                                  9
                                    9
Elastic Map Reduce – Word Count (p2)

• Configure EC2 Instances
• Advanced Options
   – Optional: Amazon EC2 Key Pair
       • To log into the master and make changes to a running job
          – E.g,, add extra nodes to speed up processing
   – Amazon S3 Log Path
       • <your-bucket>/log-2012-11-12--19-30
• Accept all other defaults and go!




                       © J Singh, 2012                              10
                                10
Monitoring Operation

• AWS Management Console provides a view into the
  operation




  – These screen-shots were taken at minute 27 of a 30-minute
    run
  – Configuration default in this case was for 2 map slots
  – First slot became available at 12:00, second around 12:10

                   © J Singh, 2012                              11
                           11
Elastic Map Reduce – Debugging

• AWS console and the log files provide clues on what went
  wrong and how to fix it

• Make a change that will break the operation and examine
  the AWS console to find the error you introduced
   – Introduce a parsing error in the mapper program
   – Uncomment these lines to have it raise an exception
                 import random
                 x = 1 / random.randint(0,1000)
   – Save the file to an S3 bucket and run
   – Can you find where EMR reveals what happened?


                     © J Singh, 2012                         12
                             12
Facebook Analytics – Summary

• Extend the architecture
   – Import Facebook data into S3
   – Change Map Reduce programs as required




                      © J Singh, 2012         13
                              13
Facebook Analytics – Observations

• Fetching and staging data is the real challenge in putting
  together an analytics solution
   – For unstructured data, it requires
       • An understanding of the data model at the source
       • Custom code to read it


   – For structured data, consider Pig/Hive (higher-level Hadoop
     components)
       • Pig/Hive can read/write tables formatted as CSV/TSV files in S3
          – Either we need to bring files into S3
          – Or point Pig/Hive at a JDBC connection
       • An opportunity to rethink the ETL pipeline?


                       © J Singh, 2012                                 14
                                 14
Facebook Analytics – Data Collection

• The exercise is based on everyone‟s Facebook data
• Log into http://apps.facebook.com/map-reduce-workshop
   – Requires permission to get
       • Information about you,
       • Your friends,
       • Your likes, your friends‟ likes.
   – Randomly selects 10 of those friends
   – Randomly selects 25 of their likes
   – Anonymizes your friends‟ Facebook IDs before storing into
     S3
• All data, even though opaque, will be deleted at the end of
 the workshop

                        © J Singh, 2012                          15
                                  15
Facebook Analytics – Data Collected




Original = 75   Friends = 750        Likes = up to about 20,000

• Each user record shows anonymized user ID and their likes
   –   4110002004281   ['21506845769', '345722385482735', '93433060687']




                        © J Singh, 2012                              16
                                16
Facebook Analytics – Likes Count

• Use the AWS Management Console >> Elastic MapReduce
  – Define Job Flow
      • Hadoop Version 1.0.3
      • Run Your Own Application
         – Streaming
  – Specify Parameters
      • For input files, use bucket datathinks-users
      • For output files, you need to define your own S3 bucket
         – In a separate browser tab, AWS Management Console >> S3
      • Mapper: copy goo.gl/PcLK4 into a bucket you own
  – Advanced options:
      • Choose a fresh log file location
  – Accept all other defaults and go!
                       © J Singh, 2012                               17
                               17
Viewing the Results

• The results of Data Analysis are available in S3.
   – Partial example:     139784736075551      1
                          140413412750046      6
                          184331976202         3
                          220854914702193      1
                          29092950651          1


• How to interpret the results.
   – Sort by frequency, then examine most frequent likes
       • 140413412750046 is cryptic
       • But http://www.facebook.com/pages/w/140413412750046
         reveals what it is (DataThinks)
• Requires further action: what to do with the results?
                        © J Singh, 2012                        18
                                18
Algorithm Discussion

• The algorithm based on exact matches for likes may be
  too restrictive
  – „Ella Fitzgerald‟ != „Duke Ellington‟
  – But people who like Ella Fitzgerald may be reachable the
    same way as people who like Duke Ellington

  – An idea to explore further:
      • Is there a way to find ID‟s that we might consider equivalent?




                      © J Singh, 2012                                    19
                              19
Data Collected and Embellished




Original = 75   Friends = 750   Likes = 15,000   Similar Likes = 150,000




                         © J Singh, 2012                                   20
                                  20
Extended Facebook Analytics – Summary

• Extend the architecture
   – Get mappers to fetch “similar likes” from the internet




                        © J Singh, 2012                       21
                                21
Facebook Analytics – Showing Results

• The other challenge in putting together an analytics
  solution is displaying results
   – Demo of our results page




                    © J Singh, 2012                      22
                            22
Take-away Messages

• Map Reduce is simple, Hadoop is one implementation of MR…
   – …made even simpler by services like Elastic Map Reduce


• But Map Reduce requires a different style of programming…
   – …and a different set of techniques for debugging


• Facebook data can get big very quickly…
   – …and storage and bandwidth costs can dominate your solution


• Analytics is an iterative (agile) process…
   – …each iteration requires evaluating results, and tuning the algorithms,
     possibly the acquisition of more data

                       © J Singh, 2012                                  23
                                23
Thank you

• J Singh
   – President, Early Stage IT
       • Technology Services and Strategy for Startups


• DataThinks.org is a service of Early Stage IT
   – “Big Data” analytics solutions




                      © J Singh, 2012                    24
                              24

More Related Content

What's hot

BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
 

What's hot (20)

Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deployment
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and HiveJan 2013 HUG: Cloud-Friendly Hadoop and Hive
Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 

Viewers also liked

Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Yahoo Developer Network
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
datasalt
 

Viewers also liked (20)

Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
Introduction to Elastic MapReduce
Introduction to Elastic MapReduceIntroduction to Elastic MapReduce
Introduction to Elastic MapReduce
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in HadoopOctober 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
BigData_Chp5: Putting it all together
BigData_Chp5: Putting it all togetherBigData_Chp5: Putting it all together
BigData_Chp5: Putting it all together
 
BigData_TP3 : Spark
BigData_TP3 : SparkBigData_TP3 : Spark
BigData_TP3 : Spark
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and Comparison
 
Slideshare Powerpoint presentation
Slideshare Powerpoint presentationSlideshare Powerpoint presentation
Slideshare Powerpoint presentation
 

Similar to Facebook Analytics with Elastic Map/Reduce

2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
asya999
 
Shop talk - Project Server 2013
Shop talk - Project Server 2013Shop talk - Project Server 2013
Shop talk - Project Server 2013
Chris Givens
 
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...
Scott_Brickey
 

Similar to Facebook Analytics with Elastic Map/Reduce (20)

[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...
 
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govNot Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov
 
SQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsSQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 Questions
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
 
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
eHarmony in the Cloud
eHarmony in the CloudeHarmony in the Cloud
eHarmony in the Cloud
 
Shop talk - Project Server 2013
Shop talk - Project Server 2013Shop talk - Project Server 2013
Shop talk - Project Server 2013
 
Accelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksAccelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on Databricks
 
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...
SharePoint Saturday - Chicago - 2014 - Decoding the Business Intelligence Alp...
 
Using Power BI and Azure as analytics engine for business applications
Using Power BI and Azure as analytics engine for business applicationsUsing Power BI and Azure as analytics engine for business applications
Using Power BI and Azure as analytics engine for business applications
 
Dax & sql in power bi
Dax & sql in power biDax & sql in power bi
Dax & sql in power bi
 
L19 Application Architecture
L19 Application ArchitectureL19 Application Architecture
L19 Application Architecture
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of Thought
 
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
 
Tableau Seattle BI Event How Tableau Changed My Life
Tableau Seattle BI Event How Tableau Changed My LifeTableau Seattle BI Event How Tableau Changed My Life
Tableau Seattle BI Event How Tableau Changed My Life
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 
EMR and DynamoDB
EMR and DynamoDBEMR and DynamoDB
EMR and DynamoDB
 
Building a Front End for a Sensor Data Cloud
Building a Front End for a Sensor Data CloudBuilding a Front End for a Sensor Data Cloud
Building a Front End for a Sensor Data Cloud
 
SQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePoint
SQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePointSQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePoint
SQL Saturday Columbus 2014 PowerBI with SQL Excel and SharePoint
 

More from J Singh

PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
J Singh
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and Tradeoffs
J Singh
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
J Singh
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
J Singh
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
J Singh
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
J Singh
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
J Singh
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
J Singh
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
J Singh
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
J Singh
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 Introduction
J Singh
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's Viewpoint
J Singh
 

More from J Singh (19)

Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big data
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
 
PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and Tradeoffs
 
Social Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceSocial Media Mining using GAE Map Reduce
Social Media Mining using GAE Map Reduce
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data Analysis
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 Introduction
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's Viewpoint
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Facebook Analytics with Elastic Map/Reduce

  • 1. Data + Algorithms = Knowledge Facebook Analytics With Elastic Map/Reduce – a Hands-on Workshop November 12, 2012 J Singh, DataThinks.org 1
  • 2. Take-away Messages • Map Reduce is simple, Hadoop is one implementation of MR… – …made even simpler by services like Elastic Map Reduce • But Map Reduce requires a different style of programming… – …and a different set of techniques for debugging • Facebook data can get big very quickly… – …and storage and bandwidth costs can dominate your solution • Analytics is an iterative (agile) process… – …each iteration requires evaluating results, and tuning the algorithms, possibly the acquisition of more data © J Singh, 2012 2 2
  • 3. Signing Up for AWS The steps required to obtain an AWS account  Create an AWS account (http://aws.amazon.com). – http://www.slideshare.net/AmazonWebServices/video-how-to-sign-up-for- amazon-web-services-8700872 – Requires a valid credit card and a phone based identification.  Sign in to the AWS Management Console – http://aws.amazon.com/console © J Singh, 2012 3 3
  • 4. Elastic Map Reduce Resources • Summary of the offering • Elastic MapReduce Training • Getting Started Guide • Developers Guide © J Singh, 2012 4 4
  • 5. MapReduce Conceptual Underpinnings • Based on Functional Programming model – From Lisp • (map square '(1 2 3 4)) (1 4 9 16) • (reduce plus '(1 4 9 16)) 30 – From APL • +/ N N  1 2 3 4 • Easy to distribute (based on each element of the vector) • New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time © J Singh, 2012 5 5
  • 6. MapReduce Flow © J Singh, 2012 6 6
  • 7. Elastic Map Reduce – Summary • Hadoop installed and maintained by Amazon – We can focus on programming – Offers a few options on map and reduce programs • Streaming – Map and Reduce programs connect through stdin and stdout – Allows Map and Reduce to be written in any language • Hive, Pig – Translates to Map/Reduce JARs – Can cascade M/R pipelines • Custom JAR – for special cases © J Singh, 2012 7 7
  • 8. Elastic Map Reduce – Architecture • Starting with data in S3 • EMR Service initiates the job • Hadoop Master coordinates operation • Slave nodes are initiated and data loaded into them • Extra nodes can be invoked if needed • Results are copied back into S3 – Nodes are destroyed © J Singh, 2012 8 8
  • 9. Elastic Map Reduce – Word Count • Use the AWS Management Console >> Elastic MapReduce – Define Job Flow • Hadoop Version 1.0.3 • Run your own application – Steaming – Specify Parameters • For input files, elasticmapreduce/samples/wordcount/input • For output files, you need to define your own S3 bucket – In a separate browser tab, AWS Management Console >> S3 – Bucket names can include lowercase letters, numbers, period, dash • Mapper code can be seen at http://goo.gl/EbCme – Copy this code to one of your buckets – Specify path <your-bucket>/wordSplitter.py © J Singh, 2012 9 9
  • 10. Elastic Map Reduce – Word Count (p2) • Configure EC2 Instances • Advanced Options – Optional: Amazon EC2 Key Pair • To log into the master and make changes to a running job – E.g,, add extra nodes to speed up processing – Amazon S3 Log Path • <your-bucket>/log-2012-11-12--19-30 • Accept all other defaults and go! © J Singh, 2012 10 10
  • 11. Monitoring Operation • AWS Management Console provides a view into the operation – These screen-shots were taken at minute 27 of a 30-minute run – Configuration default in this case was for 2 map slots – First slot became available at 12:00, second around 12:10 © J Singh, 2012 11 11
  • 12. Elastic Map Reduce – Debugging • AWS console and the log files provide clues on what went wrong and how to fix it • Make a change that will break the operation and examine the AWS console to find the error you introduced – Introduce a parsing error in the mapper program – Uncomment these lines to have it raise an exception import random x = 1 / random.randint(0,1000) – Save the file to an S3 bucket and run – Can you find where EMR reveals what happened? © J Singh, 2012 12 12
  • 13. Facebook Analytics – Summary • Extend the architecture – Import Facebook data into S3 – Change Map Reduce programs as required © J Singh, 2012 13 13
  • 14. Facebook Analytics – Observations • Fetching and staging data is the real challenge in putting together an analytics solution – For unstructured data, it requires • An understanding of the data model at the source • Custom code to read it – For structured data, consider Pig/Hive (higher-level Hadoop components) • Pig/Hive can read/write tables formatted as CSV/TSV files in S3 – Either we need to bring files into S3 – Or point Pig/Hive at a JDBC connection • An opportunity to rethink the ETL pipeline? © J Singh, 2012 14 14
  • 15. Facebook Analytics – Data Collection • The exercise is based on everyone‟s Facebook data • Log into http://apps.facebook.com/map-reduce-workshop – Requires permission to get • Information about you, • Your friends, • Your likes, your friends‟ likes. – Randomly selects 10 of those friends – Randomly selects 25 of their likes – Anonymizes your friends‟ Facebook IDs before storing into S3 • All data, even though opaque, will be deleted at the end of the workshop © J Singh, 2012 15 15
  • 16. Facebook Analytics – Data Collected Original = 75 Friends = 750 Likes = up to about 20,000 • Each user record shows anonymized user ID and their likes – 4110002004281 ['21506845769', '345722385482735', '93433060687'] © J Singh, 2012 16 16
  • 17. Facebook Analytics – Likes Count • Use the AWS Management Console >> Elastic MapReduce – Define Job Flow • Hadoop Version 1.0.3 • Run Your Own Application – Streaming – Specify Parameters • For input files, use bucket datathinks-users • For output files, you need to define your own S3 bucket – In a separate browser tab, AWS Management Console >> S3 • Mapper: copy goo.gl/PcLK4 into a bucket you own – Advanced options: • Choose a fresh log file location – Accept all other defaults and go! © J Singh, 2012 17 17
  • 18. Viewing the Results • The results of Data Analysis are available in S3. – Partial example: 139784736075551 1 140413412750046 6 184331976202 3 220854914702193 1 29092950651 1 • How to interpret the results. – Sort by frequency, then examine most frequent likes • 140413412750046 is cryptic • But http://www.facebook.com/pages/w/140413412750046 reveals what it is (DataThinks) • Requires further action: what to do with the results? © J Singh, 2012 18 18
  • 19. Algorithm Discussion • The algorithm based on exact matches for likes may be too restrictive – „Ella Fitzgerald‟ != „Duke Ellington‟ – But people who like Ella Fitzgerald may be reachable the same way as people who like Duke Ellington – An idea to explore further: • Is there a way to find ID‟s that we might consider equivalent? © J Singh, 2012 19 19
  • 20. Data Collected and Embellished Original = 75 Friends = 750 Likes = 15,000 Similar Likes = 150,000 © J Singh, 2012 20 20
  • 21. Extended Facebook Analytics – Summary • Extend the architecture – Get mappers to fetch “similar likes” from the internet © J Singh, 2012 21 21
  • 22. Facebook Analytics – Showing Results • The other challenge in putting together an analytics solution is displaying results – Demo of our results page © J Singh, 2012 22 22
  • 23. Take-away Messages • Map Reduce is simple, Hadoop is one implementation of MR… – …made even simpler by services like Elastic Map Reduce • But Map Reduce requires a different style of programming… – …and a different set of techniques for debugging • Facebook data can get big very quickly… – …and storage and bandwidth costs can dominate your solution • Analytics is an iterative (agile) process… – …each iteration requires evaluating results, and tuning the algorithms, possibly the acquisition of more data © J Singh, 2012 23 23
  • 24. Thank you • J Singh – President, Early Stage IT • Technology Services and Strategy for Startups • DataThinks.org is a service of Early Stage IT – “Big Data” analytics solutions © J Singh, 2012 24 24

Editor's Notes

  1. Get started with Hadoop
  2. Get started with Hadoop