SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
Building the Infrastructure
       for Big Data

            @ The Fifth Elephant
               July 27th, 2012
   -Prashant Kumar, Founder- PromptCloud




                                                        1
              © PromptCloud 2012, All rights reserved
Agenda
About

Context

Machines, Installation & Cloud Automation

Building blocks of a system

Sample application sketch

Lack of time components
                                                              2
                    © PromptCloud 2012, All rights reserved
About
                            Section 0




                                          3
© PromptCloud 2012, All rights reserved
About PromptCloud
 We provide data feeds and feed ourselves on data- since 2009
How??
• Large-scale data crawl and extraction
• Hosted indexing
• Custom data analytics
• Working round the clock

                     About Me
 • PromptCloud’s Founder
 • Yahoo! - 2007-2008
 • IIT-Kanpur CS- 2007
                                                                4
                      © PromptCloud 2012, All rights reserved
Deliverable




                                          5
© PromptCloud 2012, All rights reserved
Context
                            Section 0.1




                                          6
© PromptCloud 2012, All rights reserved
Generic Big Data Systems
• Multiple nodes (incoherent set of coherent
  ones)
• Compute layer- Interdependent processes
• Data storage layer & multiple middleware
• Tools for installation, monitoring & scheduling
*Meta- source control, code reviews, continuous integration



                                                               7
                     © PromptCloud 2012, All rights reserved
Machines, Installation &
     Cloud Automation
                                         Section 1




                                                       8
             © PromptCloud 2012, All rights reserved
Installation
             Create an image and install



•Easy to install                          •Modifications? Difficult to save
•No maintenance cost                      it back
•1 image for 1 purpose                    •Apt, yum, etc-keeper like
                                          systems but difficult to scale

                                                        Solutions?? 
                                                                              9
                         © PromptCloud 2012, All rights reserved
Enter the Magic!




Not a panacea; analgesic though


                                                   10
         © PromptCloud 2012, All rights reserved
Virtual Machines

                                                          Virtual Machines


                 ssh
         Up
Init
       Vagrant
                  Shared directory
       Port Forwarding                          AWS, Xen,                      Virtual Box
                                                 KVM,…                         Installation




                                                                                              11
                                     © PromptCloud 2012, All rights reserved
Code the Installation using Chef
             Give the recipe- code what’s to be done

                                 I’m Solo




  Roles,
                                                                 Data Files
 Recipes




Templates,
 Run List                                                        Chef Server


                                   Knife
                                                                               12
                       © PromptCloud 2012, All rights reserved
Building blocks
                                 Section 2




                                               13
     © PromptCloud 2012, All rights reserved
To keep processes running,

                                  Option 1- Install GOD to
                                   monitor processes and
                                    to keep them in place
                                   Option 2 (for atheists)-
                                        Install MONIT
     Courtesy- BIT Mesra




                                                         14
              © PromptCloud 2012, All rights reserved
God’s Snippet
God.watch do |w|
    w.name = watcher_name
    w.start = start_command
    #w.restart = restart_command
    w.stop = stop_command
    w.behavior(:clean_pid_file)
    #w.group = "some group"
    w.log = "/tmp/god_monitoring_#{watcher_name}.log"
    w.keepalive
    w.stop_timeout = 10.seconds
end

                                                               15
                     © PromptCloud 2012, All rights reserved
Job Scheduling
Resque, Beanstalk, Gearman, Celery, + cron and queues

Things to remember while making choices-

• Persistence
• Priorities
• Tags
• Option for retry
• Ability to inspect the queue


                                                              16
                    © PromptCloud 2012, All rights reserved
Data Storage Layer
SQL/NoSQL, key/value, document-based, graph databases

• For large systems, maintenance cost is a
  primary overhead
• Replication & Availability
• Consistency guarantees
• Full-text search



                                                             17
                   © PromptCloud 2012, All rights reserved
Voldemort                                                Not
                                                                       me!!!!!!!!
• Distributed key/value
  store
• Great performance
• Easy to add/remove
  nodes
• Alternatives- Mongo,                                    Courtesy- harrypotter.wikia.com

  Riak, Hbase, Cassandra


                                                                                            18
                © PromptCloud 2012, All rights reserved
Messaging Layer-

• RabbitMQ- most commonly used in high-load
  production systems
• Implements AMQP
• Robust exchange server
• Multiple kinds of exchanges- direct, topic,
  fanout
• Options for HA with Pacemaker/DRBD
                                                          19
                © PromptCloud 2012, All rights reserved
Demo
                            Section 3




                                          20
© PromptCloud 2012, All rights reserved
Demo Sketch
1. We’ll generate random sentences based on
   Markov chain

2. Store these in Voldemort

3. Enqueue corresponding jobs in RabbitMQ

4. Another set of workers will process these
   sentences


                                                           21
                 © PromptCloud 2012, All rights reserved
For the lack of time..
                                       Section 4




                                                     22
           © PromptCloud 2012, All rights reserved
Sensu &Graphite
•   Monitoring router
•   "check scripts” on nodes
•    “handler scripts” on servers
•   Output can be sent to pagerduty, graphite,
    twitter or IRC




                                                             23
                   © PromptCloud 2012, All rights reserved
Distributed Log Collection
               Scribe, Flume, Splunk

Flume
• Allows multiple topologies
• Agent
• Collector
• Sink


                                                           24
                 © PromptCloud 2012, All rights reserved
Feel free to reach out



                              Big Data made Small
                    info@promptcloud.com
                              Appreciate your time 




Thanks to Arpan Jha for her help with the slides
                                                                             25
                                   © PromptCloud 2012, All rights reserved

Más contenido relacionado

La actualidad más candente

Cloud Foundry and OpenStack
Cloud Foundry and OpenStackCloud Foundry and OpenStack
Cloud Foundry and OpenStackvadimspivak
 
SemeruRuntimesUnderTheCover .pptx
SemeruRuntimesUnderTheCover .pptxSemeruRuntimesUnderTheCover .pptx
SemeruRuntimesUnderTheCover .pptxSumanMitra22
 
Xen Cloud Platform at Build a Cloud Day at SCALE 10x
Xen Cloud Platform at Build a Cloud Day at SCALE 10x Xen Cloud Platform at Build a Cloud Day at SCALE 10x
Xen Cloud Platform at Build a Cloud Day at SCALE 10x The Linux Foundation
 
VMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialVMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialRichard McDougall
 
Crash course on open source cloud computing
Crash course on open source cloud computingCrash course on open source cloud computing
Crash course on open source cloud computingLorscheider Santiago
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java DevelopersRichard McDougall
 
Xen server 6.1 customer presentation
Xen server 6.1 customer presentationXen server 6.1 customer presentation
Xen server 6.1 customer presentationNuno Alves
 
What’s New in vCloud Director 5.1?
What’s New in vCloud Director 5.1?What’s New in vCloud Director 5.1?
What’s New in vCloud Director 5.1?Eric Sloof
 
20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshop20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshopDigicomp Academy AG
 
The glideinWMS approach to the ownership of System Images in the Cloud World
The glideinWMS approach to the ownership of System Images in the Cloud WorldThe glideinWMS approach to the ownership of System Images in the Cloud World
The glideinWMS approach to the ownership of System Images in the Cloud WorldIgor Sfiligoi
 
Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010
Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010
Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010Arun Gupta
 
Fusion-io Memory Flash for Microsoft SQL Server 2012
Fusion-io Memory Flash for Microsoft SQL Server 2012Fusion-io Memory Flash for Microsoft SQL Server 2012
Fusion-io Memory Flash for Microsoft SQL Server 2012Mark Ginnebaugh
 
Keynote Speech: Xen ARM Virtualization
Keynote Speech: Xen ARM VirtualizationKeynote Speech: Xen ARM Virtualization
Keynote Speech: Xen ARM VirtualizationThe Linux Foundation
 
Accelerate Your Rails Site with Automatic Generation-Based Action Caching
Accelerate Your Rails Site with Automatic Generation-Based Action CachingAccelerate Your Rails Site with Automatic Generation-Based Action Caching
Accelerate Your Rails Site with Automatic Generation-Based Action Cachingelliando dias
 

La actualidad más candente (20)

Cloud Foundry and OpenStack
Cloud Foundry and OpenStackCloud Foundry and OpenStack
Cloud Foundry and OpenStack
 
VMware vSphere
VMware vSphereVMware vSphere
VMware vSphere
 
SemeruRuntimesUnderTheCover .pptx
SemeruRuntimesUnderTheCover .pptxSemeruRuntimesUnderTheCover .pptx
SemeruRuntimesUnderTheCover .pptx
 
Xen Cloud Platform at Build a Cloud Day at SCALE 10x
Xen Cloud Platform at Build a Cloud Day at SCALE 10x Xen Cloud Platform at Build a Cloud Day at SCALE 10x
Xen Cloud Platform at Build a Cloud Day at SCALE 10x
 
VMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A TutorialVMware Performance for Gurus - A Tutorial
VMware Performance for Gurus - A Tutorial
 
XS Boston 2008 ARM
XS Boston 2008 ARMXS Boston 2008 ARM
XS Boston 2008 ARM
 
Crash course on open source cloud computing
Crash course on open source cloud computingCrash course on open source cloud computing
Crash course on open source cloud computing
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java Developers
 
VMware vSphere5.1 Training
VMware vSphere5.1 TrainingVMware vSphere5.1 Training
VMware vSphere5.1 Training
 
XS Japan 2008 App Data English
XS Japan 2008 App Data EnglishXS Japan 2008 App Data English
XS Japan 2008 App Data English
 
Xen server 6.1 customer presentation
Xen server 6.1 customer presentationXen server 6.1 customer presentation
Xen server 6.1 customer presentation
 
What’s New in vCloud Director 5.1?
What’s New in vCloud Director 5.1?What’s New in vCloud Director 5.1?
What’s New in vCloud Director 5.1?
 
20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshop20 christian ferber xen_server_6_workshop
20 christian ferber xen_server_6_workshop
 
PowerHA for i
PowerHA for iPowerHA for i
PowerHA for i
 
The glideinWMS approach to the ownership of System Images in the Cloud World
The glideinWMS approach to the ownership of System Images in the Cloud WorldThe glideinWMS approach to the ownership of System Images in the Cloud World
The glideinWMS approach to the ownership of System Images in the Cloud World
 
Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010
Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010
Getting Started with Rails on GlassFish (Hands-on Lab) - Spark IT 2010
 
Fusion-io Memory Flash for Microsoft SQL Server 2012
Fusion-io Memory Flash for Microsoft SQL Server 2012Fusion-io Memory Flash for Microsoft SQL Server 2012
Fusion-io Memory Flash for Microsoft SQL Server 2012
 
Keynote Speech: Xen ARM Virtualization
Keynote Speech: Xen ARM VirtualizationKeynote Speech: Xen ARM Virtualization
Keynote Speech: Xen ARM Virtualization
 
XS Japan 2008 Citrix English
XS Japan 2008 Citrix EnglishXS Japan 2008 Citrix English
XS Japan 2008 Citrix English
 
Accelerate Your Rails Site with Automatic Generation-Based Action Caching
Accelerate Your Rails Site with Automatic Generation-Based Action CachingAccelerate Your Rails Site with Automatic Generation-Based Action Caching
Accelerate Your Rails Site with Automatic Generation-Based Action Caching
 

Similar a Building infrastructure for Big Data

ActiveMQ Performance Tuning
ActiveMQ Performance TuningActiveMQ Performance Tuning
ActiveMQ Performance TuningChristian Posta
 
Big data in travel domain
Big data in travel domainBig data in travel domain
Big data in travel domainPromptCloud
 
Getting Started Developing with Platform as a Service
Getting Started Developing with Platform as a ServiceGetting Started Developing with Platform as a Service
Getting Started Developing with Platform as a ServiceCloudBees
 
Securing User Data with SQLCipher
Securing User Data with SQLCipherSecuring User Data with SQLCipher
Securing User Data with SQLCipherCommonsWare
 
TDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASTDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASRomeo Kienzler
 
Platform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle WorldPlatform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle WorldSimon Haslam
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You? EMC
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Kathleen Ting
 
Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineDataWorks Summit
 
Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...
Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...
Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...Dropbox
 
JVM Multitenancy (JavaOne 2012)
JVM Multitenancy (JavaOne 2012)JVM Multitenancy (JavaOne 2012)
JVM Multitenancy (JavaOne 2012)Graeme_IBM
 
GlassFish in Production Environments
GlassFish in Production EnvironmentsGlassFish in Production Environments
GlassFish in Production EnvironmentsBruno Borges
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Ecommerce product review and price crawl
Ecommerce product review and price crawlEcommerce product review and price crawl
Ecommerce product review and price crawlPromptCloud
 
eFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCraft
eFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCrafteFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCraft
eFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCraftDropbox
 
Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...
Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...
Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...Nagios
 
Monitoring VMware vFabric with Hyperic and Spring Insight
Monitoring VMware vFabric with Hyperic and Spring InsightMonitoring VMware vFabric with Hyperic and Spring Insight
Monitoring VMware vFabric with Hyperic and Spring InsightC2B2 Consulting
 

Similar a Building infrastructure for Big Data (20)

ActiveMQ Performance Tuning
ActiveMQ Performance TuningActiveMQ Performance Tuning
ActiveMQ Performance Tuning
 
Big data in travel domain
Big data in travel domainBig data in travel domain
Big data in travel domain
 
Getting Started Developing with Platform as a Service
Getting Started Developing with Platform as a ServiceGetting Started Developing with Platform as a Service
Getting Started Developing with Platform as a Service
 
Securing User Data with SQLCipher
Securing User Data with SQLCipherSecuring User Data with SQLCipher
Securing User Data with SQLCipher
 
TDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASTDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAAS
 
Platform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle WorldPlatform Engineering for the Modern Oracle World
Platform Engineering for the Modern Oracle World
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmine
 
Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...
Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...
Webinar: eFolder Expert Series: Five Technologies from AppAssure to Boost You...
 
JVM Multitenancy (JavaOne 2012)
JVM Multitenancy (JavaOne 2012)JVM Multitenancy (JavaOne 2012)
JVM Multitenancy (JavaOne 2012)
 
OWF12/Java Sacha labourey
OWF12/Java Sacha laboureyOWF12/Java Sacha labourey
OWF12/Java Sacha labourey
 
GlassFish in Production Environments
GlassFish in Production EnvironmentsGlassFish in Production Environments
GlassFish in Production Environments
 
Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
Flume and HBase
Flume and HBase Flume and HBase
Flume and HBase
 
Ecommerce product review and price crawl
Ecommerce product review and price crawlEcommerce product review and price crawl
Ecommerce product review and price crawl
 
eFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCraft
eFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCrafteFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCraft
eFolder Webinar: How One Partner Leverages Dell AppAssure and StorageCraft
 
Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...
Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...
Nagios Conference 2012 - Eric Loyd - Nagios Implementation Case Eastman Kodak...
 
Monitoring VMware vFabric with Hyperic and Spring Insight
Monitoring VMware vFabric with Hyperic and Spring InsightMonitoring VMware vFabric with Hyperic and Spring Insight
Monitoring VMware vFabric with Hyperic and Spring Insight
 

Más de PromptCloud

Big Data’s Potential for the Real Estate Industry: 2021
Big Data’s Potential for the Real Estate Industry: 2021Big Data’s Potential for the Real Estate Industry: 2021
Big Data’s Potential for the Real Estate Industry: 2021PromptCloud
 
All You Need to Know About Web Crawling.pdf
All You Need to Know About Web Crawling.pdfAll You Need to Know About Web Crawling.pdf
All You Need to Know About Web Crawling.pdfPromptCloud
 
Web Scraping Myths vs. Facts
Web Scraping Myths vs. FactsWeb Scraping Myths vs. Facts
Web Scraping Myths vs. FactsPromptCloud
 
Octoparse competitors.pdf
Octoparse competitors.pdfOctoparse competitors.pdf
Octoparse competitors.pdfPromptCloud
 
Parsehub and competitior ppt.pptx
Parsehub and competitior ppt.pptxParsehub and competitior ppt.pptx
Parsehub and competitior ppt.pptxPromptCloud
 
Product Visibility- What Is Seen First, Will ppt.pptx
Product Visibility- What Is Seen First, Will ppt.pptxProduct Visibility- What Is Seen First, Will ppt.pptx
Product Visibility- What Is Seen First, Will ppt.pptxPromptCloud
 
Data Trends in Fashion Industry
Data Trends in Fashion IndustryData Trends in Fashion Industry
Data Trends in Fashion IndustryPromptCloud
 
Data Standardization with Web Data Integration
Data Standardization with Web Data Integration Data Standardization with Web Data Integration
Data Standardization with Web Data Integration PromptCloud
 
Visualizing Marvel Cinematic Universe Movies
Visualizing Marvel Cinematic Universe MoviesVisualizing Marvel Cinematic Universe Movies
Visualizing Marvel Cinematic Universe MoviesPromptCloud
 
15 Key Metrics Every E-commerce Business Should Track
15 Key Metrics Every E-commerce Business Should Track15 Key Metrics Every E-commerce Business Should Track
15 Key Metrics Every E-commerce Business Should TrackPromptCloud
 
Top Amazon Services for Ecommerce Players
Top Amazon Services for Ecommerce PlayersTop Amazon Services for Ecommerce Players
Top Amazon Services for Ecommerce PlayersPromptCloud
 
The Birth of a Web Crawling Bot
The Birth of a Web Crawling BotThe Birth of a Web Crawling Bot
The Birth of a Web Crawling BotPromptCloud
 
Upcoming Applications of Artificial intelligence in 2019
Upcoming Applications of Artificial intelligence in 2019Upcoming Applications of Artificial intelligence in 2019
Upcoming Applications of Artificial intelligence in 2019PromptCloud
 
Zipcode based price benchmarking for retailers
Zipcode based price benchmarking for retailersZipcode based price benchmarking for retailers
Zipcode based price benchmarking for retailersPromptCloud
 
Analyzing Positiveness in 160+ Holiday Songs
Analyzing Positiveness in 160+ Holiday SongsAnalyzing Positiveness in 160+ Holiday Songs
Analyzing Positiveness in 160+ Holiday SongsPromptCloud
 
PromptCloud's Year in Review - 2019
PromptCloud's Year in Review - 2019PromptCloud's Year in Review - 2019
PromptCloud's Year in Review - 2019PromptCloud
 
Top Data Analytics Trends for 2019
Top Data Analytics Trends for 2019Top Data Analytics Trends for 2019
Top Data Analytics Trends for 2019PromptCloud
 
10 Mobile App Ideas that can be Fueled by Web Scraping
10 Mobile App Ideas that can be Fueled by Web Scraping10 Mobile App Ideas that can be Fueled by Web Scraping
10 Mobile App Ideas that can be Fueled by Web ScrapingPromptCloud
 
How Web Scraping Can Help Affiliate Marketers
How Web Scraping Can Help Affiliate MarketersHow Web Scraping Can Help Affiliate Marketers
How Web Scraping Can Help Affiliate MarketersPromptCloud
 
Hotel Review Data Analysis
Hotel Review Data AnalysisHotel Review Data Analysis
Hotel Review Data AnalysisPromptCloud
 

Más de PromptCloud (20)

Big Data’s Potential for the Real Estate Industry: 2021
Big Data’s Potential for the Real Estate Industry: 2021Big Data’s Potential for the Real Estate Industry: 2021
Big Data’s Potential for the Real Estate Industry: 2021
 
All You Need to Know About Web Crawling.pdf
All You Need to Know About Web Crawling.pdfAll You Need to Know About Web Crawling.pdf
All You Need to Know About Web Crawling.pdf
 
Web Scraping Myths vs. Facts
Web Scraping Myths vs. FactsWeb Scraping Myths vs. Facts
Web Scraping Myths vs. Facts
 
Octoparse competitors.pdf
Octoparse competitors.pdfOctoparse competitors.pdf
Octoparse competitors.pdf
 
Parsehub and competitior ppt.pptx
Parsehub and competitior ppt.pptxParsehub and competitior ppt.pptx
Parsehub and competitior ppt.pptx
 
Product Visibility- What Is Seen First, Will ppt.pptx
Product Visibility- What Is Seen First, Will ppt.pptxProduct Visibility- What Is Seen First, Will ppt.pptx
Product Visibility- What Is Seen First, Will ppt.pptx
 
Data Trends in Fashion Industry
Data Trends in Fashion IndustryData Trends in Fashion Industry
Data Trends in Fashion Industry
 
Data Standardization with Web Data Integration
Data Standardization with Web Data Integration Data Standardization with Web Data Integration
Data Standardization with Web Data Integration
 
Visualizing Marvel Cinematic Universe Movies
Visualizing Marvel Cinematic Universe MoviesVisualizing Marvel Cinematic Universe Movies
Visualizing Marvel Cinematic Universe Movies
 
15 Key Metrics Every E-commerce Business Should Track
15 Key Metrics Every E-commerce Business Should Track15 Key Metrics Every E-commerce Business Should Track
15 Key Metrics Every E-commerce Business Should Track
 
Top Amazon Services for Ecommerce Players
Top Amazon Services for Ecommerce PlayersTop Amazon Services for Ecommerce Players
Top Amazon Services for Ecommerce Players
 
The Birth of a Web Crawling Bot
The Birth of a Web Crawling BotThe Birth of a Web Crawling Bot
The Birth of a Web Crawling Bot
 
Upcoming Applications of Artificial intelligence in 2019
Upcoming Applications of Artificial intelligence in 2019Upcoming Applications of Artificial intelligence in 2019
Upcoming Applications of Artificial intelligence in 2019
 
Zipcode based price benchmarking for retailers
Zipcode based price benchmarking for retailersZipcode based price benchmarking for retailers
Zipcode based price benchmarking for retailers
 
Analyzing Positiveness in 160+ Holiday Songs
Analyzing Positiveness in 160+ Holiday SongsAnalyzing Positiveness in 160+ Holiday Songs
Analyzing Positiveness in 160+ Holiday Songs
 
PromptCloud's Year in Review - 2019
PromptCloud's Year in Review - 2019PromptCloud's Year in Review - 2019
PromptCloud's Year in Review - 2019
 
Top Data Analytics Trends for 2019
Top Data Analytics Trends for 2019Top Data Analytics Trends for 2019
Top Data Analytics Trends for 2019
 
10 Mobile App Ideas that can be Fueled by Web Scraping
10 Mobile App Ideas that can be Fueled by Web Scraping10 Mobile App Ideas that can be Fueled by Web Scraping
10 Mobile App Ideas that can be Fueled by Web Scraping
 
How Web Scraping Can Help Affiliate Marketers
How Web Scraping Can Help Affiliate MarketersHow Web Scraping Can Help Affiliate Marketers
How Web Scraping Can Help Affiliate Marketers
 
Hotel Review Data Analysis
Hotel Review Data AnalysisHotel Review Data Analysis
Hotel Review Data Analysis
 

Último

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 

Último (20)

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 

Building infrastructure for Big Data

  • 1. Building the Infrastructure for Big Data @ The Fifth Elephant July 27th, 2012 -Prashant Kumar, Founder- PromptCloud 1 © PromptCloud 2012, All rights reserved
  • 2. Agenda About Context Machines, Installation & Cloud Automation Building blocks of a system Sample application sketch Lack of time components 2 © PromptCloud 2012, All rights reserved
  • 3. About Section 0 3 © PromptCloud 2012, All rights reserved
  • 4. About PromptCloud We provide data feeds and feed ourselves on data- since 2009 How?? • Large-scale data crawl and extraction • Hosted indexing • Custom data analytics • Working round the clock About Me • PromptCloud’s Founder • Yahoo! - 2007-2008 • IIT-Kanpur CS- 2007 4 © PromptCloud 2012, All rights reserved
  • 5. Deliverable 5 © PromptCloud 2012, All rights reserved
  • 6. Context Section 0.1 6 © PromptCloud 2012, All rights reserved
  • 7. Generic Big Data Systems • Multiple nodes (incoherent set of coherent ones) • Compute layer- Interdependent processes • Data storage layer & multiple middleware • Tools for installation, monitoring & scheduling *Meta- source control, code reviews, continuous integration 7 © PromptCloud 2012, All rights reserved
  • 8. Machines, Installation & Cloud Automation Section 1 8 © PromptCloud 2012, All rights reserved
  • 9. Installation Create an image and install •Easy to install •Modifications? Difficult to save •No maintenance cost it back •1 image for 1 purpose •Apt, yum, etc-keeper like systems but difficult to scale Solutions??  9 © PromptCloud 2012, All rights reserved
  • 10. Enter the Magic! Not a panacea; analgesic though 10 © PromptCloud 2012, All rights reserved
  • 11. Virtual Machines Virtual Machines ssh Up Init Vagrant Shared directory Port Forwarding AWS, Xen, Virtual Box KVM,… Installation 11 © PromptCloud 2012, All rights reserved
  • 12. Code the Installation using Chef Give the recipe- code what’s to be done I’m Solo Roles, Data Files Recipes Templates, Run List Chef Server Knife 12 © PromptCloud 2012, All rights reserved
  • 13. Building blocks Section 2 13 © PromptCloud 2012, All rights reserved
  • 14. To keep processes running, Option 1- Install GOD to monitor processes and to keep them in place Option 2 (for atheists)- Install MONIT Courtesy- BIT Mesra 14 © PromptCloud 2012, All rights reserved
  • 15. God’s Snippet God.watch do |w| w.name = watcher_name w.start = start_command #w.restart = restart_command w.stop = stop_command w.behavior(:clean_pid_file) #w.group = "some group" w.log = "/tmp/god_monitoring_#{watcher_name}.log" w.keepalive w.stop_timeout = 10.seconds end 15 © PromptCloud 2012, All rights reserved
  • 16. Job Scheduling Resque, Beanstalk, Gearman, Celery, + cron and queues Things to remember while making choices- • Persistence • Priorities • Tags • Option for retry • Ability to inspect the queue 16 © PromptCloud 2012, All rights reserved
  • 17. Data Storage Layer SQL/NoSQL, key/value, document-based, graph databases • For large systems, maintenance cost is a primary overhead • Replication & Availability • Consistency guarantees • Full-text search 17 © PromptCloud 2012, All rights reserved
  • 18. Voldemort Not me!!!!!!!! • Distributed key/value store • Great performance • Easy to add/remove nodes • Alternatives- Mongo, Courtesy- harrypotter.wikia.com Riak, Hbase, Cassandra 18 © PromptCloud 2012, All rights reserved
  • 19. Messaging Layer- • RabbitMQ- most commonly used in high-load production systems • Implements AMQP • Robust exchange server • Multiple kinds of exchanges- direct, topic, fanout • Options for HA with Pacemaker/DRBD 19 © PromptCloud 2012, All rights reserved
  • 20. Demo Section 3 20 © PromptCloud 2012, All rights reserved
  • 21. Demo Sketch 1. We’ll generate random sentences based on Markov chain 2. Store these in Voldemort 3. Enqueue corresponding jobs in RabbitMQ 4. Another set of workers will process these sentences 21 © PromptCloud 2012, All rights reserved
  • 22. For the lack of time.. Section 4 22 © PromptCloud 2012, All rights reserved
  • 23. Sensu &Graphite • Monitoring router • "check scripts” on nodes • “handler scripts” on servers • Output can be sent to pagerduty, graphite, twitter or IRC 23 © PromptCloud 2012, All rights reserved
  • 24. Distributed Log Collection Scribe, Flume, Splunk Flume • Allows multiple topologies • Agent • Collector • Sink 24 © PromptCloud 2012, All rights reserved
  • 25. Feel free to reach out Big Data made Small info@promptcloud.com Appreciate your time  Thanks to Arpan Jha for her help with the slides 25 © PromptCloud 2012, All rights reserved