SlideShare una empresa de Scribd logo
1 de 10
Google App Engine
A Big Data Laboratory?
J Singh, Early Stage IT
March 20, 2012
2
© J Singh, 2011 2
App Engine as a Big Data Laboratory?
• Why bother? Why not use Hadoop?
• EvaluatingApp Engine as a Big Data Laboratory
– Loading Data
– Analytics Capabilities
– Visualization Capabilities
• Conclusions
3
© J Singh, 2011 3
Why Bother? Why not Hadoop?
• No install and configuration required
– Focus on the task: Analytics and Visualization
– Use the technology that powers Google Earth and Google Finance
• Works with Google Datastore
– Makes sense if your data is already there
• No import/export of data necessary
• But a purely „low-level‟ programming environment
– Write Map and Reduce functions in Python / Java
– No Pig, Hive, …
• Is this story for real? We wanted to find out.
4
© J Singh, 2011 4
Loading Data into GAE
• What? No native OS environment to work in?
– No OS commands, no file system accessible to the programmer
– Data Prep must be done elsewhere.
• But other options exist
1. Upload a file into Blobstore through an HTTP request
• Max object size 2GB, max get/put in one call: 1MB.
• Process into Datastore entities using BlobstoreInputReader or
BlobstoreZipInputReader classes.
2. Use remote_api to upload CSV files
• It‟s painful
– Only needs to be done one-time, we hope
– Or we need to set up a process for staging and feeding the data
5
© J Singh, 2011 5
Data Analysis: NumPy and SciPy
• NumPy and SciPy libraries using the traditional computing
model (not Map/Reduce) include:
– Array and Matrix manipulation
– Optimization algorithms, e.g., curve fitting, linear regression,
multi-variate regression.
– Multithreading (for embarrassingly parallel problems)
• Replace map(…) with parallel_map(…).
– map is a Python primitive
– parallel_map is a NumPy primitive
– Other scientific algorithms, e.g., Kalman Filtering, Signal
smoothing, Markov Chains.
• NumPy and SciPy depended on Python 2.7
– Enabled in Fall, 2011.
6
© J Singh, 2011 6
Data Analysis: MapReduce
• Input Reader
– Several provided by GAE, can write your own
• Map function: Written by Programmer
• Shuffle function: Provided
– Can write your own overrides for partitioning
(sharding) and comparison (use in sort)
• Reduce function: Written by Programmer
– Can be skipped if not needed
• Output Writer
– Several provided by GAE, can write your own
7
© J Singh, 2011 7
Data Analysis: Pipeline API
• Based on Python Generator functions
• Allows chaining of map reduce jobs
– Primitives for setting up various types of chains
• MapreducePipeline (prev page) was just one type of pipeline
• Available for Python or Java
– Python side better documented
Split and Merge example
class aPipe(pipeline.Pipeline):
def run(self, e_kind, prop_name, *value_list):
all_bs = []
for v in value_list:
stage = yield bPipe(e_kind, prop_name, v)
all_bs.append(stage)
yield common.Append(*all_bs)
8
© J Singh, 2011 8
Data Visualization
• Appengine supports multiple web frameworks for serving data
directly from the Datastore into an HTML5 Browser:
– Django, Jinja2, CherryPy, …
• Options:
– jQuery Visualize
– Google Visualization API
• Including MotionCharts
– Hans Rosling‟s Visualization API
– Check out his TED talk
• Conclusion:
– A rich set of facilities for visualization and taking action
9
© J Singh, 2011 9
Decision Factors
Usage Discussion
Proof of Concept
or
Demo
In GAE
Need a process for Data Loading
But saves on having to do Hadoop setup
Absence of Pig/Hive may be a limiting factor
Advantage in Visualization
Better security and isolation than Hadoop
Production
In GAE
Analyze cost before committing
Lock-in risk?
Production
elsewhere
Good semantic match between Datastore and HBase.
Need to do Hadoop setup and operation
10
© J Singh, 2011 10
Thank you
• J Singh
– President, Early Stage IT
• Technology Services and Strategy for Startups
• DataThinks.org is a new service of Early Stage IT
– “Big Data” analytics solutions

Más contenido relacionado

La actualidad más candente

Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
Dmytro Sandu
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 

La actualidad más candente (20)

Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 

Destacado

2016 Standardization of Laboratory Test Coding - PHI Conference
2016 Standardization of Laboratory Test Coding - PHI Conference2016 Standardization of Laboratory Test Coding - PHI Conference
2016 Standardization of Laboratory Test Coding - PHI Conference
Megan Sawchuk
 
2008 Spotfire Life Science Forum
2008 Spotfire Life Science Forum2008 Spotfire Life Science Forum
2008 Spotfire Life Science Forum
Wolfgang G. Hoeck
 
Quality control in the medical laboratory
Quality control in the medical laboratoryQuality control in the medical laboratory
Quality control in the medical laboratory
Adnan Jaran
 
Data analysis powerpoint
Data analysis powerpointData analysis powerpoint
Data analysis powerpoint
jamiebrandon
 

Destacado (20)

OpenLSH - a framework for locality sensitive hashing
OpenLSH  - a framework for locality sensitive hashingOpenLSH  - a framework for locality sensitive hashing
OpenLSH - a framework for locality sensitive hashing
 
Tableau reseller partner in Australia Bilytica Best business Intelligence com...
Tableau reseller partner in Australia Bilytica Best business Intelligence com...Tableau reseller partner in Australia Bilytica Best business Intelligence com...
Tableau reseller partner in Australia Bilytica Best business Intelligence com...
 
2016 Standardization of Laboratory Test Coding - PHI Conference
2016 Standardization of Laboratory Test Coding - PHI Conference2016 Standardization of Laboratory Test Coding - PHI Conference
2016 Standardization of Laboratory Test Coding - PHI Conference
 
Whitepaper2012 "Virtual Laboratory for Analytic Geometry" UNAM
Whitepaper2012 "Virtual Laboratory for Analytic Geometry" UNAMWhitepaper2012 "Virtual Laboratory for Analytic Geometry" UNAM
Whitepaper2012 "Virtual Laboratory for Analytic Geometry" UNAM
 
Tableau reseller partner in Cape Verde Bilytica Best business Intelligence Co...
Tableau reseller partner in Cape Verde Bilytica Best business Intelligence Co...Tableau reseller partner in Cape Verde Bilytica Best business Intelligence Co...
Tableau reseller partner in Cape Verde Bilytica Best business Intelligence Co...
 
Dmla0609 Hoeck Presentation
Dmla0609 Hoeck PresentationDmla0609 Hoeck Presentation
Dmla0609 Hoeck Presentation
 
Checking in on Healthcare Data Analytics
Checking in on Healthcare Data AnalyticsChecking in on Healthcare Data Analytics
Checking in on Healthcare Data Analytics
 
INCREASING LABORATORY EFFICIENCY AND VALUE OF LABORATORY DATA BY MAXIMISING ...
INCREASING LABORATORY EFFICIENCY AND VALUE  OF LABORATORY DATA BY MAXIMISING ...INCREASING LABORATORY EFFICIENCY AND VALUE  OF LABORATORY DATA BY MAXIMISING ...
INCREASING LABORATORY EFFICIENCY AND VALUE OF LABORATORY DATA BY MAXIMISING ...
 
Exploring the Role of Information Technology Systems in Preventing and Managi...
Exploring the Role of Information Technology Systems in Preventing and Managi...Exploring the Role of Information Technology Systems in Preventing and Managi...
Exploring the Role of Information Technology Systems in Preventing and Managi...
 
Process Improvement - 10 Essential Ingredients
Process Improvement - 10 Essential IngredientsProcess Improvement - 10 Essential Ingredients
Process Improvement - 10 Essential Ingredients
 
Advanced Laboratory Analytics — A Disruptive Solution for Health Systems
Advanced Laboratory Analytics — A Disruptive Solution for Health SystemsAdvanced Laboratory Analytics — A Disruptive Solution for Health Systems
Advanced Laboratory Analytics — A Disruptive Solution for Health Systems
 
2008 Spotfire Life Science Forum
2008 Spotfire Life Science Forum2008 Spotfire Life Science Forum
2008 Spotfire Life Science Forum
 
The Evolution of Laboratory Data Systems: Replacing Paper, Streamlining Proce...
The Evolution of Laboratory Data Systems: Replacing Paper, Streamlining Proce...The Evolution of Laboratory Data Systems: Replacing Paper, Streamlining Proce...
The Evolution of Laboratory Data Systems: Replacing Paper, Streamlining Proce...
 
Clinical data analytics
Clinical data analyticsClinical data analytics
Clinical data analytics
 
eHealth: Big Data, Sports Analysis & Clinical Records
eHealth: Big Data, Sports Analysis & Clinical Records eHealth: Big Data, Sports Analysis & Clinical Records
eHealth: Big Data, Sports Analysis & Clinical Records
 
Electronic Medical Records - Paperless to Big Data Initiative
Electronic Medical Records - Paperless to Big Data InitiativeElectronic Medical Records - Paperless to Big Data Initiative
Electronic Medical Records - Paperless to Big Data Initiative
 
Basics of laboratory internal quality control, Ola Elgaddar, 2012
Basics of laboratory internal quality control, Ola Elgaddar, 2012Basics of laboratory internal quality control, Ola Elgaddar, 2012
Basics of laboratory internal quality control, Ola Elgaddar, 2012
 
Quality control in the medical laboratory
Quality control in the medical laboratoryQuality control in the medical laboratory
Quality control in the medical laboratory
 
Data analysis powerpoint
Data analysis powerpointData analysis powerpoint
Data analysis powerpoint
 
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s GoingBig Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
Big Data in Healthcare Made Simple: Where It Stands Today and Where It’s Going
 

Similar a Big Data Laboratory

SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
Doug Chang
 
O365Engage17 - How to Automate SharePoint Provisioning with PNP Framework
O365Engage17 - How to Automate SharePoint Provisioning with PNP FrameworkO365Engage17 - How to Automate SharePoint Provisioning with PNP Framework
O365Engage17 - How to Automate SharePoint Provisioning with PNP Framework
NCCOMMS
 
Data analytics in the cloud with Jupyter notebooks.
Data analytics in the cloud with Jupyter notebooks.Data analytics in the cloud with Jupyter notebooks.
Data analytics in the cloud with Jupyter notebooks.
Graham Dumpleton
 

Similar a Big Data Laboratory (20)

Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Social Media Mining using GAE Map Reduce
Social Media Mining using GAE Map ReduceSocial Media Mining using GAE Map Reduce
Social Media Mining using GAE Map Reduce
 
Python ml
Python mlPython ml
Python ml
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
MongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and VisualizationMongoDB for Spatio-Behavioral Data Analysis and Visualization
MongoDB for Spatio-Behavioral Data Analysis and Visualization
 
App Engine Meetup
App Engine MeetupApp Engine Meetup
App Engine Meetup
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
Guider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGLGuider: An Integrated Runtime Performance Analyzer on AGL
Guider: An Integrated Runtime Performance Analyzer on AGL
 
EPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster AnalyticsEPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster Analytics
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
 
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017
 
O365Engage17 - How to Automate SharePoint Provisioning with PNP Framework
O365Engage17 - How to Automate SharePoint Provisioning with PNP FrameworkO365Engage17 - How to Automate SharePoint Provisioning with PNP Framework
O365Engage17 - How to Automate SharePoint Provisioning with PNP Framework
 
Machine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE FukuokaMachine Learning Platform in LINE Fukuoka
Machine Learning Platform in LINE Fukuoka
 
Data analytics in the cloud with Jupyter notebooks.
Data analytics in the cloud with Jupyter notebooks.Data analytics in the cloud with Jupyter notebooks.
Data analytics in the cloud with Jupyter notebooks.
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014
 
What's up?
What's up?What's up?
What's up?
 

Más de J Singh

PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
J Singh
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
J Singh
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and Tradeoffs
J Singh
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
J Singh
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
J Singh
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
J Singh
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
J Singh
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
J Singh
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
J Singh
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
J Singh
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
J Singh
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
J Singh
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 Introduction
J Singh
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's Viewpoint
J Singh
 

Más de J Singh (18)

Designing analytics for big data
Designing analytics for big dataDesigning analytics for big data
Designing analytics for big data
 
Open LSH - september 2014 update
Open LSH  - september 2014 updateOpen LSH  - september 2014 update
Open LSH - september 2014 update
 
PaaS - google app engine
PaaS  - google app enginePaaS  - google app engine
PaaS - google app engine
 
Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)Mining of massive datasets using locality sensitive hashing (LSH)
Mining of massive datasets using locality sensitive hashing (LSH)
 
Data Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and TradeoffsData Analytic Technology Platforms: Options and Tradeoffs
Data Analytic Technology Platforms: Options and Tradeoffs
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data Analysis
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed CommitCS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Concurrency Control, Distributed Commit
 
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency ControlCS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Failure Recovery, Concurrency Control
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
CS 542 Database Index Structures
CS 542 Database Index StructuresCS 542 Database Index Structures
CS 542 Database Index Structures
 
CS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and PerformanceCS 542 Controlling Database Integrity and Performance
CS 542 Controlling Database Integrity and Performance
 
CS 542 Overview of query processing
CS 542 Overview of query processingCS 542 Overview of query processing
CS 542 Overview of query processing
 
CS 542 Introduction
CS 542 IntroductionCS 542 Introduction
CS 542 Introduction
 
Cloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's ViewpointCloud Computing from an Entrpreneur's Viewpoint
Cloud Computing from an Entrpreneur's Viewpoint
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Big Data Laboratory

  • 1. Google App Engine A Big Data Laboratory? J Singh, Early Stage IT March 20, 2012
  • 2. 2 © J Singh, 2011 2 App Engine as a Big Data Laboratory? • Why bother? Why not use Hadoop? • EvaluatingApp Engine as a Big Data Laboratory – Loading Data – Analytics Capabilities – Visualization Capabilities • Conclusions
  • 3. 3 © J Singh, 2011 3 Why Bother? Why not Hadoop? • No install and configuration required – Focus on the task: Analytics and Visualization – Use the technology that powers Google Earth and Google Finance • Works with Google Datastore – Makes sense if your data is already there • No import/export of data necessary • But a purely „low-level‟ programming environment – Write Map and Reduce functions in Python / Java – No Pig, Hive, … • Is this story for real? We wanted to find out.
  • 4. 4 © J Singh, 2011 4 Loading Data into GAE • What? No native OS environment to work in? – No OS commands, no file system accessible to the programmer – Data Prep must be done elsewhere. • But other options exist 1. Upload a file into Blobstore through an HTTP request • Max object size 2GB, max get/put in one call: 1MB. • Process into Datastore entities using BlobstoreInputReader or BlobstoreZipInputReader classes. 2. Use remote_api to upload CSV files • It‟s painful – Only needs to be done one-time, we hope – Or we need to set up a process for staging and feeding the data
  • 5. 5 © J Singh, 2011 5 Data Analysis: NumPy and SciPy • NumPy and SciPy libraries using the traditional computing model (not Map/Reduce) include: – Array and Matrix manipulation – Optimization algorithms, e.g., curve fitting, linear regression, multi-variate regression. – Multithreading (for embarrassingly parallel problems) • Replace map(…) with parallel_map(…). – map is a Python primitive – parallel_map is a NumPy primitive – Other scientific algorithms, e.g., Kalman Filtering, Signal smoothing, Markov Chains. • NumPy and SciPy depended on Python 2.7 – Enabled in Fall, 2011.
  • 6. 6 © J Singh, 2011 6 Data Analysis: MapReduce • Input Reader – Several provided by GAE, can write your own • Map function: Written by Programmer • Shuffle function: Provided – Can write your own overrides for partitioning (sharding) and comparison (use in sort) • Reduce function: Written by Programmer – Can be skipped if not needed • Output Writer – Several provided by GAE, can write your own
  • 7. 7 © J Singh, 2011 7 Data Analysis: Pipeline API • Based on Python Generator functions • Allows chaining of map reduce jobs – Primitives for setting up various types of chains • MapreducePipeline (prev page) was just one type of pipeline • Available for Python or Java – Python side better documented Split and Merge example class aPipe(pipeline.Pipeline): def run(self, e_kind, prop_name, *value_list): all_bs = [] for v in value_list: stage = yield bPipe(e_kind, prop_name, v) all_bs.append(stage) yield common.Append(*all_bs)
  • 8. 8 © J Singh, 2011 8 Data Visualization • Appengine supports multiple web frameworks for serving data directly from the Datastore into an HTML5 Browser: – Django, Jinja2, CherryPy, … • Options: – jQuery Visualize – Google Visualization API • Including MotionCharts – Hans Rosling‟s Visualization API – Check out his TED talk • Conclusion: – A rich set of facilities for visualization and taking action
  • 9. 9 © J Singh, 2011 9 Decision Factors Usage Discussion Proof of Concept or Demo In GAE Need a process for Data Loading But saves on having to do Hadoop setup Absence of Pig/Hive may be a limiting factor Advantage in Visualization Better security and isolation than Hadoop Production In GAE Analyze cost before committing Lock-in risk? Production elsewhere Good semantic match between Datastore and HBase. Need to do Hadoop setup and operation
  • 10. 10 © J Singh, 2011 10 Thank you • J Singh – President, Early Stage IT • Technology Services and Strategy for Startups • DataThinks.org is a new service of Early Stage IT – “Big Data” analytics solutions