SlideShare a Scribd company logo
1 of 10

GrandData
                                      InfoVis challenge




“We are Big-data analysts. We
will be a Legion. We do work
hard. We do not forget
scalability.
 Expect us in you datacenter”



                                grandata.azurewebsites.net/
Data we dealt with
 Fetched from peerIndex

 The top most influencer twitter users in UK

 For each of them:
   Popular topics
   Influence graph (who influences? From whom has been
      influenced?)
   Some statistics and data on his/her activity
   His/Her twitter info

 Data are unstructured (mainly text, different attributes)
Approaching the problem
 Our focus: make a scalable Infovis solution
    If data grow, everything should scale to guarantee a
     fixed response time. At least we hope so 
    No bottlenecks nor single point of failure in the data
     processing flow
    Data are unstructured. Schemaless DB!

 Additionally: 24hrs aren’t enough to build a
   complete system. That’s only a fully-working proto
Considerations
 Problem: DB scalability and easy prototyping:
    Solution: use a sharded database -> MongoLab

 Problem: quick coldstart, reliability and easy
   management
    Solution: cloud -> Windows Azure

 Problem: algorithm scalability
    Solution: MapReduce
Vis
 Moving data to the browser is not a big-data
   challenge:
    Few pieces of data (compared to the stored)
    Very effective graphics library publicly released

 Support any (recent) browser
Further considerations
 Problem: move data to the browser
    Solution: we use MongoLab -> REST calls

 Problem: Simple frontend that can runs everywhere
    Solution: stay simple -> HTML, CSS and javascript

 Problem: surfing the UX must be appealing
    Solution: powerful js graphics library -> d3js
Algo complexity
 Given N topics and K users, the complexity is
   O(K*N)
    Since the big-data, in this case, are the users (N will
     be slow increasing during the time), the complexity
     can be approximated as O(K)
    That’s linear! Great for a big-data task 
Algo enhancement
 Given all the scores of a person, a prediction of its
   (near) future trend is trivial. For each topic.
    It’s possible to build a time-series prediction of what
       might be the next value of each score.

 If data are partially missing, or a subsampling
   filtering has been applied, it’s still possible to
   predict the scores of a generic user.
    Collaborative filtering based on user/score matrix.
If anyone wants to sponsor us …
 Improvements:
   Add security (authentication/authorization) to REST
    calls
   Unit testing every piece of code
   Build an on-line system that automatically loads data
    gathered from the Internet
Team references
 I

 Another guy

 Yeah, the last one

More Related Content

What's hot

Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Vicente Orjales
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQueryCsaba Toth
 
Building a PII scrubbing layer
Building a PII scrubbing layerBuilding a PII scrubbing layer
Building a PII scrubbing layerTilak Patidar
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...Dataconomy Media
 
Druid Adoption Tips and Tricks
Druid Adoption Tips and TricksDruid Adoption Tips and Tricks
Druid Adoption Tips and TricksImply
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidImply
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceJ Singh
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology LandscapeShivanandaVSeeri
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache DruidImply
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Charles Allen
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid Matt Sarrel
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQLRTigger
 
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleCassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleDataStax Academy
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospectc-bslim
 
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesTilak Patidar
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10benoitg
 

What's hot (19)

Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
 
Building a PII scrubbing layer
Building a PII scrubbing layerBuilding a PII scrubbing layer
Building a PII scrubbing layer
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
 
Druid Adoption Tips and Tricks
Druid Adoption Tips and TricksDruid Adoption Tips and Tricks
Druid Adoption Tips and Tricks
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache Druid
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQL
 
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleCassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospect
 
The design and implementation of modern column oriented databases
The design and implementation of modern column oriented databasesThe design and implementation of modern column oriented databases
The design and implementation of modern column oriented databases
 
Databases benoitg 2009-03-10
Databases benoitg 2009-03-10Databases benoitg 2009-03-10
Databases benoitg 2009-03-10
 

Similar to Grandata

Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfkalai75
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
The Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemsThe Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemstaimur hafeez
 
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...InfluxData
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...oj08
 
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Sergey Platonov
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolutionmark madsen
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Big data, Cloud Computing and No SQL
Big data, Cloud Computing and No SQLBig data, Cloud Computing and No SQL
Big data, Cloud Computing and No SQLManu Cohen-Yashar
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big DataMrinal Kumar
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL TechnologiesAmit Singh
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudJames Serra
 

Similar to Grandata (20)

Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Big Data
Big DataBig Data
Big Data
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
NoSQL Basics - a quick tour
NoSQL Basics - a quick tourNoSQL Basics - a quick tour
NoSQL Basics - a quick tour
 
Introduction Big data
Introduction Big data  Introduction Big data
Introduction Big data
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdf
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
The Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemsThe Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystems
 
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
 
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
One Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database RevolutionOne Size Doesn't Fit All: The New Database Revolution
One Size Doesn't Fit All: The New Database Revolution
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Big data, Cloud Computing and No SQL
Big data, Cloud Computing and No SQLBig data, Cloud Computing and No SQL
Big data, Cloud Computing and No SQL
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 

More from Stefano Paluello

Real scenario: moving a legacy app to the Cloud
Real scenario: moving a legacy app to the CloudReal scenario: moving a legacy app to the Cloud
Real scenario: moving a legacy app to the CloudStefano Paluello
 
Using MongoDB with the .Net Framework
Using MongoDB with the .Net FrameworkUsing MongoDB with the .Net Framework
Using MongoDB with the .Net FrameworkStefano Paluello
 
TDD with Visual Studio 2010
TDD with Visual Studio 2010TDD with Visual Studio 2010
TDD with Visual Studio 2010Stefano Paluello
 
Teamwork and agile methodologies
Teamwork and agile methodologiesTeamwork and agile methodologies
Teamwork and agile methodologiesStefano Paluello
 

More from Stefano Paluello (9)

Clinical Data and AI
Clinical Data and AIClinical Data and AI
Clinical Data and AI
 
Real scenario: moving a legacy app to the Cloud
Real scenario: moving a legacy app to the CloudReal scenario: moving a legacy app to the Cloud
Real scenario: moving a legacy app to the Cloud
 
How to use asana
How to use asanaHow to use asana
How to use asana
 
Using MongoDB with the .Net Framework
Using MongoDB with the .Net FrameworkUsing MongoDB with the .Net Framework
Using MongoDB with the .Net Framework
 
Windows Azure Overview
Windows Azure OverviewWindows Azure Overview
Windows Azure Overview
 
TDD with Visual Studio 2010
TDD with Visual Studio 2010TDD with Visual Studio 2010
TDD with Visual Studio 2010
 
Asp.Net MVC Intro
Asp.Net MVC IntroAsp.Net MVC Intro
Asp.Net MVC Intro
 
Entity Framework 4
Entity Framework 4Entity Framework 4
Entity Framework 4
 
Teamwork and agile methodologies
Teamwork and agile methodologiesTeamwork and agile methodologies
Teamwork and agile methodologies
 

Grandata

  • 1.  GrandData InfoVis challenge “We are Big-data analysts. We will be a Legion. We do work hard. We do not forget scalability. Expect us in you datacenter” grandata.azurewebsites.net/
  • 2. Data we dealt with  Fetched from peerIndex  The top most influencer twitter users in UK  For each of them:  Popular topics  Influence graph (who influences? From whom has been influenced?)  Some statistics and data on his/her activity  His/Her twitter info  Data are unstructured (mainly text, different attributes)
  • 3. Approaching the problem  Our focus: make a scalable Infovis solution  If data grow, everything should scale to guarantee a fixed response time. At least we hope so   No bottlenecks nor single point of failure in the data processing flow  Data are unstructured. Schemaless DB!  Additionally: 24hrs aren’t enough to build a complete system. That’s only a fully-working proto
  • 4. Considerations  Problem: DB scalability and easy prototyping:  Solution: use a sharded database -> MongoLab  Problem: quick coldstart, reliability and easy management  Solution: cloud -> Windows Azure  Problem: algorithm scalability  Solution: MapReduce
  • 5. Vis  Moving data to the browser is not a big-data challenge:  Few pieces of data (compared to the stored)  Very effective graphics library publicly released  Support any (recent) browser
  • 6. Further considerations  Problem: move data to the browser  Solution: we use MongoLab -> REST calls  Problem: Simple frontend that can runs everywhere  Solution: stay simple -> HTML, CSS and javascript  Problem: surfing the UX must be appealing  Solution: powerful js graphics library -> d3js
  • 7. Algo complexity  Given N topics and K users, the complexity is O(K*N)  Since the big-data, in this case, are the users (N will be slow increasing during the time), the complexity can be approximated as O(K)  That’s linear! Great for a big-data task 
  • 8. Algo enhancement  Given all the scores of a person, a prediction of its (near) future trend is trivial. For each topic.  It’s possible to build a time-series prediction of what might be the next value of each score.  If data are partially missing, or a subsampling filtering has been applied, it’s still possible to predict the scores of a generic user.  Collaborative filtering based on user/score matrix.
  • 9. If anyone wants to sponsor us …  Improvements:  Add security (authentication/authorization) to REST calls  Unit testing every piece of code  Build an on-line system that automatically loads data gathered from the Internet
  • 10. Team references  I  Another guy  Yeah, the last one