Grandata

•Download as PPTX, PDF•

0 likes•1,297 views

Stefano Paluello


GrandData
InfoVis challenge

“We are Big-data analysts. We
will be a Legion. We do work
hard. We do not forget
scalability.
Expect us in you datacenter”

grandata.azurewebsites.net/

Data we dealt with
 Fetched from peerIndex

 The top most influencer twitter users in UK

 For each of them:
 Popular topics
 Influence graph (who influences? From whom has been
influenced?)
 Some statistics and data on his/her activity
 His/Her twitter info

 Data are unstructured (mainly text, different attributes)

Approaching the problem
 Our focus: make a scalable Infovis solution
 If data grow, everything should scale to guarantee a
fixed response time. At least we hope so 
 No bottlenecks nor single point of failure in the data
processing flow
 Data are unstructured. Schemaless DB!

 Additionally: 24hrs aren’t enough to build a
complete system. That’s only a fully-working proto

Considerations
 Problem: DB scalability and easy prototyping:
 Solution: use a sharded database -> MongoLab

 Problem: quick coldstart, reliability and easy
management
 Solution: cloud -> Windows Azure

 Problem: algorithm scalability
 Solution: MapReduce

Vis
 Moving data to the browser is not a big-data
challenge:
 Few pieces of data (compared to the stored)
 Very effective graphics library publicly released

 Support any (recent) browser

Further considerations
 Problem: move data to the browser
 Solution: we use MongoLab -> REST calls

 Problem: Simple frontend that can runs everywhere
 Solution: stay simple -> HTML, CSS and javascript

 Problem: surfing the UX must be appealing
 Solution: powerful js graphics library -> d3js

Algo complexity
 Given N topics and K users, the complexity is
O(K*N)
 Since the big-data, in this case, are the users (N will
be slow increasing during the time), the complexity
can be approximated as O(K)
 That’s linear! Great for a big-data task 

Algo enhancement
 Given all the scores of a person, a prediction of its
(near) future trend is trivial. For each topic.
 It’s possible to build a time-series prediction of what
might be the next value of each score.

 If data are partially missing, or a subsampling
filtering has been applied, it’s still possible to
predict the scores of a generic user.
 Collaborative filtering based on user/score matrix.

If anyone wants to sponsor us …
 Improvements:
 Add security (authentication/authorization) to REST
calls
 Unit testing every piece of code
 Build an on-line system that automatically loads data
gathered from the Internet

Team references
 I

 Another guy

 Yeah, the last one

What's hot

Google Dremel. Concept and Implementations.Vicente Orjales

Introduction to Google BigQueryCsaba Toth

Building a PII scrubbing layerTilak Patidar

"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...Dataconomy Media

Druid Adoption Tips and TricksImply

Archmage, Pinterest’s Real-time Analytics Platform on DruidImply

CS 542 Parallel DBs, NoSQL, MapReduceJ Singh

Big Data technology LandscapeShivanandaVSeeri

Building Data Applications with Apache DruidImply

Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill

Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Charles Allen

Apache Hadoop - Big Data EngineeringBADR

Benchmarking Apache Druid Matt Sarrel

Sql vs NoSQLRTigger

Cassandra Summit 2014: Fuzzy Entity Matching at ScaleDataStax Academy

A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks

Apache Druid Design and Future prospectc-bslim

The design and implementation of modern column oriented databasesTilak Patidar

Databases benoitg 2009-03-10benoitg

What's hot (19)

Google Dremel. Concept and Implementations.

Introduction to Google BigQuery

Building a PII scrubbing layer

"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...

Druid Adoption Tips and Tricks

Archmage, Pinterest’s Real-time Analytics Platform on Druid

CS 542 Parallel DBs, NoSQL, MapReduce

Big Data technology Landscape

Building Data Applications with Apache Druid

Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...

Data Analytics and Processing at Snap - Druid Meetup LA - September 2018

Apache Hadoop - Big Data Engineering

Benchmarking Apache Druid

Sql vs NoSQL

Cassandra Summit 2014: Fuzzy Entity Matching at Scale

A Developer’s View into Spark's Memory Model with Wenchen Fan

Apache Druid Design and Future prospect

The design and implementation of modern column oriented databases

Databases benoitg 2009-03-10

Similar to Grandata

Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari

Big DataNGDATA

Big data business caseKarthik Padmanabhan ( MLE℠)

NoSQL Basics - a quick tourBikram Sinha. MBA, PMP

Introduction Big data مروان الوجيه

Cloud and Bid data Dr.VK.pdfkalai75

Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari

TSE_Pres12.pptxssuseracaaae2

The Six pillars for Building big data analytics ecosystemstaimur hafeez

How a Time Series Database Contributes to a Decentralized Cloud Object Storag...InfluxData

2013 International Conference on Knowledge, Innovation and Enterprise Presen...oj08

Dori Exterman, Considerations for choosing the parallel computing strategy th...Sergey Platonov

Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation

One Size Doesn't Fit All: The New Database Revolutionmark madsen

Big Data Session 1.pptxElsonPaul2

Big data, Cloud Computing and No SQLManu Cohen-Yashar

Cloud Computing & Big DataMrinal Kumar

Big data analysis concepts and referencesInformation Security Awareness Group

Big Data using NoSQL TechnologiesAmit Singh

Choosing technologies for a big data solution in the cloudJames Serra

Similar to Grandata (20)

Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays

Big Data

Big data business case

NoSQL Basics - a quick tour

Introduction Big data

Cloud and Bid data Dr.VK.pdf

Big Data made easy in the era of the Cloud - Demi Ben-Ari

TSE_Pres12.pptx

The Six pillars for Building big data analytics ecosystems

How a Time Series Database Contributes to a Decentralized Cloud Object Storag...

2013 International Conference on Knowledge, Innovation and Enterprise Presen...

Dori Exterman, Considerations for choosing the parallel computing strategy th...

Data Engineer's Lunch #85: Designing a Modern Data Stack

One Size Doesn't Fit All: The New Database Revolution

Big Data Session 1.pptx

Big data, Cloud Computing and No SQL

Cloud Computing & Big Data

Big data analysis concepts and references

Big Data using NoSQL Technologies

Choosing technologies for a big data solution in the cloud

More from Stefano Paluello

Clinical Data and AIStefano Paluello

Real scenario: moving a legacy app to the CloudStefano Paluello

How to use asanaStefano Paluello

Using MongoDB with the .Net FrameworkStefano Paluello

Windows Azure OverviewStefano Paluello

TDD with Visual Studio 2010Stefano Paluello

Asp.Net MVC IntroStefano Paluello

Entity Framework 4Stefano Paluello

Teamwork and agile methodologiesStefano Paluello

More from Stefano Paluello (9)

Clinical Data and AI

Real scenario: moving a legacy app to the Cloud

How to use asana

Using MongoDB with the .Net Framework

Windows Azure Overview

TDD with Visual Studio 2010

Asp.Net MVC Intro

Entity Framework 4

Teamwork and agile methodologies

Grandata

1.  GrandData InfoVis challenge “We are Big-data analysts. We will be a Legion. We do work hard. We do not forget scalability. Expect us in you datacenter” grandata.azurewebsites.net/

2. Data we dealt with  Fetched from peerIndex  The top most influencer twitter users in UK  For each of them:  Popular topics  Influence graph (who influences? From whom has been influenced?)  Some statistics and data on his/her activity  His/Her twitter info  Data are unstructured (mainly text, different attributes)

3. Approaching the problem  Our focus: make a scalable Infovis solution  If data grow, everything should scale to guarantee a fixed response time. At least we hope so   No bottlenecks nor single point of failure in the data processing flow  Data are unstructured. Schemaless DB!  Additionally: 24hrs aren’t enough to build a complete system. That’s only a fully-working proto

4. Considerations  Problem: DB scalability and easy prototyping:  Solution: use a sharded database -> MongoLab  Problem: quick coldstart, reliability and easy management  Solution: cloud -> Windows Azure  Problem: algorithm scalability  Solution: MapReduce

5. Vis  Moving data to the browser is not a big-data challenge:  Few pieces of data (compared to the stored)  Very effective graphics library publicly released  Support any (recent) browser

6. Further considerations  Problem: move data to the browser  Solution: we use MongoLab -> REST calls  Problem: Simple frontend that can runs everywhere  Solution: stay simple -> HTML, CSS and javascript  Problem: surfing the UX must be appealing  Solution: powerful js graphics library -> d3js

7. Algo complexity  Given N topics and K users, the complexity is O(K*N)  Since the big-data, in this case, are the users (N will be slow increasing during the time), the complexity can be approximated as O(K)  That’s linear! Great for a big-data task 

8. Algo enhancement  Given all the scores of a person, a prediction of its (near) future trend is trivial. For each topic.  It’s possible to build a time-series prediction of what might be the next value of each score.  If data are partially missing, or a subsampling filtering has been applied, it’s still possible to predict the scores of a generic user.  Collaborative filtering based on user/score matrix.

9. If anyone wants to sponsor us …  Improvements:  Add security (authentication/authorization) to REST calls  Unit testing every piece of code  Build an on-line system that automatically loads data gathered from the Internet

10. Team references  I  Another guy  Yeah, the last one

Grandata

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Grandata

Similar to Grandata (20)

More from Stefano Paluello

More from Stefano Paluello (9)

Grandata