1.
GrandData
InfoVis challenge
“We are Big-data analysts. We
will be a Legion. We do work
hard. We do not forget
scalability.
Expect us in you datacenter”
grandata.azurewebsites.net/
2. Data we dealt with
Fetched from peerIndex
The top most influencer twitter users in UK
For each of them:
Popular topics
Influence graph (who influences? From whom has been
influenced?)
Some statistics and data on his/her activity
His/Her twitter info
Data are unstructured (mainly text, different attributes)
3. Approaching the problem
Our focus: make a scalable Infovis solution
If data grow, everything should scale to guarantee a
fixed response time. At least we hope so
No bottlenecks nor single point of failure in the data
processing flow
Data are unstructured. Schemaless DB!
Additionally: 24hrs aren’t enough to build a
complete system. That’s only a fully-working proto
4. Considerations
Problem: DB scalability and easy prototyping:
Solution: use a sharded database -> MongoLab
Problem: quick coldstart, reliability and easy
management
Solution: cloud -> Windows Azure
Problem: algorithm scalability
Solution: MapReduce
5. Vis
Moving data to the browser is not a big-data
challenge:
Few pieces of data (compared to the stored)
Very effective graphics library publicly released
Support any (recent) browser
6. Further considerations
Problem: move data to the browser
Solution: we use MongoLab -> REST calls
Problem: Simple frontend that can runs everywhere
Solution: stay simple -> HTML, CSS and javascript
Problem: surfing the UX must be appealing
Solution: powerful js graphics library -> d3js
7. Algo complexity
Given N topics and K users, the complexity is
O(K*N)
Since the big-data, in this case, are the users (N will
be slow increasing during the time), the complexity
can be approximated as O(K)
That’s linear! Great for a big-data task
8. Algo enhancement
Given all the scores of a person, a prediction of its
(near) future trend is trivial. For each topic.
It’s possible to build a time-series prediction of what
might be the next value of each score.
If data are partially missing, or a subsampling
filtering has been applied, it’s still possible to
predict the scores of a generic user.
Collaborative filtering based on user/score matrix.
9. If anyone wants to sponsor us …
Improvements:
Add security (authentication/authorization) to REST
calls
Unit testing every piece of code
Build an on-line system that automatically loads data
gathered from the Internet