Data volumes have increased so significantly that we need to carefully consider how we interact with, share, and analyze data to avoid bottlenecks. In contexts such as eScience and scientific computing, a large emphasis is placed on collaboration, resulting in many well-known challenges in ensuring that data is in the right place at the right time and accessible by the right users. Yet these simple requirements create substantial challenges for the distribution, analysis, storage, and replication of potentially "large" datasets. Additional complexity is added through constraints such as budget, data locality, usage, and available local storage. In this paper, we propose a "socially driven" approach to address some of the challenges within (academic) research contexts by defining a Social Data Cloud and underpinning Content Delivery Network: a Social CDN (S-CDN). Our approach leverages digitally encoded social constructs via social network platforms that we use to represent (virtual) research communities. Ultimately, the S-CDN builds upon the intrinsic incentives of members of a given scientific community to address their data challenges collaboratively and in proven trusted settings. We define the design and architecture of a S-CDN and investigate its feasibility via a coauthorship case study as first steps to illustrate its usefulness.
A Social Content Delivery Network for Scientific Cooperation: Vision, Design, and Architecture
1. A Social Content Delivery Network
for Scientific Cooperation:
Vision, Design, and Architecture
Kyle Chard, Simon Caton, Omer Rana, Daniel S. Katz
www.ci.anl.gov
www.ci.uchicago.edu
2. Introduction
• Collaboration is increasingly data intensive
• To avoid research bottlenecks we need data...
– At the right place, at the right time, with appropriate
access permissions
• Challenges
– Distribution, storage, replication, budget, security, perf
ormance, locality, reliability, availability..
• Current approaches to data distribution/sharing?
www.ci.anl.gov
2 Social CDN -- DataCloud 2012
www.ci.uchicago.edu
3. Data (Content) Distribution
• Other domains use CDNs
– E.g. web
objects, downloads, streaming
media, social networks
• But, scientific data is often
– BigData
– Long tail
– Private
– Geographically distributed
• Commercial CDNs infeasible
and unaffordable for scientific
data.
www.ci.anl.gov
3 Social CDN -- DataCloud 2012
www.ci.uchicago.edu
4. Social Content Delivery Network (S-CDN)
• Utilizes the resources of
community members
– Low cost, distributed
infrastructure
• Social network Social Layer
identifies locations to
distribute and store
subsets of data
Resource Layer
• Algorithms to partition and
distribute data based
relationships with others
• Built upon the concept
Content Delivery Layer
of a Social (Data) Cloud
www.ci.anl.gov
4 Social CDN -- DataCloud 2012
www.ci.uchicago.edu
5. Trust
• Types of trust for a S-CDN
1. Infrastructure trust via appropriate security and
authentication mechanisms as well as policies
2. Inter-personal trust as an enabler of social
collaboration.
– “a positive expectation or assumption on future outcomes that
results from proven contextualized personal interaction-
histories”
• In the context of a S-CDN
– Leverage trust to select interaction partners
– Develop “trust models” to aid CDN management
algorithms
www.ci.anl.gov
5 Social CDN -- DataCloud 2012
www.ci.uchicago.edu
6. Motivating Use Case – Medical Imaging (1)
www.ci.anl.gov
6 Social CDN -- DataCloud 2012
www.ci.uchicago.edu
7. Motivating Use Case – Challenges
Data Privacy Data Access Big Data?
• Storage and transfer • Many researchers • Multiple centers
• Regulations (HIPAA) • Geographically • Multiple subjects
• Research IP distributed • Mutliple scans
• Trust • Different institutions • Mutltple analyses/
reconstructions
www.ci.anl.gov
7 Social CDN -- DataCloud 2012
www.ci.uchicago.edu
8. Motivating Use Case – S-CDN
• Trustworthiness: Relationships encoded within a
real world social/collaboration network and
previous scientific interactions or institutional
affiliations
• Data availability: Access to those who are
permitted to view (and need) data when required
• Reduced barriers: Collaborative infrastructure and
potential to aggregate other middleware such as
authentication, job submission, data staging
• Access and data placement: Algorithms that
leverage properties of the social graph
www.ci.anl.gov
8 Social CDN -- DataCloud 2012
www.ci.uchicago.edu
9. Architecture
Trust relationship • Storage Servers
– CDN edge nodes on which
research datasets (or fragments
thereof) reside
– Shared folder used for CDN and
local storage
Trusted third – Client to manage and transfer
party datasets
• Social Middleware
– Adds a layer of abstraction
between users and the S-CDN
– Provides authentication and
authorization
• Allocation Servers
– Centralized catalogs for global
datasets
– Maintain a list of current replicas
and place, move, update, and
maintain replicas
• Implementation?
www.ci.anl.gov
9 Social CDN -- DataCloud 2012
www.ci.uchicago.edu
10. Preliminary Investigation
• Explore data availability using a S-CDN
– Based on researcher relationships in a collaboration
• How can we extract a representation of
scientific (data) collaboration?
– Extrapolate collaborative research from the
publication history of a scientist
• Analysis
– Extract communities with different levels of trust
– Investigate simple CDN placement using social
algorithms
www.ci.anl.gov
10 Social CDN -- DataCloud 2012
www.ci.uchicago.edu
11. Community Graphs
Baseline Double Coauthorship Number of Authors
Authors 2335 811 604
Publications 1163 881 435
Edges 17973 5123 1988
• Baseline: DBLP publications, 3 Degrees, 2009-2010
• Double Coauthorship: At least 2 publications
• Number of Authors: < 6 authors per publication
www.ci.anl.gov
11 Social CDN -- DataCloud 2012
www.ci.uchicago.edu
12. Replica Selection
• Random
– Avg Hops: 2.23
• Node Degree
– Highest number of edges
– Avg Hops: 1.54
• Community Node Degree
– Highest degree within a community
(i.e. no adjacent placement)
– Avg Hops: 1.38
• Clustering Coefficient
– Highest likelihood that an author’s
coauthors are also connected
– Avg Hops: 2.62
www.ci.anl.gov
12 Social CDN -- DataCloud 2012
www.ci.uchicago.edu
13. Results
30
Baseline
Random
Node Degree
25
Community Node Degree
Replica Hit Rate (%)
Clustering Coefficient
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10
Number of Replicas
Double Coauthorship Number of Authors
40
Random 70
Random
35 Node Degree Node Degree
Community Node Degree 60
30 Community Node Degree
Replica Hit Rate (%)
Clustering Coefficient Replica Hit Rate (%) 50 Clustering Coefficient
25
40
20
15 30
10 20
5 10
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Number of Replicas Number of Replicas
www.ci.anl.gov
13 Social CDN -- DataCloud 2012
www.ci.uchicago.edu
14. Target users of a Social CDN
1. Large collaborative project with multiple
distributed participants
2. Participants are able to provide some resources to
the project
3. Good overall connectivity between participants
4. Different data set requirements for members of
the collaboration
5. Availability of data sets that can be co-hosted by
other participants
6. Varying sized data sets – not all of which may be
able to fit in one place.
www.ci.anl.gov
14 Social CDN -- DataCloud 2012
www.ci.uchicago.edu
15. Summary
• Data management across collaborations is difficult
– Right place, right time, accessible to the right people
– Complicated by size, security, availability, distance …
• Social CDN
– Builds upon the proven CDN model from other domains
– Relies on user contributed edge nodes
– Social overlay to incorporate trust and social replica selection
• Future work
– Analysis and formalization of trust as an enabler of collaboration
o Further investigation into mechanisms to extract trustworthiness from
scientific networks.
– Simulation of a wider range of attributes, such as data access
algorithms, different research networks, and indicators of trust.
– Proof of concept implementation
www.ci.anl.gov
15 Social CDN -- DataCloud 2012
www.ci.uchicago.edu
16. Thanks
• Questions?
Resources are idle 40-95%
1,000,000,000 Users
On average 190 friends
Users contribute to “good” causes
• Kyle Chard: kyle@ci.uchicago.edu
• http://www.facebook.com/SocialCloudComputing
www.ci.anl.gov
16 Social CDN -- DataCloud 2012
www.ci.uchicago.edu