SlideShare una empresa de Scribd logo
1 de 18
SOCIAL DATA AND DBS
IVAN SANCHEZ
JULIO SALINAS
MARLENE ROBLES
CONTENTS
•Background
•Study Cases:
•Twitter: Real time search-Earlybird
•Facebook: Storage.
•LinkedIn: Storage (Voldemort)
•Conclusion
•References
BACKGROUND OSN
● Huge amount of data, diverse and changing over time. Likes,
sharing, comments, logins, page-views, search queries.
● New approaches to manipulate it.
● Distributed Databases, NoSQL.
● How to retrieve the data (Search relevance,
Recommendations, Security against abusive behavior,
Newsfeed features)
● Goal: massive scaling of demand: Unstructured, Semi-
Facebook Twitter LinkedIn
2.7M likes &
comments/da
y
500M
tweets/day
300+ M.Users(2 new/s).
200 group
conversation/min
STORING AND QUERYING AT TWITTER
● Storage:
o MySQL used as key-value store.
o FlockDB to Twitter Social Graph.
● Desired queries:
o TrendingTopics
o Breaking news
o Sentiment
REAL TIME SEARCH AT TWITTER: EARLYBIRD
TAO AND THE FACEBOOK SOCIAL GRAPH
TAO
o Architecture and Data Model:
 Objects: (id) → (otype, (key ? value)∗)
 Associations: (id1, atype, id2) → (time, (key ? value)∗)
o MySQL to the Storage Layer.
o Main challenges:
 Efficience scale.
 Very fast response time.
 High Read Availability.
Professional Social
Network
Data Driven Features:
● Recomendation System
(people you may know)
● People Search (Jobs
search - candidates)
● Who view your profile?
● Events you may be
STORAGE - VOLDEMORT
Highly Available Distrib. KV
Store
10 Voldemort Clusters
(+100 nodes) - 9 of BDB
Layered Design
All layers – single interface:
-Put/Delete/Get
-Flexible
-Every layer->decorates
next one
STORAGE - VOLDEMORT
Voldemort provides:
•High available
•Low latency
•Distributed
Like a Distrib. Hash Table
(DHT).
Storage Data engine on
nodes:
•Compact index
•Data files
DISTRIBUTED HASHING ALGORITHM
This slide is from Roshan Sumbaly & Jay Kreps! (thanks Rosh & Jay)
SUMMARY
Problem Solved Main Advantages
EarlyBird
Real time search Fast indexing,
concurrence Management
TAO
Storing Facebook
Social Graph
Very fast response time.
High read availability.
Voldemort
Simple Data
Partitioning to
meet scalability
needs
High Scalable, Seamless
replication
CONCLUSION
• The selection of the database systems depends on the
needs of the applications and the primary type of
information of the social network.
• Many OSN have developed their own solutions to cope with
the ever growing nature of big data and its challenges.
• Summarizing, the main features that the data solutions
should have are:
• Storage huge amount of data.
• Fast read and low latency.
• Processing of big data (meaningful results)
• Streaming and indexing are critical.
EXTREMELY DIFFICULT QUESTIONS
1.Why did LinkedIn needed to build their own
solution Voldemort?
2.How does TAO resolve the challenges it was built
for?
3.How the real time search service works at
twitter?
REFERENCES
● Auradkar, A., Botev, C., Das, S., De Maagd, D., Feinberg, A., Ganti, P., … Zhang, J.
(2012). Data Infrastructure at LinkedIn. In Data Engineering (ICDE), 2012 IEEE 28th
International Conference on (pp. 1370–1381).
● N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo,
S. Kulkarni, and H. Li, “Tao: Facebook’s distributed data store for the social graph,” in
USENIX ATC, 2013.
● N. Ruflin, H. Burkhart, and S. Rizzotti, “Social-data storage-systems,” Databases Soc.
Networks - DBSocial ’11, pp. 7–12, 2011.
● A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H.
Liu, “Data warehousing and analytics infrastructure at facebook,” Proceedings of the
2010 ACM SIGMOD International Conference on Management of data. ACM,
Indianapolis, Indiana, USA, pp. 1013–1020, 2010.
● D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel, “Finding a Needle in Haystack:
Facebook’s Photo Storage,” in OSDI, 2010, vol. 2010, pp. 47–60.
● M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin, “Earlybird: Real-Time
Search at Twitter,” Proceedings of the 2012 IEEE 28th International Conference on Data
Engineering. IEEE Computer Society, pp. 1360–1369, 2012.
REFERENCES (II)
● D. Borthakur, J. Gray, J. Sen Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K.
Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer, “Apache hadoop goes
realtime at Facebook,” Proceedings of the 2011 ACM SIGMOD International Conference on
Management of data. ACM, Athens, Greece, pp. 1071–1080, 2011.
● A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data
warehousing and analytics infrastructure at facebook,” Proceedings of the 2010 ACM SIGMOD
International Conference on Management of data. ACM, Indianapolis, Indiana, USA, pp. 1013–
1020, 2010.
● C. Chen, F. Li, B. C. Ooi, and S. Wu, “TI: an efficient indexing mechanism for real-time search
on tweets,” Proceedings of the 2011 ACM SIGMOD International Conference on Management of
data. ACM, Athens, Greece, pp. 649–660, 2011.
● G. Mishne, J. Dalton, Z. Li, A. Sharma, and J. Lin, “Fast data in the era of big data: Twitter’s real-
time related query suggestion architecture,” Proceedings of the 2013 ACM SIGMOD
International Conference on Management of Data. ACM, New York, New York, USA, pp. 1147–
1158, 2013.
● S. Cohen and B. Kimelfeld, “A Social Network Database that Learns How to Answer Queries ∗,”
2013.
LINKS
● https://www.usenix.org/conference/atc13/technical-
sessions/presentation/bronson
● http://www-
conf.slac.stanford.edu/xldb2012/talks/xldb2012_wed_1105
_DhrubaBorthakur.pdf
● http://www.slideshare.net/linkedin/jay-kreps-on-project-
voldemort-scaling-simple-storage-at-linkedin
● http://data.linkedin.com/
● http://www.infoq.com/presentations/Project-Voldemort-at-
Gilt-Groupe
THE END

Más contenido relacionado

La actualidad más candente

IIIF and Linked Data: A Cultural Heritage DAM Ecosystem
IIIF and Linked Data: A Cultural Heritage DAM EcosystemIIIF and Linked Data: A Cultural Heritage DAM Ecosystem
IIIF and Linked Data: A Cultural Heritage DAM EcosystemRobert Sanderson
 
Knowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/InteroperabilityKnowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/InteroperabilityJames Hendler
 
Capacity Building: Data Science in the University At Rensselaer Polytechnic ...
Capacity Building: Data Science in the University  At Rensselaer Polytechnic ...Capacity Building: Data Science in the University  At Rensselaer Polytechnic ...
Capacity Building: Data Science in the University At Rensselaer Polytechnic ...James Hendler
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
Data Science presentation for elementary school students
Data Science presentation for elementary school studentsData Science presentation for elementary school students
Data Science presentation for elementary school studentsMelanie Manning, CFA
 
Discovery of IIIF Resources: Intro for Working Group / Vatican
Discovery of IIIF Resources: Intro for Working Group / VaticanDiscovery of IIIF Resources: Intro for Working Group / Vatican
Discovery of IIIF Resources: Intro for Working Group / VaticanRobert Sanderson
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningJulian Bright
 
Introduction to the Linked Art Data Model
Introduction to the Linked Art Data ModelIntroduction to the Linked Art Data Model
Introduction to the Linked Art Data ModelRobert Sanderson
 
A Review Paper on Big Data: Technologies, Tools and Trends
A Review Paper on Big Data: Technologies, Tools and TrendsA Review Paper on Big Data: Technologies, Tools and Trends
A Review Paper on Big Data: Technologies, Tools and TrendsIRJET Journal
 
Big data deep learning: applications and challenges
Big data deep learning: applications and challengesBig data deep learning: applications and challenges
Big data deep learning: applications and challengesfazail amin
 
DATA CENTRIC EDUCATION & LEARNING
 DATA CENTRIC EDUCATION & LEARNING DATA CENTRIC EDUCATION & LEARNING
DATA CENTRIC EDUCATION & LEARNINGdatasciencekorea
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science clubData Science Club
 
The Semantic Web: It's for Real
The Semantic Web: It's for RealThe Semantic Web: It's for Real
The Semantic Web: It's for RealJames Hendler
 
Noshir Contractor's view on the future of Linked Data
Noshir Contractor's view on the future of Linked DataNoshir Contractor's view on the future of Linked Data
Noshir Contractor's view on the future of Linked DataCarlos Pedrinaci
 
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...Galvanize
 
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataIllusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataRobert Sanderson
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopNikolai Avteniev
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 

La actualidad más candente (20)

Big Data introduction - Café Numérique Bruxelles
Big Data introduction - Café Numérique BruxellesBig Data introduction - Café Numérique Bruxelles
Big Data introduction - Café Numérique Bruxelles
 
IIIF and Linked Data: A Cultural Heritage DAM Ecosystem
IIIF and Linked Data: A Cultural Heritage DAM EcosystemIIIF and Linked Data: A Cultural Heritage DAM Ecosystem
IIIF and Linked Data: A Cultural Heritage DAM Ecosystem
 
Knowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/InteroperabilityKnowledge Graph Semantics/Interoperability
Knowledge Graph Semantics/Interoperability
 
Capacity Building: Data Science in the University At Rensselaer Polytechnic ...
Capacity Building: Data Science in the University  At Rensselaer Polytechnic ...Capacity Building: Data Science in the University  At Rensselaer Polytechnic ...
Capacity Building: Data Science in the University At Rensselaer Polytechnic ...
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Data Science presentation for elementary school students
Data Science presentation for elementary school studentsData Science presentation for elementary school students
Data Science presentation for elementary school students
 
Discovery of IIIF Resources: Intro for Working Group / Vatican
Discovery of IIIF Resources: Intro for Working Group / VaticanDiscovery of IIIF Resources: Intro for Working Group / Vatican
Discovery of IIIF Resources: Intro for Working Group / Vatican
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Overview of Big Data
Overview of Big DataOverview of Big Data
Overview of Big Data
 
Introduction to the Linked Art Data Model
Introduction to the Linked Art Data ModelIntroduction to the Linked Art Data Model
Introduction to the Linked Art Data Model
 
A Review Paper on Big Data: Technologies, Tools and Trends
A Review Paper on Big Data: Technologies, Tools and TrendsA Review Paper on Big Data: Technologies, Tools and Trends
A Review Paper on Big Data: Technologies, Tools and Trends
 
Big data deep learning: applications and challenges
Big data deep learning: applications and challengesBig data deep learning: applications and challenges
Big data deep learning: applications and challenges
 
DATA CENTRIC EDUCATION & LEARNING
 DATA CENTRIC EDUCATION & LEARNING DATA CENTRIC EDUCATION & LEARNING
DATA CENTRIC EDUCATION & LEARNING
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science club
 
The Semantic Web: It's for Real
The Semantic Web: It's for RealThe Semantic Web: It's for Real
The Semantic Web: It's for Real
 
Noshir Contractor's view on the future of Linked Data
Noshir Contractor's view on the future of Linked DataNoshir Contractor's view on the future of Linked Data
Noshir Contractor's view on the future of Linked Data
 
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
How to Become a Data Scientist – By Ryan Orban, VP of Operations and Expansio...
 
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataIllusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 

Similar a Social databases - A brief overview

Stay Calm & Keep Current
Stay Calm & Keep CurrentStay Calm & Keep Current
Stay Calm & Keep CurrentLiad Magen
 
Promise of web science
Promise of web sciencePromise of web science
Promise of web scienceAastha Madaan
 
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...Connected Data World
 
KNOWLEDGE ARCHITECTURE: IT’S IMPORTANCE TO AN ORGANIZATION
KNOWLEDGE ARCHITECTURE: IT’S IMPORTANCE TO AN ORGANIZATIONKNOWLEDGE ARCHITECTURE: IT’S IMPORTANCE TO AN ORGANIZATION
KNOWLEDGE ARCHITECTURE: IT’S IMPORTANCE TO AN ORGANIZATIONConnected Data World
 
6.[34 38]face location - a novel approach to post the user global location
6.[34 38]face location - a novel approach to post the user global location6.[34 38]face location - a novel approach to post the user global location
6.[34 38]face location - a novel approach to post the user global locationAlexander Decker
 
The End(s) of e-Research
The End(s) of e-ResearchThe End(s) of e-Research
The End(s) of e-ResearchEric Meyer
 
Data Mining Algorithm and New HRDSD Theory for Big Data
Data Mining Algorithm and New HRDSD Theory for Big DataData Mining Algorithm and New HRDSD Theory for Big Data
Data Mining Algorithm and New HRDSD Theory for Big DataKamleshKumar394
 
Media REVEALr: A social multimedia monitoring and intelligence system for Web...
Media REVEALr: A social multimedia monitoring and intelligence system for Web...Media REVEALr: A social multimedia monitoring and intelligence system for Web...
Media REVEALr: A social multimedia monitoring and intelligence system for Web...Symeon Papadopoulos
 
20220103 jim spohrer hicss v9
20220103 jim spohrer hicss v920220103 jim spohrer hicss v9
20220103 jim spohrer hicss v9ISSIP
 
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...REVEAL - Social Media Verification
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your RoleJay Gendron
 
ITU IHSAN lab introduction
ITU IHSAN lab introductionITU IHSAN lab introduction
ITU IHSAN lab introductionJunaid Qadir
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Sciencedatasciencekorea
 
Big data divided (24 march2014)
Big data divided (24 march2014)Big data divided (24 march2014)
Big data divided (24 march2014)Han Woo PARK
 
Lessons Learned from Lod Failure and Big Data : The Future Trend
Lessons Learned from Lod Failure and Big Data : The Future Trend Lessons Learned from Lod Failure and Big Data : The Future Trend
Lessons Learned from Lod Failure and Big Data : The Future Trend Konkuk University
 
Learning from past infrastructure to embrace friction and create the Research...
Learning from past infrastructure to embrace friction and create the Research...Learning from past infrastructure to embrace friction and create the Research...
Learning from past infrastructure to embrace friction and create the Research...Research Data Alliance
 

Similar a Social databases - A brief overview (20)

Stay Calm & Keep Current
Stay Calm & Keep CurrentStay Calm & Keep Current
Stay Calm & Keep Current
 
Promise of web science
Promise of web sciencePromise of web science
Promise of web science
 
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
Κnowledge Architecture: Combining Strategy, Data Science and Information Arch...
 
KNOWLEDGE ARCHITECTURE: IT’S IMPORTANCE TO AN ORGANIZATION
KNOWLEDGE ARCHITECTURE: IT’S IMPORTANCE TO AN ORGANIZATIONKNOWLEDGE ARCHITECTURE: IT’S IMPORTANCE TO AN ORGANIZATION
KNOWLEDGE ARCHITECTURE: IT’S IMPORTANCE TO AN ORGANIZATION
 
6.[34 38]face location - a novel approach to post the user global location
6.[34 38]face location - a novel approach to post the user global location6.[34 38]face location - a novel approach to post the user global location
6.[34 38]face location - a novel approach to post the user global location
 
Big data trends in 2020
Big data trends in 2020Big data trends in 2020
Big data trends in 2020
 
The End(s) of e-Research
The End(s) of e-ResearchThe End(s) of e-Research
The End(s) of e-Research
 
Data Mining Algorithm and New HRDSD Theory for Big Data
Data Mining Algorithm and New HRDSD Theory for Big DataData Mining Algorithm and New HRDSD Theory for Big Data
Data Mining Algorithm and New HRDSD Theory for Big Data
 
Media REVEALr: A social multimedia monitoring and intelligence system for Web...
Media REVEALr: A social multimedia monitoring and intelligence system for Web...Media REVEALr: A social multimedia monitoring and intelligence system for Web...
Media REVEALr: A social multimedia monitoring and intelligence system for Web...
 
20220103 jim spohrer hicss v9
20220103 jim spohrer hicss v920220103 jim spohrer hicss v9
20220103 jim spohrer hicss v9
 
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
Mediarevealr: A social multimedia monitoring and intelligence system for Web ...
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
 
SMART Seminar Series: "From Big Data to Smart data"
SMART Seminar Series: "From Big Data to Smart data"SMART Seminar Series: "From Big Data to Smart data"
SMART Seminar Series: "From Big Data to Smart data"
 
ITU IHSAN lab introduction
ITU IHSAN lab introductionITU IHSAN lab introduction
ITU IHSAN lab introduction
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Science
 
2014_WWW_BTOR
2014_WWW_BTOR2014_WWW_BTOR
2014_WWW_BTOR
 
Big data divided (24 march2014)
Big data divided (24 march2014)Big data divided (24 march2014)
Big data divided (24 march2014)
 
Lessons Learned from Lod Failure and Big Data : The Future Trend
Lessons Learned from Lod Failure and Big Data : The Future Trend Lessons Learned from Lod Failure and Big Data : The Future Trend
Lessons Learned from Lod Failure and Big Data : The Future Trend
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptx
 
Learning from past infrastructure to embrace friction and create the Research...
Learning from past infrastructure to embrace friction and create the Research...Learning from past infrastructure to embrace friction and create the Research...
Learning from past infrastructure to embrace friction and create the Research...
 

Más de Iván Sanchez Vera

Intro Inteligencia Artificial (AI)
Intro Inteligencia Artificial (AI)Intro Inteligencia Artificial (AI)
Intro Inteligencia Artificial (AI)Iván Sanchez Vera
 
Trajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus AlgorithmTrajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus AlgorithmIván Sanchez Vera
 
(Draft) Nuevos caminos de innovación en tecnología
(Draft) Nuevos caminos de innovación en tecnología(Draft) Nuevos caminos de innovación en tecnología
(Draft) Nuevos caminos de innovación en tecnologíaIván Sanchez Vera
 
Pin payments presentation final (4)
Pin payments presentation final (4)Pin payments presentation final (4)
Pin payments presentation final (4)Iván Sanchez Vera
 
Impacto de las Actividades Economicas sobre las Funciones de la Biosfera.pptx
Impacto de las Actividades Economicas sobre las Funciones de la Biosfera.pptxImpacto de las Actividades Economicas sobre las Funciones de la Biosfera.pptx
Impacto de las Actividades Economicas sobre las Funciones de la Biosfera.pptxIván Sanchez Vera
 
Economia de Recursos Naturales y Economia Tradicional
Economia de Recursos Naturales y Economia TradicionalEconomia de Recursos Naturales y Economia Tradicional
Economia de Recursos Naturales y Economia TradicionalIván Sanchez Vera
 
Nociones básica de ecología y recursos naturales.
Nociones básica de ecología y recursos naturales. Nociones básica de ecología y recursos naturales.
Nociones básica de ecología y recursos naturales. Iván Sanchez Vera
 
Economia de Recursos Naturales
Economia de Recursos NaturalesEconomia de Recursos Naturales
Economia de Recursos NaturalesIván Sanchez Vera
 
Proceso de Adquisiciones de Tecnologia
Proceso de Adquisiciones de TecnologiaProceso de Adquisiciones de Tecnologia
Proceso de Adquisiciones de TecnologiaIván Sanchez Vera
 
Proceso de Compra de Tecnologia
Proceso de Compra de TecnologiaProceso de Compra de Tecnologia
Proceso de Compra de TecnologiaIván Sanchez Vera
 

Más de Iván Sanchez Vera (20)

Git res baz ec - final
Git   res baz ec - finalGit   res baz ec - final
Git res baz ec - final
 
Intro a Metodos Numericos
Intro a Metodos NumericosIntro a Metodos Numericos
Intro a Metodos Numericos
 
Intro Inteligencia Artificial (AI)
Intro Inteligencia Artificial (AI)Intro Inteligencia Artificial (AI)
Intro Inteligencia Artificial (AI)
 
Trajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus AlgorithmTrajectory clustering - Traclus Algorithm
Trajectory clustering - Traclus Algorithm
 
Proofs on cryptocurrencies
Proofs on cryptocurrenciesProofs on cryptocurrencies
Proofs on cryptocurrencies
 
(Draft) Nuevos caminos de innovación en tecnología
(Draft) Nuevos caminos de innovación en tecnología(Draft) Nuevos caminos de innovación en tecnología
(Draft) Nuevos caminos de innovación en tecnología
 
Pin payments presentation final (4)
Pin payments presentation final (4)Pin payments presentation final (4)
Pin payments presentation final (4)
 
Impacto de las Actividades Economicas sobre las Funciones de la Biosfera.pptx
Impacto de las Actividades Economicas sobre las Funciones de la Biosfera.pptxImpacto de las Actividades Economicas sobre las Funciones de la Biosfera.pptx
Impacto de las Actividades Economicas sobre las Funciones de la Biosfera.pptx
 
Funciones Economicas Biosfera
Funciones Economicas BiosferaFunciones Economicas Biosfera
Funciones Economicas Biosfera
 
Economia de Recursos Naturales y Economia Tradicional
Economia de Recursos Naturales y Economia TradicionalEconomia de Recursos Naturales y Economia Tradicional
Economia de Recursos Naturales y Economia Tradicional
 
Nociones básica de ecología y recursos naturales.
Nociones básica de ecología y recursos naturales. Nociones básica de ecología y recursos naturales.
Nociones básica de ecología y recursos naturales.
 
Economia de Recursos Naturales
Economia de Recursos NaturalesEconomia de Recursos Naturales
Economia de Recursos Naturales
 
Tolerencia de fallas
Tolerencia de fallasTolerencia de fallas
Tolerencia de fallas
 
Ingenieria software
Ingenieria softwareIngenieria software
Ingenieria software
 
Pruebas de Software
Pruebas de SoftwarePruebas de Software
Pruebas de Software
 
Proceso de Adquisiciones de Tecnologia
Proceso de Adquisiciones de TecnologiaProceso de Adquisiciones de Tecnologia
Proceso de Adquisiciones de Tecnologia
 
Proceso de Compra de Tecnologia
Proceso de Compra de TecnologiaProceso de Compra de Tecnologia
Proceso de Compra de Tecnologia
 
Pasos para elaborar RFP
Pasos para elaborar  RFPPasos para elaborar  RFP
Pasos para elaborar RFP
 
Redes ieee 802_11n
Redes ieee 802_11nRedes ieee 802_11n
Redes ieee 802_11n
 
Formacion de Empresas
Formacion de EmpresasFormacion de Empresas
Formacion de Empresas
 

Último

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 

Último (20)

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 

Social databases - A brief overview

  • 1. SOCIAL DATA AND DBS IVAN SANCHEZ JULIO SALINAS MARLENE ROBLES
  • 2. CONTENTS •Background •Study Cases: •Twitter: Real time search-Earlybird •Facebook: Storage. •LinkedIn: Storage (Voldemort) •Conclusion •References
  • 3. BACKGROUND OSN ● Huge amount of data, diverse and changing over time. Likes, sharing, comments, logins, page-views, search queries. ● New approaches to manipulate it. ● Distributed Databases, NoSQL. ● How to retrieve the data (Search relevance, Recommendations, Security against abusive behavior, Newsfeed features) ● Goal: massive scaling of demand: Unstructured, Semi- Facebook Twitter LinkedIn 2.7M likes & comments/da y 500M tweets/day 300+ M.Users(2 new/s). 200 group conversation/min
  • 4. STORING AND QUERYING AT TWITTER ● Storage: o MySQL used as key-value store. o FlockDB to Twitter Social Graph. ● Desired queries: o TrendingTopics o Breaking news o Sentiment
  • 5. REAL TIME SEARCH AT TWITTER: EARLYBIRD
  • 6. TAO AND THE FACEBOOK SOCIAL GRAPH
  • 7. TAO o Architecture and Data Model:  Objects: (id) → (otype, (key ? value)∗)  Associations: (id1, atype, id2) → (time, (key ? value)∗) o MySQL to the Storage Layer. o Main challenges:  Efficience scale.  Very fast response time.  High Read Availability.
  • 8. Professional Social Network Data Driven Features: ● Recomendation System (people you may know) ● People Search (Jobs search - candidates) ● Who view your profile? ● Events you may be
  • 9. STORAGE - VOLDEMORT Highly Available Distrib. KV Store 10 Voldemort Clusters (+100 nodes) - 9 of BDB Layered Design All layers – single interface: -Put/Delete/Get -Flexible -Every layer->decorates next one
  • 10. STORAGE - VOLDEMORT Voldemort provides: •High available •Low latency •Distributed Like a Distrib. Hash Table (DHT). Storage Data engine on nodes: •Compact index •Data files
  • 11. DISTRIBUTED HASHING ALGORITHM This slide is from Roshan Sumbaly & Jay Kreps! (thanks Rosh & Jay)
  • 12. SUMMARY Problem Solved Main Advantages EarlyBird Real time search Fast indexing, concurrence Management TAO Storing Facebook Social Graph Very fast response time. High read availability. Voldemort Simple Data Partitioning to meet scalability needs High Scalable, Seamless replication
  • 13. CONCLUSION • The selection of the database systems depends on the needs of the applications and the primary type of information of the social network. • Many OSN have developed their own solutions to cope with the ever growing nature of big data and its challenges. • Summarizing, the main features that the data solutions should have are: • Storage huge amount of data. • Fast read and low latency. • Processing of big data (meaningful results) • Streaming and indexing are critical.
  • 14. EXTREMELY DIFFICULT QUESTIONS 1.Why did LinkedIn needed to build their own solution Voldemort? 2.How does TAO resolve the challenges it was built for? 3.How the real time search service works at twitter?
  • 15. REFERENCES ● Auradkar, A., Botev, C., Das, S., De Maagd, D., Feinberg, A., Ganti, P., … Zhang, J. (2012). Data Infrastructure at LinkedIn. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on (pp. 1370–1381). ● N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, and H. Li, “Tao: Facebook’s distributed data store for the social graph,” in USENIX ATC, 2013. ● N. Ruflin, H. Burkhart, and S. Rizzotti, “Social-data storage-systems,” Databases Soc. Networks - DBSocial ’11, pp. 7–12, 2011. ● A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data warehousing and analytics infrastructure at facebook,” Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, Indianapolis, Indiana, USA, pp. 1013–1020, 2010. ● D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel, “Finding a Needle in Haystack: Facebook’s Photo Storage,” in OSDI, 2010, vol. 2010, pp. 47–60. ● M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin, “Earlybird: Real-Time Search at Twitter,” Proceedings of the 2012 IEEE 28th International Conference on Data Engineering. IEEE Computer Society, pp. 1360–1369, 2012.
  • 16. REFERENCES (II) ● D. Borthakur, J. Gray, J. Sen Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer, “Apache hadoop goes realtime at Facebook,” Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, Athens, Greece, pp. 1071–1080, 2011. ● A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu, “Data warehousing and analytics infrastructure at facebook,” Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, Indianapolis, Indiana, USA, pp. 1013– 1020, 2010. ● C. Chen, F. Li, B. C. Ooi, and S. Wu, “TI: an efficient indexing mechanism for real-time search on tweets,” Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, Athens, Greece, pp. 649–660, 2011. ● G. Mishne, J. Dalton, Z. Li, A. Sharma, and J. Lin, “Fast data in the era of big data: Twitter’s real- time related query suggestion architecture,” Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, New York, New York, USA, pp. 1147– 1158, 2013. ● S. Cohen and B. Kimelfeld, “A Social Network Database that Learns How to Answer Queries ∗,” 2013.
  • 17. LINKS ● https://www.usenix.org/conference/atc13/technical- sessions/presentation/bronson ● http://www- conf.slac.stanford.edu/xldb2012/talks/xldb2012_wed_1105 _DhrubaBorthakur.pdf ● http://www.slideshare.net/linkedin/jay-kreps-on-project- voldemort-scaling-simple-storage-at-linkedin ● http://data.linkedin.com/ ● http://www.infoq.com/presentations/Project-Voldemort-at- Gilt-Groupe

Notas del editor

  1. Web applications and online social network (OSN) produce a huge amount of real time content. Within these, there are social data that is generated for example by users of Facebook, Twitter, and LinkedIn. So, this content grows apace and becomes a rising need to get new approaches to manipulate it, because the data becomes difficult to capture and impossible to storage in conventional databases systems. i.e face.,twi. it is like 7 TB per day(2+ PB per year) Also, the diverse and variety social data changes over the time and makes the structure a challenge for storage. The new community NoSQL creates different types of storage system for recording. Event data includes (1) user activity events corresponding to logins, page-views, clicks, “likes”, sharing, comments, and search queries; (2) operational metrics. This data now require online consumption for (1) search relevance, (2) recommendations which may be driven by item popularity or co-occurrence in the activity stream, (3) security applications that protect against abusive behaviours such as spam or unauthorized data scraping, (4) newsfeed features that aggregate user status updates or actions for their “friends” or “connections” to read, and (5) real time dashboards of various service metrics. How to retrieve the data (Search relevance, Recommendations (Eg: Popularity Driven), Security against abusive behavior (spam), Newsfeed features) Graph: Nodes (people) and Edges (interactions/relationships).
  2. Storage i.e., column store could be a good option for social data because it is almost scalable as key-value store is and also is able to store semi-structured data with simple indices which allows doing simple queries. Twitter uses MySQL in the way of a key-value store. Hadoop: Distributed file system(Automatic replication, fault tolerance). MapReduce- based in parallel computation (key value based computation). Powerful (sorted data very fast). Open source. Scalable A user queries a social network in pursuit of a desired outcome. People search Twitter to find temporally relevant information (e.g., breaking news, real-time content, and popular trends) and information related to people (e.g., content directed at the searcher, information about people of interest, and general sentiment and opinion). Twitter queries are shorter and popular.(tweets are short, frequent , and do not change after being posted) The systems incorporate abstract predicates relevant to social networks as primitive building blocks in the query language, uses machine learning as an integral part of the query processor, to select and improve upon the predicate implementations. i.e Twitter: allows users to search by words, people,places,and tweet properties. i.e. social network query language: SoQL and SociQL, based on SQL; SNQL, etc.
  3. Earlybird: (retrieval engine that lies at the core of Twitter’s real-time search service) it is specifically designed to handle tweets. It maintains an inverted index, manages concurrency and uses a ranking function that combines relevance signals and the user’s local social graph to compute a personalized relevance score for each tweet. Ingested tweets first fill up a segment before proceeding to the next one. At any given time, there is at most one index segment actively being modified, whereas the remaining segments are read-only. Once an index segment ceases to accept new tweets, it converts from a write-friendly structure into an optimized read-only structure. The highest-ranking, most-recent tweets are returned to the Blender, which merges and re-ranks the results before returning them to the user.
  4. LinkedIn has a number of data-driven features, including People You May Know, Jobs you may be interested in, and LinkedIn Skills. Building these features involves 2 phases: offline computation Online serving. As an early adopter of Hadoop, they were able to scale the offline computation phase successfully. Difficult part has been bulk loading the output of this computation phase into the online serving system without causing performance degradation. Batch computed algorithms (with map reduce with hadoop). To make all this offline data available to the live site, we've developed a multi-terabyte scale data pipeline from Hadoop to our online serving layer, Project Voldemort. How do we serve these massive outputs to our 300 million members? RDBMS: ORACLE Expresso: Doc-oriented Data store with hierarchical indexing. Kafka: High Volume Low Latency Messaging System Collecting & delivering Event Data. Uses a messaging API to support real time and offline consumption. Publisher/Consumer scheme 10 B message writes/day. Solves Real time log processing! Kafka solves this problem: Moves arround large amounts of data in a robust & escalable manner. Streaming: Databus (DB stream replication) & Kafka ((pub/sub user activitity and logs) Databus: Timeline – Consistent Change Data Capture Kafka: Provides feeds of data to Applications & Other Data Systems Enable near real-time processing of Data. (newsfeed & other asynch Transport of updates to subscriber systems. Kafka alone is not sufficient, as it lacks a processing engine and the ability to persist data over long spans
  5. Simple Data Partitioning to meet scalability needs. Primary goals: High performance & Availability. DB supports only the most minimal Schema. Schema in JSON. Speed and availability -> Distributed key-Value system. Bulk load massive data sets -> Offload index construction to processing system Port: 666 Storage is secondary, is really about distributing & recovering data across a set of nodes. Main Cluster has 60% read 40% writes Techniques: Decentralized -> no Master Data partitioned and replicated via consistent hashing Multinode read and writes for redundancy Versioning 4 consistency… Vector clocks for this. Non locking optimistic locking (nodeID, counter) tuples on each node. Each object has a vector clock, updated on each write and examined on each read. Pluggable persistence (BDB, MySQL, HADOOP (RO), MongoDB We currently house roughly ten clusters, spanning more than a hundred nodes, holding several hundred stores (database tables). Berkeley DB (BDB) is a software library that provides a high-performance embedded databasefor key/value data. Data is cached in oracle storage Voldemort is a distributed Key value storage system 4 high capacity storage. Data is stored under a key and partition and replicated among multiple servers. In case of failure conflicts are resolved using versioning. Voldemort is not a relational DB. Voldemort is a: Big, distributed, persistent, fault tolerant hash table”. API involve 3 operations only: Get, Put and Delete. Keys & Values can be complex objects. Client allows pluggable interfaces. Stack Conflict resolution: Multiple reads & writes, multiple versions. Latest version hides most of the difficulties of getting last version. Serialization: Network: You can choose client side or server side. Client connects to the cluster and get metadata describing how the nodes and the hashing are set
  6. Data is replicated automaticaly over multiple servers. Storage is pluggable on disk using BerkeleyDB or MySQL. Can be used with Hadoop. Info is stored in HDFS By several map reduce jobs by hadoop and the is pulled by Voldemort. A store is like a DB table, each store is split into partitions and this partitions into chunk sets. One reducer = one chunk set Chunk Set = Index + data file storage engine is made up of “chunk sets”—multiple pairs of index and data files. The index file is a compact structure containing a hash of the key followed by the offset to the corresponding value in the data file Node independence: Each node is independent of other nodes with no central point of failure or coordination For versioning and Conflict Resolution: Vector Clocks (support optimistic locking in the client). We currently house roughly ten clusters, spanning more than a hundred nodes, holding several hundred stores (database tables). Its similar to share nothing architectures in Distributed systems, which can grow indefinetely. Berkeley DB (BDB) is a software library that provides a high-performance embedded databasefor key/value data. Also use Oracle. it has been designed for frequent transient and short- term failures Each store maps to a single cluster, with the store partitioned over all nodes in the cluster. Lookups O(1).
  7. Voldemort uses a technique of consistent hashing to distribute load evenly on the servers. Key hashes to a point on fixed circular space. Avoids problem of linear hashing when you remove a bucket, all your keys has to move! Hash space (32 bit number) Fix boundaries. Can resize buckets, very efficient for rebalancing. In case of server failure, its keys will be distributed equally to other servers in cluster. Ring node structure of size Q partitions, larger than number of nodes S. Each node is assigned Q/S partitions, it appears multiple times in a ring. Keys are held by primary node as well as its K unique successors in clockwise direction. Advantage of predictable performance compare to SQL queries in Relational DB. Get operation returns all values associated with a key organized into lists.
  8. We need to agree on this