El segmento de la base de datos está evolucionando, al mismo tiempo que vemos como nuevos, almacenes escalables de datos emergen. Key value stores, grandes columnas de almacenamiento y bases de datos orientados en documentos, ofrecen una alternativa atractiva a la base de datos relacional tradicional. Evitando las suposiciones tradicionales sobre los cuales se construyeron las bases de datos anteriores, esta nueva clase de soluciones de no-relacionales o "NoSQL" adquieren la capacidad de escalar horizontalmente. Además, las soluciones NoSQL ofrecen alternativas interesantes al modelo tradicional de datos relacional.
Esta presentación mostrara a los asistentes, los conceptos claves y necesarios para comprender y evaluar los almacenes de datos NoSQL. Vamos a explorar las diferencias fundamentales que existen entre las diversas clases de soluciones NoSQL y que concluyen con un examen en profundidad, de la base de datos MongoDB orientada a documentos.
Esta presentación incluirá:
Orígenes del movimiento NoSQL
Una visión general del segmento de NoSQL
La filosofía y la creación de MongoDB
MongoDB, arquitectura del sistema
MongoDB, ejemplos de uso
2. • 1970's Aparecen las bases de datos
relacionales
– El almacenamiento es costoso
– Los datos se normalizan
– El almacenamiento es abstraído de la
aplicación
2
3. • 1970's Aparecen las bases de datos
relacionales
– El almacenamiento es caro
– Los datos se normalizan
– El almacenamiento es abstraído de la
aplicación
• 1980's Aparecen versiones
comerciales de las RDBMS
– Modelo cliente/servidor
– SQL emerge como estándar
3
4. • 1970's Aparecen las bases de datos
relacionales
– El almacenamiento es caro
– Los datos se normalizan
– El almacenamiento es abstraído de la
aplicación
• 1980's Aparecen versiones
comerciales de las RDBMS
– Modelo cliente/servidor
– SQL emerge como estándar
• 1990's Las cosas empiezan a cambiar
– Cliente/servidor => arquitectura 3-niveles
– Aparecen el internet y la web
4
5. • 2000's Web 2.0
– Aparece "Social Media"
– Aceptación de E-Commerce
– Continuan bajando precios de HW
– Incremento masivo de datos coleccionados
5
6. • 2000's Web 2.0
– Aparece "Social Media"
– Aceptación de E-Commerce
– Continuan bajando precios de HW
– Increment masivo de datos coleccionados
• Resultado
– Requerimiento continuo para escalar dramáticamente
– ¿Cómo podemos escalar?
6
7. + transacciones complejas
+ datos tabulares
+ consultas ad hoc
- O<->R mapeo es difícil
- problemas de velocidad y
escalabilidad
- no es muy ágil
BI / OLTP /
reporting operational
7
8. + transacciones complejas
+ consultas ad hoc + datos tabulares
+ SQL como protocolo + consultas ad hoc
estándar entre clientes y - O<->R mapeo es difícil
servidores - problemas de velocidad y
+ crece horizontalmente escalabilidad
mejor que las bases de - no es muy ágil
datos operacionales
- algunos limites de
escalabilidad BI / OLTP /
- esquemas rígidos
- no es en tiempo
reporting operational
real, pero funciona bien
con cargas masivas en
horas de la madrugada
8
9. + transacciones complejas
+ consultas ad hoc + datos tabulares
+ SQL como protocolo + consultas ad hoc
estándar entre clientes y - O<->R mapeo es difícil
servidores - problemas de velocidad y
+ crece horizontalmente escalabilidad
mejor que las bases de - no es muy ágil
datos operacionales
- algunos limites de
escalabilidad BI / OLTP /
- esquemas rígidos
- no es tiempo real, pero
reporting operational
funciona bien con cargas
masivas en horas de la
madrugada
Menos problemas aquí
9
10. + transacciones complejas
+ consultas ad hoc + datos tabulares
+ SQL como protocolo + consultas ad hoc
estándar entre clientes y - O<->R mapeo es difícil
servidores - problemas de velocidad y
+ crece horizontalmente escalabilidad
mejor que las bases de - no es muy ágil
datos operacionales
- algunos limites de
escalabilidad BI / OLTP /
- esquemas rígidos
- no es tiempo real, pero
reporting operational
funciona bien con cargas
masivas en horas de la
madrugada
Menos problemas aquí Más problemas aquí
10
11. + transacciones complejas
+ consultas ad hoc + datos tabulares
+ SQL como protocolo + consultas ad hoc
estándar entre clientes y - O<->R mapeo es difícil
servidores - problemas de velocidad y
+ crece horizontalmente escalabilidad
mejor que las bases de - no es muy ágil
datos operacionales
- algunos limites de
escalabilidad BI / OLTP /
- esquemas rígidos
- no es tiempo real, pero
reporting operational
cacheo
funciona bien con cargas
masivas en horas de la
madrugada
Particionamiento
Archivos planos al nivel de la
aplicación
map/reduce
11
12. • Metodología de desarrollo
ágil
• Ciclos de desarrollo cortos
• Constante evolución de
requerimientos
• Flexibilidad de diseño
12
13. • Metodología de desarrollo
ágil
• Ciclos de desarrollo cortos
• Constante evolución de
requerimientos
• Flexibilidad de diseño
• Esquema relacional
• Difícil de evolucionar
• Migraciones lentas y difíciles
• En sincronía con la aplicación
• Pocos desarrolladores interactúan
directamente con la base de datos
13
16. • Escalabilidad horizontal
• Más resultados en tiempo real
• Desarrollo más veloz
• Modelo de datos flexible
• Bajo costo inicial
• Bajo costo de operación
16
18. + velocidad y escalabilidad
- consultas ad hoc limitadas
- no son muy transaccionales
- no usan SQL/no hay estándares
+ se acoplan bien al model OO
Escalable + ágiles
BI / no-relacional
reporting (“nosql”)
OLTP /
operational
18
19. La próxima generación de bases de
datos no-relacionales
Una colección de productos muy diferentes
• Diferentes modelos de datos (no-relacionales)
• La mayoría no usan SQL para las consultas
• No requieren un esquema predefinido
• Algunos permiten estructuras de datos flexibles
19
22. • Relacional • Key-Value
• Documentos
• XML
• Grafos
• Columnas
• ACID • BASE
• Confirmación en 2 fases • Transacciones atómicas
(two-phase commit) al nivel de documentos
22
23. • Relacional • Key-Value
• Documentos
• XML
• Grafos
• Columnas
• ACID • BASE
• Confirmación en 2 fases • Transacciones atómicas
(two-phase commit) al nivel de documentos
• Uniones (joins) • No hay uniones (joins)
23
27. • Diseñado y desarrollado por los fundadores de
DoubleClick, ShopWiki, GILT Groupe, etc…
• Programación empieza a fines del 2007
• Primer sitio en producción: marzo 2008
businessinsider.com
• Código abierto – AGPL, escrito en C++
• Versión 0.8 – primera versión oficial febrero 2009
• Versión 1.0 – agosto 2009
• Versión 2.0 – septiembre 2011
• Versión 2.2 – agosto 2012
27
30. • Orientado a documentos
• Basado en documentos JSON
• Esquema flexible
• Arquitectura escalable
• Auto-sharding
• Replicación y alta disponibilidad
• Características importantes
• Índices secundarios
• Lenguaje de consulta (consultas ad hoc)
• Map/Reduce (agregación)
30
31. • Modelo de datos poderoso y flexible
• Conversión transparente de objetos en la
aplicación (OO) a documentos JSON
• Flexibilidad para datos dinámicos
• Mejor localidad de datos
31
33. {
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : “Too Big to Fail”,
when : Date(“2011-07-26”),
author : “joe”,
text : “blah”
}
33
34. {
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : “Too Big to Fail”,
when : Date(“2011-07-26”),
author : “joe”,
text : “blah”,
tags : [“business”, “news”, “north america”]
}
> db.posts.find( { tags : “news” } )
34
35. {
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : “Too Big to Fail”,
when : Date(“2011-07-26”),
author : “joe”,
text : “blah”,
tags : [“business”, “news”, “north america”],
votes : 3,
voters : [“dmerr”, “sj”, “jane” ]
}
35
36. {
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : “Too Big to Fail”,
when : Date(“2011-07-26”),
author : “joe”,
text : “blah”,
tags : [“business”, “news”, “north america”],
votes : 3,
voters : [“dmerr”, “sj”, “jane” ],
comments : [
{ by : “tim157”, text : “great story” },
{ by : “gora”, text : “i don’t think so” },
{ by : “dmerr”, text : “also check out...” }
]
}
36
37. {
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : “Too Big to Fail”,
when : Date(“2011-07-26”),
author : “joe”,
text : “blah”,
tags : [“business”, “news”, “north america”],
votes : 3,
voters : [“dmerr”, “sj”, “jane” ],
comments : [
{ by : “tim157”, text : “great story” },
{ by : “gora”, text : “i don’t think so” },
{ by : “dmerr”, text : “also check out...” }
]
}
> db.posts.find( { “comments.by” : “gora” } )
> db.posts.ensureIndex( { “comments.by” : 1 } )
37
38. Búsqueda = 5+ ms Lectura = súper rápido
Post
Comment
Author
38
39. Post
Author
Comment
Comment
Comment
Comment
Comment
39
40. • Índices secundarios
• Consultas dinámicas
• Orden de los resultados (sort)
• Operaciones poderosas: update, upsert
• Funciones para agregaciones
• Viable como almacenamiento primario
40
41. • Escalabilidad lineal
• Alta disponibilidad
• Incrementar capacidad sin sacar la
aplicación de servicio
• Transparente a la aplicación
41
42. Conjunto de réplicas (replica sets)
• Alta disponibilidad/transferencia automática
• Redundancia de los datos
• Recuperación en caso de desastre
• Transparente a la aplicación
• Posibilidad de mantenimiento sin sacar la
aplicación de servicio
42
49. • Incrementar capacidad sin sacar la
aplicación de servicio
• Transparente a la aplicación
49
50. • Incrementar capacidad sin sacar la
aplicación de servicio
• Transparente a la aplicación
• Particiones basados en rangos de valores
• Particionamiento y balanceo automático
50
51. Key Range
0..100
mongod
Escalabilidad para escribir
51
52. Key Range Key Range
0..50 51..100
mongod mongod
Escalabilidad para escribir
52
53. Key Range Key Range Key Range Key Range
0..25 26..50 51..75 76.. 100
mongod mongod
mongod mongod
Escalabilidad para escribir
53
54. Key Range Key Range Key Range Key Range
0..25 26..50 51..75 76.. 100
Primary Primary Primary Primary
Secondary Secondary Secondary Secondary
Secondary Secondary Secondary Secondary
54
55. Aplicación
MongoS
Key Range Key Range Key Range Key Range
0..25 26..50 51..75 76.. 100
Primary Primary Primary Primary
Secondary Secondary Secondary Secondary
Secondary Secondary Secondary Secondary
55
56. Aplicación
MongoS MongoS MongoS
Key Range Key Range Key Range Key Range
0..25 26..50 51..75 76.. 100
Primary Primary Primary Primary
Secondary Secondary Secondary Secondary
Secondary Secondary Secondary Secondary
56
57. Aplicación
Config
Config
MongoS MongoS MongoS
Config
Key Range Key Range Key Range Key Range
0..25 26..50 51..75 76.. 100
Primary Primary Primary Primary
Secondary Secondary Secondary Secondary
Secondary Secondary Secondary Secondary
57
58. • Pocas opciones para configurar
• La configuración estándar funciona bien
• Fácil de instalar y administrar
58
62. Manejo de contenido Inteligencia de operaciones E-Commerce
Procesamiento de datos de alto
Manejo de datos de usuarios
volúmen
62
63. Wordnik uses MongoDB as the foundation for its “live” dictionary that stores its entire
text corpus – 3.5T of data in 20 billion records
Problem Why MongoDB Impact
Analyze a staggering amount of Migrated 5 billion records in a Reduced code by 75%
data for a system build on single day with zero downtime compared to MySQL
continuous stream of high- MongoDB powers every Fetch time cut from 400ms to
quality text pulled from online website request: 20m API calls 60ms
sources per day Sustained insert speed of 8k
Adding too much data too Ability to eliminate memcached words per second, with
quickly resulted in outages; layer, creating a simplified frequent bursts of up to 50k per
tables locked for tens of system that required fewer second
seconds during inserts resources and was less prone to Significant cost savings and 15%
Initially launched entirely on error. reduction in servers
MySQL but quickly hit
performance road blocks
Life with MongoDB has been good for Wordnik. Our code is faster, more flexible and dramatically smaller.
Since we don’t spend time worrying about the database, we can spend more time writing code for our
application. -Tony Tam, Vice President of Engineering and Technical Co-founder
63
64. Intuit relies on a MongoDB-powered real-time analytics tool for small businesses to
derive interesting and actionable patterns from their customers’ website traffic
Problem Why MongoDB Impact
Intuit hosts more than 500,000 MongoDB's querying and In one week Intuit was able to
websites Map/Reduce functionality could become proficient in MongoDB
wanted to collect and analyze server as a simpler, higher- development
data to recommend conversion performance solution than a Developed application features
and lead generation complex Hadoop more quickly for MongoDB than
improvements to customers. implementation. for relational databases
With 10 years worth of user The strength of the MongoDB MongoDB was 2.5 times faster
data, it took several days to community. than MySQL
process the information using a
relational database.
We did a prototype for one week, and within one week we had made big progress. Very big progress. It
was so amazing that we decided, “Let’s go with this.” -Nirmala Ranganathan, Intuit
64
65. Shutterfly uses MongoDB to safeguard more than six billion images for millions of
customers in the form of photos and videos, and turn everyday pictures into keepsakes
Problem Why MongoDB Impact
Managing 20TB of data (six JSON-based data structure 500% cost reduction and 900%
billion images for millions of Provided Shutterfly with an performance improvement
customers) partitioning by agile, high compared to previous Oracle
function. performance, scalable solution implementation
Home-grown key value store on at a low cost. Accelerated time-to-market for
top of their Oracle database Works seamlessly with nearly a dozen projects on
offered sub-par performance Shutterfly’s services-based MongoDB
Codebase for this hybrid store architecture Improved Performance by
became hard to manage reducing average latency for
High licensing, HW costs inserts from 400ms to 2ms.
The “really killer reason” for using MongoDB is its rich JSON-based data structure, which offers Shutterfly
an agile approach to develop software. With MongoDB, the Shutterfly team can quickly develop and
deploy new applications, especially Web 2.0 and social features. -Kenny Gorman, Director of Data Services
65
67. Una base de datos de código abierto y de alto
rendimiento
67
Notas del editor
Intro – dozen years in RDBMS space, OLTP (transactional systems), OLAP (datawarehousing), etc. Now noSQL - at 10gen, MongoDB company (developed MongoDB and provides support, consulting, training for MongoDB)
No redundant data, complex ERDs, OLTP vs OLAP, 1980's indexing invented/published, becomes standard.late 1990's – learned how to scale web via load balancers, etc. but one DB behind it all.Static content distributed and "hosted" near the edge but what about writes?
No redundant data, complex ERDs, OLTP vs OLAP, 1980's indexing invented/published, becomes standard.late 1990's – learned how to scale web via load balancers, etc. but one DB behind it all.Static content distributed and "hosted" near the edge but what about writes?
No redundant data, complex ERDs, OLTP vs OLAP, 1980's indexing invented/published, becomes standard.late 1990's – learned how to scale web via load balancers, etc. but one DB behind it all.Static content distributed and "hosted" near the edge but what about writes?
Social media: writes start to catch up to the reads.also mobile/PDAs, large data projects like genome, space, clickstream analysis, logsBI comes of ageTraditional RDBMS cannot handle the volumecache data in-memorymanual sharding/partitioningreplication
Social media: writes start to catch up to the reads.also mobile/PDAs, large data projects like genome, space, clickstream analysis, logsBI comes of ageTraditional RDBMS cannot handle the volumecache data in-memorymanual sharding/partitioningreplication
Developers interact with ORM layer, lose visibility of relational data model – unintuitive, many unfamiliar with performance implications
Developers interact with ORM layer, lose visibility of relational data model – unintuitive, many unfamiliar with performance implications
Developers started coming up with work-arounds:Denormalize dataAvoid joins and long running transactions Custom cachesApplication Level PartitioningDistributed Caches
Developers started coming up with work-arounds:Denormalize dataAvoid joins and long running transactions Custom caches Require more RAM!Application Level PartitioningDistributed Caches lose ability to do joins, long running transactions
today's apps need: low and predictable response timesscalability "on demand" (high peak times, cloud deployment)HA for reads AND writesmulti data center distribution
Maybe no SQL or maybe not only SQL
trade off some of less critical components for speed, scale, and ease of use.some have no ad hoc queries, some limited way of reading or updating the data.
NoSQL = Non-relational next generation operation data stores and databasesown query language or simple fetch by keyNo SQL as query languageDoes not give ACID guarantees (transactions limited to single item)Distributed fault-tolerant architecture
Eventual Consistency (but with strong ability to distribute and high availability)no joins +light transactional semantics = horizontally scalable architectures
Eventual Consistency (but with strong ability to distribute and high availability)no joins +light transactional semantics = horizontally scalable architectures
Eventual Consistency (but with strong ability to distribute and high availability)no joins +light transactional semantics = horizontally scalable architectures
Eventual Consistency (but with strong ability to distribute and high availability)no joins +light transactional semantics = horizontally scalable architectures
K/V stores: PNUTS(Yahoo), Dynamo (by Amazon),Voldemort (originally by LinkedIn), bigtable – google (BigTable – column stores)pnuts – yahoo (key value store)Dynamo (amazon), cassandra (Facebook) similar to BigTable, riak (Basho), membaseNeo4jcouchdb/couchbase
How much data do you have? Reads? Writes? Types of queries?How important is it not to ever lose data? (too bad, all systems can lose data)How easy to maintain? Pages in the middle of the night?EASE OF USE – represents your data intuitively?
Founders founder of Double-click, Shop-wiki, GILT groupe, biz insider running into the same problems over and over.originally working on platform for the cloud (like google apps)Application stack that would scale out easily
Maintain richness and depth of functionality of RDBMS combined with performance and scalability of in-memory key-value stores.
Documents: richer than flat structures
let's look at a simple data model for blog posts, with authors, comments, tags, votes.
author/post/comments/tags
JSON document – contains key value pairs, different types, values can also be arrays and other documents
----- Meeting Notes (6/19/12 13:12) -----because of the way MongoDB lets you update documents atomically we can be sure totals and list of voters will stay in sync
comments is an array of JSON documentswe can query by fields inside embedded documents as well as array members.
secondary indexes, compound indexes, multikey indexes.----- Meeting Notes (6/19/12 13:12) -----why is it important to have all of document together?
seeks take long, reads less so.this is why joins are expensive!
REPLICA SETS
Horizonal scaling, automatically managed by MongoDB to partition the data across large number of servers transparently to the application.
Horizonal scaling, automatically managed by MongoDB to partition the data across large number of servers transparently to the application.
Language drivers are replica set aware and keep a list of replica set members (and can query via isMaster to determine which is master).
Full deployment. As many mongoS processes as you have app servers (for example); Config DBs are small but hold the critical information about where ranges of data are located on disk/shards.
Special sharding router process, to apps looks just like a stand-alone MongoD.
Full deployment. As many mongoS processes as you have app servers (for example); Config DBs are small but hold the critical information about where ranges of data are located on disk/shards.
Full deployment. As many mongoS processes as you have app servers (for example); Config DBs are small but hold the critical information about where ranges of data are located on disk/shards.