SlideShare una empresa de Scribd logo
1 de 66
... In which I tell a
story of building
a CMS on top of
‘NoSQL’
                                              (*)


(*)   HBase and SOLR

           IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
... and hopefully
warn you on
what YOU will
encounter in the
near future.
   IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
/usr/bin/whoami

» co-founder of Outerthought

 » scalable content applications
 » content management & publishing
 » Java, REST and now NoSQL
 » open source product portfolio




        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.orgTHIS NOTEBOOK BELONGS TO:   3
» Daisy: content- and knowledge management
    www.daisycms.org

» Lily: scalable store and search
                         THIS N OTE B OOK B ELO N GS TO :
    www.lilycms.org

» Kauri: RESTcentric internet app development
    www.kauriproject.org


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   4
Petite annonce semi-commercial
» Devoxx 2010 ! (ex-Javapolis)

» 15-19 Novembre, Anvers, Belgique

» Track NoSQL/Cloud
 » Speakers: Tom White (Cloudera/Hadoop O’Reilly
  author), Jonathan Ellis (Cassandra), Michael Stack
  (HBase)
 » Produits: MongoDB, Voldemort, Elastic Search
 » Cases: Twitter, Facebook, Adobe


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   5
Devoxx NoSQL/Cloud track




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   6
This story is about




     IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   7
The typical CMS ‘architecture’




  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   8
The typical CMS ‘architecture’




  application                                  cache

  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   9
The typical CMS ‘architecture’




  more cache

  application                                  cache

  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   10
The typical CMS ‘architecture’

  client (+cache?)




  more cache

  application                                  cache

  database (+opt. filesystem) (+ opt. full-text indexes)


         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   11
Hitting the scale spot
» Sweet spot of # documents: (100)Ks, not Ms

» Not everything could be solved with increasing
 heap size
 » cold cache at startup
 » OOME’s
 » we didn’t want to step in the PHP/RDBMS trap
   (of dynamic database schemes)
» The cost of flexibility

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   12
What we found hard to scale
» access control (dynamically evaluated against rule set)

» facet browsing (compute facet counts in RAM)

» all the nifty stuff people were using our
 software for


» ... anything that required random access
 to in-memory-cache data for computations

        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   13
Beyond the ‘scaling’ problem
» three-prong data layer



                                                                      fs




 » result set merging (between MySQL & Lucene)
   » happened in appcode/memory

 » ‘transactions’, set operations = hard


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   14
Beyond the three-prong problem




» errrr..... “Failover” ..... ?

» = symptom of enterprise success




         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   15
If we would be able to add more nodes ...


                                                   scalability


» True Distribution                                                  availability


                                                 performance

                   ... in the line of fire

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org    16
Solution 1




» do MORE inside the database




      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   17
Infrastructural (master/slave)



                                                               e !
                                                    as
                                         ta       b
                              d        a
              o r           e
             m


    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   20
                                                                                18
e !
                                                           a s
                                                ta b
                                    da
                           o r    e
             n m
  e ve



IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   19
s !
                                                                   u s se
                                                         e b
                                             sa g
                          mes
                  d     d
       ’s       a
l   et


     IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org         20
ff!
                                                        ! s     tu
                    B C
                 JD
                r !
            o ve t
           S w00
     I! JM
RM


 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   21
http://bigdatamatters.com/bigdatamatters/2010/04/high-availability-with-oracle.html



             IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   22
Business Development 101
user interest




                                                                                            budget




                IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org            23
Solution II: Enter The Cambrian Explosion

                                         Cassandra




                                    NoSQL
                                                                              neo4j




      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org       24
NoSQL



» the era of Polyglot Persistence

» the Tower of Bable

» the (B)Le(e|a)ding Edge




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   25
NoSQL typology


» Key/Value stores

» Document Databases

» Column (Family) Databases
                                                                         C

» Graph Databases




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   26
NoSQL tool selection

» the luxury of choice
 (but remember polyglot persistence)

» survival of the fittest

» inflated expectations + nifty marketing



 NOTE If your data fits in single node RAM
               memory, DON’T go NoSQL (just yet)

           IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   27
C

Requirements, phase I
» automatic scaling to large data sets

» fault-tolerance: replication, automatic handling of failing nodes

» a flexible data model supporting sparse data

» runs on commodity hardware

» efficient random access to data

» open source, ability to participate in the development thus
  drive the direction of the project
» some preference for a Java-based solution


          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   28
C

Requirements, phase II

» After careful consideration, we realized the
 important choices were also:
 » consistency: no chance of having two conflicting
   versions of a row
 » atomic updates of a single row, single-row
   transactions
 » bonus points for MapReduce integration
   » e.g. full-text index rebuilding



        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   29
That brought us to HBase, which bought us:
» a datamodel where you can have column
 families which keep all versions and others
 which do not, which fits very well on our
 CMS document model
» ordered tables with the ability to do range
 scans on them, which allows to build
 scalable indexes on top of it
» HDFS, a convenient place to store large blobs

» Apache license and community, a familiar
 environment for us

        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   30
HBase

» hbase.apache.org + Cloudera CDH distro

» Open Source (Google) BigTable
 implementation
» HDFS as underlying DFS (≈GFS)

» ZooKeeper as lock service (≈Chubby)

» Integration with Hadoop MapReduce


      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   31
BigTable
                                                                            column family




                                                               {
                                     "contents:"          "anchor:cnnsi.com"        "anchor:my.look.ca"



         "com.cnn.www"
                                        "<html>..."
                                      "<html>..."
                                    "<html>..."     t6
                                                       t5
                                                          t3
                                                                 "CNN"        t9        "CNN.com"         t8
                                                                                                                  }    row




ure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family con-
  the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN’s home page
                 key                       cell
 ferenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com
 anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t 3 , t5 , and t6 .


We settled on this data model after examining a variety    Column Families
 otential uses of a Bigtable-like system. As one con-                                                                             3

 e example that drove some ofTECHNOLOGIEPARKdecisions,ZWIJNAARDE (GENT) » are grouped into sets called column fami-
                           IIC »
                                 our design 3 » B-9052     Column keys www.outerthought.org
Data Model
  HBase Datamodel
                    •
    » Sparse, multi-dimensional map map
            Sparse, multi dimensional
                (row, column, timestamp) → cell cell
                           (row, column, timestamp)


                    •
    » Column = Column Family:Column Qualifier
           Column = Column Family:Column Qualifier
                                                                                    Columns
                                                                 Fam1:Qual1


             Rows
                                                                                   t1
                     AK                                                  v1   t2
                                                                    v2
                                                                                        Timestamps

                                                                    t2>t1
                                                                         7
Tuesday, August 17, 2010
                           IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   33
Regions
          » Lexicographically sorted set of rows
           » default size : 256MB

          » Hosted by region servers
             row 1



            row 200
  split
            row 201



            row 350


writes


                      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   34
Storage architecture




                                                                                © lars george

    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org            35
Storage organisation
                         Region

              Memstore

          HLog
      (Append only
     WAL on HDFS)
                               HFile             HFile
                             (on HDFS)         (on HDFS)
     (Sequence File)
      (one per RS)

                                                                         Region

          HFile: Immutable sorted map (byte[]     byte[])
              (row, column, timestamp)   cell value

                                                                                  © Amandeep Khurana


    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                   36
                                       14
Writing
                             Region
Write
                  Memstore

              HLog
          (Append only
         WAL on HDFS)
                                   HFile             HFile
                                 (on HDFS)         (on HDFS)
         (Sequence File)
          (one per RS)

                                                                             Region

              HFile: Immutable sorted map (byte[]     byte[])
                  (row, column, timestamp)   cell value

                                                                                      © Amandeep Khurana


        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                   37
                                           14
Flush
                          Region

                Memstore                     Flush

            HLog
         (Append only                                               Small
        WAL on HDFS)
                               HFile              HFile
        (Sequence File)
                             (on HDFS)         (on HDFS)            HFile
         (one per RS)

                                                                         Region

            HFile: Immutable sorted map (byte[]     byte[])
                (row, column, timestamp)   cell value

                                                                                  © Amandeep Khurana


    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                   38
                                        14
Compaction
                          Region

              Memstore

          HLog
      (Append only                                                  Small
     WAL on HDFS)
                                HFile             HFile
     (Sequence File)
                             (on HDFS)         (on HDFS)            HFile
      (one per RS)                           Compaction
                                                                         Region

          HFile: Immutable sorted map (byte[]     byte[])
              (row, column, timestamp)   cell value

                                                                                  © Amandeep Khurana


    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                   39
                                        14
Stable
                          Region

              Memstore

          HLog
      (Append only
     WAL on HDFS)
                               HFile             HFile              HFile
                             (on HDFS)         (on HDFS)         (on HDFS)
     (Sequence File)
      (one per RS)

                                                                         Region




                                                                                  © Amandeep Khurana


    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                   40
                                        15
Reading
                             Region

Read
                 Memstore

             HLog
         (Append only
        WAL on HDFS)
                                  HFile              HFile             HFile
                                (on HDFS)         (on HDFS)         (on HDFS)
        (Sequence File)
         (one per RS)

                                                                            Region




                                                                                     © Amandeep Khurana


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                   41
                                           15
HBase APIs

» Java

» REST

» Thrift

» Ruby shell

» Java M/R



         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   42
HBase Java API

» Get
          (byte arrays, mostly)
» Put

» Scan

» Delete

» MapReduce Source / Sink



         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   43
Interesting HBase-related
projects


» AvroHBase                                                  Avro: Hadoop RPC + ser/deser


» HBasene

» HBase Explorer

» asyncHbase




      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org             44
» OK, so now we have a data store !




        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   45
» However, content repository =
 store + search                             !
                                  u      ch
                                o



       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   46
a s
                                                         w
                                                        t !
                                                    h a
                                                   T asy ...)
                                                      e      er
                                                        w ev
                                                   (h o

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   47
Search ponderings

» CMS = two types of search
 » structured, ‘logic’ search
  » numbers, strings
  » based on logic          (SQL, anyone?)

 » information retrieval (or: full-text search)
  » text
  » based on statistics



        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   48
Search ponderings




» All of that, at scale




       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   49
Structured Search
» HBase Indexing Library
 » idea from Google App Engine datastore indexes
 » http://code.google.com/appengine/articles/
  index_building.html

    rowkey             col              col                             rowkey          col



                                                          order
      A               val3             foo6                              val2-B

      B               val2             foo7                              val3-A

                 content table                                              index table A


          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org           50
Full-text / IR search


» Lucene?
 » no sharding (for scale)
 » no replication (for availability)
 » batched index updates (not real-time)




        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   51
Beyond Lucene
» Katta
  » scalable architecture, however only search, no indexing

» Elastic Search
  » very young (sorry)

» hbasene et al.
  » stores inverted index in HBase, might not scale all features

» SOLR
  » widely used, schema, facets, query syntax, cloud branch




          IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   52
?
                             +
                         =
                                         r ?
                                      ! O
                         as y
                     E
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   53
Remember distribution ?
Remember secondary indexes ?




 ➙ Need for reliable queuing

    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   54
Connecting things
» we needed a reliable bridge between our
 main storage (HBase) and our index/search
 server(s) (SOLR)
 » indexing, reindexing, mass reindexing (M/R)

» we need a reliable method of updating
 HBase secondary indexes
» all of that eventually to run distributed

» distribution means coping with failure

       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   55
Solution

» ... a QUEUE ! Meh.

» ACMEMessageQueue ? Bzzzzzt.
 We wanted fault-safe HBase persistence for
 the queues.
 Also for ease of administration.
» ➙ WAL  & Queue implemented on top of
 HBase tables


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   56
WAL / Queue
» WAL                                                  » Queue
 » guaranteed execution                                   » triggering of async
   of synchronous actions                                     actions
 » call doesn’t return before                             » e.g. (re)index (updated)
   secondary action finishes                                   record with SOLR back-end
 » e.g. update secondary actions                          » size depends on speed of
 » if all goes well,                                          back-end process
   size = #concurrent ops
 » useful outside of Lily context
   as well!



              IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   57
The Sum
» Lily model (records & fields)

» mapped onto HBase (=storage)

» indexed and searchable through
 SOLR
» using a WAL/Queue mechanism
 implemented in HBase
» runtime based on Kauri

» with client/server comms via Avro
 (and a REST interface with JSON)

        IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   58
Lily Content Model


» Records > Fields

» Field types: the usual base types + blobs + link
 fields
» ... so we can model relationships again
 (and have free versioning while at it)



         IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   59
Architecture
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   60
Architecture
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   61
Roadmap

» Available now = learning material
 (architecture, model, API, Javadoc)
 + developer playground ‘proof of architecture’
 ➥ www.lilycms.org

» End of October = fully distributed release                                                      re!
                                                                                               the
                                                                                       early
» from there on, ca. 3-monthly releases                                            N

 leading up to Lily 1.0


       IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                   62
License




» Apache




      IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   63
Documentation




    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   64
Questions?




                                                                  http://www.flickr.com/photos/leehaywood/4237636853/


    IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org                                 65
Thanks for your
                                                                       hospitality and
                                                                       attention !



                      THIS NOTEBOOK BELONGS TO:
                                                                   » stevenn@outerthought.org

Noteblock_03.indd 1                               23/05/10 14:42
                                                                   »     @stevenn

                            IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org   66

Más contenido relacionado

Similar a Building a CMS on top of NoSQL (for ParisJUG)

Lily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionLily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionNGDATA
 
Welcome to the Age of Data
Welcome to the Age of DataWelcome to the Age of Data
Welcome to the Age of DataNGDATA
 
Hadoop World 2011: Lily: Smart Data at Scale, Made Easy
Hadoop World 2011: Lily: Smart Data at Scale, Made EasyHadoop World 2011: Lily: Smart Data at Scale, Made Easy
Hadoop World 2011: Lily: Smart Data at Scale, Made EasyCloudera, Inc.
 
Lily @ Work Webinar
Lily @ Work WebinarLily @ Work Webinar
Lily @ Work WebinarNGDATA
 
NoSQL intro for YaJUG / NoSQL UG Luxembourg
NoSQL intro for YaJUG / NoSQL UG LuxembourgNoSQL intro for YaJUG / NoSQL UG Luxembourg
NoSQL intro for YaJUG / NoSQL UG LuxembourgNGDATA
 
NoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNGDATA
 
Devoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyDevoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyNGDATA
 
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...Sirris
 
The Lily RowLog library
The Lily RowLog libraryThe Lily RowLog library
The Lily RowLog libraryNGDATA
 
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaDevoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaNGDATA
 
From Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart DataFrom Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart DataNGDATA
 
Afterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écranAfterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écranJoseph Glorieux
 
Lily at HUG UK
Lily at HUG UKLily at HUG UK
Lily at HUG UKNGDATA
 
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRWKubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRWHenning Jacobs
 
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...Alessandro Confetti
 
Optimisation of Industrial Processes SimQRi - A Query-oriented Tool for the E...
Optimisation of Industrial Processes SimQRi - A Query-oriented Tool for the E...Optimisation of Industrial Processes SimQRi - A Query-oriented Tool for the E...
Optimisation of Industrial Processes SimQRi - A Query-oriented Tool for the E...Agence du Numérique (AdN)
 
Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...
Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...
Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...OpenCredo
 
The world is the computer and the programmer is you
The world is the computer and the programmer is youThe world is the computer and the programmer is you
The world is the computer and the programmer is youDavide Carboni
 

Similar a Building a CMS on top of NoSQL (for ParisJUG) (20)

Lily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionLily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC edition
 
Welcome to the Age of Data
Welcome to the Age of DataWelcome to the Age of Data
Welcome to the Age of Data
 
Hadoop World 2011: Lily: Smart Data at Scale, Made Easy
Hadoop World 2011: Lily: Smart Data at Scale, Made EasyHadoop World 2011: Lily: Smart Data at Scale, Made Easy
Hadoop World 2011: Lily: Smart Data at Scale, Made Easy
 
Lily @ Work Webinar
Lily @ Work WebinarLily @ Work Webinar
Lily @ Work Webinar
 
NoSQL intro for YaJUG / NoSQL UG Luxembourg
NoSQL intro for YaJUG / NoSQL UG LuxembourgNoSQL intro for YaJUG / NoSQL UG Luxembourg
NoSQL intro for YaJUG / NoSQL UG Luxembourg
 
NoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNoSQL with Hadoop and HBase
NoSQL with Hadoop and HBase
 
Devoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyDevoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and Lily
 
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
Sirris innovate2011 - Lily, Smart Data at scale made easy, Steven Noels, Oute...
 
The Lily RowLog library
The Lily RowLog libraryThe Lily RowLog library
The Lily RowLog library
 
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaDevoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in Java
 
From Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart DataFrom Content Storage to Scaling Smart Data
From Content Storage to Scaling Smart Data
 
Afterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écranAfterwork big data et data viz - du lac à votre écran
Afterwork big data et data viz - du lac à votre écran
 
Binary Analysis - Luxembourg
Binary Analysis - LuxembourgBinary Analysis - Luxembourg
Binary Analysis - Luxembourg
 
Lily at HUG UK
Lily at HUG UKLily at HUG UK
Lily at HUG UK
 
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRWKubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
 
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
 
Optimisation of Industrial Processes SimQRi - A Query-oriented Tool for the E...
Optimisation of Industrial Processes SimQRi - A Query-oriented Tool for the E...Optimisation of Industrial Processes SimQRi - A Query-oriented Tool for the E...
Optimisation of Industrial Processes SimQRi - A Query-oriented Tool for the E...
 
Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...
Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...
Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...
 
The world is the computer and the programmer is you
The world is the computer and the programmer is youThe world is the computer and the programmer is you
The world is the computer and the programmer is you
 
Huguk lily
Huguk lilyHuguk lily
Huguk lily
 

Más de NGDATA

NGDATA Corporate Presentation
NGDATA Corporate PresentationNGDATA Corporate Presentation
NGDATA Corporate PresentationNGDATA
 
20110514 appsforghent
20110514 appsforghent20110514 appsforghent
20110514 appsforghentNGDATA
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Devoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyDevoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyNGDATA
 
NoSQL BOF at Devoxx
NoSQL BOF at DevoxxNoSQL BOF at Devoxx
NoSQL BOF at DevoxxNGDATA
 
NoSQL "Tools in Action" talk at Devoxx
NoSQL "Tools in Action" talk at DevoxxNoSQL "Tools in Action" talk at Devoxx
NoSQL "Tools in Action" talk at DevoxxNGDATA
 

Más de NGDATA (6)

NGDATA Corporate Presentation
NGDATA Corporate PresentationNGDATA Corporate Presentation
NGDATA Corporate Presentation
 
20110514 appsforghent
20110514 appsforghent20110514 appsforghent
20110514 appsforghent
 
Big Data
Big DataBig Data
Big Data
 
Devoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and LilyDevoxx 2010 | Tools In Action : Kauri and Lily
Devoxx 2010 | Tools In Action : Kauri and Lily
 
NoSQL BOF at Devoxx
NoSQL BOF at DevoxxNoSQL BOF at Devoxx
NoSQL BOF at Devoxx
 
NoSQL "Tools in Action" talk at Devoxx
NoSQL "Tools in Action" talk at DevoxxNoSQL "Tools in Action" talk at Devoxx
NoSQL "Tools in Action" talk at Devoxx
 

Último

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Último (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Building a CMS on top of NoSQL (for ParisJUG)

  • 1. ... In which I tell a story of building a CMS on top of ‘NoSQL’ (*) (*) HBase and SOLR IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 2. ... and hopefully warn you on what YOU will encounter in the near future. IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
  • 3. /usr/bin/whoami » co-founder of Outerthought » scalable content applications » content management & publishing » Java, REST and now NoSQL » open source product portfolio IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.orgTHIS NOTEBOOK BELONGS TO: 3
  • 4. » Daisy: content- and knowledge management www.daisycms.org » Lily: scalable store and search THIS N OTE B OOK B ELO N GS TO : www.lilycms.org » Kauri: RESTcentric internet app development www.kauriproject.org IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 4
  • 5. Petite annonce semi-commercial » Devoxx 2010 ! (ex-Javapolis) » 15-19 Novembre, Anvers, Belgique » Track NoSQL/Cloud » Speakers: Tom White (Cloudera/Hadoop O’Reilly author), Jonathan Ellis (Cassandra), Michael Stack (HBase) » Produits: MongoDB, Voldemort, Elastic Search » Cases: Twitter, Facebook, Adobe IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 5
  • 6. Devoxx NoSQL/Cloud track IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 6
  • 7. This story is about IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 7
  • 8. The typical CMS ‘architecture’ database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 8
  • 9. The typical CMS ‘architecture’ application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 9
  • 10. The typical CMS ‘architecture’ more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10
  • 11. The typical CMS ‘architecture’ client (+cache?) more cache application cache database (+opt. filesystem) (+ opt. full-text indexes) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11
  • 12. Hitting the scale spot » Sweet spot of # documents: (100)Ks, not Ms » Not everything could be solved with increasing heap size » cold cache at startup » OOME’s » we didn’t want to step in the PHP/RDBMS trap (of dynamic database schemes) » The cost of flexibility IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12
  • 13. What we found hard to scale » access control (dynamically evaluated against rule set) » facet browsing (compute facet counts in RAM) » all the nifty stuff people were using our software for » ... anything that required random access to in-memory-cache data for computations IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 13
  • 14. Beyond the ‘scaling’ problem » three-prong data layer fs » result set merging (between MySQL & Lucene) » happened in appcode/memory » ‘transactions’, set operations = hard IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 14
  • 15. Beyond the three-prong problem » errrr..... “Failover” ..... ? » = symptom of enterprise success IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 15
  • 16. If we would be able to add more nodes ... scalability » True Distribution availability performance ... in the line of fire IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 16
  • 17. Solution 1 » do MORE inside the database IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17
  • 18. Infrastructural (master/slave) e ! as ta b d a o r e m IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 20 18
  • 19. e ! a s ta b da o r e n m e ve IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 19
  • 20. s ! u s se e b sa g mes d d ’s a l et IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 20
  • 21. ff! ! s tu B C JD r ! o ve t S w00 I! JM RM IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21
  • 22. http://bigdatamatters.com/bigdatamatters/2010/04/high-availability-with-oracle.html IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 22
  • 23. Business Development 101 user interest budget IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 23
  • 24. Solution II: Enter The Cambrian Explosion Cassandra NoSQL neo4j IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 24
  • 25. NoSQL » the era of Polyglot Persistence » the Tower of Bable » the (B)Le(e|a)ding Edge IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 25
  • 26. NoSQL typology » Key/Value stores » Document Databases » Column (Family) Databases C » Graph Databases IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 26
  • 27. NoSQL tool selection » the luxury of choice (but remember polyglot persistence) » survival of the fittest » inflated expectations + nifty marketing NOTE If your data fits in single node RAM memory, DON’T go NoSQL (just yet) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27
  • 28. C Requirements, phase I » automatic scaling to large data sets » fault-tolerance: replication, automatic handling of failing nodes » a flexible data model supporting sparse data » runs on commodity hardware » efficient random access to data » open source, ability to participate in the development thus drive the direction of the project » some preference for a Java-based solution IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 28
  • 29. C Requirements, phase II » After careful consideration, we realized the important choices were also: » consistency: no chance of having two conflicting versions of a row » atomic updates of a single row, single-row transactions » bonus points for MapReduce integration » e.g. full-text index rebuilding IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 29
  • 30. That brought us to HBase, which bought us: » a datamodel where you can have column families which keep all versions and others which do not, which fits very well on our CMS document model » ordered tables with the ability to do range scans on them, which allows to build scalable indexes on top of it » HDFS, a convenient place to store large blobs » Apache license and community, a familiar environment for us IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 30
  • 31. HBase » hbase.apache.org + Cloudera CDH distro » Open Source (Google) BigTable implementation » HDFS as underlying DFS (≈GFS) » ZooKeeper as lock service (≈Chubby) » Integration with Hadoop MapReduce IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 31
  • 32. BigTable column family { "contents:" "anchor:cnnsi.com" "anchor:my.look.ca" "com.cnn.www" "<html>..." "<html>..." "<html>..." t6 t5 t3 "CNN" t9 "CNN.com" t8 } row ure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family con- the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN’s home page key cell ferenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t 3 , t5 , and t6 . We settled on this data model after examining a variety Column Families otential uses of a Bigtable-like system. As one con- 3 e example that drove some ofTECHNOLOGIEPARKdecisions,ZWIJNAARDE (GENT) » are grouped into sets called column fami- IIC » our design 3 » B-9052 Column keys www.outerthought.org
  • 33. Data Model HBase Datamodel • » Sparse, multi-dimensional map map Sparse, multi dimensional (row, column, timestamp) → cell cell (row, column, timestamp) • » Column = Column Family:Column Qualifier Column = Column Family:Column Qualifier Columns Fam1:Qual1 Rows t1 AK v1 t2 v2 Timestamps t2>t1 7 Tuesday, August 17, 2010 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33
  • 34. Regions » Lexicographically sorted set of rows » default size : 256MB » Hosted by region servers row 1 row 200 split row 201 row 350 writes IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34
  • 35. Storage architecture © lars george IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 35
  • 36. Storage organisation Region Memstore HLog (Append only WAL on HDFS) HFile HFile (on HDFS) (on HDFS) (Sequence File) (one per RS) Region HFile: Immutable sorted map (byte[] byte[]) (row, column, timestamp) cell value © Amandeep Khurana IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 36 14
  • 37. Writing Region Write Memstore HLog (Append only WAL on HDFS) HFile HFile (on HDFS) (on HDFS) (Sequence File) (one per RS) Region HFile: Immutable sorted map (byte[] byte[]) (row, column, timestamp) cell value © Amandeep Khurana IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 37 14
  • 38. Flush Region Memstore Flush HLog (Append only Small WAL on HDFS) HFile HFile (Sequence File) (on HDFS) (on HDFS) HFile (one per RS) Region HFile: Immutable sorted map (byte[] byte[]) (row, column, timestamp) cell value © Amandeep Khurana IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 38 14
  • 39. Compaction Region Memstore HLog (Append only Small WAL on HDFS) HFile HFile (Sequence File) (on HDFS) (on HDFS) HFile (one per RS) Compaction Region HFile: Immutable sorted map (byte[] byte[]) (row, column, timestamp) cell value © Amandeep Khurana IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 39 14
  • 40. Stable Region Memstore HLog (Append only WAL on HDFS) HFile HFile HFile (on HDFS) (on HDFS) (on HDFS) (Sequence File) (one per RS) Region © Amandeep Khurana IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 40 15
  • 41. Reading Region Read Memstore HLog (Append only WAL on HDFS) HFile HFile HFile (on HDFS) (on HDFS) (on HDFS) (Sequence File) (one per RS) Region © Amandeep Khurana IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 41 15
  • 42. HBase APIs » Java » REST » Thrift » Ruby shell » Java M/R IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 42
  • 43. HBase Java API » Get (byte arrays, mostly) » Put » Scan » Delete » MapReduce Source / Sink IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 43
  • 44. Interesting HBase-related projects » AvroHBase Avro: Hadoop RPC + ser/deser » HBasene » HBase Explorer » asyncHbase IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 44
  • 45. » OK, so now we have a data store ! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 45
  • 46. » However, content repository = store + search ! u ch o IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 46
  • 47. a s w t ! h a T asy ...) e er w ev (h o IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 47
  • 48. Search ponderings » CMS = two types of search » structured, ‘logic’ search » numbers, strings » based on logic (SQL, anyone?) » information retrieval (or: full-text search) » text » based on statistics IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 48
  • 49. Search ponderings » All of that, at scale IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 49
  • 50. Structured Search » HBase Indexing Library » idea from Google App Engine datastore indexes » http://code.google.com/appengine/articles/ index_building.html rowkey col col rowkey col order A val3 foo6 val2-B B val2 foo7 val3-A content table index table A IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 50
  • 51. Full-text / IR search » Lucene? » no sharding (for scale) » no replication (for availability) » batched index updates (not real-time) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 51
  • 52. Beyond Lucene » Katta » scalable architecture, however only search, no indexing » Elastic Search » very young (sorry) » hbasene et al. » stores inverted index in HBase, might not scale all features » SOLR » widely used, schema, facets, query syntax, cloud branch IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 52
  • 53. ? + = r ? ! O as y E IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 53
  • 54. Remember distribution ? Remember secondary indexes ? ➙ Need for reliable queuing IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 54
  • 55. Connecting things » we needed a reliable bridge between our main storage (HBase) and our index/search server(s) (SOLR) » indexing, reindexing, mass reindexing (M/R) » we need a reliable method of updating HBase secondary indexes » all of that eventually to run distributed » distribution means coping with failure IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 55
  • 56. Solution » ... a QUEUE ! Meh. » ACMEMessageQueue ? Bzzzzzt. We wanted fault-safe HBase persistence for the queues. Also for ease of administration. » ➙ WAL & Queue implemented on top of HBase tables IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 56
  • 57. WAL / Queue » WAL » Queue » guaranteed execution » triggering of async of synchronous actions actions » call doesn’t return before » e.g. (re)index (updated) secondary action finishes record with SOLR back-end » e.g. update secondary actions » size depends on speed of » if all goes well, back-end process size = #concurrent ops » useful outside of Lily context as well! IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 57
  • 58. The Sum » Lily model (records & fields) » mapped onto HBase (=storage) » indexed and searchable through SOLR » using a WAL/Queue mechanism implemented in HBase » runtime based on Kauri » with client/server comms via Avro (and a REST interface with JSON) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 58
  • 59. Lily Content Model » Records > Fields » Field types: the usual base types + blobs + link fields » ... so we can model relationships again (and have free versioning while at it) IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 59
  • 60. Architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 60
  • 61. Architecture IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 61
  • 62. Roadmap » Available now = learning material (architecture, model, API, Javadoc) + developer playground ‘proof of architecture’ ➥ www.lilycms.org » End of October = fully distributed release re! the early » from there on, ca. 3-monthly releases N leading up to Lily 1.0 IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 62
  • 63. License » Apache IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 63
  • 64. Documentation IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 64
  • 65. Questions? http://www.flickr.com/photos/leehaywood/4237636853/ IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 65
  • 66. Thanks for your hospitality and attention ! THIS NOTEBOOK BELONGS TO: » stevenn@outerthought.org Noteblock_03.indd 1 23/05/10 14:42 » @stevenn IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 66

Notas del editor

  1. + mergen van results van search overheen mysql &amp; lucene
  2. consistency?? we&amp;#x2019;re a content repository, after all - people rely on us MapReduce for index rebuilding
  3. This can be used instead of Lucene for indexes which are structured, large, and should be immediately up to date. For example, we use this to keep an index of the links that exist between records.
  4. use values as base for key computation and rely on HBase naturally-ordered rows + scans