2. Who we are
• A search
engine
• A people
search engine
• An influencer
search engine
• Subscription-
based
3. George Stathis
VP Engineering
14+ years of experience
building full-stack web
software systems with a past
focus on e-commerce and
publishing. Currently
responsible for building
engineering capability to
enable Traackr's growth goals.
4. What’s this talk about?
• Share what we know about Big Data/NoSQL:
what’s behind the buzz words?
• Our reasons and method for picking a NoSQL
database
• Share the lessons we learned going through
the process
6. What is Big Data?
• 3 Vs:
– Volume
– Velocity
– Variety
7. What is Big Data? Volume + Velocity
• Data sets too large or coming in at too high a velocity
to process using traditional databases or desktop tools.
E.g.
big science Astronomy
web logs atmospheric science
rfid genomics
sensor networks biogeochemical
social networks military surveillance
social data medical records
internet text and documents photography archives
internet search indexing video archives
call detail records large-scale e-commerce
8. What is Big Data? Variety
• Big Data is varied and unstructured
Traditional static reports Analytics, exploration &
experimentation
9. What is Big Data?
• Scaling data processing cost effectively
$$$$$
$$$$$$$$ $$$
10. What is NoSQL?
• NoSQL ≠ No SQL
• NoSQL ≈ Not Only SQL
• NoSQL addresses RDBMS limitations, it’s not
about the SQL language
• RDBMS = static schema
• NoSQL = schema flexibility; don’t have to
know exact structure before storing
11. What is Distributed Computing?
• Sharing the workload: divide a problem into
many tasks, each of which can be solved by one
or more computers
• Allows computations to be accomplished in
acceptable timeframes
• Distributed computation approaches were
developed to leverage multiple machines:
MapReduce
• With MapReduce, the program goes to the data
since the data is too big to move
13. What is MapReduce?
• MapReduce = batch processing = analytical
• MapReduce ≠ interactive
• Therefore many NoSQL solutions don’t
outright replace warehouse solutions, they
complement them
• RDBMS is still safe
14. What is Big Data? Velocity
• In some instances, being able to process large
amounts of data in real-time can yield a
competitive advantage. E.g.
– Online retailers leveraging buying history and click-
though data for real-time recommendations
• No time to wait for MapReduce jobs to finish
• Solutions: streaming processing (e.g. Twitter
Storm), pre-computing (e.g. aggregate and count
analytics as data arrives), quick to read key/value
stores (e.g. distributed hashes)
15. What is Big Data? Data Science
• Emergence of Data Science
• Data Scientist ≈ Statistician
• Possess scientific discipline & expertise
• Formulate and test hypotheses
• Understand the math behind the algorithms so
they can tweak when they don’t work
• Can distill the results into an easy to understand
story
• Help businesses gain actionable insights
20. Traackr: context
• A cloud computing company as about to
launch a new platform; how does it find the
most influential IT bloggers on the web that
can help bring visibility to the new product?
How does it find the opinion leaders, the
people that matter?
55. Requirement: batch processing
MapReduce + RDBMS:
Possible but proprietary solutions
Usually involves exporting data from
RDBMS into a NoSQL system anyway.
Defeats data locality benefit of MR
56. Traackr’s Datastore Requirements
• Schema flexibility ✓
• Good at storing lots of variable length text ✓
• Batch processing options ✓
A NoSQL option is the right fit
58. Bewildering number of options (early 2010)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
59. Bewildering number of options (early 2010)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
60. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables while•weSpread sheet like
Graph Databases: can model
• • Key is a row
Designed for high as a graph we don’t want to id
our domain load
• pigeonhole ourselves into this structure. columns
In-memory or on-disk • Attributes are
• Eventually consistent use these tools for can be grouped
We’d rather • Columns
specialized data analysis but not as the
into families
main data store.
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
61. Trimming options
Key/Value Databases Column Databases
Memcache: memory-based,
• Distributed hashtables • Spread sheet like
we need true persistence
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
62. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
Amazon SimpleDB: not willing to
store our data in into families
a proprietary
datastore.
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
63. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
Not willing to store ourProject a
Redis and LinkedIn’s data in
• Value proprietary datastore. •
= Document
Voldermort: no query filters, Great for modeling
• Document used as queues or
better = JSON/BSON networks
• JSON = Flexible Schema
distributed caches • Great for graph-based
query algorithms
64. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
CouchDB: no ad-hoc queries;
• Eventually consistent • Columns can us
maturity in early 2010 madebe grouped
into families
shy away although we did try
early prototypes.
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
65. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases 2010, Graph Databases
Cassandra: in early
• •
maturity questions, no secondary Graph Theory G=(E,V)
Like Key/Value
• Value = Document processing Great for modeling
indexes and no batch •
• options (came later on).
Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
66. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• MongoDB: in earlyis a row id
Designed for high load • Key 2010, maturity
• In-memory or on-disk questions, adoption questions
• Attributes are columns
and no batch processing options.
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
67. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value very close but• in early 2010,
Riak: Graph Theory G=(E,V)
• • Great for
Value = Document adoption questions. modeling
we had
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
68. Trimming options
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value came across as•theGraphmature G=(E,V)
HBase: most Theory
• Value = Document with several deployments, a
at the time, • Great for modeling
• Document = JSON/BSON "out-of-the box"
healthy community, networks
secondary indexes through a contrib and
• JSON = Flexible Schema • Great for graph-based
support for batch processing using
Hadoop/MR query algorithms
.
69. Lessons Learned
Challenges Rewards
- Complexity - Choices
- Missing Features - Empowering
- Problem solution fit - Community
- Resources - Cost
70. Rewards: Choices
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
72. Lessons Learned
Challenges Rewards
- Complexity - Choices
- Missing Features - Empowering
- Problem solution fit - Community
- Resources - Cost
73. When Big-Data = Big Architectures
Must have an odd Master/slave architecture
number of means a single point of failure,
Zookeeper quorum so you need to protect your
nodes master.
Then you can run your Hbase
nodes but it’s recommended to
co-locate regionservers with
hadoop datanodes so you have
to manage resources.
Must have a Hadoop HDFS
cluster of at least 2x replication
factor nodes
And then we also have to
manage the MapReduce
processes and resources in the
Hadoop layer.
Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
76. To be expected
• Hadoop/Hbase are
designed to move
mountains
• If you want to move big
stuff, be prepared to
sometimes use big
equipment
77. What it means to a startup
Development capacity before
Congrats, you
are now a
sysadmin… Development capacity after
78. Lessons Learned
Challenges Rewards
- Complexity - Choices
- Missing Features - Empowering
- Problem solution fit - Community
- Resources - Cost
79. Mapping an saved search to a column store
Name
Ranks References to influencer records
80. Mapping an saved search to a column store
“attributes”
column family
Unique for general “influencerId” column family
key attributes for influencer ranks and foreign keys
81. Mapping an saved search to a column store
Influencer ranks
can be attribute
“name” attribute names as well
82. Mapping an saved search to a column store
Can get pretty long so needs indexing and pagination
87. Need to upgrade to Hbase 0.90
• Making sure to remain on recent code base
• Performance improvements
• Mostly to get the latest bug fixes
No thanks!
91. Let’s get this straight
• Hbase no longer comes with secondary
indexing out-of-the-box
• It’s been moved out of the trunk to GitHub
• Where only one other company besides us
seems to care about it
102. Cracks in the data model
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/arianna-huffington/post_1.html
http://www.huffingtonpost.com/arianna-huffington/post_2.html
authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/shaun-donovan/post1.html
http://www.huffingtonpost.com/shaun-donovan/post2.html
authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
103. Cracks in the data model
huffingtonpost.com
published under
writes for
Denormalized/duplicated
for fast runtime access
http://www.huffingtonpost.com/arianna-huffington/post_1.html
and storage of influencer-
http://www.huffingtonpost.com/arianna-huffington/post_2.html
authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html
to-site relationship
properties
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/shaun-donovan/post1.html
http://www.huffingtonpost.com/shaun-donovan/post2.html
authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
104. Cracks in the data model
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/arianna-huffington/post_1.html
http://www.huffingtonpost.com/arianna-huffington/post_2.html
authored by
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/shaun-donovan/post1.html
http://www.huffingtonpost.com/shaun-donovan/post2.html
authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
Content attribution logic could sometimes
mis-attribute posts because of the
duplicated data.
105. Cracks in the data model
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/arianna-huffington/post_1.html
authored by
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/shaun-donovan/post1.html
http://www.huffingtonpost.com/shaun-donovan/post2.html
authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/arianna-huffington/post_2.html
Exacerbated when we started tracking
people’s content on a daily basis in mid-
2011
106. Fixing the cracks in the data model
Normalize the sites
http://www.huffingtonpost.com/arianna-huffington/post_1.html
http://www.huffingtonpost.com/arianna-huffington/post_2.html
authored by http://www.huffingtonpost.com/arianna-huffington/post_3.html
writes for
published under
huffingtonpost.com
published under
writes for
http://www.huffingtonpost.com/shaun-donovan/post1.html
http://www.huffingtonpost.com/shaun-donovan/post2.html
authored by http://www.huffingtonpost.com/shaun-donovan/post3.html
107. Fixing the cracks in the data model
• Normalization requires stronger
secondary indexing
• Our application layer indexing would
need revisiting…again!
108. What it means to a startup
Psych! You are back
to writing indexing
code.
Development capacity
110. Lessons Learned
Challenges Rewards
- Complexity - Choices
- Missing Features - Empowering
- Problem solution fit - Community
- Resources - Cost
111. Traackr’s Datastore Requirements
(Revisited)
• Schema flexibility
• Good at storing lots of variable length text
• Out-of-the-box SECONDARY INDEX support!
• Simple to use and administer
112. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
113. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Nope!
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
114. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• Graph Databases:•weAttributes are columns
In-memory or on-disk looked at
• • Columns can
Eventually consistent closer but passed again be grouped
Neo4J a bit
for the same reasons into families
as before.
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
115. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
Memcache: still no
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
116. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
Amazon SimpleDB: still no.
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
117. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
Not willing to store ourProject a
Redis and LinkedIn’s data in
• Value proprietary datastore. •
=Voldermort: still no
Document Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
118. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
CouchDB: more mature but still
• Eventually consistent • Columns can
no ad-hoc queries. be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
119. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databasesa bit, added
Cassandra: matured quite Graph Databases
secondary indexes and batch processing
• •
Like Key/Valuerestrictive in its’ use than Graph Theory G=(E,V)
options but more
• •
Value =solutions. After the Hbase lesson, Great for modeling
other Document
• Document useJSON/BSON
simplicity of
= was now more important. networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
120. NoSQL picking – Round 2 (mid 2011)
Key/Value Databases Column Databases
• Distributed hashtables • Spread sheet like
• Designed for high load • Key is a row id
• In-memory or on-disk • Attributes are columns
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• • Graph Theory G=(E,V)
Like Key/Value strong contender still but
Riak:
• • Great for
Value = Document questions remained. modeling
adoption
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
121. NoSQL picking – Round 2 (mid 2011)
Key/Value Databasesby leaps Column Databases
MongoDB: matured and bounds, increased
• • Spread sheet like
Distributed hashtables 10gen, advanced indexing
adoption, support from
• • batch processing
Designed for high load as some Key is a row id
out-of-the-box as well
options, breeze to use, well documented and fit into
• • Attributes
In-memory or on-disk code base very nicely. are columns
our existing
• Eventually consistent • Columns can be grouped
into families
Document Databases Graph Databases
• Like Key/Value • Graph Theory G=(E,V)
• Value = Document • Great for modeling
• Document = JSON/BSON networks
• JSON = Flexible Schema • Great for graph-based
query algorithms
122. Lessons Learned
Challenges Rewards
- Complexity - Choices
- Missing Features - Empowering
- Problem solution fit - Community
- Resources - Cost
124. What it means to a startup
Yay! I’m back!
Development capacity
125. Immediate Benefits
• No more maintaining custom application-layer
secondary indexing code
• Single binary installation greatly simplifies
administration
126. What it means to a startup
Honestly, I thought
I’d never see you
guys again!
Development capacity
127. Immediate Benefits
• No more maintaining custom application-layer
secondary indexing code
• Single binary installation greatly simplifies
administration
• Our NoSQL could now support our domain
model
131. Other Benefits
• Ad hoc queries and reports became easier to write with JavaScript:
no need for a Java developer to write map reduce code to extract
the data in a usable form like it was needed with Hbase.
• Simpler backups: Hbase mostly relied on HDFS redundancy; intra-
cluster replication is available but experimental and a lot more
involved to setup.
• Great documentation
• Great adoption and community
134. And less of this
Source: socialbutterflyclt.com
135. Recap & Final Thoughts
• 3 Vs of Big Data:
– Volume
– Velocity
– Variety Traackr
• Big Data technologies are complementary to
SQL and RDBMS
• Until machines can think for themselves Data
Science will be increasingly important
136. Recap & Final Thoughts
• Be prepared to deal with less mature tech
• Be as flexible as the data => fearless
refactoring
• Importance of ease of use and
administration cannot be overstated for a
small startup
Big science: Large Hadron Collider (LHC)Sensor networks: forest fire detectionCall detail record, a record of a (billing) event produced by a telecommunication network element
Big science: Large Hadron Collider (LHC)Sensor networks: forest fire detectionCall detail record, a record of a (billing) event produced by a telecommunication network element
Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
Scaling here means maintaining throughput of computation and analysis while data sizes increase: divide up the work on multiple machines
Taking a look at the amount of storage we are using as of a month ago in Mongo; this includes indexes
The point is that we don’t need to track the entire web: just the subset belonging to influencers!
There is a different perspective on “Web Scale” that has to do with the nature of the data on the web
Take the approach of using a simplifiedentity model
…withsemi-structured data storage formats like JSON:Facilitate capturing related attribute structures Enablethe flexibility of definingnew attributes as they are discovered
CLOB pre-allocated space
Sparse maps
- This is something we thought we needed back in early 2010- Traack needs to score its’ entire DB of influencers on a weekly basis to adjust the weighted averages and stats that drive the scores. This means processing north of 750K of sites, over 650K influencers and soon, millions of posts (post-level attributes)
Graph Databases: while we can model our domain as a graph we don’t want to pigeonhole ourselves into this structure. We’d rather use these tools for specialized data analysis but not as the main data store.
Memcache: memory-based,we need true persistence
Amazon SimpleDB: not willing to store our data in a proprietary datastore.
Redis and LinkedIn’s Project Voldermort: no query filters, better used as queues or distributed caches
CouchDB: no ad-hoc queries; maturity in early 2010 made us shy away although we did try early prototypes
Cassandra: in early 2010, maturity questions, no secondary indexes and no batch processing options (came later on).
MongoDB: in early 2010, maturity questions, adoption questions and no batch processing options
Riak: very close but in early 2010, we had adoption questions
HBase: came across as the most mature at the time, with several deployments, a healthy community, "out-of-the box" secondary indexes through a contrib and support for batch processing using Hadoop/MR Hadoop and its’ maturity was a big reason we picked HBase
Had to deal with a complex right from the start:- minimum number of data nodes to support replication- odd number of zookeper nodes to avoid voting deadlocks- co-locating region servers = paying close attention to JVM resources- Master = SPOF- co-locating job trackers = paying close attention to JVM resources
- Quick overview of how we modeled a list in hbase => saved searches- This is what our customers see- Let's consider the name, the ranks of the influencers and the influencer references
Each row has a unique key: the alist idWe would group general attributes under one family of columns appropriately named “attributes”. Benefit: can get Alist information without loading all the influencersWe would group the influencer references under another family of columns named “influencerIds”
Now we can see where the attributes we see on the screen are stored
- We coded the pagination and indexing features ourselves and contributed them back- Felt really good about it!
It wasn’t bad enough we had to write our own code to support our indexing needs, we now had to maintain a third-party code base that was quickly becoming outdated!
Simplified example for posts
Denormalized/duplicated for fast runtime access and storage of influencer-to-site relationship properties
Content attribution logic could sometimes mis-attribute posts because of the duplicated
Exacerbated when we started tracking people’s content on a daily basis in mid-2011
Graph Databases: we looked at Neo4J a bit closer but passed again for the same reasons as before
CouchDB: more mature but still no ad-hoc queries
Cassandra: matured quite a bit, added secondary indexes and batch processing options but more restrictive in its’ use than other solutions. After the Hbase lesson, simplicity of use was now more important
Riak: strong contender still but adoption questions
MongoDB: matured by leaps and bounds, increased adoption, support from 10gen, advanced indexing out-of-the-box as well as some batch processing options, breeze to use, well documented and fit into our existing code base very nicely.
Embedded list of references to sites augmented with influencer-specific site attributes (e.g. percent contribution to content)
siteId indexed for “find influencers connected to site X”
Big science: Large Hadron Collider (LHC)Sensor networks: forest fire detectionCall detail record, a record of a (billing) event produced by a telecommunication network element