Indic threads pune12-nosql now and path ahead

NoSQL: Now and Path Ahead
Shubham Kumar Srivastava
MakeMyTrip

Abstract

What and Why : NoSql

Fundamentals

Use Case

Challenges

Path Ahead
3

.

What is NoSql
Database which does not adhere to the traditional relational database
management system (RDMS) structure .

Why NoSql

 Scalability and Performance

 Cost

 Data Modeling

Why NoSql : Motives and Drivers
Scalability and Performance

 Horizontal scalability better than Vertical

 Hardware getting cheaper and processing power increasing

 Less Operational complexity as against RDBMS solutions.

 In most of the solutions you get automatic sharding etc as default .

Why NoSql : Motives and Drivers contd..

Cost

 Scale(as with NoSql) with Hefty Cost

 Commodity hardware, software versions, upgrades,
maintenance.

 This brought organizations look out for alternatives and
the need for a cost effective scale out option.

Data Modeling
SQL has been for

 Concurreny,Consistency,Integrity

 For Summations,Aggregations,Grouping’s

 Schema Says: What all Do I answer ??

Data Modeling

 A plain key-value store is very powerful and fit the max use cases for
a NoSQL solution

 Hierarchical or graph-like data modelling and processing.

 Values like maps of maps of maps.

 Document Databases which even store arbitrary complex objects.

 Document based indexing data store’s are a huge success.

At times SW apps are not limited to these constraints . This lead to
data models like

Key/Value Store :
Redis,MemcacheDb/Voldemort etc.

Wide Column Store / Column Families :
Cassandra/Hadoop(Hbase)/Hypertable/Cloudera etc.

Document Based Store’s :
Solr/Lucene/MongoDb/CouchDb/TerraStore etc.

Graph Data Store :
Neo4J/GraphBase/FlockDb etc.


 Schema Says: What are the questions

 Data modeling is based on the set of Queries

 Exploit De-normalization Duplication

 Use Aggregates

 Manage Joins with App + Aggregation + DeNormalization etc.

Some Fanda-mentals
CAP Theorem

At the most only two properties of the three in a
shared/distributed system can be satisfied.

 Consistency

 Availability

 Tolerance to Network Partitions

Explanation
Use case:
Scaling Web Apps

Critical fact’s :
• Network outages are common
• Customer shopping carts, email search, social network
queries—can tolerate stale data

How:
Compromise on Consistency in-order to remain available vs
disrupt user service at outages.

Explanation

 Rather than requiring consistency after every transaction, it
is enough for the database to eventually be in a consistent
state.

 Brewer’s CAP theorem says you have no choice if you want
to scale up.

Explanation contd..
Sharp Contrast : High Speed Financial Application

 Highly Transactional

 Consistent

 Automated

 Can’t live with Eventual consistency

ACID vs BASE
ACID
 Atomic: Everything in a transaction succeeds or the
entire transaction is rolled back.

 Consistent: A transaction cannot leave the database in an
inconsistent state.

 Isolated: Transactions cannot interfere with each other.

 Durable: Completed transactions persist, even when
servers restart etc.

Some Fanda-mentals cont..
BASE
Basic Availability

Soft-state

Eventual consistency

Consistent Hashing
Common way to load balance .

The machine chosen to cache object o will be:

hash(o) mod n
n:total number of machines

Consistent Hashing contd..

 Adding a machine to the cache means
hash(o) mod (n + 1)

 Removing a machine to the cache means
hash(o) mod (n - 1)

 Result on any above: Disaster 

Swamped machines with redistribution


Commonly, a hash function(e.g MD5 hash) will
map a value into a 128-bit key, 0~2^127-1(or 32 bit
even as given next) .

Both Key and Machine hashed with the same function

Adding a Node

Removing a Node

Use Case and NoSQL Solution
Problem:

Need to store bookings per day of all hotels .
Queries centered around city and regions.

Hotel count : 1 Million

Date Range : Now to next 365 *2 Days

NoSQL: Path Ahead

 ACID equivalence(Neo4J,CouchDb etc)

 Transaction Support

 Atomicity

 MVCC

NoSQL: Path Ahead contd..
Possible Solution

Work with SQL Db w.r.t Creation/Updation etc.

Archive the data in NoSQL for query/analysis etc.

Enterprise Adoption and Challenges

 NoSQL looks good for Unstructured data largely

 SQL is the best choice for a broad range of
traditional workloads.

Shout out loud

Hybrid

ACID + BASE

They are not alternatives but supplements

 Maturity

 Support

 Skillset and Administration/Operation

 Analytics and BI support

References
 Nancy Lynch and Seth Gilbert, “Brewer's conjecture and the feasibility of consistent, available, partition-
tolerant web services”, ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-59.
 Brewer's CAP Theorem", julianbrowne.com, Retrieved 02-Mar-2010
 Brewers CAP theorem on distributed systems", royans.net
 CAP Twelve Years Later: How the "Rules" Have Changed on-line resource
 E. Brewer, "Towards Robust Distributed Systems," Proc. 19th Ann. ACM Symp.Principles of Distributed
Computing (PODC 00), ACM, 2000, pp. 7-10; on-line resource
 D. Abadi, "Problems with CAP, and Yahoo’s Little Known NoSQL System," DBMS Musings, blog, 23 Apr.
2010; on-line resource.
 C. Hale, "You Can’t Sacrifice Partition Tolerance," 7 Oct. 2010; on-line resource.
 Facebook: Scaling Out on-line resource.
 Gemstone : The Hardest Problems In Data Management on-line resource
 The Log-Structured Merge-Tree (Research Paper)
 CodeProject : Consistent Hashing on-line resource
 HighlyScalable : NoSQL Data Modeling Techniques on-line resource
 eBay Tech Blog :Cassandra Data Modeling Best Practices on-line resource
 John D Cook : Acid Vs Base on-line resource
 Merkle Trees
 Phy-Accural Faliover Detaection (Research Paper)

Backup Slides

Better than the Original 1 

Document Based DataStore
{
_id : ObjectId("4e77bb3b8a3e000000004f7a"),
when : Date("2011-09-19T02:10:11.3Z",
author : "alex",
title : "No Free Lunch",
text : "This is the text of the post. It could be very long.",
tags : [ "business", "ramblings" ],
votes : 5,
voters : [ "jane", "joe", "spencer", "phyllis", "li" ],
comments : [
{ who : "jane", when : Date("2011-09-19T04:00:10.112Z"),
comment : "I agree." },
{ who : "meghan", when : Date("2011-09-20T14:36:06.958Z"),
comment : "You must be joking. etc etc ..." }
]
}

Use Case 1
Ecommerce Site

Problem : Record User Preferences e.g :
Location,IP,Currency selected, Source of Traffic,
Multiple other dynamic values

Solution : In a CF based structure keep it simple

UserId_Key:
Pref2_Name:Value1,Pref2_Name:Value2,
….PrefN_Name:ValueN

Use Case 1
RowKey: 1350136093705_6501082438199894
=> (column=1350136093764, value=-3242432#911167901131523, timestamp=1350136093766000)
=> (column=1350283322499, value=GOI#200701231712126570, timestamp=1350283322502001)
(column=1350785230322, value=BOM#200701251747233158, timestamp=1350785230324001)

⇒ RowKey: 1354499614310_10861558002828044
⇒ => (column=1354499614368, value=TRV#201104071059204768, timestamp=1354499614370000, ttl=1728000)
⇒ -------------------
⇒ RowKey: 1349760150553_6114662943774777
⇒ => (column=1349760152066, value=BLR#200802111324575807, timestamp=1349760152068001)
⇒ -------------------
⇒ RowKey: 1349805109805_6167423558533191
⇒ => (column=1349805111833, value=TRV#312254274337517, timestamp=1349805111835001)
⇒ -------------------
⇒ RowKey: 1354435656227_7908056941568359
⇒ => (column=1354435656367, value=IDR#200701211254519381, timestamp=1354435656369000, ttl=1728000)
⇒ -------------------
⇒ RowKey: 1347648097261_15570089270962881
⇒ => (column=1347648097304, value=DEL#201101192008115545, timestamp=1347648097307000)

Use Case 1
Get

private Map<String, String> getPrerences(Keyspace keySpace, String userId, String...
prefernceNames) throws IOException, CharacterCodingException {
SliceQuery<String, String, String> rsq = HFactory.createSliceQuery(keySpace,
StringSerializer.get(), StringSerializer.get(), StringSerializer.get());
rsq.setColumnFamily(USER_PREFERENCE);
rsq.setKey(userId);
rsq.setColumnNames(prefernceNames);

QueryResult<ColumnSlice<String, String>> orows = rsq.execute();
Map<String, String> preferenceMap = new LinkedHashMap<String, String>();
for (HColumn<String, String> column : orows.get().getColumns()) {
preferenceMap.put(column.getName(), column.getValue());
}
return preferenceMap;

}

Use Case 1
Save

Mutator<String> m = HFactory.createMutator(keySpace, StringSerializer.get());

HColumn<String, String> userPrefrences = HFactory.createColumn(colkey, colvalue,
StringSerializer.get(), StringSerializer.get());

userPrefrences.setTtl(ttlUserPrefrences);

m.addInsertion(rowkey, USER_PREFERENCE, userPrefrences);

m.execute();

Use Case 2
Online Travel Site

Problem: Need to know different metrics for a
city hotels e.g.:

Hotels booked in last X Time
Hotels Last viewed in Y Time
Hotels Left with Z Inventory

Use Case 2
RowKey: 2d323436353731
=> (super_column=911167901297486,
(column=6c6173747669657765646d657373616765, value=VIEWED#Last viewed 23 hour(s) ago.,
timestamp=1354962852610000)
column=6c6173747669657765646d657373616762, value=Inventory#20 ,
timestamp=1354962852610000,
column=6c6173747669657765646d657373616769, value=Bookings#8 , timestamp=135496282610000
)
-------------------
RowKey: 58524f
=> (super_column=200903041759196196,
(column=6c617374626f6f6b65646d657373616765, value=Booked#Last booked 1 day(s) ago.,
timestamp=1347781187842000)
(column=6c6173747669657765646d657373616765, value=VIEWED#Last viewed 2 hours ago.,
timestamp=1347707080147000))
=> (super_column=200903041848352230,
(column=6c6173747669657765646d657373616765, value=VIEWED#Last viewed 1 day(s) ago.,
timestamp=1347266107708000))

Use Case 2
SuperSliceQuery<String, String, String, String> superQuery = HFactory.createSuperSliceQuery(getKeySpace(),
StringSerializer.get(), StringSerializer.get(),
StringSerializer.get(), StringSerializer.get());
superQuery.setColumnFamily(SUPER_SOCIAL_MESSAGE).setKey(cityCode);

QueryResult<SuperSlice<String, String, String>> result = superQuery.execute();
List<HSuperColumn<String, String, String>> superColumns = result.get().getSuperColumns();

if (superColumns != null) {
for (HSuperColumn<String, String, String> superColumn : superColumns) {
Map<String, String> messages = new HashMap<String, String>();
List<HColumn<String, String>> columns = superColumn.getColumns();
if (columns != null) {
for (HColumn<String, String> column : columns) {
messages.put(column.getName(), column.getValue());
}
}
/* The equivalent doc *
document.addField(superColumn.getName(), messages);
documents.add(document);
}
}

Pig Script : MR
<document>

<pigscript start="-16" end="-43200" start1="-1441" end1="-10080" start2="0" end2="-15" start3="0" end3="-1440">

<comment>Delete All Messages</comment>

<query><![CDATA[rows0 = LOAD 'cassandra://LH/HotelMessage' USING com.mmt.solr.hotels.cassandra.CassandraStorage() as (key:chararray, cols:bag{T:tuple(name:chararray, value:chararray) } );]]></query>

<query><![CDATA[cols0 = FOREACH rows0 GENERATE key as key,flatten($1) as (name:chararray, value:chararray);]]></query>

<query><![CDATA[cols0 = FOREACH rows0 GENERATE key as key,flatten($1) as (name:chararray, value:chararray);]]></query>

<query><![CDATA[userhotel0 = FOREACH cols0 GENERATE key as key,com.mmt.solr.hotels.cassandra.ByteBufferToString($1) as name,com.mmt.solr.hotels.cassandra.ByteBufferToString($2) as value;]]></query>

<query><![CDATA[uriCounts0 = FOREACH userhotel0 GENERATE key as citycode,com.mmt.solr.hotels.cassandra.ToBag(TOTUPLE(name,null));]]></query>

<comment>Last Viewed start 15 minutes to 30 days ago</comment>

<query><![CDATA[rows = LOAD 'cassandra://LH/LastViewedHotels?slice_start=#start&slice_end=#end&limit=1024&reversed=true' USING com.mmt.solr.hotels.cassandra.CassandraStorage() as (key:chararray, cols:bag{T:tuple(name:long,
value:chararray) } );]]></query>

<query><![CDATA[cols = FOREACH rows GENERATE key as key,flatten($1) as (name:long, value:chararray);]]></query>

<query><![CDATA[userhotel = FOREACH cols GENERATE key as key,com.mmt.solr.hotels.cassandra.LongToHours($1) as name,com.mmt.solr.hotels.cassandra.ByteBufferToString($2) as value;]]></query>

<query><![CDATA[userhotelByCity = FOREACH userhotel GENERATE key as key,flatten($1) as name,flatten(org.apache.pig.piggybank.evaluation.string.Split(value,'#',2)) as (citycode:chararray,hotelid:chararray);]]></query>

<query><![CDATA[groupByhotels = GROUP userhotelByCity BY hotelid;]]></query>

<query><![CDATA[uriCounts = FOREACH groupByhotels { D = LIMIT userhotelByCity 1;

GENERATE flatten(D.citycode) as citycode,com.mmt.solr.hotels.cassandra.ToBag(

TOTUPLE(group,com.mmt.solr.hotels.cassandra.StringAppend('VIEWED#Last viewed ',D.name,' ago.')));

};]]></query>

<comment>Last Booked 1 to 8 days ago</comment>

<query><![CDATA[rows1 = LOAD 'cassandra://LH/BookedHotels?slice_start=#startA&slice_end=#endA&limit=1024&reversed=true' USING com.mmt.solr.hotels.cassandra.CassandraStorage() as (key:chararray, cols:bag{T:tuple(name:long,
value:chararray) } );]]></query>

<query><![CDATA[cols1 = FOREACH rows1 GENERATE key as key,flatten($1) as (name:long, value:chararray);]]></query>

<query><![CDATA[userhotel1 = FOREACH cols1 GENERATE key as key,com.mmt.solr.hotels.cassandra.LongToHours($1) as name,com.mmt.solr.hotels.cassandra.ByteBufferToString($2) as value;]]></query>

<query><![CDATA[userhotelByCity1 = FOREACH userhotel1 GENERATE key as key,flatten($1) as name,flatten(org.apache.pig.piggybank.evaluation.string.Split(value,'#',2)) as (citycode:chararray,hotelid:chararray);]]></query>

<query><![CDATA[groupByhotels1 = GROUP userhotelByCity1 BY hotelid;]]></query>

<query><![CDATA[uriCounts1 = FOREACH groupByhotels1 { D = LIMIT userhotelByCity1 1;

GENERATE flatten(D.citycode) as citycode,com.mmt.solr.hotels.cassandra.ToBag(

TOTUPLE(group,com.mmt.solr.hotels.cassandra.StringAppend('Booked#Last booked ',D.name,' ago.')));

};]]></query>

Criteria's to Evaluate NoSQL Solutions

Internal partitioning

Automated flexible data distribution

Hot swappable nodes

Replication-style

Automated failover strategy

Indic threads pune12-nosql now and path ahead

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Indic threads pune12-nosql now and path ahead

Similar a Indic threads pune12-nosql now and path ahead (20)

Más de IndicThreads

Más de IndicThreads (20)

Último

Último (20)

Indic threads pune12-nosql now and path ahead