A brief look at databases.
Takes SMS Gyan's architecture as case study.
Talk about MySQL, Elastisearch, Redis. About how and when to choose SQL or NoSQL.
About Me
Shyam Anand
Senior Software Engineer at Google.
Previously worked with several startups.
Works on distributed systems, system architecture, etc.
linkedin.com/in/shyamanand
We’ll talk about databases.
With a case-study of SMS Gyan.
● SMS Gyan was launched by Innoz in 2008.
● An SMS based answering engine, came to be known as “Internet on SMS”.
SMS
Introduction
airtel SMS Gyan
HTTP
Database Systems
Data modelling is perhaps the most important part of developing software.
Decision on how to structure, store, and retrieve data can affect the entire
application, throughout its life.
There are several factors to consider while choosing a database, such as,
● Structure of the data
● Expected data volume
● Performance requirements
Relational vs Non-Relational Databases
Relational
For structured data.
Stores data in tables that may share information (and
hence, “Relational”).
Uses JOIN queries to access data in different tables.
Performance tuning becomes necessary with large
volumes of data.
Relatively difficult to scale out.
Lacks flexibility in how data is stored.
Atomicity, Consistency, Isolation, and Durability (ACID)
guarantees.
Non-Relational
For unstructured data (documents).
No concept of tables, fields/columns.
MongoDB, And Elasticsearch store data as JSON-like
documents.
Supports data locality.
Can easily support very large volumes of data.
Easier to scale out, because of native support for
replication, sharding, etc.
Can support changes to the structure of data stored,
making it easier to modify the application layer.
No transactions (typically), so no ACID guarantees.
Some provide Eventual Consistency.
Consistency or Availability?
● Network partitions will inevitably happen in a distributed system.
● Choosing between a relational vs non-relational db can boil down to this
question.
● The first version was a simple PHP app with a MySQL database.
● Supported a few hundred users and a few hundred queries a day.
SMS Gyan
Telecom
Operator
mysql
<network>
backend
smsgyan
app
MySQL
● A Relational Database Management System (RDBMS).
● One of the most popular databases.
● Free and open-source, easy to get started.
● Reliable and scalable.
Data Modelling
Need to store
● The queries from users
● The answers to the queries (as a local cache)
● User details (network operator, whether a subscriber, etc)
phone network is_subscribed query result source query_ts
9876543210 airtel 1 MySQL MySQL is
…
wikipedia 2009-11-10 12:00:00
Schema
phone query query_ts
9876543210 MySQL 2009-11-10 12:00:00
phone network is_subscribed last_active
9876543210 airtel 1 2009-11-10
queries
users
query result source
MySQL MySQL is an open-source
...
wikipedia
knowledge_base
High volume of airtel 121 requests
● The application was receiving a large number of requests (> 1000 qps).
● Caused the database to become slow, and the requests to fail (SLA violation).
Scaling
App DB Airtel
X
X
● MySQL FULLTEXT index was used.
● The results were sometimes not accurate, especially for queries that are
sentences or phrases.
● MySQL performance was deteriorating as the data volume was increasing.
Improving search results
query result source
MySQL MySQL is an
open-source...
wikipedia
queries
● Designed for really fast text searches. Supports stemming, ranking, etc.
● Data is stored as documents. Provides REST APIs to read and write data.
● Highly available, scalable, and (relatively) easy to configure.
● Natively supports sharding and replication.
Elasticsearch
smsgyan
app
Elasticsearch cluster
Elasticsearch
Cluster: Consists of one or more nodes.
Node: An instance of ES.
Index: A logical namespace, maps to one or more primary shards, and can have 0 or more
replica shards.
Document: A record stored in ES.
Shard: A single low-level worker unit managed by ES.
Primary Shard: Each document is stored in a primary shard.
Replica Shard: A copy of a Primary shard. Each primary shard can have 0 or more replicas.
Replicas help distribute ES’s load, and can help in failover if a primary shard is unavailable.
Caching
Pagination of results
● SMS replies put a limit on the length of content, so a whole wikipedia article
would be returned as several pages.
● Users need to send SMS to retrieve each page.
Redis
● A distributed, in-memory data structure store.
○ Can store simple key values, Sets, Lists, Ordered Lists, etc., and can perform operations such
as Set union/intersection, push/pop to/from Lists, etc.
● Can be used as an in-memory key-value db, cache, and message broker.
● Durability is optional.
● Different function from the databases discussed earlier.
Redis
In SMS Gyan
1. Fetch query result (database, or source on internet)
2. Write the entire result into cache, with user’s phone number as key.
3. Extract a page (upto 240 characters) and send to user, remove the served
page from the content in cache.
4. If user requests more pages, do step 3.
5. Clear the key if
a. The user sends a different query, or
b. There is no request from user for a specific period of time.