SlideShare a Scribd company logo
1 of 38
Download to read offline
for duplicate detection on real time stream
whoami(1) 
15 years of experience, proud to be a programmer 
Writes software for information extraction, nlp, opinion mining (@scale ), and a 
lot of other buzzwords 
Implements scalable architectures 
Member of the JUG-Torino coordination team 
ro.franchini@gmail.com github.com/robfrank 
twitter.com/robfrankie linkedin.com/in/robfrank 
http://www.celi.it http://www.blogmeter.it
Agenda 
What is it? 
Main features 
Caching 
Counters 
Scripting 
How we use it
From the site 
Redis is an open source, BSD licensed, advanced 
key-value cache and store. It is often referred to 
as a data structure server since keys can contain 
strings, hashes, lists, sets, sorted sets, bitmaps 
and hyperloglogs.
Who use it 
Twitter 
Github 
Youporn 
Pinterest 
Groupon 
...
Ecosystem 
Clients in every known language 
Articles, books, presentations 
On High Scalability every other day
Architecture 
Single-threaded server 
Yes: single threaded server 
Remember that when you need to scale 
Single Linux server can handle 500k req/s
Main features 
In memory K/V store 
But with durable persistence 
Master-slave async replica 
Transactions 
Pub/Sub 
Server side LUA scripting
Main features 
Keys with TTL 
LRU eviction 
Keys can contain strings, hashes, lists, sets, sorted sets, 
bitmaps and hyperloglogs 
REDIS cluster on the go (3.0.0-rc1)
K/V store 
Key-value (KV) stores use the associative array (also 
known as a map or dictionary) as their fundamental data 
model. In this model, data is represented as a collection of 
key-value pairs, such that each possible key appears at 
most once in the collection. (wikipedia)
K/V store 
Key 
“plain text” 
name rob 
surname frank 
A C E D B F 
A B C D E F 
String/blobs/bitmaps 
HashTable: Objects 
Linked lists 
Sets
Persistence 
Configurable, two flavors 
RDB: perfect for backup 
AOF: append only log, replayed at startup 
Use AOF + RDB for rock solid persistence 
Automatic cache warm-up at startup!! 
Only RAM: switch off persistence
Common use cases 
Cache 
Queue 
Session replication 
In memory indexes 
Centralized ID generation
Basics 
SET user:1 frank 
GET user:1 → frank 
EXISTS user:2 → 1 
EXPIRE user:1 3600 
INCR count:1 
GET count:1 → 1
Basics 
KEYS user:* → user:1, user:2 
MSET user:1 frank user:2 coder 
MGET user:1 user:2 → frank, coder 
HMSET userdetail:3 name rob surname frank 
HGETALL userdetail:3 → name::rob, surname:: frank
Transactions 
MULTI 
INCR counter:1 
INCR counter:2 
EXEC 
> 1 
> 1 
WATCH counter:3 
val = GET counter:3 
val = val +1 
MULTI 
SET counter:3 $val 
EXEC
Atomic counters 
Operators for key increment 
INCR counter:1 
GET counter:1 → 1 
INCRBY counter:1 9 
GET counter:1 → 10
LUA scripting 
Server side LUA scripting 
A “sort of” stored procedure 
Scripts are sandboxed 
Atomic execution ← bear in mind
LUA scripting 
SCRIPT LOAD "return {KEYS[1],KEYS[2]}" 
"3905aac1828a8f75707b48e446988eaaeb173f13" 
EVALSHA 
3905aac1828a8f75707b48e446988eaaeb173f13 2 user:1 
user:2 
1) "user:1" 
2) "user:2"
Caching: server level 
Configure REDIS as a cache 
maxmemory 1024mb 
maxmemory-policy allkeys-lru 
all the keys will be evicted using an 
approximated LRU algorithm
Caching: TTL on key 
Set a timeout on a key 
SET doc:1 “mydoc.txt” 
EXIPRE doc:1 10 
Or 
SETEX doc:1 10 “mydoc.txt”
Demo
Caching 
+ 
Atomic Counters 
+ 
Atomic LUA scripting
Duplicate detection 
Real time stream of documents from 
the Internet 
20% to 50% of documents are duplicated 
DUPLICATES ARE EVIL 
And customers don’t pay for that :(
Basic Scenario 
5M 3M 3M 
Duplicates 
Producer 
Producer detector NLP Storage 
Producer
Avoid duplicated documents 
Act on producers was 
TOO HARD 
Filter-out them before heavy document analysis (NLP)
Documents 
“Documents” are from: 
twitter 
facebook 
gplus 
instagram 
forums 
blogs
Documents 
Each kind of document has its own natural id 
twitter: status id 
facebook: post id 
forum: URL 
blog: URL 
We don’t want this IDs inside our system
Duplicate and id generation 
Producer 
2M 
Producer 
Producer 
Duplicate 
detector - 
ID 
generatio 
n 
Analysis 
Storage 
3M 
3M 
Duplicate 
detector - 
ID 
generatio 
n 
1M Analysis 1M 
5M
Map external keys to internal UID 
Generate an ID for each document 
IDs are generated using daily named counters: 
INCR day:20141028 → 12576 
INCR day:20141010 → 23412576 
Cache generated ID 
tw_1234578688 → day:20141028;12576
Map external keys to internal UID 
Documents are internally stored on different storage 
systems with their generated id 
globalId→ 20141028:3456789
Operations 
Natural Keys are cached with TTL 
Documents out of time are parked in a staging area 
Duplicated documents are usually dropped
LRU cache, counters and LUA 
LUA scripts are executed atomically 
Wrote a simple script to: 
return previous mapped id 
or generate id and store key and id in cache 
EVALSHA “sha” 2 20141028 tw_1234566 → 20141028:123 
GET tw_1234566 → 20141028:123
Demo
Deployment 
Pre-production phase 
Single server 
70M keys in 10GB of RAM 
In production with a simple M/S
Alternatives 
PostgreSQL 
sequence(s) 
table OR hstore 
Hazelcast (we are java based) 
in memory 
write your own persistence
Q/A
References 
http://redis.io/ 
http://redis.io/commands 
http://stackoverflow.com/questions/tagged/redis 
http://try.redis.io/

More Related Content

What's hot

Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Pythondidip
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Principles of Monitoring Microservices
Principles of Monitoring MicroservicesPrinciples of Monitoring Microservices
Principles of Monitoring MicroservicesMichael Ducy
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Argo Workflows 3.0, a detailed look at what’s new from the Argo Team
Argo Workflows 3.0, a detailed look at what’s new from the Argo TeamArgo Workflows 3.0, a detailed look at what’s new from the Argo Team
Argo Workflows 3.0, a detailed look at what’s new from the Argo TeamLibbySchulze
 
MongoDB performance
MongoDB performanceMongoDB performance
MongoDB performanceMydbops
 
AIXpert - AIX Security expert
AIXpert - AIX Security expertAIXpert - AIX Security expert
AIXpert - AIX Security expertdlfrench
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyftTao Feng
 
How to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your NeedsHow to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your NeedsScyllaDB
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerMongoDB
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3SANG WON PARK
 
AWS EMR Cost optimization
AWS EMR Cost optimizationAWS EMR Cost optimization
AWS EMR Cost optimizationSANG WON PARK
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit
 
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...HostedbyConfluent
 
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform MiddlewareApache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform MiddlewareKai Wähner
 

What's hot (20)

Presto
PrestoPresto
Presto
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Python
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Principles of Monitoring Microservices
Principles of Monitoring MicroservicesPrinciples of Monitoring Microservices
Principles of Monitoring Microservices
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Argo Workflows 3.0, a detailed look at what’s new from the Argo Team
Argo Workflows 3.0, a detailed look at what’s new from the Argo TeamArgo Workflows 3.0, a detailed look at what’s new from the Argo Team
Argo Workflows 3.0, a detailed look at what’s new from the Argo Team
 
MongoDB performance
MongoDB performanceMongoDB performance
MongoDB performance
 
AIXpert - AIX Security expert
AIXpert - AIX Security expertAIXpert - AIX Security expert
AIXpert - AIX Security expert
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
How to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your NeedsHow to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your Needs
 
A Technical Introduction to WiredTiger
A Technical Introduction to WiredTigerA Technical Introduction to WiredTiger
A Technical Introduction to WiredTiger
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
 
AWS EMR Cost optimization
AWS EMR Cost optimizationAWS EMR Cost optimization
AWS EMR Cost optimization
 
Why to docker
Why to dockerWhy to docker
Why to docker
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
Distributed Tracing for Kafka with OpenTelemetry with Daniel Kim | Kafka Summ...
 
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform MiddlewareApache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 

Viewers also liked

Redis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamRedis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamCodemotion
 
Push! - MQTT for the Internet of Things
Push! - MQTT for the Internet of ThingsPush! - MQTT for the Internet of Things
Push! - MQTT for the Internet of ThingsDominik Obermaier
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...DECK36
 
Codemotion Rome 2015. GlusterFS
Codemotion Rome 2015. GlusterFSCodemotion Rome 2015. GlusterFS
Codemotion Rome 2015. GlusterFSRoberto Franchini
 
What the hell is your software doing at runtime?
What the hell is your software doing at runtime?What the hell is your software doing at runtime?
What the hell is your software doing at runtime?Roberto Franchini
 
Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Roberto Franchini
 
GlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataGlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataRoberto Franchini
 
Scaling MQTT - Webinar with Elastic Beam
Scaling MQTT - Webinar with Elastic BeamScaling MQTT - Webinar with Elastic Beam
Scaling MQTT - Webinar with Elastic BeamDominik Obermaier
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkXiaoqian Liu
 
Use Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual WaysUse Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual WaysItamar Haber
 
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...Redis Labs
 
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)Maarten Balliauw
 
Redis Developers Day 2015 - Secondary Indexes and State of Lua
Redis Developers Day 2015 - Secondary Indexes and State of LuaRedis Developers Day 2015 - Secondary Indexes and State of Lua
Redis Developers Day 2015 - Secondary Indexes and State of LuaItamar Haber
 
Getting Started with Redis
Getting Started with RedisGetting Started with Redis
Getting Started with RedisFaisal Akber
 
RespClient - Minimal Redis Client for PowerShell
RespClient - Minimal Redis Client for PowerShellRespClient - Minimal Redis Client for PowerShell
RespClient - Minimal Redis Client for PowerShellYoshifumi Kawai
 
UV logic using redis bitmap
UV logic using redis bitmapUV logic using redis bitmap
UV logic using redis bitmap주용 오
 
HIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoProHIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoProRedis Labs
 
Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
 Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre... Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...Redis Labs
 
RedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
RedisConf 2016 talk - The Redis API: Simple, Composable, PowerfulRedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
RedisConf 2016 talk - The Redis API: Simple, Composable, PowerfulDynomiteDB
 

Viewers also liked (20)

Redis - for duplicate detection on real time stream
Redis - for duplicate detection on real time streamRedis - for duplicate detection on real time stream
Redis - for duplicate detection on real time stream
 
Push! - MQTT for the Internet of Things
Push! - MQTT for the Internet of ThingsPush! - MQTT for the Internet of Things
Push! - MQTT for the Internet of Things
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 
Codemotion Rome 2015. GlusterFS
Codemotion Rome 2015. GlusterFSCodemotion Rome 2015. GlusterFS
Codemotion Rome 2015. GlusterFS
 
What the hell is your software doing at runtime?
What the hell is your software doing at runtime?What the hell is your software doing at runtime?
What the hell is your software doing at runtime?
 
Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?Where are yours vertexes and what are they talking about?
Where are yours vertexes and what are they talking about?
 
GlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big dataGlusterFs: a scalable file system for today's and tomorrow's big data
GlusterFs: a scalable file system for today's and tomorrow's big data
 
Scaling MQTT - Webinar with Elastic Beam
Scaling MQTT - Webinar with Elastic BeamScaling MQTT - Webinar with Elastic Beam
Scaling MQTT - Webinar with Elastic Beam
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on Spark
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Use Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual WaysUse Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual Ways
 
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
Build a Geospatial App with Redis 3.2- Andrew Bass, Sean Yesmunt, Sergio Prad...
 
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)
 
Redis Developers Day 2015 - Secondary Indexes and State of Lua
Redis Developers Day 2015 - Secondary Indexes and State of LuaRedis Developers Day 2015 - Secondary Indexes and State of Lua
Redis Developers Day 2015 - Secondary Indexes and State of Lua
 
Getting Started with Redis
Getting Started with RedisGetting Started with Redis
Getting Started with Redis
 
RespClient - Minimal Redis Client for PowerShell
RespClient - Minimal Redis Client for PowerShellRespClient - Minimal Redis Client for PowerShell
RespClient - Minimal Redis Client for PowerShell
 
UV logic using redis bitmap
UV logic using redis bitmapUV logic using redis bitmap
UV logic using redis bitmap
 
HIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoProHIgh Performance Redis- Tague Griffith, GoPro
HIgh Performance Redis- Tague Griffith, GoPro
 
Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
 Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre... Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
Using Redis as Distributed Cache for ASP.NET apps - Peter Kellner, 73rd Stre...
 
RedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
RedisConf 2016 talk - The Redis API: Simple, Composable, PowerfulRedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
RedisConf 2016 talk - The Redis API: Simple, Composable, Powerful
 

Similar to Stream Duplicate Detection with Redis Counters and LUA Scripting

Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
Swift Install Workshop - OpenStack Conference Spring 2012
Swift Install Workshop - OpenStack Conference Spring 2012Swift Install Workshop - OpenStack Conference Spring 2012
Swift Install Workshop - OpenStack Conference Spring 2012Joe Arnold
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsCeph Community
 
Application Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyApplication Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyTim Bunce
 
Varnish http accelerator
Varnish http acceleratorVarnish http accelerator
Varnish http acceleratorno no
 
Speed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisSpeed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisRicard Clau
 
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleIntroduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleJérôme Petazzoni
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchVic Hargrave
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton insertsChris Adkin
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xrkr10
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBob Ward
 
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQDocker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQJérôme Petazzoni
 
Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Jérôme Petazzoni
 
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesOSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesNETWAYS
 
dba_lounge_Iasi: Everybody likes redis
dba_lounge_Iasi: Everybody likes redisdba_lounge_Iasi: Everybody likes redis
dba_lounge_Iasi: Everybody likes redisLiviu Costea
 
Privilege Escalation with Metasploit
Privilege Escalation with MetasploitPrivilege Escalation with Metasploit
Privilege Escalation with Metasploitegypt
 
FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018Xavier Mertens
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1Hajime Tazaki
 
NSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNoSuchCon
 

Similar to Stream Duplicate Detection with Redis Counters and LUA Scripting (20)

Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Swift Install Workshop - OpenStack Conference Spring 2012
Swift Install Workshop - OpenStack Conference Spring 2012Swift Install Workshop - OpenStack Conference Spring 2012
Swift Install Workshop - OpenStack Conference Spring 2012
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Application Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyApplication Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.key
 
Varnish http accelerator
Varnish http acceleratorVarnish http accelerator
Varnish http accelerator
 
Speed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisSpeed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with Redis
 
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleIntroduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with Elasticsearch
 
Advances in Open Source Password Cracking
Advances in Open Source Password CrackingAdvances in Open Source Password Cracking
Advances in Open Source Password Cracking
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton inserts
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and docker
 
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQDocker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
 
Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9
 
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-BayesOSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
OSDC 2016 - Ingesting Logs with Style by Pere Urbon-Bayes
 
dba_lounge_Iasi: Everybody likes redis
dba_lounge_Iasi: Everybody likes redisdba_lounge_Iasi: Everybody likes redis
dba_lounge_Iasi: Everybody likes redis
 
Privilege Escalation with Metasploit
Privilege Escalation with MetasploitPrivilege Escalation with Metasploit
Privilege Escalation with Metasploit
 
FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018
 
LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1LibOS as a regression test framework for Linux networking #netdev1.1
LibOS as a regression test framework for Linux networking #netdev1.1
 
NSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNSC #2 - Challenge Solution
NSC #2 - Challenge Solution
 

Recently uploaded

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 

Recently uploaded (20)

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 

Stream Duplicate Detection with Redis Counters and LUA Scripting

  • 1. for duplicate detection on real time stream
  • 2. whoami(1) 15 years of experience, proud to be a programmer Writes software for information extraction, nlp, opinion mining (@scale ), and a lot of other buzzwords Implements scalable architectures Member of the JUG-Torino coordination team ro.franchini@gmail.com github.com/robfrank twitter.com/robfrankie linkedin.com/in/robfrank http://www.celi.it http://www.blogmeter.it
  • 3. Agenda What is it? Main features Caching Counters Scripting How we use it
  • 4. From the site Redis is an open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs.
  • 5. Who use it Twitter Github Youporn Pinterest Groupon ...
  • 6. Ecosystem Clients in every known language Articles, books, presentations On High Scalability every other day
  • 7. Architecture Single-threaded server Yes: single threaded server Remember that when you need to scale Single Linux server can handle 500k req/s
  • 8. Main features In memory K/V store But with durable persistence Master-slave async replica Transactions Pub/Sub Server side LUA scripting
  • 9. Main features Keys with TTL LRU eviction Keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs REDIS cluster on the go (3.0.0-rc1)
  • 10. K/V store Key-value (KV) stores use the associative array (also known as a map or dictionary) as their fundamental data model. In this model, data is represented as a collection of key-value pairs, such that each possible key appears at most once in the collection. (wikipedia)
  • 11. K/V store Key “plain text” name rob surname frank A C E D B F A B C D E F String/blobs/bitmaps HashTable: Objects Linked lists Sets
  • 12. Persistence Configurable, two flavors RDB: perfect for backup AOF: append only log, replayed at startup Use AOF + RDB for rock solid persistence Automatic cache warm-up at startup!! Only RAM: switch off persistence
  • 13. Common use cases Cache Queue Session replication In memory indexes Centralized ID generation
  • 14. Basics SET user:1 frank GET user:1 → frank EXISTS user:2 → 1 EXPIRE user:1 3600 INCR count:1 GET count:1 → 1
  • 15. Basics KEYS user:* → user:1, user:2 MSET user:1 frank user:2 coder MGET user:1 user:2 → frank, coder HMSET userdetail:3 name rob surname frank HGETALL userdetail:3 → name::rob, surname:: frank
  • 16. Transactions MULTI INCR counter:1 INCR counter:2 EXEC > 1 > 1 WATCH counter:3 val = GET counter:3 val = val +1 MULTI SET counter:3 $val EXEC
  • 17. Atomic counters Operators for key increment INCR counter:1 GET counter:1 → 1 INCRBY counter:1 9 GET counter:1 → 10
  • 18. LUA scripting Server side LUA scripting A “sort of” stored procedure Scripts are sandboxed Atomic execution ← bear in mind
  • 19. LUA scripting SCRIPT LOAD "return {KEYS[1],KEYS[2]}" "3905aac1828a8f75707b48e446988eaaeb173f13" EVALSHA 3905aac1828a8f75707b48e446988eaaeb173f13 2 user:1 user:2 1) "user:1" 2) "user:2"
  • 20. Caching: server level Configure REDIS as a cache maxmemory 1024mb maxmemory-policy allkeys-lru all the keys will be evicted using an approximated LRU algorithm
  • 21. Caching: TTL on key Set a timeout on a key SET doc:1 “mydoc.txt” EXIPRE doc:1 10 Or SETEX doc:1 10 “mydoc.txt”
  • 22. Demo
  • 23. Caching + Atomic Counters + Atomic LUA scripting
  • 24. Duplicate detection Real time stream of documents from the Internet 20% to 50% of documents are duplicated DUPLICATES ARE EVIL And customers don’t pay for that :(
  • 25. Basic Scenario 5M 3M 3M Duplicates Producer Producer detector NLP Storage Producer
  • 26. Avoid duplicated documents Act on producers was TOO HARD Filter-out them before heavy document analysis (NLP)
  • 27. Documents “Documents” are from: twitter facebook gplus instagram forums blogs
  • 28. Documents Each kind of document has its own natural id twitter: status id facebook: post id forum: URL blog: URL We don’t want this IDs inside our system
  • 29. Duplicate and id generation Producer 2M Producer Producer Duplicate detector - ID generatio n Analysis Storage 3M 3M Duplicate detector - ID generatio n 1M Analysis 1M 5M
  • 30. Map external keys to internal UID Generate an ID for each document IDs are generated using daily named counters: INCR day:20141028 → 12576 INCR day:20141010 → 23412576 Cache generated ID tw_1234578688 → day:20141028;12576
  • 31. Map external keys to internal UID Documents are internally stored on different storage systems with their generated id globalId→ 20141028:3456789
  • 32. Operations Natural Keys are cached with TTL Documents out of time are parked in a staging area Duplicated documents are usually dropped
  • 33. LRU cache, counters and LUA LUA scripts are executed atomically Wrote a simple script to: return previous mapped id or generate id and store key and id in cache EVALSHA “sha” 2 20141028 tw_1234566 → 20141028:123 GET tw_1234566 → 20141028:123
  • 34. Demo
  • 35. Deployment Pre-production phase Single server 70M keys in 10GB of RAM In production with a simple M/S
  • 36. Alternatives PostgreSQL sequence(s) table OR hstore Hazelcast (we are java based) in memory write your own persistence
  • 37. Q/A
  • 38. References http://redis.io/ http://redis.io/commands http://stackoverflow.com/questions/tagged/redis http://try.redis.io/