SlideShare una empresa de Scribd logo
1 de 21
Fully Online Grammar
Compression in Constant Space
Shirou Maruyama1 and Yasuo Tabei2
1Preferred Infrastructure, Inc.
2PRESTO, JST
Data Compression Conference (DCC)
March 26, 2014
Compression of large-scale
repetitive texts
Ex) Personal genomes, version controlled documents,
source code in repositories
• Fully online LCA (FOLCA) [SPIRE,13]: builds a CFG and
directly encodes it into a succinct representation
– Working in the CFG size and taking linear time to the
length of a text
• Require a large working space for noisy repetitive texts
– Average 9% differences between human genomes in recent
database [Nature, 2010]
• Present novel variants of FOLCA working in constant
space
Straight Line Program (SLP)
• Canonical form of a CFG deriving a single text
• Every production rule satisfies
– Right-hand side is a digram
– Subscripts of the left symbol is larger than subscripts
of the right symbols
Example:
X1➝ab
aabbabb X2➝X1a
X3➝X1X2
X4➝X3X2
a b
a
X2
X1
X1 X3
X4
X5
b
b
a b
Straight Line Program (SLP)
• Canonical form of a CFG deriving a single text
• Every production rule satisfies
– Right-hand side is a digram
– Subscripts of the left symbol is larger than subscripts
of the right symbols
Example:
X1➝ab
aabbabb X2➝X1a
X3➝X1X2
X4➝X3X2
a b
a
X2
X1
X1 X3
X4
X5
b
b
a b
n
N:text length
Grammar compression (GC)
• Build a small SLP from an input text
– Bottom-up construction of a parse tree
• Hash table (a.k.a. reverse dictionary) is a crucial
data structure
– Given XiXj, it returns Xk for Xk→XiXj
– Access time:O(1/α) Memory: n(3+α)lg(n+σ) bits
α: load factor σ: alphabet size
a a b b a b b
X1 X1X2
X3
Existing GCs
• Compression time and working space are
important for scalability
• Online LCA (OLCA) [CCP,2011] = efficient GC
• Drawbacks: they need a large working space
• Challenge: developing fast GC of smaller
working space
Method
Compression
time Working space (bits)
CCP,2011 O(N/α) (3+α)nlgn
SPIRE,2012 O(N/α) (11/4+α)nlgn
CPM,2013 O(Nlgn) 2nlgn(1+o(1))+2nlgp (p << √n)
Menu
• Review of FOLCA in compressed space
• FOLCA in constant space
• Decompression in constant space
• Experiments
Fully Online LCA (FOLCA) [SPIRE,2013]
• Smaller working space : (1+α)nlgn+n(3+lg(αn)) bits
• Optimal encoding: nlgn+2n+o(n) bits
– Almost equal to the lower bound [CPM,2013]
abaababa
12345678910
B:0010101011
L:abaX1X2
P:123469
Text
SLP (Parse Tree) Partial Parse Tree Succinct
Representation
Direct encoding of an SLP
Basic idea of FOLCA
• Replace the same pairs of symbols in common
substrings by as many as possible of the same
non-terminal symbols
• Build 2-trees or 2-2-trees
a b r a k a d a b r a k a d a b r
common substrings
X1
X2
X1
X2
X4 X1
X2
X3
X3 X4
• Iterate this procedure to novel non-terminal
symbols until it builds a single parse tree
Online construction of a parse tree
• Use a queue corresponding to each level of a parse tree
• (i)Read a character, (ii)build a subtree in each queue,
and (iii)enqueue a non-terminal symbol of the root to the
higher queue
Qi q0 q1 q2 q3 q4
z
zQi+1
enqueue
dequeue
q0q1
Qi q0 q1 q2 q3 q4
zQi+1
enqueue
dequeue
q0q1q2
y
z
(i) q1 is land mark (ii) otherwise
Demonstration
1 2 3 4 5
d
1 2 3 4 5
d
1 2 3 4 5
d
Q1
Q2
Q3
aaa
X1→aa
X1
a abab a a a b
X1
X2→ab
b X2
X3→X1X1
X3
Rules
Input string
Courtesy by S.Maruyama
FOLCA in compressed space
• Succinct PPT is output to a secondary storage
– Size: nlgn + 2n bits
• Hash table is kept in a main memory
– Each element = triple (Xk,Xi,Xj) for Xk→XiXj
• Working space depends only on the SLP size n
– n(3+α)lg(n+σ) bits
Partial Parse Tree (PPT) Succinct PPT
B: 0010101011
L : abaX1X2
Secondary storage
Hash table
ab→X1 X1a→X2
X2X1→X3 X3X2→X4
Main memory
FOLCA in constant space
• Basic idea: compute the frequencies of production rules in
hash table and remove infrequent ones
• Naive = divide a text into fixed-length blocks and apply
FOLCA into each block
• Apply stream mining techniques
– frequency counting [Demaine et al., 02]: FREQ_FOLCA
– lossy counting [Manku et al., 02]: LOSSY_FOLCA
a a b b a b b a b a…
X1 X1X2
X3
Freq
2
2
1
FREQ_FOLCA
• Basic idea: (i)use a hash table of the maximum
entry k and (ii)remove the lowest ε percent of
infrequent ones
• Remove infrequent production rules every time
the hash table size reaches k
• Built on relative frequencies
• Working space: bits
• Computational time:
LOSSY_FOLCA
• Basic idea: (i)divide a text into blocks of fixed-length l,
and (ii)keep production rules in the next successive
blocks according to frequencies
– A production rule appearing q times, it is kept for q
successive blocks
• Remove infrequent production rules on absolute
frequencies
• Working space: bits
• Computational time:
l
Decompression in constant space
• FREQ/LOSSY_FOLCA outputs multiple succinct PPTs
• Recover a subtext per PPT
– Detect one PPT by counting 0 and 1 in B
• Working space is the same as FREQ/LOSSY_FOLCA
B: 0010101011
L : abaX1X2 abaababa
I) Succinct PPT II) Recover SLP III) Recover a
subtext
Experiments
• Use 100 human genomes (≒300GB) from 1000
human genomes project [Nature, 2010]
• Compare FREQ_FOLCA, LOSSY_FOLCA and
naïve approach(BLOCK_FOLCA)
• Use working space, compression ratio, and
compression time as evaluation measure
Working space for compression
Working space for decompression
Compression ratio and working space for
100 human genomes (≒306GB)
• Compression ratio (CR)
• Compression time (CT) in seconds (s)
• Maximum working space (WS) in mega bytes (MB)
Method CR WS (MB) CT (s)
FREQ_FOLCA (k=1000MB) 31.39 38,048 86,098
FREQ_FOLCA (k=2000MB) 19.71 76,096 93,823
LOSSY_FOLCA (l=5000MB) 20.07 36,246 87,548
LOSSY_FOLCA (l=10000MB) 17.45 56,878 87,446
BLOCK_FOLCA (l=5000MB) 31.85 23,276 88,501
BLOCK_FOOCA (l=10000MB) 25.91 34,665 92,007
Summary
• Two variants of FOLCA working in constant
space
• Frequecy-based algorhtm:
– compute frequencies of production rules in a hash
table and remove infrequent ones
• Built on stream mining techniques
• Can compress 100 human genomes (300GB) in
about one day

Más contenido relacionado

La actualidad más candente

Mandrill Templates
Mandrill TemplatesMandrill Templates
Mandrill TemplatesKnoldus Inc.
 
For the Greater Good: Leveraging VMware's RPC Interface for fun and profit by...
For the Greater Good: Leveraging VMware's RPC Interface for fun and profit by...For the Greater Good: Leveraging VMware's RPC Interface for fun and profit by...
For the Greater Good: Leveraging VMware's RPC Interface for fun and profit by...CODE BLUE
 
安全性を証明するために知っておくべき4つのこと
安全性を証明するために知っておくべき4つのこと安全性を証明するために知っておくべき4つのこと
安全性を証明するために知っておくべき4つのことshibataka000
 
Wzory na pochodne
Wzory na pochodneWzory na pochodne
Wzory na pochodnetaskbook
 
Rで学ぶデータマイニングI 第8章〜第13章
Rで学ぶデータマイニングI 第8章〜第13章Rで学ぶデータマイニングI 第8章〜第13章
Rで学ぶデータマイニングI 第8章〜第13章Prunus 1350
 
2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)
2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)
2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)智啓 出川
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesKernel TLV
 
Binlog Servers 구축사례
Binlog Servers 구축사례Binlog Servers 구축사례
Binlog Servers 구축사례I Goo Lee
 
レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明
レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明
レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明MITSUNARI Shigeo
 
OpenCVを用いた画像処理入門
OpenCVを用いた画像処理入門OpenCVを用いた画像処理入門
OpenCVを用いた画像処理入門uranishi
 
AtCoder Beginner Contest 006 解説
AtCoder Beginner Contest 006 解説AtCoder Beginner Contest 006 解説
AtCoder Beginner Contest 006 解説AtCoder Inc.
 
猫にはわかる暗号技術 1
猫にはわかる暗号技術 1猫にはわかる暗号技術 1
猫にはわかる暗号技術 1Yu Ogawa
 
llvm basic porting for risc v
llvm basic porting for risc vllvm basic porting for risc v
llvm basic porting for risc vTsung-Chun Lin
 
競技プログラミングにおけるコードの書き方とその利便性
競技プログラミングにおけるコードの書き方とその利便性競技プログラミングにおけるコードの書き方とその利便性
競技プログラミングにおけるコードの書き方とその利便性Hibiki Yamashiro
 
아두이노와 Fpga를 이용한 로봇제작
아두이노와 Fpga를 이용한 로봇제작아두이노와 Fpga를 이용한 로봇제작
아두이노와 Fpga를 이용한 로봇제작chcbaram
 
AtCoder Beginner Contest 034 解説
AtCoder Beginner Contest 034 解説AtCoder Beginner Contest 034 解説
AtCoder Beginner Contest 034 解説AtCoder Inc.
 
AtCoder Beginner Contest 012 解説
AtCoder Beginner Contest 012 解説AtCoder Beginner Contest 012 解説
AtCoder Beginner Contest 012 解説AtCoder Inc.
 

La actualidad más candente (20)

Mandrill Templates
Mandrill TemplatesMandrill Templates
Mandrill Templates
 
For the Greater Good: Leveraging VMware's RPC Interface for fun and profit by...
For the Greater Good: Leveraging VMware's RPC Interface for fun and profit by...For the Greater Good: Leveraging VMware's RPC Interface for fun and profit by...
For the Greater Good: Leveraging VMware's RPC Interface for fun and profit by...
 
安全性を証明するために知っておくべき4つのこと
安全性を証明するために知っておくべき4つのこと安全性を証明するために知っておくべき4つのこと
安全性を証明するために知っておくべき4つのこと
 
Wzory na pochodne
Wzory na pochodneWzory na pochodne
Wzory na pochodne
 
Rで学ぶデータマイニングI 第8章〜第13章
Rで学ぶデータマイニングI 第8章〜第13章Rで学ぶデータマイニングI 第8章〜第13章
Rで学ぶデータマイニングI 第8章〜第13章
 
2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)
2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)
2015年度GPGPU実践プログラミング 第10回 行列計算(行列-行列積の高度な最適化)
 
Linux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use CasesLinux Kernel Cryptographic API and Use Cases
Linux Kernel Cryptographic API and Use Cases
 
Crypto graphy
Crypto graphyCrypto graphy
Crypto graphy
 
Binlog Servers 구축사례
Binlog Servers 구축사례Binlog Servers 구축사례
Binlog Servers 구축사례
 
レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明
レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明
レベル2準同型暗号の平文バイナリ制約を与えるコンパクトな非対話ゼロ知識証明
 
Salsa20 Cipher
Salsa20 CipherSalsa20 Cipher
Salsa20 Cipher
 
OpenCVを用いた画像処理入門
OpenCVを用いた画像処理入門OpenCVを用いた画像処理入門
OpenCVを用いた画像処理入門
 
AtCoder Beginner Contest 006 解説
AtCoder Beginner Contest 006 解説AtCoder Beginner Contest 006 解説
AtCoder Beginner Contest 006 解説
 
暗号技術入門
暗号技術入門暗号技術入門
暗号技術入門
 
猫にはわかる暗号技術 1
猫にはわかる暗号技術 1猫にはわかる暗号技術 1
猫にはわかる暗号技術 1
 
llvm basic porting for risc v
llvm basic porting for risc vllvm basic porting for risc v
llvm basic porting for risc v
 
競技プログラミングにおけるコードの書き方とその利便性
競技プログラミングにおけるコードの書き方とその利便性競技プログラミングにおけるコードの書き方とその利便性
競技プログラミングにおけるコードの書き方とその利便性
 
아두이노와 Fpga를 이용한 로봇제작
아두이노와 Fpga를 이용한 로봇제작아두이노와 Fpga를 이용한 로봇제작
아두이노와 Fpga를 이용한 로봇제작
 
AtCoder Beginner Contest 034 解説
AtCoder Beginner Contest 034 解説AtCoder Beginner Contest 034 解説
AtCoder Beginner Contest 034 解説
 
AtCoder Beginner Contest 012 解説
AtCoder Beginner Contest 012 解説AtCoder Beginner Contest 012 解説
AtCoder Beginner Contest 012 解説
 

Destacado

Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Yasuo Tabei
 
Gwt presen alsip-20111201
Gwt presen alsip-20111201Gwt presen alsip-20111201
Gwt presen alsip-20111201Yasuo Tabei
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabeiYasuo Tabei
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20Yasuo Tabei
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicYasuo Tabei
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicYasuo Tabei
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009Yasuo Tabei
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306Yasuo Tabei
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeYasuo Tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesYasuo Tabei
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 publicYasuo Tabei
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法MapR Technologies Japan
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界Preferred Networks
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)Shirou Maruyama
 

Destacado (20)

Mlab2012 tabei 20120806
Mlab2012 tabei 20120806Mlab2012 tabei 20120806
Mlab2012 tabei 20120806
 
Dmss2011 public
Dmss2011 publicDmss2011 public
Dmss2011 public
 
Gwt presen alsip-20111201
Gwt presen alsip-20111201Gwt presen alsip-20111201
Gwt presen alsip-20111201
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
 
GIW2013
GIW2013GIW2013
GIW2013
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
 
Lp Boost
Lp BoostLp Boost
Lp Boost
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
WABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTreeWABI2012-SuccinctMultibitTree
WABI2012-SuccinctMultibitTree
 
Gwt sdm public
Gwt sdm publicGwt sdm public
Gwt sdm public
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
 

Similar a DCC2014 - Fully Online Grammar Compression in Constant Space

06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...Zalando adtech lab
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...Alex Pruden
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Tokyo Institute of Technology
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a FoeHaim Yadid
 
Stable SSAO in Battlefield 3 with Selective Temporal Filtering
Stable SSAO in Battlefield 3 with Selective Temporal FilteringStable SSAO in Battlefield 3 with Selective Temporal Filtering
Stable SSAO in Battlefield 3 with Selective Temporal FilteringElectronic Arts / DICE
 
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001Casiano Rodriguez-leon
 
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001Casiano Rodriguez-leon
 
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Saliya Ekanayake
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Wavesinside-BigData.com
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
 
Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsYu Liu
 
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...Tomoki Koriyama
 
Partitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph ExecutionPartitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph Execution Chen Wu
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)Qiangning Hong
 
asymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysisasymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysisAnindita Kundu
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf
 
Introduction to datastructure and algorithm
Introduction to datastructure and algorithmIntroduction to datastructure and algorithm
Introduction to datastructure and algorithmPratik Mota
 
What to do when detect deadlock
What to do when detect deadlockWhat to do when detect deadlock
What to do when detect deadlockSyed Zaid Irshad
 

Similar a DCC2014 - Fully Online Grammar Compression in Constant Space (20)

06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
 
Stable SSAO in Battlefield 3 with Selective Temporal Filtering
Stable SSAO in Battlefield 3 with Selective Temporal FilteringStable SSAO in Battlefield 3 with Selective Temporal Filtering
Stable SSAO in Battlefield 3 with Selective Temporal Filtering
 
Slide11 icc2015
Slide11 icc2015Slide11 icc2015
Slide11 icc2015
 
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
 
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
PREDICTING THE TIME OF OBLIVIOUS PROGRAMS. Euromicro 2001
 
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Waves
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applications
 
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
 
Partitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph ExecutionPartitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph Execution
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
asymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysisasymptotic analysis and insertion sort analysis
asymptotic analysis and insertion sort analysis
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
 
Introduction to datastructure and algorithm
Introduction to datastructure and algorithmIntroduction to datastructure and algorithm
Introduction to datastructure and algorithm
 
What to do when detect deadlock
What to do when detect deadlockWhat to do when detect deadlock
What to do when detect deadlock
 

DCC2014 - Fully Online Grammar Compression in Constant Space

  • 1. Fully Online Grammar Compression in Constant Space Shirou Maruyama1 and Yasuo Tabei2 1Preferred Infrastructure, Inc. 2PRESTO, JST Data Compression Conference (DCC) March 26, 2014
  • 2. Compression of large-scale repetitive texts Ex) Personal genomes, version controlled documents, source code in repositories • Fully online LCA (FOLCA) [SPIRE,13]: builds a CFG and directly encodes it into a succinct representation – Working in the CFG size and taking linear time to the length of a text • Require a large working space for noisy repetitive texts – Average 9% differences between human genomes in recent database [Nature, 2010] • Present novel variants of FOLCA working in constant space
  • 3. Straight Line Program (SLP) • Canonical form of a CFG deriving a single text • Every production rule satisfies – Right-hand side is a digram – Subscripts of the left symbol is larger than subscripts of the right symbols Example: X1➝ab aabbabb X2➝X1a X3➝X1X2 X4➝X3X2 a b a X2 X1 X1 X3 X4 X5 b b a b
  • 4. Straight Line Program (SLP) • Canonical form of a CFG deriving a single text • Every production rule satisfies – Right-hand side is a digram – Subscripts of the left symbol is larger than subscripts of the right symbols Example: X1➝ab aabbabb X2➝X1a X3➝X1X2 X4➝X3X2 a b a X2 X1 X1 X3 X4 X5 b b a b n N:text length
  • 5. Grammar compression (GC) • Build a small SLP from an input text – Bottom-up construction of a parse tree • Hash table (a.k.a. reverse dictionary) is a crucial data structure – Given XiXj, it returns Xk for Xk→XiXj – Access time:O(1/α) Memory: n(3+α)lg(n+σ) bits α: load factor σ: alphabet size a a b b a b b X1 X1X2 X3
  • 6. Existing GCs • Compression time and working space are important for scalability • Online LCA (OLCA) [CCP,2011] = efficient GC • Drawbacks: they need a large working space • Challenge: developing fast GC of smaller working space Method Compression time Working space (bits) CCP,2011 O(N/α) (3+α)nlgn SPIRE,2012 O(N/α) (11/4+α)nlgn CPM,2013 O(Nlgn) 2nlgn(1+o(1))+2nlgp (p << √n)
  • 7. Menu • Review of FOLCA in compressed space • FOLCA in constant space • Decompression in constant space • Experiments
  • 8. Fully Online LCA (FOLCA) [SPIRE,2013] • Smaller working space : (1+α)nlgn+n(3+lg(αn)) bits • Optimal encoding: nlgn+2n+o(n) bits – Almost equal to the lower bound [CPM,2013] abaababa 12345678910 B:0010101011 L:abaX1X2 P:123469 Text SLP (Parse Tree) Partial Parse Tree Succinct Representation Direct encoding of an SLP
  • 9. Basic idea of FOLCA • Replace the same pairs of symbols in common substrings by as many as possible of the same non-terminal symbols • Build 2-trees or 2-2-trees a b r a k a d a b r a k a d a b r common substrings X1 X2 X1 X2 X4 X1 X2 X3 X3 X4 • Iterate this procedure to novel non-terminal symbols until it builds a single parse tree
  • 10. Online construction of a parse tree • Use a queue corresponding to each level of a parse tree • (i)Read a character, (ii)build a subtree in each queue, and (iii)enqueue a non-terminal symbol of the root to the higher queue Qi q0 q1 q2 q3 q4 z zQi+1 enqueue dequeue q0q1 Qi q0 q1 q2 q3 q4 zQi+1 enqueue dequeue q0q1q2 y z (i) q1 is land mark (ii) otherwise
  • 11. Demonstration 1 2 3 4 5 d 1 2 3 4 5 d 1 2 3 4 5 d Q1 Q2 Q3 aaa X1→aa X1 a abab a a a b X1 X2→ab b X2 X3→X1X1 X3 Rules Input string Courtesy by S.Maruyama
  • 12. FOLCA in compressed space • Succinct PPT is output to a secondary storage – Size: nlgn + 2n bits • Hash table is kept in a main memory – Each element = triple (Xk,Xi,Xj) for Xk→XiXj • Working space depends only on the SLP size n – n(3+α)lg(n+σ) bits Partial Parse Tree (PPT) Succinct PPT B: 0010101011 L : abaX1X2 Secondary storage Hash table ab→X1 X1a→X2 X2X1→X3 X3X2→X4 Main memory
  • 13. FOLCA in constant space • Basic idea: compute the frequencies of production rules in hash table and remove infrequent ones • Naive = divide a text into fixed-length blocks and apply FOLCA into each block • Apply stream mining techniques – frequency counting [Demaine et al., 02]: FREQ_FOLCA – lossy counting [Manku et al., 02]: LOSSY_FOLCA a a b b a b b a b a… X1 X1X2 X3 Freq 2 2 1
  • 14. FREQ_FOLCA • Basic idea: (i)use a hash table of the maximum entry k and (ii)remove the lowest ε percent of infrequent ones • Remove infrequent production rules every time the hash table size reaches k • Built on relative frequencies • Working space: bits • Computational time:
  • 15. LOSSY_FOLCA • Basic idea: (i)divide a text into blocks of fixed-length l, and (ii)keep production rules in the next successive blocks according to frequencies – A production rule appearing q times, it is kept for q successive blocks • Remove infrequent production rules on absolute frequencies • Working space: bits • Computational time: l
  • 16. Decompression in constant space • FREQ/LOSSY_FOLCA outputs multiple succinct PPTs • Recover a subtext per PPT – Detect one PPT by counting 0 and 1 in B • Working space is the same as FREQ/LOSSY_FOLCA B: 0010101011 L : abaX1X2 abaababa I) Succinct PPT II) Recover SLP III) Recover a subtext
  • 17. Experiments • Use 100 human genomes (≒300GB) from 1000 human genomes project [Nature, 2010] • Compare FREQ_FOLCA, LOSSY_FOLCA and naïve approach(BLOCK_FOLCA) • Use working space, compression ratio, and compression time as evaluation measure
  • 18. Working space for compression
  • 19. Working space for decompression
  • 20. Compression ratio and working space for 100 human genomes (≒306GB) • Compression ratio (CR) • Compression time (CT) in seconds (s) • Maximum working space (WS) in mega bytes (MB) Method CR WS (MB) CT (s) FREQ_FOLCA (k=1000MB) 31.39 38,048 86,098 FREQ_FOLCA (k=2000MB) 19.71 76,096 93,823 LOSSY_FOLCA (l=5000MB) 20.07 36,246 87,548 LOSSY_FOLCA (l=10000MB) 17.45 56,878 87,446 BLOCK_FOLCA (l=5000MB) 31.85 23,276 88,501 BLOCK_FOOCA (l=10000MB) 25.91 34,665 92,007
  • 21. Summary • Two variants of FOLCA working in constant space • Frequecy-based algorhtm: – compute frequencies of production rules in a hash table and remove infrequent ones • Built on stream mining techniques • Can compress 100 human genomes (300GB) in about one day

Notas del editor

  1. In this talk, I will deal with compression of large-scale repetitive texts. Examples are personal genomes, version controlled documents, source code in repositories. We presented fully online LCA called FOLCA that builds an SLP and directly encodes it into a succinct representation. Working space is the SLP size and computational time is linear to the length of a text However, recent sequencing technology generates noisy repetitive texts. Actually, there is 9% difference on average between human genomes in recent database, qlthough it is said that the difference between individual genomes is 0.01%. For such noisy repetitve texts, FOLCA working in the SLP size consumes a large amount of memory. We present novel variants of FOLCA working in constant space.
  2. In this talk, we assumes straight line programs for grammars. SLP is a canonical form of a CFG deriving a single string. Every production rule satisfies: right-hand side is a digram Subscripts of the left symbol is larger than subscripts of the right symbols.
  3. Grammar compression (GC) builds a small SLP from an input text. It builds a parse tree corresponding to an SLP in a bottom-up manner. Hash table also known as reverse dictionary is a crucial data structure in grammar compressions. Given right hand side of symbols XiXj, it return the right symbol Xk in a production rule Xk ¥to XiXj Access time is O(1/alpha), memory is n(3+alpha)lg(n+alpha) bits Alpha: load factor
  4. Compression time and working space are important for applying grammar compression for large-scale repetitive texts. Online LCA (OLCA) is an efficient grammar compression. OLCA is extend as achieving a smaller working space. But, they still need a large working space. Now our challenge is to develop fast GC of smaller working space.
  5. We modify FOLCA as working in compressed space. FOLCA builds POPPT that is output to a secondary storage device. The succinct representation is indexed by a rank/select dictionary. There is no small O(n) here. In addition, hash table is kept in a main memory. The hash table consumes most of the memory. Working space is n(3+alpha)lg(n+sigma) bits. Thus, the working space depends only on the SLP size n.
  6. From this slide, I will present FOLCA working in constant space. Basic idea of our novel variants of FOLCA is to compute the frequencies of production rules in hash table and remove infrequent ones at a point We apply stream mining techniques in data mining area for extracting frequent items in data streams. We apply two techniques. First is frequency counting proposed by Demaine et al 2002. We shall referrer to FOLCA using frequency counting as FREQ_FOLCA. Second is lossy counting proposed by Manku et al in 2002. We shall referrer to FOLCA using lossy counting as LOSSY_FOLCA. Naïve approach to compress long repetitive texts is to divide a text into fixed-length blocks and apply compressors into each block. Compression is ruined because long range repetitions are not captured. On the otherhand, our variants of FOLCA can capture long range repetitions.
  7. Basic idea of FREQ_FOLCA is to use a hash table of the maximum entry k and remove the lowest ε percent of infrequent one.
  8. Basic idea of LOSSY_
  9. First figure shows that working space by increasing the length of text. The horizontal axis represents the length of texts. The vertical axis represents working space in megabytes. We tried two parapeters for LOSSY_FOLCA and FREQ_FOLCA. The working space of FOLCA is increasing for the long input texts. FOLCA works in the SLP size. It is not applicable to large-scale, noisy repetitive texts. On the otherhan, our method, LOSSY_FOLCA and FREQ_FOLCA works in the constant space that does not depend on the text length.
  10. Second figure shows that the working space for decompression. The horizontal axis represents the length of texts. The vertical axis represents the working space in megabytes. You can see the same trends in the working space for decompression as in that for compression. The working space for LOSSY_FOLCA and FREQ_FOLCA remains constant not depending on the length of text.
  11. The last figure shows that compression ratio and working space for 100 human genomes. Compression fished for about one day. You can see the trade off between compression ratio and working space for each method. The larger value of parameters achieves high compression ratio. There are trade off between compression ratio and working space. The compression ratio of LOSSY_FOLCA is better than that of BLOCK_FOLCA for the same block length, which showed that the strategy of LOSSY_FOLCA for removing infrequent production rules was more effective that that of BLOCK_FOLCA. LOSSY_FOLCA using a smaller working space achieved a high compression ratio that FREQ_FOLCA. These results demostrate that applicalities of our method to large-scale repetitive texts.