Presented by David Hamson & Mou Nandi, NetDocuments - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
NetDocuments, a SaaS document management company, is migrating their large document repository from Microsoft FAST to Solr. During this presentation, the speakers will discuss the the entire process, including major decision points and lessons learned. The migration is a two-phase implementation: The first being a short-cut of moving the FAST xml data directly to Solr to get a Solr meta-data index available quickly and the second phase implements the full architecture, including both meta-data and full text processing and search. The presenters will talk about architecting Solr to meet the company's requirements of scaling to billions of work-product documents, low indexing latency, and high availability. NetDocuments uses the search engine to build the user experience and also for document discovery by users. Solr was architected to scale and perform in order to address these two very different needs and also to match all the features and functionality available with FAST. Finally, the presenters will share the benchmark results from tests run on various hardware configurations and on different file systems, and also share results from search quality testing as the capabilities of Solr were tested on a single server, both single Solr core as well as multiple Solr cores.
2. Goal of the Session
• NetDocuments
• Why
move
to
Solr
from
FAST
• Architec8ng
Solr
to
work
as
a
core
module
for
a
Cloud
Document
Management
product
user
interface
building
and
document
discovery
• Tes8ng
and
benchmarking
Solr
to
scale
and
perform
for
billions
of
documents
with
200
QPS
and
200
DPS
• Lessons
learned/
shortcuts
found
migra8ng
from
FAST
to
Solr
2/14
3. Who We Are
A
Leading
cloud
content
management
and
collabora8on
service
for
small
to
medium
businesses
(SMB)
and
professional
services
firms
2/14
4. Who We Serve
We
service
over
1,000
customers
across
128
countries
worldwide
and
host
over
250+million
documents.
2/14
5. Why Migrate to Solr
• Product
roadmap
does
not
fit
with
company
roadmap
• Large
hardware
footprint
,
expensive
to
scale
• High
indexing
latency
• Unpredictable
and
untraceable
document
loss
• A
black
box
search
engine,
dependency
on
MicrosoT
FAST
support
team
• No
control
over
new
features
• Expensive
license
• Solr
supports
massive
index
• Ac8ve
hardworking
development
community
• Access
to
what’s
happening
under
the
hood
• Improved
hardware
footprint
• Reduced
licensing
cost
2/14
6. Migration to Solr
FAST Instance 1 • 95
%
of
searches
are
metadata
search
-‐
Metadata
FIXML
Fast
MDI + FTI
index
does
not
need
rich
text
Indexer
Fast Doc Processors
processing
FAST Instance 2 • Flexibility
to
implement
different
architecture
for
ND
Document FIXML
Fast
Indexer MDI + FTI MDI
and
FTI
Fast Doc Processors
• Highest
level
of
logging
can
not
trace
the
document
loss
More FAST Instances
during
a
heavy
feeding
traffic
2/14
7. Migration to Solr – Solr Indexing
Solr MD Instance 1
Solr MDI MDI
MD Solr MD
XML
Solr MD Instance 1
Solr MDI MDI
ND
Document
Solr FT Instance
ND Pipeline
Solr FTI FTI
FT Solr FT
XML
Solr FT Instance
Aspire
Solr FTI FTI
2/14
8. The Migration Project
• Only create MDI
Phase 1 - MDI • Use FAST data to prototype Solr
• Use the fixmls to build the Solr index
• Use 100% filter queries
Phase 2 – FTI • Build a robust feeding pipeline to handle both MD FT
• Building a text processing pipeline
Phase 3 • Implement new Solr features
2/14
9. Some ft. view of NetDocuments Search Architecture
Web Queue Solr MDI
NDPipeline
-‐
Administration ( monitoring, debugging, stats)
MDH1 FTP1 D1
FT Processor pool
MD Handler Pool
Dispatcher queue
Dispatcher pool
MDH2 FTP2 D2
Query
FT Queue
Web App
Web App MDH3 FTP3 D3 Distributor
MDH4 FTP4 D4
MDH5 FTP5 D5
File Solr FTI
System
2/14
10. Benchmarking Solr Config Parameter for indexing
• Created
Solr
index
from
fixmls
with
different
ram
buffer,
merge
factor
and
auto
commit
configura8on
Testing with HDD and SSD
• We
did
not
see
any
performance
difference
between
HDD
(
15k
rpm)
and
the
iodrive2
with
ND
documents
• 15
threads
running
at
a
8me
from
client
feeder
applica8on
2/14
11. Testing using different file system
• We
did
not
see
huge
performance
difference
between
ext3
and
xfs
on
HDD
or
SSD,
with
ND
Documents
• We
chose
to
use
ext3
for
FTI
with
15K
HDD
on
RAID10
• We
are
using
xfs
for
iodrive
for
MDI
as
suggested
by
fusion
Io
2/14
12. Benchmarking Solr Indexing and Query Process
search
going
to
10
search
going
to
5
shards
shards
5
solr
meter
instances
10
Solr
meter
instances
Each
shard
serving
3000
queries
per
min
Each
shard
serving
1500
queries/min
Total
15000
queries/min
Total
15000
queries/min
Implemented
and
compared
mul8-‐core
index
processing
avg
response
8me
8
ms
avg
response
8me
12
ms
and
query
performance
cpu
20
%
cpu
32
%
compared
to
single
core
index
ram
-‐
52
G
ram
-‐
53
G
cache
warmup
8me
2.5
S
cache
warmup
8me
2.7
S
cachehit
ra8o
.98
cachehit
ra8o
.98
cache
size
2276
cache
size
2276
no
evic8on
no
evic8on
index
updated
every
7
sec
index
updated
every
7
sec
test
ran
5
min
test
ran
8
min
2/14
13. Benchmark qtime increase as Solr scales and start row increases
qTime does not vary much with start row increase.
6/14
14. Tuning System queries for Solr
• System
searches
are
metadata
searches
• Thousands
of
real-‐life
queries
were
extracted
from
FAST
query
log
•
Extensive
use
of
filter
queries
and
filter
cache
give
excellent
response
8me
for
complex
queries
• Example
queries:
FAST
Query
:
ANDNOT(ANDNOT(ANDNOT(AND(AND(ndcabinets:string(“cab1",
mode="and"),ndcredate:range(2011-‐09-‐26T00:00:00,2012-‐04-‐13T23:59:59)),FILTER(ndacl:string(“acl1
acl2
acl3
",mode="OR"))),nddeletedcabs:string(“cab1",
mode="and")),ndexten:string("ndws",
mode="and")),ndexten:string("ndflt",
mode="and"))
Solr
Query:
hlp://solrserver:port/solrSearch/core0/select?shards=solrserver:port/solrSearch/core0,1solrserver:port/solrSearch/
core1&start=0&rows=500&fl=ndenvurl,nddocmodnum_s_std,nd8tle_t_idx_std&sort=ndlastmoddate_tdt_idx
+desc&q=ndenvurl:*&fq=ndcabinets_smul8_idx:cab1&fq=ndcredate_tdt_idx:[2011-‐09-‐26T00:00:00Z
TO
2012-‐04-‐13T23:59:59Z]&fq={!cache=false
cost=100}(ndacl_smul8_idx:acl1
OR
ndacl_smul8_idx:acl2
OR
ndacl_smul8_idx:acl3)&fq=-‐nddeletedcabs_smul8_idx:cab1&fq=-‐ndexten_s_idx:ndws&fq=-‐ndexten_s_idx:ndflt
2/14