3. Who
am
I
• So)ware
developer
with
20+
years
of
experience
– Including
3
years
as
Senior
Tech
Support
(BEA
Weblogic)
• Solr
popularizer
• Published
book
author
on
Solr
Indexing
(for
Solr
4.3)
• Run
hLp://www.solr-‐start.com
resource
site
• Solr
commiLer
(since
August
2016)
• Past
and
present
Solr
focus
on
onboarding,
usability,
tooling,
informaSon
sharing
4. Example
catch-‐22
• Search
is
a
–
surprisingly
-‐
complex
experSse
• Solr
is
a
complex
product
– Wide
– Deep
– History-‐rich
• And
so
are
its
many
examples
5. Fasten
the
seatbelt
• Review
all
of
the
(Solr
6.2)
OOTB
examples
• Make
a
small
one
from
scratch
• Deconstruct
a
real
shipped
example
• Next
learning
acSon...
6. OOTB
Examples
–
how
many?
bin/solr
start
–e
-‐e
<example>
Name
of
the
example
to
run;
available
examples:
cloud:
SolrCloud
example
techproducts:
Comprehensive
example
illustraSng
many
of
Solr's
core
capabiliSes
dih:
Data
Import
Handler
schemaless:
Schema-‐less
example
7. techproducts
example
• Used
to
be
collec/on1
• solr.home:
example/techproducts/solr
– Can
restart
with
bin/solr
start
-‐s
example/techproducts/solr
– Actual
core
at
example/techproducts/solr/techproducts
8. techproducts
example
(cont.)
• Source
configuraSon
– server/solr/configset/sample_techproducts_config
– Not
actually
a
configset
(copy,
not
share)
• Can
be
rebuilt
rm
–rf
example/techproducts
• Has
data
(14
files
of
products,
money,
uc8
tests)
bin/post
-‐c
techproducts
example/exampledocs/*.xml
9. schemaless
example
• solr.home:
example/schemaless/solr
• Actual
core:
example/schemaless/solr/ge?ngstarted
• Source
configuraSon:
– server/solr/configset/data_driven_schema_configs
– Config
you
get
when
you
are
not
using
config:
bin/solr
create
-‐c
newcore
• No
data,
but
can
take
(nearly)
anything:
bin/post
-‐c
<name>
example/exampledocs/*.xml
10. schemaless
mode?
• “Let
us
guess
what
you
mean”
– Auto-‐guess
field
type
based
on
first
content
occurrence
– Create
explicit
field
definiSons
• booleans,
dates,
numbers,
strings
• Always
mulSvalued
(because:
who
knows?!?)
• Can
be
configured
(URP
chain
in
solrconfig.xml)
– Rewrites
managed-‐schema
(coments
begone!)
– Makes
search
work
with
<copyField
source="*"
dest="_text_"/>
11. techproducts
vs
schemaless
• Configured
techproducts
vs
auto-‐detecSng
schemaless
• Strings
"name":"Test
with
some
GB18030
encoded
characters",
"name":["Test
with
some
GB18030
encoded
characters"],
• Numbers
"price":0.0,
"price_c":"0.0,USD",
"price":[0.0],
• Booleans
"inStock":true,
"inStock":[true],
12. cloud
example
• Highly
configurable
(unless
using
–noprompt)
• solr.home:
example/cloud/nodeX/solr
• Source
configuraSon
is
a
choice
Please
choose
a
configuraSon
for
the
genngstarted
collecSon,
available
opSons
are:
basic_configs,
data_driven_schema_configs,
or
sample_techproducts_configs
[data_driven_schema_configs]
• Can
be
rebuilt:
bin/solr
stop
-‐all
rm
-‐rf
example/cloud
• Demonstrates
Config
API
(configoverlay.json)
13. dih
example(s)
• Data
import
handler
–
legacy,
but
sSll
kicking
• solr.home:
example/example-‐DIH/solr
• Has
5
(five!)
different
cores
– db
-‐
database
import
(example/example-‐DIH/hsqldb/ex.*)
– solr
-‐
import
from
another
Solr
core
(configured
for
db
core)
– mail
-‐
import
from
IMAP
(needs
some
configuraSon)
– /ka
-‐
import
rich-‐content
(example/exampledocs/solr-‐word.pdf)
– rss
-‐
external
XML
feed
(very
broken
right
now)
• Cannot
be
rebuilt
–
only
empSed
bin/post
-‐c
db
-‐type
'applica/on/json'
-‐d
'{delete:
{query:"*:*"}}'
14. What
about:
bin/solr
start?
• solr.home:
server/solr
• No
iniSal
collecSon/cores,
have
to
create
explicitly:
– With
script
(see
bin/solr
create_core
–h
for
details):
bin/solr
create
–c
<corename>
-‐d
<name
or
path>
– With
Core
Admin
UI
for
non-‐SolrCloud:
hRp://localhost:8983/solr/admin/cores?ac/on=CREATE&…
– With
CollecSon
API
for
SolrCloud:
hRp://localhost:8983/admin/collec'ons?ac/on=CREATE&…
15. basic_configs
configuraSon
• Available
for
cloud
example
and
explicit
creaSon
• Schemaless
mode
is
configured,
not
enabled
• “Minimal
Solr
configuraSon”
!?!
– managed-‐schema:
1005
lines
– solrconfig.xml:
1484
lines
16. files
example
• Specifically
tuned
for
file
indexing
– Augmented
schemaless
mode
with
language,
content-‐type
guessing
– Custom
/browse
end-‐point
– Source
configuraSon:
example/files/conf
– Setup
instrucSons:
example/files/README.txt
– Bring
your
own
data
17.
18. films
example
• Schemaless
(Based
on
data_driven_schema_configs)
– Uses
Schema
API
to
add
custom
fields
– Uses
schemaless
for
rest
of
fields
• Comes
with
its
own
data
(1100
film
records)
• Uses
velocity
(/browse),
Schema
API,
Request
Parameters
API
(params.json)
• Setup
instrucSons:
example/films/README.txt
19. That
was
a
good
news
• Many
examples
• Easy
to
get
one
running
• Some
come
with
data
• Some
you
can
throw
your
own
data
into
• Lots
of
comments
20. This
is
the
bad
news
Files
Types
Fields
Dynamic
Fields
managed-‐schema
size
solrconfig.
xml
size
basic
46
71
4
73
1005
1484
data_driven
46
71
4
73
1005
1482
techproducts
101
66
33
28
1149
1701
dih
db
62
62
31
28
1129
1490
dih
Ska
6
61
3
27
901
1466
files
69
73
9
73
517
1508
films
(data_driven+)
46
71
8
73
481
1482
21. Tip
–
genng
these
numbers
• XML
extracSon
with
XMLStarlet
(XLST
CLI)
– xml
sel
-‐t
-‐m
"//fieldType"
-‐v
@name
-‐n
managed-‐schema
– xml
sel
-‐t
-‐m
"//copyField"
-‐c
.
-‐n
managed-‐schema
|wc
-‐l
– xml
sel
-‐t
-‐m
"//*[@docValues]"
-‐v
"concat(local-‐name(),
'
',
@name,
'
docValues:',
@docValues)"
-‐n
managed-‐schema
– xml
sel
-‐t
-‐m
"//requestHandler"
-‐v
"@name"
-‐n
solrconfig.xml
22. Why
is
it
like
this?
• Many
examples
predate
Solr
Reference
Guide
• grep
for
opSons,
possibiliSes,
defaults
• Each
example
is
a
kitchen
sink
“Too
much
of
a
good
thing
is
also
a
bad
thing”
Source:
1980s
Soviet
joke
about
Virtual
Reality
24. Go
small
–
managed-‐schema(2)
…
<fieldType name="string" class="solr.StrField"/>
<fieldType name="text_basic" class="solr.TextField">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory" />
</analyzer>
</fieldType>
</schema>
25. Go
small
–
solrconfig.xml
<config>
<luceneMatchVersion>6.2.0</luceneMatchVersion>
<requestHandler name="/select” class="solr.SearchHandler”>
<lst name="defaults">
<str name="df">text</str>
</lst>
</requestHandler>
</config>
26. Go
small
–
load
and
test
• bin/solr
create
-‐c
demo
-‐d
.../demo-‐config/
• bin/post
-‐c
demo
example/exampledocs/*.xml
• Test
it
works,
using
HTTPie
(HTTP
CLI)
27.
28. Go
small
-‐
review
• Minimal
example
could
be
very
minimal
• Some
things
will
not
work
– No
uniqueKey
–
no
way
to
update
documents,
no
SolrCloud
– No
_version_
–
no
SolrCloud
– Everything
is
mulSValued
–
no
sorSng
– copyField
*
=>
text,
no
meaningful
relevancy,
specialized
analyzer
chain
processing
29. DeconstrucSng
films
example
• bin/solr
create
–c
films
• curl
hLp://localhost:8983/solr/films/schema
...
(add
name,
ini/al_release_date)
• Index
1100
records
from
– (Solr)
XML,
– (generic)
JSON
(doc),
or
– CSV
format
• Search
for
batman
• Use
/browse
end-‐point
and
search
for
batman
• Enable
highlighSng
in
results
30.
31. IniSal
stats
for
films
core
Sizes
(line
counts)
managed-‐schema*
481
solrconfig.xml
1482
params.json
20
File
count
in
conf
.txt
41
.xml
3
.json
1
managed-‐schema
(xml)
1
*
already
has
no
comments
32. DeconstrucSng
–
just
straight
tags
• managed-‐schema
lost
comments
during
construcSon
• Let's
remove
comments
from
solrconfig.xml
• xml
ed
-‐L
-‐d
"//comment()"
solrconfig.xml
– Edit
in
place
– Delete
XPATH
34. DeconstrucSng
–
what
to
clean
• Currently
– (explicit)
fields:
8
– dynamic
fields:
73
• xml
sel
-‐t
-‐m
"//dynamicField"
-‐v
@name
-‐n
managed-‐
schema
|wc
-‐l
– types:
71
– copyFields:
1
• Let's
start
from
dynamic
fields
35. DeconstrucSng
–
dynamic
fields
• Used
dynamic
fields
– do
NOT
modify
schema
– DO
show
up
in
Admin
UI,
if
used
– Example
from
different
schema:
• Used/matched
fields
• Generic
definiSons
37. DeconstrucSng
–
in
use
dynamic
fields
• NO
dynamic
fields
are
used
– *
is
a
copyField
instrucSon
• Can
remove
them
all
• xml
ed
-‐L
-‐d
"//dynamicField"
managed-‐schema
39. DeconstrucSng
–
field
types
• How
many
types
out
of
71
do
we
use?
– xml
sel
-‐t
-‐m
"//field|//dynamicField"
-‐v
"@type"
-‐n
conf/managed-‐schema
|sort
–u
– long,
string,
strings,
tdate,
text_general
• But
also
some
in
solrconfig.xml
– booleans,
string,
strings,
tdates,
tdoubles,
text_general,
tlongs
• Combined
total:
9
field
type
definiSons
• Delete
the
rest
(by
hand)
41. DeconstrucSng
–
support
files
• Inside
lang
directory
(38
files)
– find
lang
–name
'stopwords_*.txt'
|
wc
-‐l
• stopwords_*.txt:
30
files
• contracSons_*.txt:
4
files
– find
lang
-‐type
f
|egrep
-‐v
'stopwords_|contrac/ons_'
• hyphenaSons_ga.txt,
stemdict_nl.txt,
stoptags_ja.txt,
userdict_ja.txt
42. Support
files
–
sSll
in
use?
• Check
for
usage
– grep
-‐o
'stopwords_.*.txt'
managed-‐schema
solrconfig.xml
– grep
-‐o
'contrac/ons_.*.txt'
...
– ...
• NO
Matches
(we
no
longer
have
related
types)
– Delete
the
whole
lang
directory
• What
about
files
just
inside
config
directory
– Don't
need
currency.xml,
protwords.txt
46. The
mystery
of
_root_
• In
the
original
schema
–
no
explanaSons
• DocumentaSon
–
used
for
nested
documents:
To
support
nested
documents,
the
schema
must
include
an
indexed/non-‐stored
field
_root_
.
The
value
of
that
field
is
populated
automa/cally
and
is
the
same
for
all
documents
in
the
block,
regardless
of
the
inheritance
depth.
• We
are
not
using
nested
documents
• And
neither
does
any
other
shipped
example...
49. text_general
support
files
stopwords.txt
#
Licensed
to
the
Apache
Sokware
Founda/on
(ASF)
under
one
or
more
#
contributor
license
agreements.
See
the
NOTICE
file
distributed
with
#
this
work
for
addi/onal
informa/on
regarding
copyright
ownership.
#
The
ASF
licenses
this
file
to
You
under
the
Apache
License,
Version
2.0
#
(the
"License");
you
may
not
use
this
file
except
in
compliance
with
#
the
License.
You
may
obtain
a
copy
of
the
License
at
#
#
hRp://www.apache.org/licenses/LICENSE-‐2.0#
#
Unless
required
by
applicable
law
or
agreed
to
in
wri/ng,
sokware
#
distributed
under
the
License
is
distributed
on
an
"AS
IS"
BASIS,
#
WITHOUT
WARRANTIES
OR
CONDITIONS
OF
ANY
KIND,
either
express
or
implied.
#
See
the
License
for
the
specific
language
governing
permissions
and
#
limita/ons
under
the
License.
• synonyms.txt
#
The
ASF
licenses
this
file
to
You
under
the
Apache
License,
Version
2.0
#
(the
"License");
you
may
not
use
this
file
except
in
compliance
with#
the
License.
You
may
obtain
a
copy
of
the
License
at#.
......
.#-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
#some
test
synonym
mappings
unlikely
to
appear
in
real
input
textaaafoo
=>
aaabar
bbbfoo
=>
bbbfoo
bbbbar
cccfoo
=>
cccbar
cccbaz
fooaaa,baraaa,bazaaa
#
Some
synonym
groups
specific
to
this
example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television,
Televisions,
TV,
TVs
#no/ce
we
use
"gib"
instead
of
"GiB"
so
any
WordDelimiterFilter
coming
#aker
us
won't
split
it
into
two
words.
#
Synonym
mappings
can
be
used
for
spelling
correc/on
toopixima
=>
pixma
50. text_general's
empty
stopwords
• No
file
=>
default
stopwords
=>
English
• Empty
file
=>
disabled
stopwords
• Currently
–
NOT
used
53. How
far
did
we
get
Sizes
(line
counts)
managed-‐schema*
481
26
solrconfig.xml
1482
278
params.json
20
File
count
in
conf
.txt
41
0
.xml
3
2
.json
1
managed-‐schema
(xml)
1
*
already
has
no
comments
54. DeconstrucSng
–
solrconfig.xml
• solrconfig.xml
is
more
complex
than
schema
• Heterogeneous
SecSons
• Nested
definiSons
• AlternaSve
implementaSons
(e.g.
highlighter)
• Also
remember
– configoverlay.json
–
overrides
solrconfig.xml
– params.json
–
addiSonal
configuraSon
parameters
58. add-‐unknown-‐fields-‐to-‐the-‐schema
• Famous
"schemaless"
mode
• Generic,
but
fully
configurable
• Far
from
perfect
– Remember,
we
had
to
manually
pre-‐add
fields
– Development,
not
producSon
– Has
normalizaSon
side-‐effects
(normalizes
dates)
• Cannot
remove
it
in
our
example
60. highlighter
–
the
truth
• Highlighter
searchComponent
is
in
default
stack
• The
params
are
a
mix
of
standard
highlighter,
alternaSve
FastVector
highlighter
• Cannot
use
FastVector
version
as
schema
fields
are
missing
termVectors,
etc
• And
standard
highlighter
params
are
same
as
implicit
values
• Therefore,
we
can
remove
the
WHOLE
definiSon
64. solrconfig.xml
–
more
stuff
• There
is
more
that
can
be
taken
out
– query
secSon,
since
you
have
to
tune
it
anyway
– updateHandler,
and
revert
to
basic
commits
– jmx
– enableRemoteStreaming
–
definitely
take
that
out
• But
keep
velocity,
browse,
search
support
65. Next
acSon
• Join
the
(virtual)
Solr
Example
Reading
Group
– Starts
November
2016
– Register
at
hLp://bit.ly/SolrERG
• Join
mailing
list
at
hLp://www.solr-‐start.com
– Get
the
link
to
the
presentaSon
source
– Learn
about
other
similar
projects
– Get
news
of
Solr
arScles
and
projects
on
the
web