SlideShare una empresa de Scribd logo
1 de 66
Descargar para leer sin conexión
O C T O B E R 	
   1 1 -­‐ 1 4 , 	
   2 0 1 6 	
   	
   • 	
   	
   B O S T O N , 	
   M A 	
  
Rebuilding	
  Solr	
  6	
  examples	
  –	
  	
  
layer	
  by	
  layer	
  
Alexandre	
  Rafalovitch	
  
www.solr-­‐start.com	
  
Who	
  am	
  I	
  
•  So)ware	
  developer	
  with	
  20+	
  years	
  of	
  experience	
  	
  
–  Including	
  3	
  years	
  as	
  Senior	
  Tech	
  Support	
  (BEA	
  Weblogic)	
  
•  Solr	
  popularizer	
  
•  Published	
  book	
  author	
  on	
  Solr	
  Indexing	
  (for	
  Solr	
  4.3)	
  
•  Run	
  hLp://www.solr-­‐start.com	
  resource	
  site	
  
•  Solr	
  commiLer	
  (since	
  August	
  2016)	
  
•  Past	
  and	
  present	
  Solr	
  focus	
  on	
  onboarding,	
  usability,	
  
tooling,	
  informaSon	
  sharing	
  
Example	
  catch-­‐22	
  
•  Search	
  is	
  a	
  –	
  surprisingly	
  -­‐	
  complex	
  experSse	
  
•  Solr	
  is	
  a	
  complex	
  product	
  
– Wide	
  
– Deep	
  
– History-­‐rich	
  
•  And	
  so	
  are	
  its	
  many	
  examples	
  
Fasten	
  the	
  seatbelt	
  
•  Review	
  all	
  of	
  the	
  (Solr	
  6.2)	
  OOTB	
  examples	
  
•  Make	
  a	
  small	
  one	
  from	
  scratch	
  
•  Deconstruct	
  a	
  real	
  shipped	
  example	
  
•  Next	
  learning	
  acSon...	
  	
  
OOTB	
  Examples	
  –	
  how	
  many?	
  
bin/solr	
  start	
  –e	
  
	
   -­‐e	
  <example>	
  	
  Name	
  of	
  the	
  example	
  to	
  run;	
  available	
  examples:	
  
	
  	
  	
  	
  	
  	
  cloud:	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  SolrCloud	
  example	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  techproducts:	
  	
  Comprehensive	
  example	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  illustraSng	
  many	
  of	
  Solr's	
  core	
  capabiliSes	
  
	
  	
  	
  	
  	
  	
  dih:	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Data	
  Import	
  Handler	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  schemaless:	
  	
  	
  	
  	
  	
  Schema-­‐less	
  example	
  
techproducts	
  example	
  
•  Used	
  to	
  be	
  collec/on1	
  
•  solr.home:	
  example/techproducts/solr	
  
– Can	
  restart	
  with	
  	
  
bin/solr	
  start	
  -­‐s	
  example/techproducts/solr	
  
– Actual	
  core	
  at	
  
example/techproducts/solr/techproducts	
  
techproducts	
  example	
  (cont.)	
  
•  Source	
  configuraSon	
  
–  server/solr/configset/sample_techproducts_config	
  	
  
–  Not	
  actually	
  a	
  configset	
  (copy,	
  not	
  share)	
  
•  Can	
  be	
  rebuilt	
  	
  
rm	
  –rf	
  example/techproducts	
  
•  Has	
  data	
  (14	
  files	
  of	
  products,	
  money,	
  uc8	
  tests)	
  
bin/post	
  -­‐c	
  techproducts	
  example/exampledocs/*.xml	
  
schemaless	
  example	
  
•  solr.home:	
  example/schemaless/solr	
  
•  Actual	
  core:	
  example/schemaless/solr/ge?ngstarted	
  
•  Source	
  configuraSon:	
  
–  server/solr/configset/data_driven_schema_configs	
  
–  Config	
  you	
  get	
  when	
  you	
  are	
  not	
  using	
  config:	
  
bin/solr	
  create	
  -­‐c	
  newcore	
  
•  No	
  data,	
  but	
  can	
  take	
  (nearly)	
  anything:	
  
bin/post	
  -­‐c	
  <name>	
  example/exampledocs/*.xml	
  
schemaless	
  mode?	
  
•  “Let	
  us	
  guess	
  what	
  you	
  mean”	
  
–  Auto-­‐guess	
  field	
  type	
  based	
  on	
  first	
  content	
  occurrence	
  
–  Create	
  explicit	
  field	
  definiSons	
  
•  booleans,	
  dates,	
  numbers,	
  strings	
  
•  Always	
  mulSvalued	
  (because:	
  who	
  knows?!?)	
  
•  Can	
  be	
  configured	
  (URP	
  chain	
  in	
  solrconfig.xml)	
  
–  Rewrites	
  managed-­‐schema	
  (coments	
  begone!)	
  
–  Makes	
  search	
  work	
  with	
  	
  
<copyField	
  source="*"	
  dest="_text_"/>	
  
techproducts	
  vs	
  schemaless	
  
•  Configured	
  techproducts	
  vs	
  	
  
auto-­‐detecSng	
  schemaless	
  
•  Strings	
  
"name":"Test	
  with	
  some	
  GB18030	
  encoded	
  characters",	
  
"name":["Test	
  with	
  some	
  GB18030	
  encoded	
  characters"],	
  
•  Numbers	
  
"price":0.0,	
  "price_c":"0.0,USD",	
  
"price":[0.0],	
  
•  Booleans	
  
"inStock":true,	
  
"inStock":[true],	
  
cloud	
  example	
  
•  Highly	
  configurable	
  (unless	
  using	
  –noprompt)	
  
•  solr.home:	
  example/cloud/nodeX/solr	
  
•  Source	
  configuraSon	
  is	
  a	
  choice	
  
Please	
  choose	
  a	
  configuraSon	
  for	
  the	
  genngstarted	
  collecSon,	
  available	
  
opSons	
  are:	
  basic_configs,	
  data_driven_schema_configs,	
  or	
  
sample_techproducts_configs	
  [data_driven_schema_configs]	
  
•  Can	
  be	
  rebuilt:	
  
bin/solr	
  stop	
  -­‐all	
  
rm	
  -­‐rf	
  example/cloud	
  
•  Demonstrates	
  Config	
  API	
  (configoverlay.json)	
  
dih	
  example(s)	
  
•  Data	
  import	
  handler	
  –	
  	
  legacy,	
  but	
  sSll	
  kicking	
  
•  solr.home:	
  example/example-­‐DIH/solr	
  
•  Has	
  5	
  (five!)	
  different	
  cores	
  
–  db	
  	
  	
  	
  -­‐	
  database	
  import	
  (example/example-­‐DIH/hsqldb/ex.*)	
  
–  solr	
  	
  -­‐	
  import	
  from	
  another	
  Solr	
  core	
  (configured	
  for	
  db	
  core)	
  
–  mail	
  -­‐	
  import	
  from	
  IMAP	
  (needs	
  some	
  configuraSon)	
  
–  /ka	
  	
  -­‐	
  import	
  rich-­‐content	
  (example/exampledocs/solr-­‐word.pdf)	
  	
  
–  rss	
  	
  	
  	
  -­‐	
  external	
  XML	
  feed	
  (very	
  broken	
  right	
  now)	
  
•  Cannot	
  be	
  rebuilt	
  –	
  only	
  empSed	
  
bin/post	
  -­‐c	
  db	
  -­‐type	
  'applica/on/json'	
  -­‐d	
  '{delete:	
  {query:"*:*"}}'	
  
What	
  about:	
  bin/solr	
  start?	
  
•  solr.home:	
  server/solr	
  
•  No	
  iniSal	
  collecSon/cores,	
  have	
  to	
  create	
  explicitly:	
  
–  With	
  script	
  (see	
  bin/solr	
  create_core	
  –h	
  for	
  details):	
  
bin/solr	
  create	
  –c	
  <corename>	
  -­‐d	
  <name	
  or	
  path>	
  
–  With	
  Core	
  Admin	
  UI	
  for	
  non-­‐SolrCloud:	
  
hRp://localhost:8983/solr/admin/cores?ac/on=CREATE&…	
  
–  With	
  CollecSon	
  API	
  	
  for	
  SolrCloud:	
  
hRp://localhost:8983/admin/collec'ons?ac/on=CREATE&…	
  
basic_configs	
  configuraSon	
  
•  Available	
  for	
  cloud	
  example	
  	
  
and	
  explicit	
  creaSon	
  
•  Schemaless	
  mode	
  is	
  configured,	
  not	
  enabled	
  
•  “Minimal	
  Solr	
  configuraSon”	
  !?!	
  
– managed-­‐schema:	
  1005	
  lines	
  
– solrconfig.xml:	
  1484	
  lines	
  
files	
  example	
  
•  Specifically	
  tuned	
  for	
  file	
  indexing	
  
– Augmented	
  schemaless	
  mode	
  with	
  language,	
  
content-­‐type	
  guessing	
  
– Custom	
  /browse	
  end-­‐point	
  
– Source	
  configuraSon:	
  example/files/conf	
  
– Setup	
  instrucSons:	
  example/files/README.txt	
  
– Bring	
  your	
  own	
  data	
  
films	
  example	
  
•  Schemaless	
  (Based	
  on	
  data_driven_schema_configs)	
  	
  
–  Uses	
  Schema	
  API	
  to	
  add	
  custom	
  fields	
  
–  Uses	
  schemaless	
  for	
  rest	
  of	
  fields	
  
•  Comes	
  with	
  its	
  own	
  data	
  (1100	
  film	
  records)	
  
•  Uses	
  velocity	
  (/browse),	
  Schema	
  API,	
  Request	
  
Parameters	
  API	
  (params.json)	
  
•  Setup	
  instrucSons:	
  example/films/README.txt	
  
That	
  was	
  a	
  good	
  news	
  
•  Many	
  examples	
  
•  Easy	
  to	
  get	
  one	
  running	
  
•  Some	
  come	
  with	
  data	
  
•  Some	
  you	
  can	
  throw	
  your	
  own	
  data	
  into	
  
•  Lots	
  of	
  comments	
  
This	
  is	
  the	
  bad	
  news	
  
Files	
   Types	
   Fields	
   Dynamic	
  
Fields	
  
managed-­‐schema	
  
size	
  
solrconfig.
xml	
  size	
  
basic	
   46	
   71	
   4	
   73	
   1005	
   1484	
  
data_driven	
   46	
   71	
   4	
   73	
   1005	
   1482	
  
techproducts	
   101	
   66	
   33	
   28	
   1149	
   1701	
  
dih	
  db	
   62	
   62	
   31	
   28	
   1129	
   1490	
  
dih	
  Ska	
   6	
   61	
   3	
   27	
   901	
   1466	
  
files	
   69	
   73	
   9	
   73	
   517	
   1508	
  
films	
  
(data_driven+)	
  
46	
   71	
   8	
   73	
   481	
   1482	
  
Tip	
  –	
  genng	
  these	
  numbers	
  	
  
•  XML	
  extracSon	
  with	
  XMLStarlet	
  (XLST	
  CLI)	
  
–  xml	
  sel	
  -­‐t	
  -­‐m	
  "//fieldType"	
  -­‐v	
  @name	
  -­‐n	
  managed-­‐schema	
  
–  xml	
  sel	
  -­‐t	
  -­‐m	
  "//copyField"	
  -­‐c	
  .	
  -­‐n	
  managed-­‐schema	
  |wc	
  -­‐l	
  
–  xml	
  sel	
  -­‐t	
  -­‐m	
  "//*[@docValues]"	
  	
  
-­‐v	
  "concat(local-­‐name(),	
  '	
  ',	
  @name,	
  '	
  docValues:',	
  
@docValues)"	
  -­‐n	
  managed-­‐schema	
  
–  xml	
  sel	
  -­‐t	
  -­‐m	
  "//requestHandler"	
  -­‐v	
  "@name"	
  -­‐n	
  
solrconfig.xml	
  
Why	
  is	
  it	
  like	
  this?	
  
•  Many	
  examples	
  predate	
  Solr	
  Reference	
  Guide	
  
•  grep	
  for	
  opSons,	
  possibiliSes,	
  defaults	
  
•  Each	
  example	
  is	
  a	
  kitchen	
  sink	
  
	
  
“Too	
  much	
  of	
  a	
  good	
  thing	
  is	
  also	
  a	
  bad	
  thing”	
  
	
  Source:	
  1980s	
  Soviet	
  joke	
  about	
  Virtual	
  Reality	
  
Go	
  small	
  –	
  managed-­‐schema	
  
<schema name="demo" version="1.6">
<dynamicField name="*" type="string"
indexed="true" stored="true" multiValued="true"/>
<field name="text" type="text_basic"
indexed="true" stored="false" multiValued="true"/>
<copyField source="*" dest="text"/>
…
Go	
  small	
  –	
  managed-­‐schema(2)	
  
…
<fieldType name="string" class="solr.StrField"/>
<fieldType name="text_basic" class="solr.TextField">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory" />
</analyzer>
</fieldType>
</schema>
	
  
Go	
  small	
  –	
  solrconfig.xml	
  
<config>
<luceneMatchVersion>6.2.0</luceneMatchVersion>
<requestHandler name="/select” class="solr.SearchHandler”>
<lst name="defaults">
<str name="df">text</str>
</lst>
</requestHandler>
</config>
Go	
  small	
  –	
  load	
  and	
  test	
  
•  bin/solr	
  create	
  -­‐c	
  demo	
  -­‐d	
  .../demo-­‐config/	
  
•  bin/post	
  -­‐c	
  demo	
  example/exampledocs/*.xml	
  
•  Test	
  it	
  works,	
  using	
  HTTPie	
  (HTTP	
  CLI)	
  
	
  
Go	
  small	
  -­‐	
  review	
  
•  Minimal	
  example	
  could	
  be	
  very	
  minimal	
  
•  Some	
  things	
  will	
  not	
  work	
  
–  No	
  uniqueKey	
  –	
  no	
  way	
  to	
  update	
  documents,	
  no	
  
SolrCloud	
  
–  No	
  _version_	
  –	
  no	
  SolrCloud	
  
–  Everything	
  is	
  mulSValued	
  –	
  no	
  sorSng	
  
–  copyField	
  *	
  =>	
  text,	
  no	
  meaningful	
  relevancy,	
  
specialized	
  analyzer	
  chain	
  processing	
  
DeconstrucSng	
  films	
  example	
  
•  bin/solr	
  create	
  –c	
  films	
  
•  curl	
  hLp://localhost:8983/solr/films/schema	
  ...	
  (add	
  name,	
  
ini/al_release_date)	
  
•  Index	
  1100	
  records	
  from	
  	
  
–  (Solr)	
  XML,	
  	
  
–  (generic)	
  JSON	
  (doc),	
  or	
  	
  
–  CSV	
  format	
  
•  Search	
  for	
  batman	
  
•  Use	
  /browse	
  end-­‐point	
  and	
  search	
  for	
  batman	
  
•  Enable	
  highlighSng	
  in	
  results	
  
IniSal	
  stats	
  for	
  films	
  core	
  
Sizes	
  (line	
  counts)	
  
managed-­‐schema*	
   481	
  
solrconfig.xml	
   1482	
  
params.json	
   20	
  
File	
  count	
  in	
  conf	
  
.txt	
   41	
  
.xml	
   3	
  
.json	
   1	
  
managed-­‐schema	
  (xml)	
   1	
  
*	
  already	
  has	
  no	
  comments	
  
DeconstrucSng	
  –	
  just	
  straight	
  tags	
  
•  managed-­‐schema	
  lost	
  comments	
  during	
  
construcSon	
  
•  Let's	
  remove	
  comments	
  from	
  solrconfig.xml	
  
•  xml	
  ed	
  -­‐L	
  -­‐d	
  "//comment()"	
  solrconfig.xml	
  
– Edit	
  in	
  place	
  
– Delete	
  XPATH	
  
solrconfig.xml	
  without	
  comments	
  
Sizes	
  (line	
  counts)	
  
managed-­‐schema	
   481	
  
solrconfig.xml	
   1482	
  
278	
  
params.json	
   20	
  
File	
  count	
  in	
  conf	
  
.txt	
   41	
  
.xml	
   3	
  
.json	
   1	
  
managed-­‐schema	
  (xml)	
   1	
  
DeconstrucSng	
  –	
  what	
  to	
  clean	
  
•  Currently	
  
–  (explicit)	
  fields:	
  8	
  
–  dynamic	
  fields:	
  73	
  
•  xml	
  sel	
  -­‐t	
  -­‐m	
  "//dynamicField"	
  -­‐v	
  @name	
  -­‐n	
  managed-­‐
schema	
  |wc	
  -­‐l	
  
–  types:	
  71	
  
–  copyFields:	
  1	
  
•  Let's	
  start	
  from	
  dynamic	
  fields	
  
DeconstrucSng	
  –	
  dynamic	
  fields	
  
•  Used	
  dynamic	
  fields	
  	
  
– do	
  NOT	
  modify	
  schema	
  
– DO	
  show	
  up	
  in	
  Admin	
  UI,	
  if	
  used	
  
– Example	
  from	
  different	
  schema:	
  
•  Used/matched	
  fields	
  
•  Generic	
  definiSons	
  
DeconstrucSng	
  –	
  in	
  use	
  dynamic	
  fields	
  
DeconstrucSng	
  –	
  in	
  use	
  dynamic	
  fields	
  
•  NO	
  dynamic	
  fields	
  are	
  used	
  
– *	
  is	
  a	
  copyField	
  instrucSon	
  
•  Can	
  remove	
  them	
  all	
  
•  xml	
  ed	
  -­‐L	
  -­‐d	
  "//dynamicField"	
  	
  
managed-­‐schema	
  
Remove	
  dynamicFields	
  
Sizes	
  (line	
  counts)	
  
managed-­‐schema	
   481	
  	
  
409	
  
solrconfig.xml	
   278	
  
params.json	
   20	
  
File	
  count	
  in	
  conf	
  
.txt	
   41	
  
.xml	
   3	
  
.json	
   1	
  
managed-­‐schema	
  (xml)	
   1	
  
DeconstrucSng	
  –	
  field	
  types	
  
•  How	
  many	
  types	
  out	
  of	
  71	
  do	
  we	
  use?	
  
–  xml	
  sel	
  -­‐t	
  -­‐m	
  "//field|//dynamicField"	
  	
  
-­‐v	
  "@type"	
  -­‐n	
  conf/managed-­‐schema	
  |sort	
  –u	
  
–  long,	
  string,	
  strings,	
  tdate,	
  text_general	
  
•  But	
  also	
  some	
  in	
  solrconfig.xml	
  
–  booleans,	
  string,	
  strings,	
  tdates,	
  tdoubles,	
  text_general,	
  
tlongs	
  
•  Combined	
  total:	
  9	
  field	
  type	
  definiSons	
  
•  Delete	
  the	
  rest	
  (by	
  hand)	
  
Remove	
  no-­‐longer	
  used	
  types	
  
Sizes	
  (line	
  counts)	
  
managed-­‐schema	
   409	
  	
  
34	
  (!!!)	
  
solrconfig.xml	
   278	
  
params.json	
   20	
  
File	
  count	
  in	
  conf	
  
.txt	
   41	
  
.xml	
   3	
  
.json	
   1	
  
managed-­‐schema	
  (xml)	
   1	
  
DeconstrucSng	
  –	
  support	
  files	
  
•  Inside	
  lang	
  directory	
  (38	
  files)	
  
–  find	
  lang	
  –name	
  'stopwords_*.txt'	
  |	
  wc	
  -­‐l	
  
•  stopwords_*.txt:	
  30	
  files	
  
•  contracSons_*.txt:	
  4	
  files	
  
–  find	
  lang	
  -­‐type	
  f	
  |egrep	
  -­‐v	
  'stopwords_|contrac/ons_'	
  
•  hyphenaSons_ga.txt,	
  stemdict_nl.txt,	
  stoptags_ja.txt,	
  
userdict_ja.txt	
  
Support	
  files	
  –	
  sSll	
  in	
  use?	
  
•  Check	
  for	
  usage	
  
–  grep	
  -­‐o	
  'stopwords_.*.txt'	
  managed-­‐schema	
  solrconfig.xml	
  
–  grep	
  -­‐o	
  'contrac/ons_.*.txt'	
  ...	
  
–  ...	
  
•  NO	
  Matches	
  (we	
  no	
  longer	
  have	
  related	
  types)	
  
–  Delete	
  the	
  whole	
  lang	
  directory	
  
•  What	
  about	
  files	
  just	
  inside	
  config	
  directory	
  
–  Don't	
  need	
  currency.xml,	
  protwords.txt	
  	
  
Remove	
  no-­‐longer	
  used	
  types	
  
Sizes	
  (line	
  counts)	
  
managed-­‐schema	
   34	
  
solrconfig.xml	
   278	
  
params.json	
   20	
  
File	
  count	
  in	
  conf	
  
.txt	
   41	
  2	
  
.xml	
   3	
  2	
  
.json	
   1	
  
managed-­‐schema	
  (xml)	
   1	
  
DeconstrucSng	
  –	
  actual	
  field	
  usage	
  
Actual	
  field	
  usage	
  -­‐	
  _root_	
  
The	
  mystery	
  of	
  _root_	
  
•  In	
  the	
  original	
  schema	
  –	
  no	
  explanaSons	
  
•  DocumentaSon	
  –	
  used	
  for	
  nested	
  documents:	
  
To	
  support	
  nested	
  documents,	
  the	
  schema	
  must	
  include	
  an	
  indexed/non-­‐stored	
  
field	
  _root_	
  .	
  The	
  value	
  of	
  that	
  field	
  is	
  populated	
  automa/cally	
  and	
  is	
  the	
  same	
  for	
  
all	
  documents	
  in	
  the	
  block,	
  regardless	
  of	
  the	
  inheritance	
  depth.	
  
•  We	
  are	
  not	
  using	
  nested	
  documents	
  
•  And	
  neither	
  does	
  any	
  other	
  shipped	
  example...	
  
	
  
Remove	
  _root_	
  
Sizes	
  (line	
  counts)	
  
managed-­‐schema	
   34	
  33	
  
solrconfig.xml	
   278	
  
params.json	
   20	
  
File	
  count	
  in	
  conf	
  
.txt	
   2	
  
.xml	
   2	
  
.json	
   1	
  
managed-­‐schema	
  (xml)	
   1	
  
DeconstrucSng	
  –	
  text_general	
  type	
  
<fieldType	
  name="text_general"	
  class="solr.TextField"	
  posiSonIncrementGap="100"	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  mulSValued="true">	
  
	
  	
  	
  <analyzer	
  type="index">	
  
	
  	
  	
  	
  	
  <tokenizer	
  class="solr.StandardTokenizerFactory"/>	
  
	
  	
  	
  	
  	
  <filter	
  class="solr.StopFilterFactory"	
  words="stopwords.txt"	
  ignoreCase="true"/>	
  
	
  	
  	
  	
  	
  <filter	
  class="solr.LowerCaseFilterFactory"/>	
  
	
  	
  	
  </analyzer>	
  
	
  	
  	
  <analyzer	
  type="query">	
  
	
  	
  	
  	
  	
  <tokenizer	
  class="solr.StandardTokenizerFactory"/>	
  
	
  	
  	
  	
  	
  	
  	
  <filter	
  class="solr.StopFilterFactory"	
  words="stopwords.txt"	
  ignoreCase="true"/>	
  
	
  	
  	
  	
  	
  	
  	
  <filter	
  class="solr.SynonymFilterFactory"	
  expand="true"	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ignoreCase="true"	
  synonyms="synonyms.txt"/>	
  
	
  	
  	
  	
  	
  	
  	
  <filter	
  class="solr.LowerCaseFilterFactory"/>	
  
	
  	
  	
  	
  	
  </analyzer>	
  
	
  </fieldType>	
  
text_general	
  support	
  files	
  
stopwords.txt	
  	
  
#	
  Licensed	
  to	
  the	
  Apache	
  Sokware	
  Founda/on	
  (ASF)	
  under	
  one	
  or	
  more	
  
#	
  contributor	
  license	
  agreements.	
  	
  See	
  the	
  NOTICE	
  file	
  distributed	
  with	
  
#	
  this	
  work	
  for	
  addi/onal	
  informa/on	
  regarding	
  copyright	
  ownership.	
  
#	
  The	
  ASF	
  licenses	
  this	
  file	
  to	
  You	
  under	
  the	
  Apache	
  License,	
  Version	
  2.0	
  
#	
  (the	
  "License");	
  you	
  may	
  not	
  use	
  this	
  file	
  except	
  in	
  compliance	
  with	
  
#	
  the	
  License.	
  	
  You	
  may	
  obtain	
  a	
  copy	
  of	
  the	
  License	
  at	
  
#	
  
#	
  	
  	
  	
  	
  hRp://www.apache.org/licenses/LICENSE-­‐2.0#	
  
#	
  Unless	
  required	
  by	
  applicable	
  law	
  or	
  agreed	
  to	
  in	
  wri/ng,	
  sokware	
  
#	
  distributed	
  under	
  the	
  License	
  is	
  distributed	
  on	
  an	
  "AS	
  IS"	
  BASIS,	
  
#	
  WITHOUT	
  WARRANTIES	
  OR	
  CONDITIONS	
  OF	
  ANY	
  KIND,	
  either	
  express	
  or	
  implied.	
  
#	
  See	
  the	
  License	
  for	
  the	
  specific	
  language	
  governing	
  permissions	
  and	
  
#	
  limita/ons	
  under	
  the	
  License.	
  
•  synonyms.txt	
  
#	
  The	
  ASF	
  licenses	
  this	
  file	
  to	
  You	
  under	
  the	
  Apache	
  License,	
  Version	
  2.0	
  
#	
  (the	
  "License");	
  you	
  may	
  not	
  use	
  this	
  file	
  except	
  in	
  compliance	
  with#	
  the	
  
License.	
  	
  You	
  may	
  obtain	
  a	
  copy	
  of	
  the	
  License	
  at#.	
  
......	
  
	
  
.#-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
#some	
  test	
  synonym	
  mappings	
  unlikely	
  to	
  appear	
  in	
  real	
  input	
  textaaafoo	
  =>	
  
aaabar	
  
bbbfoo	
  =>	
  bbbfoo	
  bbbbar	
  
cccfoo	
  =>	
  cccbar	
  cccbaz	
  
fooaaa,baraaa,bazaaa	
  
	
  
#	
  Some	
  synonym	
  groups	
  specific	
  to	
  this	
  example	
  
GB,gib,gigabyte,gigabytes	
  
MB,mib,megabyte,megabytes	
  
Television,	
  Televisions,	
  TV,	
  TVs	
  
#no/ce	
  we	
  use	
  "gib"	
  instead	
  of	
  "GiB"	
  so	
  any	
  WordDelimiterFilter	
  coming	
  
#aker	
  us	
  won't	
  split	
  it	
  into	
  two	
  words.	
  
#	
  Synonym	
  mappings	
  can	
  be	
  used	
  for	
  spelling	
  correc/on	
  	
  
toopixima	
  =>	
  pixma	
  
text_general's	
  empty	
  stopwords	
  
•  No	
  file	
  	
  
=>	
  default	
  stopwords	
  	
  
=>	
  English	
  
•  Empty	
  file	
  	
  
=>	
  disabled	
  stopwords	
  
•  Currently	
  –	
  NOT	
  used	
  
text_general	
  simplified	
  definiSon	
  
	
  <fieldType	
  name="text_general"	
  class="solr.TextField"	
  
posiSonIncrementGap="100"	
  mulSValued="true">	
  
	
  	
  	
  	
  <analyzer>	
  
	
  	
  	
  	
  	
  	
  <tokenizer	
  class="solr.StandardTokenizerFactory"/>	
  
	
  	
  	
  	
  	
  	
  <filter	
  class="solr.LowerCaseFilterFactory"/>	
  	
  
	
  	
  	
  </analyzer>	
  
	
  	
  </fieldType>	
  
Remove	
  stopwords	
  and	
  synonyms	
  
Sizes	
  (line	
  counts)	
  
managed-­‐schema	
   33	
  26	
  
solrconfig.xml	
   278	
  
params.json	
   20	
  
File	
  count	
  in	
  conf	
  
.txt	
   2	
  0	
  
.xml	
   2	
  
.json	
   1	
  
managed-­‐schema	
  (xml)	
   1	
  
How	
  far	
  did	
  we	
  get	
  
Sizes	
  (line	
  counts)	
  
managed-­‐schema*	
   481	
  26	
  
solrconfig.xml	
   1482	
  
278	
  
params.json	
   20	
  
File	
  count	
  in	
  conf	
  
.txt	
   41	
  0	
  
.xml	
   3	
  2	
  
.json	
   1	
  
managed-­‐schema	
  (xml)	
   1	
  
*	
  already	
  has	
  no	
  comments	
  
DeconstrucSng	
  –	
  solrconfig.xml	
  
•  solrconfig.xml	
  is	
  more	
  complex	
  than	
  schema	
  
•  Heterogeneous	
  SecSons	
  
•  Nested	
  definiSons	
  
•  AlternaSve	
  implementaSons	
  (e.g.	
  highlighter)	
  
•  Also	
  remember	
  	
  
–  configoverlay.json	
  –	
  overrides	
  solrconfig.xml	
  
–  params.json	
  –	
  addiSonal	
  configuraSon	
  parameters	
  
solrconfig.xml	
  –	
  feature	
  counts	
  
11	
  requestHandler	
  
8	
  lib	
  
5	
  searchComponent	
  
3	
  queryResponseWriter	
  
2	
  initParams	
  
1	
  updateRequestProcessorChain	
  
1	
  updateHandler	
  
1	
  requestDispatcher	
  	
  
	
  
	
  
1	
  query	
  
1	
  
luceneMatchVersion	
  
1	
  jmx	
  
1	
  indexConfig	
  
1	
  directoryFactory	
  
1	
  dataDir	
  
1	
  codecFactory	
  
solrconfig.xml	
  –	
  line	
  counts	
  
55:<updateRequestProcessorChain	
  name="add-­‐unknown-­‐fields-­‐to-­‐the-­‐schema">	
  	
  
52:<searchComponent	
  class="solr.HighlightComponent"	
  name="highlight">	
  	
  
18:<query>	
  
17:<requestHandler	
  name="/spell"	
  class="solr.SearchHandler"	
  startup="lazy">	
  	
  
15:<searchComponent	
  name="spellcheck"	
  class="solr.SpellCheckComponent">	
  
13:<updateHandler	
  class="solr.DirectUpdateHandler2">	
  	
  
9:<requestHandler	
  name="/terms"	
  class="solr.SearchHandler"	
  startup="lazy">	
  	
  
8:<requestHandler	
  name="/elevate"	
  class="solr.SearchHandler"	
  startup="lazy">	
  	
  
8:<requestHandler	
  name="/tvrh"	
  class="solr.SearchHandler"	
  startup="lazy">	
  	
  
7:<requestHandler	
  name="/update/extract"	
  startup="lazy"	
  
class="solr.extracSon.ExtracSngRequestHandler">	
  	
  
7:<requestHandler	
  name="/query"	
  class="solr.SearchHandler">	
  	
  
6:<requestHandler	
  name="/debug/dump"	
  class="solr.DumpRequestHandler">	
  	
  
......	
  
Remember,	
  this	
  works!	
  
<config>
<luceneMatchVersion>6.2.0</luceneMatchVersion>
<requestHandler name="/select” class="solr.SearchHandler”>
<lst name="defaults">
<str name="df">text</str>
</lst>
</requestHandler>
</config>
add-­‐unknown-­‐fields-­‐to-­‐the-­‐schema	
  
•  Famous	
  "schemaless"	
  mode	
  
•  Generic,	
  but	
  fully	
  configurable	
  
•  Far	
  from	
  perfect	
  
–  Remember,	
  we	
  had	
  to	
  manually	
  pre-­‐add	
  fields	
  
–  Development,	
  not	
  producSon	
  
–  Has	
  normalizaSon	
  side-­‐effects	
  (normalizes	
  dates)	
  
•  Cannot	
  remove	
  it	
  in	
  our	
  example	
  
solrconfig.xml	
  -­‐	
  highlighter	
  
	
  <searchComponent	
  class="solr.HighlightComponent"	
  name="highlight">	
  	
  
	
  	
  	
  <highlighSng>	
  
	
  	
  	
  	
  	
  	
  <fragmenter	
  name="gap"	
  default="true"	
  
class="solr.highlight.GapFragmenter">	
  	
  
	
  	
  	
  	
  	
  	
  	
  <lst	
  name="defaults">	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <int	
  name="hl.fragsize">100</int>	
  
	
  	
  	
  	
  	
  	
  	
  	
  </lst>	
  
	
  	
  	
  	
  	
  	
  </fragmenter>	
  
	
  	
  	
  	
  	
  	
  <fragmenter	
  name="regex"	
  class="solr.highlight.RegexFragmenter">	
  
	
  	
  	
  	
  	
  	
  	
  	
  <lst	
  name="defaults">	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <int	
  name="hl.fragsize">70</int>	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <float	
  name="hl.regex.slop">0.5</float>	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <str	
  name="hl.regex.paLern">[-­‐w	
  ,/n"']{20,200}</str>	
  
	
  	
  	
  	
  	
  	
  	
  	
  </lst>	
  
	
  	
  	
  	
  	
  	
  </fragmenter>	
  
	
  	
  	
  	
  	
  	
  <formaLer	
  name="html"	
  default="true"	
  class="solr.highlight.HtmlFormaLer">	
  
	
  	
  	
  	
  	
  	
  	
  	
  <lst	
  name="defaults">	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <str	
  name="hl.simple.pre"><![CDATA[<em>]]></str>	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <str	
  name="hl.simple.post"><![CDATA[</em>]]></str>	
  
	
  	
  	
  	
  	
  	
  	
  	
  </lst>	
  
	
  	
  	
  	
  	
  	
  </formaLer>	
  	
  
	
  	
  	
  	
  	
  <encoder	
  name="html"	
  class="solr.highlight.HtmlEncoder"/>	
  
	
  	
  	
  	
  	
  	
  <fragListBuilder	
  name="simple"	
  class="solr.highlight.SimpleFragListBuilder"/>	
  
	
  	
  	
  	
  	
  	
  <fragListBuilder	
  name="single"	
  class="solr.highlight.SingleFragListBuilder"/>	
  	
  
	
  	
  	
  .......	
  
•  fragmenters	
  
•  encoders	
  
•  fragListBuilders	
  
•  fragmentBuilders	
  
•  boundaryScanners	
  
•  ....	
  
highlighter	
  –	
  the	
  truth	
  
•  Highlighter	
  searchComponent	
  is	
  in	
  default	
  stack	
  
•  The	
  params	
  are	
  a	
  mix	
  of	
  standard	
  highlighter,	
  
alternaSve	
  FastVector	
  highlighter	
  
•  Cannot	
  use	
  FastVector	
  version	
  as	
  schema	
  fields	
  
are	
  missing	
  termVectors,	
  etc	
  
•  And	
  standard	
  highlighter	
  params	
  are	
  same	
  as	
  
implicit	
  values	
  
•  Therefore,	
  we	
  can	
  remove	
  the	
  WHOLE	
  definiSon	
  
Remove	
  highlighter	
  
Sizes	
  (line	
  counts)	
  
managed-­‐schema	
   26	
  
solrconfig.xml	
   278	
  226	
  
params.json	
   20	
  
File	
  count	
  in	
  conf	
  
.txt	
   0	
  
.xml	
   2	
  
.json	
   1	
  
managed-­‐schema	
  (xml)	
   1	
  
Other	
  searchComponents	
  
•  Not	
  on	
  the	
  default	
  stack	
  
–  spellcheck	
  
–  term	
  
–  termVector	
  
–  elevator	
  
•  Have	
  dedicated	
  requestHandlers	
  
•  IncepSon	
  (example	
  within	
  example)	
  
•  Can	
  be	
  deleted	
  
–  also	
  delete	
  elevate.xml	
  
15:<searchComponent	
  name="spellcheck"	
  
class="solr.SpellCheckComponent">	
  
17:<requestHandler	
  name="/spell"	
  
class="solr.SearchHandler"	
  startup="lazy">	
  
1:<searchComponent	
  name="terms"	
  
class="solr.TermsComponent"/>	
  
9:<requestHandler	
  name="/terms"	
  
class="solr.SearchHandler"	
  startup="lazy">	
  
1:<searchComponent	
  name="tvComponent"	
  
class="solr.TermVectorComponent"/>	
  
8:<requestHandler	
  name="/tvrh"	
  
class="solr.SearchHandler"	
  startup="lazy">	
  
4:<searchComponent	
  name="elevator"	
  
class="solr.QueryElevaSonComponent">	
  
8:<requestHandler	
  name="/elevate"	
  
class="solr.SearchHandler"	
  startup="lazy">	
  
	
  
Remove	
  custom	
  searchComponents	
  
Sizes	
  (line	
  counts)	
  
managed-­‐schema	
   26	
  
solrconfig.xml	
   226	
  163	
  
params.json	
   20	
  
File	
  count	
  in	
  conf	
  
.txt	
   0	
  
.xml	
   2	
  1	
  
.json	
   1	
  
managed-­‐schema	
  (xml)	
   1	
  
solrconfig.xml	
  –	
  more	
  stuff	
  
•  There	
  is	
  more	
  that	
  can	
  be	
  taken	
  out	
  
– query	
  secSon,	
  since	
  you	
  have	
  to	
  tune	
  it	
  anyway	
  
– updateHandler,	
  and	
  revert	
  to	
  basic	
  commits	
  
– jmx	
  
– enableRemoteStreaming	
  –	
  definitely	
  take	
  that	
  out	
  
•  But	
  keep	
  velocity,	
  browse,	
  search	
  support	
  
Next	
  acSon	
  
•  Join	
  the	
  (virtual)	
  Solr	
  Example	
  Reading	
  Group	
  
–  Starts	
  November	
  2016	
  
–  Register	
  at	
  hLp://bit.ly/SolrERG	
  	
  
•  Join	
  mailing	
  list	
  at	
  hLp://www.solr-­‐start.com	
  	
  
–  Get	
  the	
  link	
  to	
  the	
  presentaSon	
  source	
  
–  Learn	
  about	
  other	
  similar	
  projects	
  
–  Get	
  news	
  of	
  Solr	
  arScles	
  and	
  projects	
  on	
  the	
  web	
  
Rebuilding	
  Solr	
  6	
  examples	
  –	
  	
  
layer	
  by	
  layer	
  
Alexandre	
  Rafalovitch	
  
www.solr-­‐start.com	
  

Más contenido relacionado

La actualidad más candente

Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Lucidworks
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0
Erik Hatcher
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 

La actualidad más candente (20)

Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Webinar: What's New in Solr 6
Webinar: What's New in Solr 6
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis Tricks
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!Apache Solr! Enterprise Search Solutions at your Fingertips!
Apache Solr! Enterprise Search Solutions at your Fingertips!
 
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0What's New in Solr 3.x / 4.0
What's New in Solr 3.x / 4.0
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 

Destacado

Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Lucidworks
 
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Lucidworks
 

Destacado (20)

Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
Webinar: Building Conversational Search with Fusion
Webinar: Building Conversational Search with FusionWebinar: Building Conversational Search with Fusion
Webinar: Building Conversational Search with Fusion
 
Webinar: Site Search in an Hour with Fusion
Webinar: Site Search in an Hour with FusionWebinar: Site Search in an Hour with Fusion
Webinar: Site Search in an Hour with Fusion
 
Fusion 3 Overview Webinar
Fusion 3 Overview Webinar Fusion 3 Overview Webinar
Fusion 3 Overview Webinar
 
Search++: Cognitive transformation of human-system interaction: Presented by ...
Search++: Cognitive transformation of human-system interaction: Presented by ...Search++: Cognitive transformation of human-system interaction: Presented by ...
Search++: Cognitive transformation of human-system interaction: Presented by ...
 
Improving Enterprise Findability: Presented by Jayesh Govindarajan, Salesforce
Improving Enterprise Findability: Presented by Jayesh Govindarajan, SalesforceImproving Enterprise Findability: Presented by Jayesh Govindarajan, Salesforce
Improving Enterprise Findability: Presented by Jayesh Govindarajan, Salesforce
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLeveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
 
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
 
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Webinar: Ecommerce, Rules, and Relevance
Webinar: Ecommerce, Rules, and RelevanceWebinar: Ecommerce, Rules, and Relevance
Webinar: Ecommerce, Rules, and Relevance
 
The Many Facets of Apache Solr - Yonik Seeley
The Many Facets of Apache Solr - Yonik SeeleyThe Many Facets of Apache Solr - Yonik Seeley
The Many Facets of Apache Solr - Yonik Seeley
 
Beyond the Google Search Appliance with Lucidworks Fusion
Beyond the Google Search Appliance with Lucidworks Fusion Beyond the Google Search Appliance with Lucidworks Fusion
Beyond the Google Search Appliance with Lucidworks Fusion
 
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
Queue Based Solr Indexing with Collection Management: Presented by Devansh Dh...
 
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, ClouderaParallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
 
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, ClouderaWhy Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
 
LinkedIn Skills: RecSys Conference 2014
LinkedIn Skills: RecSys Conference 2014LinkedIn Skills: RecSys Conference 2014
LinkedIn Skills: RecSys Conference 2014
 
Automotive Information Research Driven by Apache Solr: Presented by Mario-Lea...
Automotive Information Research Driven by Apache Solr: Presented by Mario-Lea...Automotive Information Research Driven by Apache Solr: Presented by Mario-Lea...
Automotive Information Research Driven by Apache Solr: Presented by Mario-Lea...
 

Similar a Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovitch, Search Stack Solutions

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
Sourcesense
 
Javascript done right - Open Web Camp III
Javascript done right - Open Web Camp IIIJavascript done right - Open Web Camp III
Javascript done right - Open Web Camp III
Dirk Ginader
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query Language
Neo4j
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Lucidworks
 

Similar a Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovitch, Search Stack Solutions (20)

Tanel Poder - Scripts and Tools short
Tanel Poder - Scripts and Tools shortTanel Poder - Scripts and Tools short
Tanel Poder - Scripts and Tools short
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher
 
Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)
 
Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014
 
Important work-arounds for making ASS multi-lingual
Important work-arounds for making ASS multi-lingualImportant work-arounds for making ASS multi-lingual
Important work-arounds for making ASS multi-lingual
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Logstash
LogstashLogstash
Logstash
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
Javascript done right - Open Web Camp III
Javascript done right - Open Web Camp IIIJavascript done right - Open Web Camp III
Javascript done right - Open Web Camp III
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Using Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowUsing Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 Flow
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query Language
 
Solr As A SparkSQL DataSource
Solr As A SparkSQL DataSourceSolr As A SparkSQL DataSource
Solr As A SparkSQL DataSource
 
Hadoop spark online demo
Hadoop spark online demoHadoop spark online demo
Hadoop spark online demo
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
 
OpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collectorOpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collector
 
Solr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approachSolr Troubleshooting - TreeMap approach
Solr Troubleshooting - TreeMap approach
 
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
Solr Troubleshooting - Treemap Approach: Presented by Alexandre Rafolovitch, ...
 

Más de Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

Más de Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovitch, Search Stack Solutions

  • 1. O C T O B E R   1 1 -­‐ 1 4 ,   2 0 1 6     •     B O S T O N ,   M A  
  • 2. Rebuilding  Solr  6  examples  –     layer  by  layer   Alexandre  Rafalovitch   www.solr-­‐start.com  
  • 3. Who  am  I   •  So)ware  developer  with  20+  years  of  experience     –  Including  3  years  as  Senior  Tech  Support  (BEA  Weblogic)   •  Solr  popularizer   •  Published  book  author  on  Solr  Indexing  (for  Solr  4.3)   •  Run  hLp://www.solr-­‐start.com  resource  site   •  Solr  commiLer  (since  August  2016)   •  Past  and  present  Solr  focus  on  onboarding,  usability,   tooling,  informaSon  sharing  
  • 4. Example  catch-­‐22   •  Search  is  a  –  surprisingly  -­‐  complex  experSse   •  Solr  is  a  complex  product   – Wide   – Deep   – History-­‐rich   •  And  so  are  its  many  examples  
  • 5. Fasten  the  seatbelt   •  Review  all  of  the  (Solr  6.2)  OOTB  examples   •  Make  a  small  one  from  scratch   •  Deconstruct  a  real  shipped  example   •  Next  learning  acSon...    
  • 6. OOTB  Examples  –  how  many?   bin/solr  start  –e     -­‐e  <example>    Name  of  the  example  to  run;  available  examples:              cloud:                                SolrCloud  example                        techproducts:    Comprehensive  example                                                                      illustraSng  many  of  Solr's  core  capabiliSes              dih:                                        Data  Import  Handler                        schemaless:            Schema-­‐less  example  
  • 7. techproducts  example   •  Used  to  be  collec/on1   •  solr.home:  example/techproducts/solr   – Can  restart  with     bin/solr  start  -­‐s  example/techproducts/solr   – Actual  core  at   example/techproducts/solr/techproducts  
  • 8. techproducts  example  (cont.)   •  Source  configuraSon   –  server/solr/configset/sample_techproducts_config     –  Not  actually  a  configset  (copy,  not  share)   •  Can  be  rebuilt     rm  –rf  example/techproducts   •  Has  data  (14  files  of  products,  money,  uc8  tests)   bin/post  -­‐c  techproducts  example/exampledocs/*.xml  
  • 9. schemaless  example   •  solr.home:  example/schemaless/solr   •  Actual  core:  example/schemaless/solr/ge?ngstarted   •  Source  configuraSon:   –  server/solr/configset/data_driven_schema_configs   –  Config  you  get  when  you  are  not  using  config:   bin/solr  create  -­‐c  newcore   •  No  data,  but  can  take  (nearly)  anything:   bin/post  -­‐c  <name>  example/exampledocs/*.xml  
  • 10. schemaless  mode?   •  “Let  us  guess  what  you  mean”   –  Auto-­‐guess  field  type  based  on  first  content  occurrence   –  Create  explicit  field  definiSons   •  booleans,  dates,  numbers,  strings   •  Always  mulSvalued  (because:  who  knows?!?)   •  Can  be  configured  (URP  chain  in  solrconfig.xml)   –  Rewrites  managed-­‐schema  (coments  begone!)   –  Makes  search  work  with     <copyField  source="*"  dest="_text_"/>  
  • 11. techproducts  vs  schemaless   •  Configured  techproducts  vs     auto-­‐detecSng  schemaless   •  Strings   "name":"Test  with  some  GB18030  encoded  characters",   "name":["Test  with  some  GB18030  encoded  characters"],   •  Numbers   "price":0.0,  "price_c":"0.0,USD",   "price":[0.0],   •  Booleans   "inStock":true,   "inStock":[true],  
  • 12. cloud  example   •  Highly  configurable  (unless  using  –noprompt)   •  solr.home:  example/cloud/nodeX/solr   •  Source  configuraSon  is  a  choice   Please  choose  a  configuraSon  for  the  genngstarted  collecSon,  available   opSons  are:  basic_configs,  data_driven_schema_configs,  or   sample_techproducts_configs  [data_driven_schema_configs]   •  Can  be  rebuilt:   bin/solr  stop  -­‐all   rm  -­‐rf  example/cloud   •  Demonstrates  Config  API  (configoverlay.json)  
  • 13. dih  example(s)   •  Data  import  handler  –    legacy,  but  sSll  kicking   •  solr.home:  example/example-­‐DIH/solr   •  Has  5  (five!)  different  cores   –  db        -­‐  database  import  (example/example-­‐DIH/hsqldb/ex.*)   –  solr    -­‐  import  from  another  Solr  core  (configured  for  db  core)   –  mail  -­‐  import  from  IMAP  (needs  some  configuraSon)   –  /ka    -­‐  import  rich-­‐content  (example/exampledocs/solr-­‐word.pdf)     –  rss        -­‐  external  XML  feed  (very  broken  right  now)   •  Cannot  be  rebuilt  –  only  empSed   bin/post  -­‐c  db  -­‐type  'applica/on/json'  -­‐d  '{delete:  {query:"*:*"}}'  
  • 14. What  about:  bin/solr  start?   •  solr.home:  server/solr   •  No  iniSal  collecSon/cores,  have  to  create  explicitly:   –  With  script  (see  bin/solr  create_core  –h  for  details):   bin/solr  create  –c  <corename>  -­‐d  <name  or  path>   –  With  Core  Admin  UI  for  non-­‐SolrCloud:   hRp://localhost:8983/solr/admin/cores?ac/on=CREATE&…   –  With  CollecSon  API    for  SolrCloud:   hRp://localhost:8983/admin/collec'ons?ac/on=CREATE&…  
  • 15. basic_configs  configuraSon   •  Available  for  cloud  example     and  explicit  creaSon   •  Schemaless  mode  is  configured,  not  enabled   •  “Minimal  Solr  configuraSon”  !?!   – managed-­‐schema:  1005  lines   – solrconfig.xml:  1484  lines  
  • 16. files  example   •  Specifically  tuned  for  file  indexing   – Augmented  schemaless  mode  with  language,   content-­‐type  guessing   – Custom  /browse  end-­‐point   – Source  configuraSon:  example/files/conf   – Setup  instrucSons:  example/files/README.txt   – Bring  your  own  data  
  • 17.
  • 18. films  example   •  Schemaless  (Based  on  data_driven_schema_configs)     –  Uses  Schema  API  to  add  custom  fields   –  Uses  schemaless  for  rest  of  fields   •  Comes  with  its  own  data  (1100  film  records)   •  Uses  velocity  (/browse),  Schema  API,  Request   Parameters  API  (params.json)   •  Setup  instrucSons:  example/films/README.txt  
  • 19. That  was  a  good  news   •  Many  examples   •  Easy  to  get  one  running   •  Some  come  with  data   •  Some  you  can  throw  your  own  data  into   •  Lots  of  comments  
  • 20. This  is  the  bad  news   Files   Types   Fields   Dynamic   Fields   managed-­‐schema   size   solrconfig. xml  size   basic   46   71   4   73   1005   1484   data_driven   46   71   4   73   1005   1482   techproducts   101   66   33   28   1149   1701   dih  db   62   62   31   28   1129   1490   dih  Ska   6   61   3   27   901   1466   files   69   73   9   73   517   1508   films   (data_driven+)   46   71   8   73   481   1482  
  • 21. Tip  –  genng  these  numbers     •  XML  extracSon  with  XMLStarlet  (XLST  CLI)   –  xml  sel  -­‐t  -­‐m  "//fieldType"  -­‐v  @name  -­‐n  managed-­‐schema   –  xml  sel  -­‐t  -­‐m  "//copyField"  -­‐c  .  -­‐n  managed-­‐schema  |wc  -­‐l   –  xml  sel  -­‐t  -­‐m  "//*[@docValues]"     -­‐v  "concat(local-­‐name(),  '  ',  @name,  '  docValues:',   @docValues)"  -­‐n  managed-­‐schema   –  xml  sel  -­‐t  -­‐m  "//requestHandler"  -­‐v  "@name"  -­‐n   solrconfig.xml  
  • 22. Why  is  it  like  this?   •  Many  examples  predate  Solr  Reference  Guide   •  grep  for  opSons,  possibiliSes,  defaults   •  Each  example  is  a  kitchen  sink     “Too  much  of  a  good  thing  is  also  a  bad  thing”    Source:  1980s  Soviet  joke  about  Virtual  Reality  
  • 23. Go  small  –  managed-­‐schema   <schema name="demo" version="1.6"> <dynamicField name="*" type="string" indexed="true" stored="true" multiValued="true"/> <field name="text" type="text_basic" indexed="true" stored="false" multiValued="true"/> <copyField source="*" dest="text"/> …
  • 24. Go  small  –  managed-­‐schema(2)   … <fieldType name="string" class="solr.StrField"/> <fieldType name="text_basic" class="solr.TextField"> <analyzer> <tokenizer class="solr.LowerCaseTokenizerFactory" /> </analyzer> </fieldType> </schema>  
  • 25. Go  small  –  solrconfig.xml   <config> <luceneMatchVersion>6.2.0</luceneMatchVersion> <requestHandler name="/select” class="solr.SearchHandler”> <lst name="defaults"> <str name="df">text</str> </lst> </requestHandler> </config>
  • 26. Go  small  –  load  and  test   •  bin/solr  create  -­‐c  demo  -­‐d  .../demo-­‐config/   •  bin/post  -­‐c  demo  example/exampledocs/*.xml   •  Test  it  works,  using  HTTPie  (HTTP  CLI)    
  • 27.
  • 28. Go  small  -­‐  review   •  Minimal  example  could  be  very  minimal   •  Some  things  will  not  work   –  No  uniqueKey  –  no  way  to  update  documents,  no   SolrCloud   –  No  _version_  –  no  SolrCloud   –  Everything  is  mulSValued  –  no  sorSng   –  copyField  *  =>  text,  no  meaningful  relevancy,   specialized  analyzer  chain  processing  
  • 29. DeconstrucSng  films  example   •  bin/solr  create  –c  films   •  curl  hLp://localhost:8983/solr/films/schema  ...  (add  name,   ini/al_release_date)   •  Index  1100  records  from     –  (Solr)  XML,     –  (generic)  JSON  (doc),  or     –  CSV  format   •  Search  for  batman   •  Use  /browse  end-­‐point  and  search  for  batman   •  Enable  highlighSng  in  results  
  • 30.
  • 31. IniSal  stats  for  films  core   Sizes  (line  counts)   managed-­‐schema*   481   solrconfig.xml   1482   params.json   20   File  count  in  conf   .txt   41   .xml   3   .json   1   managed-­‐schema  (xml)   1   *  already  has  no  comments  
  • 32. DeconstrucSng  –  just  straight  tags   •  managed-­‐schema  lost  comments  during   construcSon   •  Let's  remove  comments  from  solrconfig.xml   •  xml  ed  -­‐L  -­‐d  "//comment()"  solrconfig.xml   – Edit  in  place   – Delete  XPATH  
  • 33. solrconfig.xml  without  comments   Sizes  (line  counts)   managed-­‐schema   481   solrconfig.xml   1482   278   params.json   20   File  count  in  conf   .txt   41   .xml   3   .json   1   managed-­‐schema  (xml)   1  
  • 34. DeconstrucSng  –  what  to  clean   •  Currently   –  (explicit)  fields:  8   –  dynamic  fields:  73   •  xml  sel  -­‐t  -­‐m  "//dynamicField"  -­‐v  @name  -­‐n  managed-­‐ schema  |wc  -­‐l   –  types:  71   –  copyFields:  1   •  Let's  start  from  dynamic  fields  
  • 35. DeconstrucSng  –  dynamic  fields   •  Used  dynamic  fields     – do  NOT  modify  schema   – DO  show  up  in  Admin  UI,  if  used   – Example  from  different  schema:   •  Used/matched  fields   •  Generic  definiSons  
  • 36. DeconstrucSng  –  in  use  dynamic  fields  
  • 37. DeconstrucSng  –  in  use  dynamic  fields   •  NO  dynamic  fields  are  used   – *  is  a  copyField  instrucSon   •  Can  remove  them  all   •  xml  ed  -­‐L  -­‐d  "//dynamicField"     managed-­‐schema  
  • 38. Remove  dynamicFields   Sizes  (line  counts)   managed-­‐schema   481     409   solrconfig.xml   278   params.json   20   File  count  in  conf   .txt   41   .xml   3   .json   1   managed-­‐schema  (xml)   1  
  • 39. DeconstrucSng  –  field  types   •  How  many  types  out  of  71  do  we  use?   –  xml  sel  -­‐t  -­‐m  "//field|//dynamicField"     -­‐v  "@type"  -­‐n  conf/managed-­‐schema  |sort  –u   –  long,  string,  strings,  tdate,  text_general   •  But  also  some  in  solrconfig.xml   –  booleans,  string,  strings,  tdates,  tdoubles,  text_general,   tlongs   •  Combined  total:  9  field  type  definiSons   •  Delete  the  rest  (by  hand)  
  • 40. Remove  no-­‐longer  used  types   Sizes  (line  counts)   managed-­‐schema   409     34  (!!!)   solrconfig.xml   278   params.json   20   File  count  in  conf   .txt   41   .xml   3   .json   1   managed-­‐schema  (xml)   1  
  • 41. DeconstrucSng  –  support  files   •  Inside  lang  directory  (38  files)   –  find  lang  –name  'stopwords_*.txt'  |  wc  -­‐l   •  stopwords_*.txt:  30  files   •  contracSons_*.txt:  4  files   –  find  lang  -­‐type  f  |egrep  -­‐v  'stopwords_|contrac/ons_'   •  hyphenaSons_ga.txt,  stemdict_nl.txt,  stoptags_ja.txt,   userdict_ja.txt  
  • 42. Support  files  –  sSll  in  use?   •  Check  for  usage   –  grep  -­‐o  'stopwords_.*.txt'  managed-­‐schema  solrconfig.xml   –  grep  -­‐o  'contrac/ons_.*.txt'  ...   –  ...   •  NO  Matches  (we  no  longer  have  related  types)   –  Delete  the  whole  lang  directory   •  What  about  files  just  inside  config  directory   –  Don't  need  currency.xml,  protwords.txt    
  • 43. Remove  no-­‐longer  used  types   Sizes  (line  counts)   managed-­‐schema   34   solrconfig.xml   278   params.json   20   File  count  in  conf   .txt   41  2   .xml   3  2   .json   1   managed-­‐schema  (xml)   1  
  • 44. DeconstrucSng  –  actual  field  usage  
  • 45. Actual  field  usage  -­‐  _root_  
  • 46. The  mystery  of  _root_   •  In  the  original  schema  –  no  explanaSons   •  DocumentaSon  –  used  for  nested  documents:   To  support  nested  documents,  the  schema  must  include  an  indexed/non-­‐stored   field  _root_  .  The  value  of  that  field  is  populated  automa/cally  and  is  the  same  for   all  documents  in  the  block,  regardless  of  the  inheritance  depth.   •  We  are  not  using  nested  documents   •  And  neither  does  any  other  shipped  example...    
  • 47. Remove  _root_   Sizes  (line  counts)   managed-­‐schema   34  33   solrconfig.xml   278   params.json   20   File  count  in  conf   .txt   2   .xml   2   .json   1   managed-­‐schema  (xml)   1  
  • 48. DeconstrucSng  –  text_general  type   <fieldType  name="text_general"  class="solr.TextField"  posiSonIncrementGap="100"                                          mulSValued="true">        <analyzer  type="index">            <tokenizer  class="solr.StandardTokenizerFactory"/>            <filter  class="solr.StopFilterFactory"  words="stopwords.txt"  ignoreCase="true"/>            <filter  class="solr.LowerCaseFilterFactory"/>        </analyzer>        <analyzer  type="query">            <tokenizer  class="solr.StandardTokenizerFactory"/>                <filter  class="solr.StopFilterFactory"  words="stopwords.txt"  ignoreCase="true"/>                <filter  class="solr.SynonymFilterFactory"  expand="true"                                        ignoreCase="true"  synonyms="synonyms.txt"/>                <filter  class="solr.LowerCaseFilterFactory"/>            </analyzer>    </fieldType>  
  • 49. text_general  support  files   stopwords.txt     #  Licensed  to  the  Apache  Sokware  Founda/on  (ASF)  under  one  or  more   #  contributor  license  agreements.    See  the  NOTICE  file  distributed  with   #  this  work  for  addi/onal  informa/on  regarding  copyright  ownership.   #  The  ASF  licenses  this  file  to  You  under  the  Apache  License,  Version  2.0   #  (the  "License");  you  may  not  use  this  file  except  in  compliance  with   #  the  License.    You  may  obtain  a  copy  of  the  License  at   #   #          hRp://www.apache.org/licenses/LICENSE-­‐2.0#   #  Unless  required  by  applicable  law  or  agreed  to  in  wri/ng,  sokware   #  distributed  under  the  License  is  distributed  on  an  "AS  IS"  BASIS,   #  WITHOUT  WARRANTIES  OR  CONDITIONS  OF  ANY  KIND,  either  express  or  implied.   #  See  the  License  for  the  specific  language  governing  permissions  and   #  limita/ons  under  the  License.   •  synonyms.txt   #  The  ASF  licenses  this  file  to  You  under  the  Apache  License,  Version  2.0   #  (the  "License");  you  may  not  use  this  file  except  in  compliance  with#  the   License.    You  may  obtain  a  copy  of  the  License  at#.   ......     .#-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐   #some  test  synonym  mappings  unlikely  to  appear  in  real  input  textaaafoo  =>   aaabar   bbbfoo  =>  bbbfoo  bbbbar   cccfoo  =>  cccbar  cccbaz   fooaaa,baraaa,bazaaa     #  Some  synonym  groups  specific  to  this  example   GB,gib,gigabyte,gigabytes   MB,mib,megabyte,megabytes   Television,  Televisions,  TV,  TVs   #no/ce  we  use  "gib"  instead  of  "GiB"  so  any  WordDelimiterFilter  coming   #aker  us  won't  split  it  into  two  words.   #  Synonym  mappings  can  be  used  for  spelling  correc/on     toopixima  =>  pixma  
  • 50. text_general's  empty  stopwords   •  No  file     =>  default  stopwords     =>  English   •  Empty  file     =>  disabled  stopwords   •  Currently  –  NOT  used  
  • 51. text_general  simplified  definiSon    <fieldType  name="text_general"  class="solr.TextField"   posiSonIncrementGap="100"  mulSValued="true">          <analyzer>              <tokenizer  class="solr.StandardTokenizerFactory"/>              <filter  class="solr.LowerCaseFilterFactory"/>          </analyzer>      </fieldType>  
  • 52. Remove  stopwords  and  synonyms   Sizes  (line  counts)   managed-­‐schema   33  26   solrconfig.xml   278   params.json   20   File  count  in  conf   .txt   2  0   .xml   2   .json   1   managed-­‐schema  (xml)   1  
  • 53. How  far  did  we  get   Sizes  (line  counts)   managed-­‐schema*   481  26   solrconfig.xml   1482   278   params.json   20   File  count  in  conf   .txt   41  0   .xml   3  2   .json   1   managed-­‐schema  (xml)   1   *  already  has  no  comments  
  • 54. DeconstrucSng  –  solrconfig.xml   •  solrconfig.xml  is  more  complex  than  schema   •  Heterogeneous  SecSons   •  Nested  definiSons   •  AlternaSve  implementaSons  (e.g.  highlighter)   •  Also  remember     –  configoverlay.json  –  overrides  solrconfig.xml   –  params.json  –  addiSonal  configuraSon  parameters  
  • 55. solrconfig.xml  –  feature  counts   11  requestHandler   8  lib   5  searchComponent   3  queryResponseWriter   2  initParams   1  updateRequestProcessorChain   1  updateHandler   1  requestDispatcher         1  query   1   luceneMatchVersion   1  jmx   1  indexConfig   1  directoryFactory   1  dataDir   1  codecFactory  
  • 56. solrconfig.xml  –  line  counts   55:<updateRequestProcessorChain  name="add-­‐unknown-­‐fields-­‐to-­‐the-­‐schema">     52:<searchComponent  class="solr.HighlightComponent"  name="highlight">     18:<query>   17:<requestHandler  name="/spell"  class="solr.SearchHandler"  startup="lazy">     15:<searchComponent  name="spellcheck"  class="solr.SpellCheckComponent">   13:<updateHandler  class="solr.DirectUpdateHandler2">     9:<requestHandler  name="/terms"  class="solr.SearchHandler"  startup="lazy">     8:<requestHandler  name="/elevate"  class="solr.SearchHandler"  startup="lazy">     8:<requestHandler  name="/tvrh"  class="solr.SearchHandler"  startup="lazy">     7:<requestHandler  name="/update/extract"  startup="lazy"   class="solr.extracSon.ExtracSngRequestHandler">     7:<requestHandler  name="/query"  class="solr.SearchHandler">     6:<requestHandler  name="/debug/dump"  class="solr.DumpRequestHandler">     ......  
  • 57. Remember,  this  works!   <config> <luceneMatchVersion>6.2.0</luceneMatchVersion> <requestHandler name="/select” class="solr.SearchHandler”> <lst name="defaults"> <str name="df">text</str> </lst> </requestHandler> </config>
  • 58. add-­‐unknown-­‐fields-­‐to-­‐the-­‐schema   •  Famous  "schemaless"  mode   •  Generic,  but  fully  configurable   •  Far  from  perfect   –  Remember,  we  had  to  manually  pre-­‐add  fields   –  Development,  not  producSon   –  Has  normalizaSon  side-­‐effects  (normalizes  dates)   •  Cannot  remove  it  in  our  example  
  • 59. solrconfig.xml  -­‐  highlighter    <searchComponent  class="solr.HighlightComponent"  name="highlight">          <highlighSng>              <fragmenter  name="gap"  default="true"   class="solr.highlight.GapFragmenter">                  <lst  name="defaults">                      <int  name="hl.fragsize">100</int>                  </lst>              </fragmenter>              <fragmenter  name="regex"  class="solr.highlight.RegexFragmenter">                  <lst  name="defaults">                      <int  name="hl.fragsize">70</int>                      <float  name="hl.regex.slop">0.5</float>                      <str  name="hl.regex.paLern">[-­‐w  ,/n"']{20,200}</str>                  </lst>              </fragmenter>              <formaLer  name="html"  default="true"  class="solr.highlight.HtmlFormaLer">                  <lst  name="defaults">                      <str  name="hl.simple.pre"><![CDATA[<em>]]></str>                      <str  name="hl.simple.post"><![CDATA[</em>]]></str>                  </lst>              </formaLer>              <encoder  name="html"  class="solr.highlight.HtmlEncoder"/>              <fragListBuilder  name="simple"  class="solr.highlight.SimpleFragListBuilder"/>              <fragListBuilder  name="single"  class="solr.highlight.SingleFragListBuilder"/>          .......   •  fragmenters   •  encoders   •  fragListBuilders   •  fragmentBuilders   •  boundaryScanners   •  ....  
  • 60. highlighter  –  the  truth   •  Highlighter  searchComponent  is  in  default  stack   •  The  params  are  a  mix  of  standard  highlighter,   alternaSve  FastVector  highlighter   •  Cannot  use  FastVector  version  as  schema  fields   are  missing  termVectors,  etc   •  And  standard  highlighter  params  are  same  as   implicit  values   •  Therefore,  we  can  remove  the  WHOLE  definiSon  
  • 61. Remove  highlighter   Sizes  (line  counts)   managed-­‐schema   26   solrconfig.xml   278  226   params.json   20   File  count  in  conf   .txt   0   .xml   2   .json   1   managed-­‐schema  (xml)   1  
  • 62. Other  searchComponents   •  Not  on  the  default  stack   –  spellcheck   –  term   –  termVector   –  elevator   •  Have  dedicated  requestHandlers   •  IncepSon  (example  within  example)   •  Can  be  deleted   –  also  delete  elevate.xml   15:<searchComponent  name="spellcheck"   class="solr.SpellCheckComponent">   17:<requestHandler  name="/spell"   class="solr.SearchHandler"  startup="lazy">   1:<searchComponent  name="terms"   class="solr.TermsComponent"/>   9:<requestHandler  name="/terms"   class="solr.SearchHandler"  startup="lazy">   1:<searchComponent  name="tvComponent"   class="solr.TermVectorComponent"/>   8:<requestHandler  name="/tvrh"   class="solr.SearchHandler"  startup="lazy">   4:<searchComponent  name="elevator"   class="solr.QueryElevaSonComponent">   8:<requestHandler  name="/elevate"   class="solr.SearchHandler"  startup="lazy">    
  • 63. Remove  custom  searchComponents   Sizes  (line  counts)   managed-­‐schema   26   solrconfig.xml   226  163   params.json   20   File  count  in  conf   .txt   0   .xml   2  1   .json   1   managed-­‐schema  (xml)   1  
  • 64. solrconfig.xml  –  more  stuff   •  There  is  more  that  can  be  taken  out   – query  secSon,  since  you  have  to  tune  it  anyway   – updateHandler,  and  revert  to  basic  commits   – jmx   – enableRemoteStreaming  –  definitely  take  that  out   •  But  keep  velocity,  browse,  search  support  
  • 65. Next  acSon   •  Join  the  (virtual)  Solr  Example  Reading  Group   –  Starts  November  2016   –  Register  at  hLp://bit.ly/SolrERG     •  Join  mailing  list  at  hLp://www.solr-­‐start.com     –  Get  the  link  to  the  presentaSon  source   –  Learn  about  other  similar  projects   –  Get  news  of  Solr  arScles  and  projects  on  the  web  
  • 66. Rebuilding  Solr  6  examples  –     layer  by  layer   Alexandre  Rafalovitch   www.solr-­‐start.com