Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience

Kitenga
reinventing information

Mark Davis
Founder/CTO

Enabling
Big Data
Search via
the Lucid
ReST API

Big
Data

Enormous
transactional
data

Enormous
unstructured
information

Too
big
for
databases

New
tools
are
needed

kilobyte (kB) 103 210 kibibyte
(KiB) 210 megabyte (MB)
106 220 mebibyte (MiB) 220
gigabyte (GB) 109 230
gibibyte (GiB) 230 terabyte
(TB) 1012 240 tebibyte (TiB)
240 petabyte (PB) 1015 250
pebibyte (PiB) 250 exabyte
(EB) 1018 260 exbibyte (EiB)
260 zettabyte (ZB) 1021 270
zebibyte (ZiB) 270 yottabyte
(YB) 1024 280 yobibyte (YiB)
280

Volume
Velocity
Variety

Indexing
Challenges

Complex,
varied
data

Compute-‐intensive
metadata
generation

Schema
and
collection
management

Gather
Extract
Metadata
Index

Resources

•  Crawl
•  Named
•  Schema

•  Crack
formats
entities
deﬁnition

•  Categories
•  Collection

•  Machine
management

learning

•  Semantic

analysis

Initial
Query
Reﬁne
Query
Evaluate

Relevance

•  Keyword
•  Analytic
•  Read
KWIC

guesses
tools
•  Read

•  Category
•  Facetted
metadata

guidance
guidance
•  Read

document

Search
Experience
Challenges

Complex,
varied
data

Resource
discovery

Facetted
search
experience
management

The
Solution

Enable fast metadata generation:

Hadoop
Mahout
GPUs

Manage and control collections and schema:

LucidWorks Enterprise API

SQL
Search

RDBMS
Documents

Transactional
Data
Text
Classiﬁcation

BI
Tools
Taxonomies

Ontologies

Machine-‐Learning

Finite
State
Transducer

Finite
State
Transducer

Finite
State
Transducer

Parts-‐of-‐Speech
Tagging

Lemmatization

Tokenization

Resource
Integration

Facet
Browsing
Facet
Charting

Spellcheck
Autosuggest

Query
Language

Indexing

Metadata
Extraction

¡  Start
to
POC
in
a
week

¡  Open
source
intelligence
problems

ZettaSearch

GOAL:
Be
more
competitive

Facetted Search
SOURCES:
Patents,
PR
and Analytics

announcements,
legal
documents,

relationships

whitepapers,
crawled
websites
metadata
entities

ZettaVox

data

ANALYSIS:
Extract
named
entities
and

relationships,
classify
and
label;

visually
understand
relationships
and

trends

Sources

ACTION:
Change
R&D
priorities
and

improve
marketing
approaches

13

¡  Understand
IP
among
competitors

¡  Assist
legal
team
with
litigation

¡  Custom
search
experience

¡  Custom
extractors:

§  Electronic
parts

§  Memory
types

§  Flash
memory

. 5/15/12 14

Documents
Size

Dell
102,508
9Gb

EMC
303,678
14Gb

Huawei
11,912
890Mb

Kingston
2,534
134Mb

Lenovo
8,305
542Mb

NEC
3,900
252Mb

Nokia
174,681
22Gb

Panasonic
5,804
473Mb

Rim
181
8Mb

Sharp
USA
31,918
4.9Gb

645,421
60.2Gb

5/15/12 . 15

ZettaSearch

GOAL:
Discover
new
drugs,
detect
side-‐
eﬀects,
speed
R&D
Facetted Search
and Analytics
SOURCES:
Published
research
reports,

relationships
pathways

patents,
adverse
eﬀects
databases,
sequences
entities

ZettaVox

genomics
and
proteomics
databases
data

ANALYSIS:
Extract
named
entities
and

relationships,
classify
and
label;
visually

discover
trends
and
relationships

ACTION:
Change
R&D
priorities
Sources

16

¡  Lousy
search
(Google
Search
Appliance)

¡  Internal
regulators
can’t
ﬁnd
by
accession

number

¡  Custom
extractors:

§  Accession
number

§  Ontology
of
active
ingredients

§  Drug
names

© 2012 Kitenga Proprietary 17

ZettaSearch

GOAL:
Build
“second
screen

Facetted Search
experiences”
and Analytics

SOURCES:
wikipedia,
IMDB,
blogs

relationships

ANALYSIS:
Extract
named
entities
and
metadata
entities

ZettaVox

data

relationships,
preserve
existing

structural
metadata

ACTION:
Enable
new
media
experiences

Sources

18

¡  Crawlers
on
Hadoop

¡  Document
format
crackers
on
Hadoop

¡  Extractors
on
Hadoop

¡  Filters
on
Hadoop

¡  HTTP
documents
to
Solr
sharded
cluster

¡  Intermediary
ﬁles
remain
on
HDFS
for

reprocessing

¡  Missing
piece
of
the
puzzle

¡  Addresses
the
impedance
mismatch
between

Big
Data
technologies
and
Solr
search

¡  Manage
collections

¡  Manage
schema

¡  Create
collections

¡  Delete
collections

¡  Update
collection
properties

¡  Create
schema

¡  Modify
schema

¡  Schema
interrogation

¡  Schema
binding
to
user
experience

¡  Facetted
search

¡  Embedded
analytics

¡  Big
Data
search
and
analytics
has
many
challenges:

§  Volume
of
data

§  Variety
of
data

§  Velocity
of
data

§  Extracting
structure
from
unstructured
information

¡  Hadoop
processing
enables
each
of
these
aspects

¡  Controlling
indexing
and
search
is
enabled
by
the

Lucid
Imagination
search
API

¡  We
can
enable
complex
user
interactions
with
Big

Data
on
a
self-‐serve
basis

Analyst
Browser

Enterprise
servers
Cloud
services

Tomcat
App
Server

Amazon
S3

Tomcat

Web
Services

Enterprise

ZettaVoxServices

Cloud

XML
Manager

ZettaVox
+

Author
JSON

GPU
Hadoop

RIA
Search
Indexing

Services
Services

Manager
Manager

ReST

JSON

GPU
MR
Service
Hadoop
Server
Hadoop
Server

Manager
Name
node
Job
Tracker

GPU

GPU
Hadoop

Hadoop

Task
Manager

Hadoop
Task
Manager

Quantum4D
Task
Manager

RDBMS

Entity

Mahout

Crawling

Extraction

©
2012

Kitenga
Proprietary

Analyst
Browser

Enterprise
servers

Search
Indexing

• Get
collection
information

• Create
new
collection

• Create
fields

• Delete
fields

• Edit
fields

ZettaVox
ReST

Author

RIA
JSON

Hadoop
Server
Hadoop
Server

Name
node
Job
Tracker

Hadoop

Hadoop

Task
Manager

Hadoop
Task
Manager

Task
Manager

Entity

Mahout

Crawling
Indexing

Extraction

©
2012

Kitenga
Proprietary

Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience

Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience

Recomendados

Recomendados

Más contenido relacionado

Similar a Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience

Similar a Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience (20)

Más de lucenerevolution

Más de lucenerevolution (20)

Último

Último (20)

Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience