Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

BUILDING A LIGHTWEIGHT
DISCOVERY INTERFACE FOR
CHINESE PATENTS
!
New York Solr/Lucene Meetup
ERIC PUGH | epugh@o19s.com | @dep4b

Who am I?
• Principal of OpenSource Connections
- Solr/Lucene Search Consultancy
http://bit.ly/OSCCommercialSummary

• Member of Apache Software
Foundation

• SOLR-284 UpdateRichDocuments
(July 07)

• Fascinated by the art of software
development

• First USPTO application in
“the cloud”

• Simple, and discoverable

• Expresses our philosophy of
“Cloud meets Ocean”

!
• Check it out at http://
gpsn.uspto.gov

Telling some stories
➡How to inject “Discovery” into your
app

• The Cloud to the Rescue (sorta!)

• Parsers and Parsers and Parsers

• Don’t be Afraid to Share!

Flow of understanding
Data UnderstandingInformation

Building “Discovery”
Engine
UX DataTension

Grok data at gut level

Look for outliers

!
User Interviews

Surveys

Card Sorting

Scenarios/Personas

!
UX
Data
brainstorm
Mockups

Proof of concept

!
!

Where to spend time?
UX
Engine
Data
40%

!
20%

!
40%

!
40%

!
40%

!
20%

We spent

!

• How to inject “Discovery” into your app

➡The Cloud to the Rescue (sorta!)


Boy meets Girl Story
Metadata
Ingest

Pipeline

Discovery
UX
Content
Files

Nothing but JS and
Solr!
• Updates are quarterly

• User state in browser

• Solr is the “RESTful” API ;-)

• KISS: EmberJS + Solr

How we built it
EmberJS Single Page Search App
HTML
XML
JSON
Server Dashboard
GPSN UI (Bootsrap CSS)
Browsers
Mobile/
Tablet
Third Party
Application
Servers
S3 BucketSolr

Yes, Solr is hangout out
there on the Net…
• Using Jetty container security to lock down
everything but the /select handler.

• Yes, the /admin interface appears to load,
but no panels load.

• Go ahead, do a delete query! I dare you.
Actually, please don’t. ;-)

Single 550 GB index
• Solr + Index are in a Amazon AMI image.

• Currently running two independent Solrs.

• Optimize works! Still.

• Elastic Load Balancer + AutoScale spins up
more Solr’s if needed.

• Threw lots of “provisioned IOPS” atVM

A better
security proxy
from Alex?
https://github.com/
dergachev/solr-
security-proxy

Spyglass
• EmberJS based Widget framework

• List of Results

• Facets

• Autocomplete

• “Deploy” is just .html + .js. S3 bucket!

• Tooling is a pain. EmberJS is complex!
BetterthenAjaxSolr!

Daniel Beach’s
project
https://github.com/
o19s/spyglass

Key scaling concept
behind GPSN:

!
Cloud meets Ocean

More prosaically…
Database
Server
Server
Server
Client
Client
Client
$
$
$
$

Don’t Move Files
• Copying 5 TB data up to S3 was very
painful.

• We used S3Funnel which is “rsync like”

• We bought more network bandwidth for
our ofﬁce

Never
underestimate
the bandwidth of
a station wagon
full of tapes
hurtling down
the highway. 
–Andrew Tanenbaum, 1981

Data Size
0
250000
500000
750000
1000000
1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011
Patent Count
277871

Think about DataVolume
• Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Map/Reduce nice,
need more visibility into progress..

• Should have sharded our Search Index from the beginning
just to make indexing faster and cheaper process (500 gb
index!)

• 8 shards dropped time from 12 hours to 2 hours.
Merging took 5!

• We had too many steps in our pipeline

Building
a
Patents
Index
MachineCount
0
75
150
225
300
5 days 3 days 30 Minutes
1 5
300



➡Parsers and Parsers and Parsers

Why so many pipelines?
Morphlines

Lot’s of File Types
• Sometimes in ZIP archives, sometimes not!

• multiple XML formats as well as CSV and
EDI

• Purplebook,Yellowbook,
Redbook,Greenbook, Questel, SIPO…

Tika as a pipeline!
• Auto detects content type

• Metadata structure has all the
key/value needed for Solr

• Allows us to scale up with
Behemoth project (and
others!).

Lots of ﬁles!
HHHHHT APS1 ISSUE - 760106!
PATN!
WKU 039302717!
SRC 5!
APN 5328756!
APT 1!
ART 353!
APD 19741216!
TTL Golf glove!
ISD 19760106!
NCL 4!
ECL 1
<PatentGrant>!
<BibliographicData>!
<GrantIdentiﬁcation>!
<DocumentKindCode>B1</DocumentKindCode>!
<GrantNumber>06644224</GrantNumber>!
<CountryCode>US</CountryCode>!
<IssueDateText>2003-11-11</IssueDateText>

Detector to pick File
public
class
GreenbookDetector
implements
Detector
{

!

private
static
Pattern
pattern
=
Pattern.compile("PATN");

@Override

public
MediaType
detect(InputStream
stream,
Metadata
metadata)
throws
IOException
{

!

MediaType
type
=
MediaType.OCTET_STREAM;

InputStream
lookahead
=
new
LookaheadInputStream(stream,
1024);

String
extract
=
org.apache.commons.io.IOUtils.toString(lookahead,
"UTF-‐8");

!

Matcher
matcher
=
pattern.matcher(extract);

!

if
(matcher.find())
{

type
=
GreenbookParser.MEDIA_TYPE;

}

!

lookahead.close();

return
type;

}

}




➡Don’t be Afraid to Share!

Your Search solution
isn’t perfect
• Allow users to export data

• Most business users want to work in Excel!
Accept it!

• Allow other applications to build on top of
it.

GPSN has
• Lots of easy “Print to
PDF” options.

• Data stored in S3 as:

• individual patent ﬁles

• chunky downloads.

• Filtering to expand or
select speciﬁc data sets.

• Permalinks: simple, very
sharable URLs.

• Underlying Solr service
is exposed to public via
proxy. You can query
Solr yourself.

• Need advance querying?
Use Lucene syntax in
search bar.

Measuring the impact
of our algorithms
changes is just getting
harder as we get
smarter.

www.quepid.com
Quepid: Give your Queries
some Love
W
e
need
betausers!

Thank you!
!
Questions?
• epugh@o19s.com

• @dep4b

• www.opensourceconnections.com

• slideshare.com/o19s
Nervous about
speaking up? Ask
me later!

Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (11)

Similar a Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup

Similar a Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup (20)

Más de OpenSource Connections

Más de OpenSource Connections (20)

Último

Último (20)

Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucene Meetup