2. Who am I?
• Principal of OpenSource Connections
- Solr/Lucene Search Consultancy
http://bit.ly/OSCCommercialSummary
• Member of Apache Software
Foundation
• SOLR-284 UpdateRichDocuments
(July 07)
• Fascinated by the art of software
development
9. • First USPTO application in
“the cloud”
• Simple, and discoverable
• Expresses our philosophy of
“Cloud meets Ocean”
!
• Check it out at http://
gpsn.uspto.gov
10. Telling some stories
➡How to inject “Discovery” into your
app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
13. Grok data at gut level
Look for outliers
!
User Interviews
Surveys
Card Sorting
Scenarios/Personas
!
UX
Data
brainstorm
Mockups
Proof of concept
!
!
14. Where to spend time?
UX
Engine
Data
40%
!
20%
!
40%
!
40%
!
40%
!
20%
We spent
!
15. Telling some stories
• How to inject “Discovery” into your app
➡The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
17. Boy meets Girl Story
Metadata
Ingest
Pipeline
Discovery
UX
Content
Files
18. Nothing but JS and
Solr!
• Updates are quarterly
• User state in browser
• Solr is the “RESTful” API ;-)
• KISS: EmberJS + Solr
19. How we built it
EmberJS Single Page Search App
HTML
XML
JSON
Server Dashboard
GPSN UI (Bootsrap CSS)
Browsers
Mobile/
Tablet
Third Party
Application
Servers
S3 BucketSolr
20. Yes, Solr is hangout out
there on the Net…
• Using Jetty container security to lock down
everything but the /select handler.
• Yes, the /admin interface appears to load,
but no panels load.
• Go ahead, do a delete query! I dare you.
Actually, please don’t. ;-)
21. Single 550 GB index
• Solr + Index are in a Amazon AMI image.
• Currently running two independent Solrs.
• Optimize works! Still.
• Elastic Load Balancer + AutoScale spins up
more Solr’s if needed.
• Threw lots of “provisioned IOPS” atVM
23. Spyglass
• EmberJS based Widget framework
• List of Results
• Facets
• Autocomplete
• “Deploy” is just .html + .js. S3 bucket!
• Tooling is a pain. EmberJS is complex!
BetterthenAjaxSolr!
29. Don’t Move Files
• Copying 5 TB data up to S3 was very
painful.
• We used S3Funnel which is “rsync like”
• We bought more network bandwidth for
our office
32. Think about DataVolume
• Started with older dataset, and tasks like TIFF -> PNG
conversion became progressively harder. Map/Reduce nice,
need more visibility into progress..
• Should have sharded our Search Index from the beginning
just to make indexing faster and cheaper process (500 gb
index!)
• 8 shards dropped time from 12 hours to 2 hours.
Merging took 5!
• We had too many steps in our pipeline
33. Building
a
Patents
Index
MachineCount
0
75
150
225
300
5 days 3 days 30 Minutes
1 5
300
34. Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
➡Parsers and Parsers and Parsers
• Don’t be Afraid to Share!
37. Lot’s of File Types
• Sometimes in ZIP archives, sometimes not!
• multiple XML formats as well as CSV and
EDI
• Purplebook,Yellowbook,
Redbook,Greenbook, Questel, SIPO…
38. Tika as a pipeline!
• Auto detects content type
• Metadata structure has all the
key/value needed for Solr
• Allows us to scale up with
Behemoth project (and
others!).
40. Detector to pick File
public
class
GreenbookDetector
implements
Detector
{
!
private
static
Pattern
pattern
=
Pattern.compile("PATN");
@Override
public
MediaType
detect(InputStream
stream,
Metadata
metadata)
throws
IOException
{
!
MediaType
type
=
MediaType.OCTET_STREAM;
InputStream
lookahead
=
new
LookaheadInputStream(stream,
1024);
String
extract
=
org.apache.commons.io.IOUtils.toString(lookahead,
"UTF-‐8");
!
Matcher
matcher
=
pattern.matcher(extract);
!
if
(matcher.find())
{
type
=
GreenbookParser.MEDIA_TYPE;
}
!
lookahead.close();
return
type;
}
}
41. Telling some stories
• How to inject “Discovery” into your app
• The Cloud to the Rescue (sorta!)
• Parsers and Parsers and Parsers
➡Don’t be Afraid to Share!
42. Your Search solution
isn’t perfect
• Allow users to export data
• Most business users want to work in Excel!
Accept it!
• Allow other applications to build on top of
it.
43. GPSN has
• Lots of easy “Print to
PDF” options.
• Data stored in S3 as:
• individual patent files
• chunky downloads.
• Filtering to expand or
select specific data sets.
• Permalinks: simple, very
sharable URLs.
• Underlying Solr service
is exposed to public via
proxy. You can query
Solr yourself.
• Need advance querying?
Use Lucene syntax in
search bar.