1. Open Source Software for Geospatial
Analytics on Unstructured Big Data
Charlie Greenbacker, Principal Data Scientist
2. Background
About Me:
Data Scientist
Natural Language Processing
Unstructured Text Information
Berico Technologies:
Veteran-owned Small Business
Big Data Analytics in the Cloud
Defense & Intel Community
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 2
3. The Problem: geotagging unstructured text
Growing demand for
geospatial analytics
Most of human knowledge
remains “trapped” in text
Existing solutions are
expensive and don’t scale
Need an open source solution
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 3
4. The Solution: an open source geoparser
1. Data Ingestion
Input: unstructured text
2. Entity Extraction
Named entity recognition
Find location names in text
3. Entity Resolution
Match against a gazetteer
“The Springfield Problem”
4. Data Enrichment
Output: structured geo data
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 4
5. Data Ingestion: unstructured text
photo: Flickr user NS Newsflash
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 5
6. Entity Extraction: named entity recognition
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 6
7. Entity Resolution: match against a gazetteer
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 7
8. Data Enrichment: structured geo data
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 8
9. “The Springfield Problem”
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 9
10. Dealing with Ambiguity
Intelligent Context-based Heuristics
First: rank by population
Next: look for other locations mentioned in the same document
“Springfield” + “Chicago” = Illinois
“Springfield” + “Boston” = Massachusetts
Soon: calculate distance based on lat/lons
Resolve alternate names to same geospatial entity
“Ivory Coast” = “Côte d’Ivoire”
Use fuzzy matching to capture misspelled place names
Including both phonetic spelling & typographical errors
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 10
11. CLAVIN: an open source geoparser
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 11
12. System Architecture
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 12
13. Live Demonstration
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 13
14. Live Demonstration
What can I do
with this data?
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 14
15. Map Visualizations
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 15
16. Hierarchical Geospatial Search
Virginia
Reston Arlington
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 16
17. Geospatial Bounding Box Search
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 17
18. Geospatial Analytics on Unstructured Text
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 18
19. Performance Metrics & Features
Accurate: 0.75 F-measure
CLAVIN
“
Fast: 100 locations per sec per cpu
Cartographic
Scalable: processes 1 million documents
Location in 1 hour on a 9-node Hadoop cluster
And Smart: natural language
processing, context-based heuristics, &
Vicinity fuzzy matching
INdexer Easy to use: simple Java-based API
Open source: Apache License
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 19
20. clavin.bericotechnologies.com
Charlie Greenbacker
@greenbacker
meetup.com/DC-NLP
@DCNLP
All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 20
Editor's Notes
“Berico specializes in building open source software to support analytic missions, and implementing them through our services.”“We help our customers optimize the use of open source solutions for Cloud environments to replace the functionality traditionally licensed based projects.”“All of our products are built to run on and optimize cloud technologies – specifically HBase or Accumulo. We are the first authorized Cloudera partner in the federal sector”“CLAVIN is one of 7 open source products that we’ve built and implemented with customers in the DoD and IC. We’ve chosen CLAVIN as example to walk through today to illustrate how Berico’s open source products deliver great, market-leading, functionality with no licensing constraints, and at a fraction of the cost of proprietary tools in the market” (an infinite fraction – it’s free)
Paris, France > Paris, Texas
The interactivelive demo will be run offline from the presenter’s laptop. The CLAVIN demo interface accepts plain text as input, and returns a list of geospatial entities (with lat/lons, etc.) corresponding to the place names extracted and resolved from the text, along with a visualization plotting these locations on a map.The example text used in the demo may include the following:the sample text file built into the CLAVIN demo interface“Grover Cleveland was the 22nd president of the United States. He never went to Cuba.” (shows that CLAVIN knows “Grover Cleveland” is not a city in Ohio)“I was born in Boston and grew up in Springfield.” (produces a map of Massachusetts)“I was born in Chicago and grew up in Springfield.” (produces a map of Illinois)“I traveled to London and Oxford last summer.” (produces a map of England)“I traveled to London and Toronto last summer.” (produces a map of Ontario)a random news article from CNN.com (or a similar source)any example text provided by the audience
geotag 1M documents containing 5.7M places names in under 1 hour on a 9-node Hadoop clustervsthe prohibitively expensive enterprise licenses of competing solutions like MetaCarta