This document discusses large scale geo processing on Hadoop. It begins with an introduction to geo processing and spatial data. It then covers pre-processing spatial data, loading it into HDFS, and performing analysis using tools like Hive, ESRI, and Hadoop. Finally, it provides examples of use cases like network analysis at T-Mobile Austria and traffic prediction using GPS data.
1. Large Scale Geo Processing
on Hadoop
Christoph Koerner
Slides available
2. About me
● Data Scientist at T-Mobile Austria
● Visual Computing at Vienna University of Technology
● Author of Data Visualizations with D3 and AngularJS
● Author of Learning Responsive Data Visualization
● Organizer of Vienna Kaggle Meetup
● LinkedIn: linkedin.com/in/christophkoerner
● Twitter: @ChrisiKrnr
● Google+: +ChrisiHififm
● Github: github.com/chaosmail
5. What is Geo Processing?
● Operations to manipulate spatial data
● Operations include geographic feature
overlay, feature selection and analysis,
topology processing, raster processing, and
data conversion
Source: Wikipedia
6. Spatial Data
● Contains data for a spatial reference
● Mostly 1D or 2D Geometries such as
Points, Lines, Polygons, etc.
● Usually in latitude and longitude
(or x and y) coordinates
10. 3D Coordinates
● Earth is not a perfect sphere!
● Can be approximated by a biaxial ellipsoid
● 3D coordinates need a reference ellipsoid
● Most widely used is the World Geodetic
System (WGS84) used by GPS
● Minimal positioning error on the surface
13. 2D Projections
● The earth cannot be displayed on a 2D map
without distortion
● Mapping to the surface of other 3D Volumes
○ Cylindrical
○ Conical
○ Azimuthal
14. 2D Projections
● The earth cannot be displayed on a 2D map
without distortion
● Every mapping has its tradeoff
○ Length Preserving (Equidistant)
○ Area Preserving (Equal Area)
○ Angle Preserving (Conformal)
15. 2D Projections
● Commonly used in Austria: MGI Austria
Lambert (equal area)
● Commonly used in the US: Albers USA
projection (equal area)
18. Geo Processing on Hadoop
● Acquire spatial dataset
● Pre-process the dataset
● Load the dataset into HDFS
● Perform topological processing and analysis
using Hive
● Visualize the results
19. Data Sources
● https://www.data.gv.at
Free Austrian data, demographics, health, tourism, public
transport, etc.
● GIP Graphenintegrations-Plattform
Austrian traffic graph, public transport, streets, etc.
● GADM Database of Global Administrative Areas
Country shapes
● Many more..
21. GDAL - Geospatial Data Abstraction Library
● Translator library for raster and vector
geospatial data formats
● Converts spatial data between file formats,
reference systems and projections
● SQL query syntax
● Command-line tool (MIT license)
Source: gdal.org
23. GDAL - Use spatial queries
ogr2ogr -sql "SELECT A.* FROM shape1 A,
shape2 B WHERE ST_Intersects(A.geo, B.geo)"
-dialect SQLITE
data input_dir
-nln output.shp
24. More complex pre-processing
● Fiona for loading Shapefiles
● Shapely for geo processing
● Complex pre-processing, extraction,
transformations, area and length
computations, etc.
26. ESRI Tools on Hadoop
● Open Source Tools from ESRI
● Provided under Apache-License 2.0
● Geo processing tools for Hadoop + ArcGis
● Active development and feedback (on Github)
27. ESRI Tools on Hadoop
● Esri Geometry API for Java
Java library for geo processing
● Spatial Framework for Hadoop
Hive SerDe and UDFs based on the geometry API
● Geoprocessing Tools for Hadoop
Tools for data exchange between ArcGis and HDFS
● GIS Tools for Hadoop
Sample application and demos
28. Spatial Framework for Hadoop (Hive UDFs)
● Geometry Construction
Create geometries from WKT, from binary, or manually
● Relationship Tests
Contains, intersects, overlaps, touches, etc.
● Operations
Boundary, envelop, union, convex hull, etc.
● Accessors
Length, Area, Centroid, Distance, etc.
29. Spatial Framework for Hadoop
ADD JAR libs/esri-geometry-api.jar;
ADD JAR libs/spatial-sdk-hadoop.jar;
CREATE TEMPORARY FUNCTION ST_Point AS
‘com.esri.hadoop.hive.ST_Point’;
CREATE TEMPORARY FUNCTION ST_LineString AS
‘com.esri.hadoop.hive.ST_LineString’;
CREATE TEMPORARY FUNCTION ST_Polygon AS
‘com.esri.hadoop.hive.ST_Polygon’;
30. Spatial Framework for Hadoop
SELECT ST_Area(
ST_Polygon(0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0)
) FROM src LIMIT 1;
SELECT ST_AsText(
ST_Centroid(
ST_GeomFromText(geo_as_wkt)
)
) FROM src LIMIT 1;
31. Problems
● No persistent spatial indices
● No projections - length/area!
● Binary output by default
● Doesn’t work with vectorization
● No visualization
● Not feature complete (but most things work)
33. Use Case: Geo Processing @ T-Mobile Austria
● Network traffic analysis and optimization
● Signal performance analog railway tracks
● Better analysis of network coverage
● Many more..
34. Use Case: Trips Analysis @ Uber
● What do trips look like?
● How can we reduce wait time and make more
trips?
● Are there new products we should introduce?
Source: slideshare.net
35. Use Case: Traffic Jam Prediction based on GPS/FCD
● Estimate average speed of cars on road
● Compare to the max speed on each street
● Use public traffic jam data as ground truth
● Train a model to predict traffic jams