Invited talk for 2016 AGU Fall Meeting Session IN12A Big Data Analytics I
Introduced is a new approach for processing spatiotemporal big data by leveraging distributed analytics and storage. A suite of temporally-aware analysis tools summarizes data nearby or within variable windows, aggregates points (e.g., for various sensor observations or vessel positions), reconstructs time-enabled points into tracks (e.g., for mapping and visualizing storm tracks), joins features (e.g., to find associations between features based on attributes, spatial relationships, temporal relationships or all three simultaneously), calculates point densities, finds hot spots (e.g., in species distributions), and creates space-time slices and cubes (e.g., in microweather applications with temperature, humidity, and pressure, or within human mobility studies). These “feature geo analytics” tools run in both batch and streaming spatial analysis mode as distributed computations across a cluster of servers on typical “big” data sets, where static data exist in traditional geospatial formats (e.g., shapefile) locally on a disk or file share, attached as static spatiotemporal big data stores, or streamed in near-real-time. In other words, the approach registers large datasets or data stores with ArcGIS Server, then distributes analysis across a cluster of machines for parallel processing. Several brief use cases will be highlighted based on a 16-node server cluster at 14 Gb RAM per node, allowing, for example, the buffering of over 8 million points or thousands of polygons in ~1 minute. The approach is “hybrid” in that ArcGIS Server integrates open-source big data frameworks such as Apache Hadoop and Apache Spark on the cluster in order to run the analytics. In addition, the user may devise and connect custom open-source interfaces and tools developed in Python or Python Notebooks; the common denominator being the familiar REST API.
Feature Geo Analytics and Big Data Processing: Hybrid Approaches for Earth Science and Real-Time Decision Support
1. Feature Geo Analytics and
Big Data Processing:
Hybrid Approaches for Earth Science
and Real-Time Decision Making
Mansour Raad, Erik Hoel, Michael Park, Adam
Mollenkopf, Dawn J. Wright
Environmental Systems Research Institute (aka Esri)
IN12A-01 (Invited)
AGU Fall Meeting, 12 December 2016
2. What is Feature Geo Analytics?
A new way of processing spatiotemporal data designed for WEB-
BASED big data by leveraging distributed analytics and storage
• Works with existing GIS data and tabular data
• Designed to perform both spatial and temporal analysis
• Uses familiar workflows to complete complex analyses
• “Hybridity” - integrating open-source frameworks on clusters to run analytics
4. Solve New Problems
Run analytics:
• against data too big for a single desktop machine
- Buffer 8.2 million points or thousands of polygons in a little over a minute
- billions of observations of ship movements ingested via GeoEvent
• designed to gain insight into both spatial and temporal patterns
• against massive collections in a scalable manner
• and meet time constraints
months weeks days hours minutes
5. Geo Analytics Architectural Overview
Portal
Web GIS Layers
Un-Managed Data
New Web GIS Layers
Register large data stores, then distribute
spatial analysis across cluster of machines
for parallel processing
Store and/or deploy to web
Web GIS layers
via Pro, Portal,
Python Notebooks,
or the REST API
Managed Data
Relational
Data Store
Spatiotemporal
Data Store
Files
Files
Delimited Files EnterpriseShapefiles Big Data Stores
Server
Cluster
6. Rich Collection of (Web) Analysis Tools
Summarize Data
Aggregate Points
Summarize Nearby
Summarize Within
Reconstruct Tracks
Join Features
Find Locations
Find Existing Locations
Find Similar Locations
Analyze Patterns
Calculate Density
Find Hot Spots
Create Space Time Cube
Use Proximity
Create Buffers
Manage Data
Extract Data
* Temporally aware tools
Aggregate Points
Summarize Nearby
Summarize Within
Find Existing Locations
Find Similar Locations
Calculate Density
Find Hot Spots
Create Buffers
Extract Data
12. • Reconstruct Tracks
- Summarize time-enabled points into tracks
Analytical Overview: Aggregating and Summarizing
13. Use Case: Hurricane Tracts
• Hurricane dataset
- 120,000 points, ~100 years
- Each point has:
- ID number
- Location
- Date
- Wind speed and pressure attributes
- Problems?
- Difficult to visualize that many points
- Difficult to visualize hurricane path
16. Real-Time GIS Performance
ArcGIS 10.4
10s of thousands of e/s
ArcGIS Spatiotemporal
Big Data Store
DesktopWeb Device
ArcGIS Server
4,000
e/s
Ingestion
GeoEvent
4,000
e/s
Visualization
Live and Historic
Aggregates & Features
Enhanced Map and
Feature Service
• Ingest high-velocity real-
time data
• Observations in a Big Data
Store
• Visualize high-velocity,
high-volume data
- as an AGGREGATION,
- as discrete FEATURES,
- live & HISTORICALLY
• Visualizations CAN scale
Stream Service
Stream Layer
3,000
e/s
Live Features
Geo Analytics Performance
Spatiotemporal
Big Data Store
17. Discussion groups at geonet.esri.com
Step 1. Click orange “Join in” button to create your
account.
Step 2. Join the Big Data or Sciences groups
Step 3. Contribute to AGU conversations!
Mansour Raad, Esri Big Data Team
mraad@esri.com
thunderheadxpler.blogspot.com
github.com/mraad
@mraad
For Questions/Discussion
Notas del editor
“hybrid” in that ArcGIS Server integrates open-source big data frameworks such as Apache Hadoop and Apache Spark on the cluster in order to run the analytics
Building blocks of this approach
buffer 8.2 million points or thousands of polygons in a little over a minute
Meet time constraints, especially against the next NSF proposal deadlines
These “feature geo analytics” tools run in both batch and streaming spatial analysis mode as distributed computations across a cluster of servers on typical “big” data sets, where static data exist in traditional geospatial formats (e.g., shapefile) locally on a disk or file share, attached as static spatiotemporal big data stores, or streamed in near-real-time. In other words, the approach registers large datasets or data stores with ArcGIS Enterprise (Server), then distributes analysis across a cluster of machines for parallel processing.
We aim to register large data stores / data sets with ArcGIS Server, then distribute analysis across a cluster of machines for parallel processing
Many frameworks/technologies exist for distributing computation
E.g., Hadoop, MapReduce, Spark
Spark: processes distributed data in memory; Supports MapReduce programming model
Includes additional framework level distributed algorithms
ArcGIS Server integrates these technologies on a cluster to solve analytic problems
Due to lack of time, will focus on Aggregation and Summarizing
Many frameworks/technologies exist for distributing computation
E.g., Hadoop, MapReduce, Spark
Spark: processes distributed data in memory; Supports MapReduce programming model
Includes additional framework level distributed algorithms
ArcGIS Server integrates these technologies on a cluster to solve analytic problems
For fast, dynamic queries, integrate Cloudera Impala which is an open-source query engine that runs on Apache Hadoop (Hadoop Distributed File System).
Delivers fast SQL processing on HDFS
Read/write data in HDFS using Impala
Write code in Python, Java, Scala (like C, ”scalable language”)
ArcPy helps you to perform geographic data analysis in Python
By the way, you’ll need at least
8 CPU cores
16 Gb RAM (32 Gb is better)
512 Gb Solid State Drive (1 Tb is better)
e/s = events per second
We aim to register large data stores / data sets with ArcGIS Server, then distribute analysis across a cluster of machines for parallel processing
Performance example: buffer 8.2 million points or thousands of polygons in a little over a minute, Coming: ~250,000 writes to disk per second across 5 nodes
Many frameworks/technologies exist for distributing computation
E.g., Hadoop, MapReduce, Spark
Spark: processes distributed data in memory; Supports MapReduce programming model
Includes additional framework level distributed algorithms
ArcGIS Server integrates these technologies on a cluster to solve analytic problems