The document describes a Hadoop-enabled ship tracking application developed for the Port of Rotterdam to analyze large volumes of ship position data. The application stores over 5 terabytes of ship position data captured every 10 seconds in Hadoop. It develops custom tools to analyze the data and visualize it in ESRI's ArcGIS system. The application provides port managers faster access to information to improve operations, safety, and environmental management of the busy port.
For the Port of Rotterdam it’s of course important to know where are all the ships!
One of the primary roles of the Port of Rotterdam Authorities is to guide ships safely to their destinations. Each ship has an AIS transponder that continuously transmits the current location of the ship. AIS stands for Automatic Identification System, so not only the location is broadcast but also the ID of the ship (amongst other parameters).
That’s one way, how the Port of Rotterdam “knows” where the ships are.
Throughout the Port the radar stations continuously scan for ships. This information together with the AIS information is passed to the (vessel traffic service) VTS operators. The VTS operators are located in several control stations throughout the port and they are responsible to manage the shipping in real time.
As ships become bigger and bigger, it poses a challenge to get the goods transported further on. This container ship can handle up to 18,000 containers, transporting the equivalent of 125 milion pairs of shoes.
Therefore the CEO of the Port of Rotterdam has said that the Port should become smarter, faster and more sustainable. The way to do that is to innovate in which this project contributes.
This animated slides gives a offline demonstration.
Where does our presentation stand out?
First, our presentation deal with a geospatial data set, which means data where location is important. Despite location being omnipresent, very few Hadoop applications deal with this data set; as far as I can tell this might be the only presentation in the summit.
Second, we deal with sensor data; meaning observations and measurements. We’re obviously not only at that, although most Hadoop applications deal with social and/or transaction data. As with any measurement, errors can and will occur. Special care is need tot take that into account for analysis.
Third, we’ve created an interface that allow end users easy access to the information obtained from big data. I’ll demonstrate that in the presentation.
Here are some facts about the Port of Rotterdam. It’s area is about three times the city of Brussels or 80% of the Brussels region.
You may have heard about the Port of Rotterdam. It’s one of the biggest in the world. In terms of size, it’ big; it stretches for more than 40 km. It takes up to 4 hours to sail from one end to the other.
We store all this data for three main customers.
The Harbour master main interest is in safety. They use the tool for incidents. For example when there is a collision, they’ll like to know what happened. They of course like to prevent this from happening so they’ll like to see how the harbour is used and identify possible safety concerns.
The second group, capacity management is interested to ensure quick and easy passage of goods through the harbour. They’re interesting in identifying bottlenecks by looking at traffic patterns. Furthermore they’re interested in how current traffic patterns may alter if certain changes are made like widening of channels. This enables better decision making.
The third group, environmental management is interested in the pollution effect of the shipping. They are also evaluating speed measures that are put in place to reduce the pollution.
The big data work here is part of the Portmaps project. In this project the Port of Rotterdam has implemented a new geographical information system.
Uniform source of data.
All this data about ships and their location: is it big data? And does it make sense to use Hadoop for it?
Let us look at the big data score card:
For big data three key characteristics are import: volume, velocity and variety. The data has a reasonable volume. It comes in at quite a high velocity at over a 1,000 records per second. It has only a single data format so it doesn’t meet the variety characteristic. However, it meets the other two characteristics so it is big data and it does make sense to use Hadoop for it.
Volume = 18 billion records since 2009, there is three times the number of people in the world.
Velocity = during this presentation 250,000 records have been added
One part of the data set for Portmaps data set is ship position data. As we’ve seen the properties for this data set are receiving data every 10 seconds.
Several options were considered. The most potential one is storing it in a geospatial database. However it is expensive, it may require custom partitioning. It also requires custom queries and code to perform analyses.
And then there is of course Hadoop.
The external radar/AIS system places a file in the spool directory every 10s. Flume picks up this file and serialises it and sinks it into Hadoop.
To be able to access data a custom toolbox has been created that access the Hadoop cluster. It can read and write data from HDFS and can submit jobs.
The clients ArcMap and WebMap make use of the geoprocessing services that is provided by the custom Java toolbox.
The data set is just a CSV line for each observed ship every 10 seconds. Here is one example line. Each field is separated by the bar character. The following information is extracted from this line:
Track number – an assigned number by the radar/AIS system
MMSI – an unique identification number of the ship; based on that we know which ship it is.
X – The X-coordinate of the ship
Y – The Y-coordinate of the ship
Navigational status: whether is moored, anchored or moving. In this case it is anchored.
Length – the length of the ship. Although based on the MMSI this ship property may be found. However, may differ for barges that are pushing boat with car floats the length is variable.
Breadth – or width same as for the length.
Time – th
The ship positions data set is stored in Hadoop. Two considerations are important. One is that Hadoop prefers big files. In fact it can split big files and have it send to different mappers if need. Second, users often wants ranges of data to be considered.
We have chosen to partition the data at the hourly level. For each hour we store about 80Mb. Each file can therefore be processed by one mapper. If we consider a day, 24 mappers can work in parallel.
To facilitate easy deployment, the Port of Rotterdam has chosen for the Hadoop as a Service solution as provided by KPN. KPN is one of the main IT service providers for the Port.
The cluster is configured as stated.
Although the cluster is virtual, each node has exclusive access to its three disks.
This animated slides gives a offline demonstration.
To make it even easier for end users to obtain information. We’ve also created a webmap application. The end user just needs to go to the right website and gets a map of the area.