Data Deep Dive: Predicting China's Exports Using AIS Vessel Tracking Data (+ Other Uses)
We embark on a journey into AIS data in this Data Score entry; the first in a recurring series providing a detailed look at how specific dataset address investment debates.
Welcome to the Data Score newsletter, your go-to source for insights into the world of data-driven decision-making. Whether you're an insight seeker, a unique data company, a software-as-a-service provider, or an investor, this newsletter is for you. I'm Jason DeRise, a seasoned expert in the field of alternative data insights. As one of the first 10 members of UBS Evidence Lab, I was at the forefront of pioneering new ways to generate actionable insights from data. Before that, I successfully built a sellside equity research franchise based on proprietary data and non-consensus insights. Through my extensive experience as a purchaser and creator of data, I have gained a unique perspective that allows me to collaborate with end-users to generate meaningful insights.
Introducing the Data Deep Dive Series
I’ve appreciated all the positive feedback for The Data Score Newsletter as well as suggestions for improvement. One of the requests has been to go deeper into various datasets, explaining the use cases, methodology, and limitations associated with the analytics. The goal is to help insight seekers better understand the potential of various types of datasets and to help data providers think through how they can align their products with the outcomes needed by their clients.
AIS Data Deep Dive
In this entry, we delve into the world of AIS (Automatic Identification System) vessel tracking data1. Analytic techniques applied to the AIS data can help predict China's exports, monitor global trade, and understand real-time economic activity.
The methodology is built around real-time AIS data, tracking the largest container, dry bulk, and liquid tanker ships2. Using extensive metadata on fleets and global ports, as well as the current draft (draught)3 and deadweight4 tonnage of the vessel, a model can determine the aggregated cargo tonnage moving through global maritime points of interest.
First, I will start with the outcomes addressed by the data and approach before walking through the process from accessing the data to enriching the data. I will discuss the limitations, which are always an important component in deciphering insights from alternative data.
Common questions addressed with the data
The potential applications of methodologies leveraging AIS data are wide-ranging. It's an instrument to assess the impact of global events on trade activity, track ship traffic at major ports, and observe changes in supply chains. It offers valuable insights for the analysis of trading activity over time, particularly when paired with deadweight tonnage and draft (draught) data.
China cargo export estimates:
As the title of this newsletter entry highlights, this is possible with a lead time versus official statistics. Use the draft to calculate the capacity utilization applied to the deadweight tonnage to derive the overall tonnage shipped. Confirm the relationship with statistical modeling with an in-sample, test-sample, and out-of-sample set of data. It is also possible to estimate the number of TEUs (twenty-foot equivalent units), which is an industry standard measurement of volumes, as an approximation for the number of containers shipped.
Considerations: Adjustments need to be made to leverage the satellite AIS data in place of the terrestrial AIS data to track vessels close to shore, which is different than other geographies. The model provides a clear measurement of the volume of maritime shipments, using cargo weight as a proxy. It’s important to note that the weight and value of cargo are not directly related. If the mix of cargo shipped changes such that the value per weight is higher or lower, the statistical relationship with country-reported export value could breakdown. Another constraint to consider is the possibility that higher-value goods could be shipped by air with a changing frequency compared to shipments via maritime channels, which would hurt the relationship between the AIS data and country-reported export figures. Various sources cite that 60% of China’s exported goods are transported by maritime channels5.
International oil and gas supply:
Monitor the oil and liquid natural gas (LNG) tanker departures and arrivals from key oil and gas ports, leveraging the deadweight and draft to estimate capacity; Monitoring oil and LNG tankers departing from hard-to-track oil-producing geographies; Or understand imports of oil and natural gas into China.
Considerations: Vessel’s AIS signals may go dark around some ports, which may mean this methodology is not directly able to monitor imports or exports of oil or natural gas to and from sensitive geographies. However, it is possible to build a probabilistic model based on the data that estimates the likelihood that a vessel delivered or picked up oil or gas from a nearby port (or secondary vessel).
International coal and iron ore supply
Monitoring bulker deadweight tonnage and draft from key ports in the metal and mining supply chains provides a highly accurate view of international supply volume. By digging deeper into the details of the berths and terminals in the port6, it is possible to relate the berths to specific companies, which will allow for accurate estimates of volumes shipped (e.g., Vale’s Brazilian export volumes).
Considerations: Some companies may share berths with other companies, which will require an estimate to allocate the volumes shipped between the companies.
Port and canal congestion:
In 2021, one of the biggest debates in the market will be supply chain bottlenecks, ranging from measuring the congestion time at the US West Coast ports to the attempts of the supply chain to work around the bottleneck by using alternate routes (such as via the Panama Canal to get to the East Coast). By tracking the date and time stamps associated with each vessel’s entry and exit from the port, the median and average time in port can also be calculated as a proxy for congestion (note: removing voyages passing through the geofenced port from the calculation).
Considerations: The port congestion became so difficult that monitoring anchorages offshore was also needed to fully understand the situation. For a period, the time in port improved, but only because more time was being spent in the anchorage waiting instead of in port.
Shipping Company Fleet performance
Leveraging metadata about the vessels, unique vessels can be mapped back to shipping companies, and metrics around capacity utilization of the fleet, average speeds, distance traveled, and number of ports entered can provide insight into how the quarter is progressing. Pairing it with market freight rate data could further enhance insights on company performance.
Considerations: Vessel ownership is constantly changing, so a process of mapping vessels back to owners needs to be maintained on an ongoing basis. Different metrics may need to be calculated on a pro forma and as-is basis depending on the acquisition and divestment activity of the fleet.
These questions can be addressed even more accurately by combining multiple datasets together to fill in the limitations of each dataset. The approach can either be done by applying a Bayesian approach to probabilistic outcomes or by combining datasets in a nowcasting model.
As a summary, the table below outlines sample investment outcomes needed, the related metrics, and the logic connecting the two.
This isn’t a complete list of use cases and investment debates to be addressed.
Data providers operating in this space should feel free to chime in with other use cases in the comments. Let’s get the conversation going.
In the rest of this Data Score Newsletter entry, we’ll go into more details about how the raw data can be transformed into insights to address the use cases above, expanding on some of the ideas introduced above.
Underlying Data
There are two datasets that should be sourced and joined: AIS data and points of interest data.
AIS data
The AIS data covers more than 100,000 maritime vessels globally. This robust network includes liquid tankers, bulkers, and containers. The AIS data, updated every 15 minutes, encompasses historical vessel movements over the last 10+ years, contributing to a rich data set that is essential in understanding trade patterns.
The underlying AIS data is free and available from multiple sources. However, the way the data is cleansed and enriched is what allows it to be a powerful signal generator for investment signals. Some providers of AIS data take on the work of cleaning and enriching the data as well as making the historic data easier to integrate into analytics.
The relevant data points provided include
The International Maritime Organization (IMO) number, which is the unique identifier of the vessel
The cargo type is: Bulker, Tanker, Cargo, etc.
Location in Latitude and Longitude
Draft (Draught) of a vessel measures how deep the vessel sits in the water and is important for assessing capacity utilization.
Speed
UTC time stamp of the observation
Terrestrial AIS and Satellite AIS data points
Terrestrial AIS provides an accurate reading of where each vessel is when it is close to shore. This approach to collecting the locations allows for high frequency (measured in seconds per measurement) and high precision, which allows for its primary use case of safety in navigating maritime traffic in and out of ports. However, due to the line of sight, the range is just above 40 miles. Beyond the coasts, Satellite AIS is used. Since the satellites are not bound by the limitations of terrestrial receivers, they can monitor vessels in the open ocean, arctic regions, or any other area that is far from terrestrial AIS station coverage. However, it is less frequent, and when vessels are close together in highly trafficked locations, it can be prone to misreading the precise location of each vessel.
Points of interest data (E.g. ports, canals, straights)
The key to generating insights from the AIS is a second dataset covering global ports and key points of interest. This dataset includes metadata on global ports, terminals, and berths, each associated with a unique virtual, geofenced polygon. Similarly, key points of congestion, such as the Suez Canal or the Panama Canal, the Straight of Hormuz, or areas for anchorage, should be geofenced in the dataset.
The meta data of the ports should include the locations, a breakdown of the terminals and berths inside the port, and where possible, a mapping of berths to companies and the types of cargo shipped from the location (especially for commodities).
There are providers of port database data, but to get the details down to which company operates from each berth and what commodities are shipped from the location requires quite a bit of desk research, which makes the dataset quite rich for analytic insights and predictions.
Cleaning the Data
Data cleansing is a vital step in the process of converting raw AIS data into actionable insights.
The large dataset of vessels should be filtered to exclude vessels that are not large enough to carry a material amount of cargo and filtered to exclude passenger vessels (though one could monitor cruise line movements with the same methodology by filtering for these vessels).
AIS data can be filled with outliers and potentially contain inaccuracies; an algorithm specifically designed to manage these challenges is required.
An algorithm should begin by identifying and removing any clear anomalies, such as coordinates that place a ship on land or speeds that exceed the known physical limitations of the vessel type.
Next, the data is scrutinized for inconsistent reporting, such as abrupt changes in location or course that are beyond the operational norms of a vessel. These inconsistencies may be due to a temporary loss of signal or error in the AIS device, and such data points are discarded to maintain the integrity of the dataset. This process includes backward-looking tests as more data on each vessel’s journeys is captured. New data could result in a previously classified entry or exit from a port being reclassified as a pass-through or as still in port. This improves the accuracy of the data, but with any data that can be restated, it’s important to include timestamps on each version of the data point to allow proper back-testing by financial market users of the data.
The cleansing process also includes aggregate-level benchmarking and back-testing the data against ground-truth data about ports and fleet activity. This rigorous cleansing process is what ensures the resulting dataset's reliability, providing a solid foundation for the subsequent analyses and insights.
Selecting between terrestrial and satellite-based AIS data: Use the terrestrial location data as a priority, but when it's not available, supplement with the satellite location data. Sometimes, the land-based transmission of the AIS data can be redacted by local authorities. But the satellite-based AIS provides an alternative source for vessel information while near or in ports. Satellite-based AIS provides a global overview of maritime vessel movements, irrespective of the vessel's location.
Enriching the data
This section discusses how to take the cleansed data and turn it into metrics that are highly aligned with use cases. There are a number of assumptions that need to be made to use the data effectively.
The port database plays a critical role in this enrichment process. The geofenced polygon of each point of interest allows the system to monitor if a specific ship has entered, is inside of, or has exited a point of interest based on the coordinates of each ping (location update). The sequence and frequency of these entry and exit events form the basis for calculations in the model, delivering valuable insights on the vessel's activity. The output allows for the number of vessels to be tracked entering and exiting key ports and canals, the aggregate tonnage being shipped, and the length of time at each port or point of interest. This dataset will need continuous maintenance over time as relationships between companies and berths gradually change.
The modeling requires a Geographic Information System (GIS)7 to analyze the movements of the vessels around the world, capturing when they enter and exit the geofenced polygons associated with each port, terminal, and berth. It’s important that the entries and exits are deduplicated per day to avoid double-counting activity.
The relation of the draft to the full deadweight of the vessel is important. The maximum draft, as seen in the history of each vessel, marks what full capacity looks like. However, it is important to realize that part of the deadweight of the vessel is the fuel, the crew, and other supplies in addition to the cargo being shipped. Furthermore, the vessel never has a zero draft level; it must take on water to ballast8 the vessel as an offset for the lack of cargo. Therefore, there should be parameters set based on the maximum draft of the vessel, such as less than 60% of the peak draft being considered empty from a cargo point of view. However, at peak draft, take 5% off the top for non-cargo deadweight such as fuel and the crew. Leverage desk research and expert interviews to set the range of draft measurements relative to the maximum draft to reflect what empty and full capacity look like in the data.
Automating as many of the enrichment steps as possible is critical to being able to generate insights at scale. However, even with the most automation, there’s a large need for humans in the process to investigate and address the exceptions triggered by the automated process.
Key metrics to calculate:
Count of vessels entering and exiting specific points of interest, such as ports, canals, and straights.
Deadweight tonnage of the vessels entering and exiting specific points of interest
Utilized deadweight tonnage entering and exiting specific points of interest.
These measurements can then be aggregated to higher levels of aggregation, such as by geography and company. The metrics can also be aggregated as a time series mapped to arrival or departure dates.
An important nuance is the differentiation between aggregations based on arrival or departure dates, which are crucial for long transpacific movements. Depending on the use case, each could be relevant. For some questions, the origin and destination pairs of travel are important, say, to understand the trade activity between Asia and North America. The pair will not be known in the data until the trip is complete, so be advised that long voyages that take 4-6 weeks would take that long to appear in the data aggregations by origin-destination pairs. Further, by aggregating by departure date, the data would continually be restated until all voyages are complete (with short voyages showing up first in the aggregation). To avoid restating, origin-destination pair aggregations are better grouped by the arrival date instead of the departure date.
In the question of predicting China's exports, it's better to use the departure date without factoring in the final destination to get the fastest and most complete signal the dataset would allow.
Limitations to consider
It tracks large commercial ships only, leaving smaller ships in local trade outside its purview. The nature of cargo, especially for container ships, remains an area of uncertainty. Variables like split cargoes, anomalies due to the AIS system being turned off, and the varying weights of container contents add layers of complexity to the interpretation of the data.
In the logic for determining if a vessel has docked at a port, it's important to make sure the time in the polygon is tracked and set thresholds to be sure it's stayed long enough. However, set a time too long as the threshold, and you will miss quick, partial cross-docking9 cargo drops and pick-ups. Set a time too short and count too many arrivals by mistaking a pass through the polygons for a true arrival. As a somewhat subjective range, consider using somewhere between 4 and 8 hours as the minimum threshold for assigning a unique vessel to the berth.
It is possible for the ballasting pattern of a vessel to change, which could give a different read of what the draft means for capacity utilization.
Beware of the stated destination information. It’s best to think of this as a survey where the respondent is providing an intention or perception of where they will head. However, it's not always filled in.
Lloyd’s List Intelligence is one of the leading companies that leverages AIS data for insurance purposes and is an authority on AIS datasets.
Lloyd’s List Intelligence: Although raw AIS data is the default go-to for information, it’s difficult to plan ahead when about 32% of vessels sail without a destination logged https://www.lloydslistintelligence.com/knowledge-hub/data-storytelling/how-data-analytics-and-machine-learning-helps-the-shipping-sector-thrive
And, as I understand it, it’s also common for the destination to change as the economics of the freight market change.
If freight carrying rates increase materially, it may make sense for the vessel to change direction, pick up more cargo, and add more drop-off destinations.
An alternative scenario to consider could be that if commodity shipments could be fulfilled by a different vessel with a more profitable route, the voyage could be rerouted to maximize profit.
The main takeaway is to not use the destination field carefully, or if you do, be highly skeptical.
Lloyd’s List Intelligence also shares more details on the reasons why AIS data might have gaps caused by AIS data going “dark”:
Lloyd’s List Intelligence Quote: “There are several legitimate reasons for a vessel to go ‘dark’, including passing through dangerous waters and issues with infrastructure, which would prevent AIS messages from being received… The more common and well-established pattern to explain AIS gaps is that ships are deliberately turning off AIS to obfuscate their movements.” https://www.lloydslistintelligence.com/knowledge-hub/data-storytelling/jump-in-ais-gaps-mask-russian-maritime-activity
It’s important to keep the geopolitical backdrop in mind while analyzing the AIS data and setting anomaly detection10 to identify times when trends at ports have differed materially from history and investigate what’s going on via the press, industry experts, and satellite images. It’s important to have a process to determine actual dropping activity from declines that reflect dropping data coverage (for whatever reason) without a corresponding real drop in activity.
An advanced approach can include remote sensing techniques using computer vision to monitor vessels at key points of interest that have turned the AIS off. Planet Labs highlights the approach here: https://www.planet.com/pulse/planets-data-leveraged-to-identify-dark-vessels-and-monitor-the-illicit-russian-oil-trade/
Action items to begin using the data
To leverage the AIS data, follow these steps:
Access real-time AIS data to get the locations of vessels around the world, including fleet metadata, leveraging a GIS.
Create a second database in the GIS with a robust set of metadata on global ports, terminals, and berths.
Create a model that time stamps the entry and exit of vessels into each port. Use algorithms and statistical methods to clean the data to ensure accuracy.
Calculate aggregated metrics such as vessel count, deadweight tonnage, and Utilized deadweight tonnage at the port, geography, and company level. Consider correct time aggregation based on departure or arrival date, depending on the specific use case of the data.
Consider the assumptions and limitations inherent in this model in the interpretation of the data.
What dataset should I deep dive into next? Let me know in the comments
- Jason DeRise, CFA
So much nautical jargon today in the footnotes, below… But before getting to the simple definitions, I asked ChatGPT to write some “Dad Jokes” about this article:
GPT4: “Sure, I'd be happy to bring some light-hearted humor into this topic. Here are a few dad jokes related to AIS data and shipping:
Why don't ships ever get lost at sea? Because they always follow the "shipping" route.
Why was the AIS data analyst always calm? Because he knew there's no point in going overboard with data.
Why did the ship stay in port for a while? It was harboring a grudge against the sea.
What did one vessel say to the other about their AIS data? "You're making waves in the data set!"
Why was the boat always accurate in its AIS data? Because it didn’t want to get into ship shape trouble.
Remember, the key to a good dad joke is the groan it induces!”
AIS (Automatic Identification System): A tracking system used on ships and by vessel traffic services for identifying and locating vessels by electronically exchanging data with other nearby ships and AIS Base stations.
Vessel Types:
Container: A container or cargo vessel is a type of ship designed to transport goods in large, standardized containers. These vessels are equipped with special infrastructure, such as cranes or guides, to load and unload the containers. The size of container ships can vary significantly, from small coastal feeder ships to massive vessels capable of carrying thousands of twenty-foot equivalent units (TEUs).
Bulker: A type of ship designed to transport unpackaged bulk commodities, such as grains, coal, ore, and cement, in its cargo holds.
Tanker: A tanker vessel is a ship designed for the specific purpose of transporting large volumes of liquids, particularly hydrocarbons, over long distances. This includes substances like crude oil, petroleum products, liquefied natural gas (LNG), chemicals, and even wine.
Draft (Draught): The vertical distance between the waterline and the bottom of the hull (keel), which tells about the deepest point of a vessel. It is used to determine the depth needed for a vessel to safely navigate and dock at a port.
Deadweight tonnage: The carrying capacity of a ship, which includes cargo, fuel, passengers, crew, and their belongings.
China Exports via Maritime: An example article referencing 60% of exports are via maritime channels: https://www.reuters.com/article/us-global-shipping-container/boxed-out-chinas-exports-pinched-by-global-run-on-shipping-containers-idUSKBN28K0UA
Ports break down into Terminals and Berths:
Terminals: Facilities where cargo containers are transshipped between different transport vehicles for onward transportation.
Berths: A location on the terminal used specifically for mooring vessels.
GIS (Geographic Information System): A system designed to capture, store, manipulate, analyze, manage, and present all types of geographical data.
Ballasting Pattern: Refers to the practice of loading ballast water into a ship to increase its stability and maneuverability.
Cross-dock cargo drops: The process of unloading goods from an incoming vessel and loading these goods directly onto outbound vessels, with little or no storage in between.
Anomaly Detection: In data analysis, it's a statistical technique used to identify unusual patterns that do not conform to expected behavior.
You’re making waves in the data set - I’m going to take that one and use it in my internal chatterbox
Brilliant