Tornado Archive v2.1 brings a lot of updates to tornado data, especially on the US side of things – the extensive amount of detail we have been able to bring together is completely unprecedented. This blog post describes what I spent most of my time working on leading up to the release.
NCEI Data
Prior to the release of Tornado Archive v2.1, we were using the Storm Prediction Center’s (SPC’s) dataset for the bulk of our US data (1950-2019). But that dataset is derived from the Storm Events Database provided by the National Centers for Environmental Information (NCEI), and although the datasets are similar, the NCEI’s has some features the SPC’s does not, such as waterspouts and detailed event descriptions. That’s why I decided I was going to integrate it with the rest of our data.
The NCEI breaks up tornado tracks into segments by county, so the first step of my process was linking those segments together. With my Python script, I checked each segment against other segments that were close in space and time. Unfortunately, there wasn’t a specific attribute in the NCEI data that could reliably indicate whether two segments were part of the same tornado (in fact, for most tornadoes, the end point of one segment was not even at the same place as the start point of the next). Therefore, I checked various attributes, such as event ID, county name, time, and location, and turned all of this information into a single number that represented the ‘difference’ between two segments. If two segments are far apart in time or in space, the difference is large; if they are close in all respects, the difference is near zero. I grouped each segment with whichever other segment had the lowest difference, as long as that value was below a certain threshold. As I constructed each group of segments, I recorded the differences between connected segments (think of these differences as strengths of links in a chain. If the difference is zero, two segments are very strongly connected, and vice versa for high difference).
Left: A linkage error. The top left segment should have linked with the bottom right segment; instead, it linked with a separate tornado that happened to touch down on the county line. Right: A particularly bad example of messy NCEI data – all of these segments are from the same tornado!
To improve the accuracy of segment linkages, I compared them against the SPC data. My program went through each tornado in the SPC data, and checked how close it was to each subgroup of each nearby NCEI segment group, based on how close the end and start points were. If an SPC tornado matched better with a subgroup than with a whole group, the differences between the NCEI segments in question were used in order to determine whether the group should be broken up (again it’s useful to think of link strength. If segments are strongly connected they will ignore the SPC path, but if they are weakly connected they will be easily broken up based on where the SPC path starts and ends). In this way, I simultaneously refined the segment linkage and matched SPC tornadoes to NCEI tornadoes.
DAT Data
I along with other members of the team had been interested in including the DAT (Damage Assessment Toolkit, a source of detailed paths and damage polygons) for a while, so I started working on adding that as well.
Adding data from the DAT presented different challenges. Its tornadoes are not segmented, but many have multiple geometry objects (polygons and paths) associated with them, and these are not intrinsically connected. So I started by grouping all enclosed polygons with similar timestamps together, with the restriction that the rating of an enclosed polygon be strictly higher than the rating of the polygon enclosing it (so F1 can be inside F0 but F0 cannot be inside F1). Then, I matched grouped polygons with paths, similar to how I linked NCEI segments together, using a ‘difference’ value. If two groups of polygons shared the same line, those groups were combined.
Many polygon groups did not have lines at all, so I needed to generate a medial axis (just a line running down the middle) for them. I took the lowest-rated polygon in the group (the one that represents the full extent of the tornado), rasterized it, and found its medial ‘skeleton’. Then, I selected the longest line in the skeleton (which can be assumed to be the path of the tornado, since tornadoes are longer than they are wide) and turned it back into an actual line instead of a raster image. I used a Gaussian filter to smooth it, and applied the Ramer-Douglas-Peucker algorithm to remove unnecessary points (for space-saving purposes).
By convention, if there is only one polygon available for a tornado, it is given the same rating as the tornado. If there are multiple polygons (multiple levels of damage surveyed), they are given their actual ratings. So for example, in the image below, that EF4 polygon does not mean that EF4 damage occurred everywhere inside it; it instead represents the maximum areal extent of damage.
Bringing it together
Then, I matched all of that with the SPC+NCEI data I had already generated. I assigned each tornado a unique ID that referred back to its constituent segments and geometries in each of the three source datasets, which makes it easy to cross-compare if necessary.
Neither the input data nor the process applied to it was perfect (see examples from NCEI above and examples from DAT below), so it needed a fair amount of quality control, and I wanted to do that work in an organized manner. That’s why I took the Sequelize (Node.js) tornado database structure/API that Spencer wrote a few months ago, and added several functions and scripts that allowed me to easily make edits to the data. These included a ‘merge’ function that let me merge two tornadoes in various different ways (i.e. combining tracks but keeping only one set of attributes, or combining attributes but keeping only one track, etc.), or a script that can read a KML file and insert it as the path for a given tornado (KML files are the default export in Google Earth, so now paths/polygons drawn there can be easily imported into our database). The quality control for the full 1950-2019 data took me about a week.
I came across some particularly creative uses of the DAT. Left: Someone inputting a DAT track using a different line for each intensity; Right: Someone using the line tool to draw polygons.
I also entirely stopped storing country or state information in our database. Instead, I wrote a program that assigns countries and US states and counties to all of our tornadoes based purely on their coordinates. This removed the hassle of trying to keep track of attributes that are ultimately redundant (since they are entirely determined by geographical location).
Overall, I am really excited about where this is going. I’m glad to finally have a database structure and API that is built for dealing with tornadoes, since it means that modifications and additions can be made more smoothly. I think the next broad step in terms of data is to open it more to the public – setting up ways in which data (like KML files, for example) can be sent to us directly through the website. No tornado database will ever be perfect, so the best thing we can do is make ours easily improvable.