What should every data scientist know when working with ZIP Codes?

What should every data scientist know when working with ZIP Codes?

Datasets often contain ZIP code fields making it tempting for data scientists to organize data and develop models based on ZIP codes. However, ZIP codes present a significant challenge. When considering a ZIP code, many think of a well-bounded area contained perfectly within another geographic space (such as a city, congressional district, or census tract). However, this is often not the case.

To comprehend the complexities, one must first understand what ZIP codes are and how they work. Modern ZIP codes were implemented in 1963 when the United States Postal Service (USPS) adopted the Zoning Improvement Plan (ZIP) to expand the postal zones established in 1943. Like an automotive VIN, the five-digit ZIP code encodes meaningful information. The first digit indicates a region or group of states that make up a zone. The United States is separated into ten such zones. The following two digits represent the sectional center facility, a mail sorting facility located within one of the ten zones. The last two digits represent the post office or delivery area. In 1983, the USPS added four additional digits, commonly referred to as plus four (stylized as +4), to identify specific streets and street directions. Essentially all of this was done to improve mail delivery, not to create convenient boundaries for data analysis. In fact, if plotted correctly, ZIP codes would appear as lines (representing delivery networks) and points (representing large buildings, campuses, and post office boxes).

Data scientists have encountered other challenges related to ZIP codes. For instance, large portions of the United States, particularly in Alaska and Nevada, do not have an assigned ZIP code. The USPS does not assign ZIP codes to remote regions that do not receive mail. Another challenge is that ZIP codes may change, particularly in response to new construction and new delivery routes. To make things even more confusing – if they weren’t already – some ZIP codes are not stationary. For instance, 96620-2820 is the ZIP+4 for the 5,500+ crew (ship’s company and air wing) aboard the nuclear aircraft supercarrier USS Nimitz.

Data scientists should know that ZIP codes do not always fall within state boundaries (or even of the borders of states within a zone). There are over 100 cases where ZIP codes cross state lines. Even if ZIP codes could be plotted with well-defined boundaries, these would not align with other political boundaries such as county or municipal borders. And ZIP codes certainly do not align with United States Census tracks, block groups, or blocks.

To help facilitate the relationship between census data and ZIP codes, the United States Census Bureau created ZIP Code Tabulation Areas (ZCTAs). Therefore, such ZCTAs are often included in census data. However, ZCTAs are far from precise. The most frequent ZIP code (the mode ZIP code of all mailing addresses) within the block is the ZCTA of the block. If there is no identifiable most frequent ZIP code, the ZCTA of the neighboring block with the longest shared border is assigned. However, ZCTAs are not without other limitations. ZCTAs do not include all ZIP codes, particularly those of large buildings, campuses, or post office boxes. Also, keep in mind that ZCTAs do not use ZIP+4 codes, only the five-digit ZIP code.

Data scientists with access to address data are advised to geocode the addresses to geographic coordinates. Or, if the data has geographic coordinates, start there. Next, a vector overlay operation can determine the relevant census tract or block, congressional district, or political jurisdiction. Doing so presents an opportunity for more precise analyses. Unfortunately, if no addresses are associated with the data, ZCTAs may be the best option to crosswalk from ZIP codes to more meaningful boundaries. Also, some municipalities and for-profit groups provide demographic data collected (or aggregated) to ZIP codes.

If you encounter a situation that requires linking data with ZIP codes to data without ZIP codes, proceed cautiously and be aware of your limitations.