Well this was a hard-fought issue. Setting the scene: since April 2020 I’ve been working on Global.health: a Data Science Initiative, where we collate information about Covid-19 cases worldwide and make them available in a standard schema for analysis. In the early days the data were collected by hand, as had been done for prior outbreaks, but the pandemic quickly grew beyond a scale where that was manageable so instead we looked to automatically import data from trustworthy published sources like ministries of health.
Cases typically have a location associated with them (often the centroid of their local health service district, or the administrative region the sufferer is registered in; never something uniquely identifiable like a home address). Now already being able to work with location data throws us some usability curveballs. Did you search for London as in anywhere with the name “London” in (such as London Oxford Airport, just outside Oxford and 62 miles from the City of London, a city within the place people generally know as London in England), or did you have a specific location in mind like London, Ontario, Canada or London Street, Los Angeles, California, USA?
But we had it worse than that. Our data format had the name of the country in, which led to all sorts of problems. Had the curator entered that London St case as being in the US, or the USA, or America, or the United States of America, or the United States? Sometimes even cases that had location data filled by a geolocation service had weird glitches, like a number of cases from Algeria being associated with the non-country Algiers. And it made it easier for us devs to make unforced errors, like not generating cached per-country data when the country has a space in its name.
For all of these reasons, I ended up with the task of changing our schema so that countries are stored as ISO-3166 two-letter codes. Along the way I spotted all of the above difficulties, and more, some of which even manifested in the libraries that map from country codes to names, and vice versa. Note I’m using “country code” fairly loosely; some places are far enough from where everybody thinks of as “the country” that they have a separate ISO code (it wouldn’t help anyone to record a case as being in “the UK” when you mean “the Falkland Islands”, of which more below).
- Countries have changed names recently. Swaziland became Eswatini (ISO code SZ) in 2018. The Former Yugoslavian Republic of Macedonia (officially, as far as they were concerned, Macedonia; but often named FYROM as Greece wouldn’t allow them to accede to the EU under Macedonia) became the Republic of North Macedonia (ISO code MK) in 2019. Both of these appeared in our geocoding provider under their original names, even though we didn’t start gathering data until early 2020.
- People don’t think of a country by its official name. China is China to many, not the People’s Republic of China (ISO code CN). Do we show the official name or the common name?
- People don’t think of a region as part of its sovereign country. The Hong Kong Special Administrative Region of the People’s Republic of China has ISO code HK, but is…well, it’s a special administrative region of the People’s Republic of China (CN). When you show people countries on a map like they asked for, and they ask “where is Hong Kong”, the answer is “it isn’t there because it isn’t a country”.
- A country’s ISO code is not the same as its top-level domain. Not a correctness problem for us, but one that might impact usability, when people look for the United Kingdom of Great Britain and Northern Ireland under “UK” when they’ll find it under “GB”. There is a .gb TLD, but it doesn’t accept new registrations.
- The extent of a region can change when its name changes. We have cases geocoded to the Netherlands Antilles (doesn’t have an ISO code, technically, see next point, but used to be AN); there’s extra work involved to decide whether these should be associated with Aruba (AW), Curaçao (CW), Sint Maarten (SX) or the Carribean Netherlands (BQ).
- As mentioned above, ISO “retires” codes. There isn’t a code AN any more, because when the Netherlands Antilles stopped existing they decided it isn’t needed. This causes a problem in that a library that has “a complete” database of ISO country codes doesn’t necessarily have the retired ones. The historical cases are in ISO 3166-3 but occupy a different namespace than the ISO 3166-1 codes so it’s not like AN still means “the region that used to be Netherlands Antilles”: its ISO 3166-3 code is ANHH. Similarly, the German Democratic Republic used to have the ISO 3166-1 code DD but now has the ISO 3166-3 code DDDE.
- A region may have two official names. The Falkland Islands (FK) are a British Overseas Territory; they are also constitutionally part of Argentina under the name “Islas Malvinas”. Politics aside, the reason this is an immediate problem is that some services feel the need to helpfully list both but not in a standard way; you need to be able to cope with “Falkland Islands [Malvinas]”, “Falkland Islands (Islas Malvinas)” and more.
- A region might have no official existence. Kosovo (XK) has an ISO code but is a disputed region, recognised by about half of the UN as a sovereign country and claimed as an autonomous region by Serbia.
- A country’s name might be the same as another country’s name. The Democratic Republic of the Congo (CD) is sometimes known as “the Congo”, and the neighbouring Republic of the Congo (CG) is also sometimes known as “the Congo”. Frustratingly, the i18n-iso-countries package lists “The Congo” as a name for both countries, so name-to-code mapping is unreliable.
You may be able to find more; feel welcome to comment (or maybe file a bug in Global.health if it’s causing trouble over there). Also, I’ve just created the SICPers mailing list as a one-stop shop for all my reading, writing and talking about software engineering, please consider subscribing!