Extract transliteration pairs (entries that are coded with transliteration systems) for testing of those transliteration systems.
LAST UPDATED: See releases page.
Due to recent instabilities of pages that serve GNDB releases, the make
checkdate
target is created in order to fetch the latest release date (once or
multiple times) to ensure that a consistent version is served across all GNDB
load balancers.
# Check once
make checkdate
# Check 10 times
CHECKDATE_TIMES=10 make checkdate
All dates displayed should be identical. If interleaving entries are shown, there is an issue with endpoint consistency.
Create a SQLite3 database using the “all countries” GeoNames data set.
make geonames.db
Internally it will run this:
$ curl -sL http://geonames.nga.mil/gns/html/cntyfile/geonames_20200106.zip -o geonames.zip
$ unzip geonames.zip -d data
$ sqlite
sqlite> .mode tabs
sqlite> .import geonames_20200106/Countries.txt countries
sqlite> .backup geonames.db
sqlite> exit
make geonames_pairs.db
You will get geonames_pairs.csv
which contains all transliteration pairs.
TRANSL_CD|SRC_UNI|SRC_NT|SRC_FULL_NAME_RO|SRC_FULL_NAME_RG|DEST_UNI|DEST_NT|DEST_FULL_NAME_RO|DEST_FULL_NAME_RG|LC
zho_Hani2Latn_WDG_1979|475320|D|Qiantangjiang Zhan|Qiantangjiang Zhan|12277331|DS|钱塘江站|钱塘江站|zho
zho_Hani2Latn_WDG_1979|14673196|V|Tung-men Chiao|Tung-men Chiao|14633249|VS|东门礁|东门礁|zho
zho_Hani2Latn_WDG_1979|14673197|V|Hsi-men Chiao|Hsi-men Chiao|14633249|VS|东门礁|东门礁|zho
...
make all
Your output files will be stored in pairs/${translit_system}.csv
and look like this:
TRANSL_CD,SRC_UNI,SRC_NT,SRC_FULL_NAME_RO,SRC_FULL_NAME_RG,DEST_UNI,DEST_NT,DEST_FULL_NAME_RO,DEST_FULL_NAME_RG,LC
ara_Arab2Latn_BUC_2002,17550392,N,"Wādī al Khuḑrah","Wādī al Khuḑrah",17550393,NS,"وادي الخضرة","وادي الخضرة",ara
ara_Arab2Latn_BUC_2002,-3489887,D,"Wādī ad Dafīn","Wādī ad Dafīn",14461283,DS,"وادي الدفين","وادي الدفين",ara
ara_Arab2Latn_BUC_2002,18101655,N,"Shi‘b Ḩāzirin","Ḩāzirin, Shi‘b",18101656,NS,"شعب حازرن","شعب حازرن",ara
ara_Arab2Latn_BUC_2002,14363956,N,"Būghāz Raqm Ithnān","Raqm Ithnān, Būghāz",14305791,NS,"بوغاز رقم ٢","بوغاز رقم ٢",ara
...
Source name. RO ⇒ full name. RG ⇒ name with grouped feature, e.g. "Lake Nemo" in RO will be formatted as "Nemo, Lake" in RG.
In the *_NT
columns, rows with values DS
, NS
, VS
are always the name source, the rest are generated.
The meaning of *_NT
values are described here:
http://geonames.nga.mil/gns/html/rest/lookuptables.html#Name%20Type%20Codes
NT_CD | DESCRIPTION | DEFINITION |
---|---|---|
C |
Conventional |
A commonly used English-language name approved by the U.S. Board on Geographic Names (BGN) for use in addition to, or in lieu of, a BGN-approved local official name or names, e.g., Rome, Alps, Danube River. |
D |
Unverified |
A name from a source whose official status can not be verified by the BGN. |
DS |
Unverified Non-Roman Script |
The non-Roman script form of a name from a source whose official status can not be verified by the BGN. |
N |
Approved |
The BGN-approved local official name for a geographic feature. Except for countries with more than one official language; there is normally only one such name for a feature wholly within a country. |
NS |
Non-Roman Script |
The non-Roman script form of the BGN-approved local official name for a geographic feature. Except for countries with more than one official language; there is normally only one such name for a feature wholly within a country. |
P |
Provisional |
A geographic name of an area for which the territorial status is not finally determined or not recognized by the United States. |
V |
Variant |
A former name, name in local usage, or other spelling found on various sources. |
VA |
Anglicized Variant |
An English-language name that is derived by modifying the local official name to render it more accessible or meaningful to an English-language user. |
VS |
Variant Non-Roman Script |
The non-Roman script form of a former name, name in local usage, or other spelling found on various sources. |