Companion code for "Toward a Thermodynamics of Meaning," CHR 2020
Official: http://ceur-ws.org/Vol-2723/short40.pdf
arXiv: https://arxiv.org/abs/2009.11963
This contains a simple reference implementation of a lingusitic partition function as described in the paper, with some limited documentation.
The repository is pip-installable:
pip install git+https://github.com/senderle/lexpart#egg=lexpart
To train an embedding based on the included test dataset (enwiki8), run the following commands:
python -m lexpart vocab vocab.npz -
python -m lexpart corpus corpus.npz vocab.npz -
python -m lexpart embed embed.npz corpus.npz
python -m lexpart wordsim embed.npz paris
This will print out a list of words in the corpus that are similar to "paris."
To train an embedding based on your own corpus, replace the -
in the above
commands with the path to a folder containing plain text files.
The model described in the paper is based on the grand canonical partition function for multiple species in its standard form:
Z = ∑i eβ(µ1N1,i + µ2N2,i + ... + µkNk,i − Ei)
For computational purposes, however, it's convenient to represent the partition function in another form. Substituting uk for eβμk, we can rewrite the above like so:
Z = ∑i u1N1,i u2N2,i ... ukNk,i e−βEi
If we cheat a bit by treating the energy term (e−βEi)
as a constant for all i, we can treat the partition function as one huge
polynomial. Each term in the polynomial represents a sentence as a bag of
words, where the exponent is the word count. Since counts for sentences are
sparse, and differentiation is a linear operator, we can calculate values
for the Jacobian and Hessian very efficiently. The code that performs this
calculation is in sparsehess.py
.
There are some interesting connections between this way of thinking about sentences and contexts in natural language and the way of thinking about data types described in Conor McBride's "The Derivative of a Regular Type is its Type of One-Hole Contexts."