|
| 1 | +# MetroHash |
| 2 | + |
| 3 | +Python wrapper for [MetroHash](https://github.com/jandrewrogers/MetroHash), a |
| 4 | +fast non-cryptographic hash function. |
| 5 | + |
| 6 | +[](https://pypi.python.org/pypi/metrohash) |
| 8 | +[](https://pypi.python.org/pypi/metrohash) |
| 9 | +[](https://circleci.com/gh/escherba/python-metrohash) |
| 11 | +[](https://pypi.python.org/pypi/cityhash) |
| 13 | +[](https://pypi.python.org/pypi/cityhash) |
| 14 | + |
| 15 | +## Getting Started |
| 16 | + |
| 17 | +To use this package in your program, simply type |
| 18 | + |
| 19 | +``` bash |
| 20 | +pip install metrohash |
| 21 | +``` |
| 22 | + |
| 23 | +After that, you should be able to import the module and do things with |
| 24 | +it (see usage example below). |
| 25 | + |
| 26 | +## Usage Examples |
| 27 | + |
| 28 | +### Stateless hashing |
| 29 | + |
| 30 | +This package provides Python interfaces to 64- and 128-bit |
| 31 | +implementations of MetroHash algorithm. For stateless hashing, it |
| 32 | +exports `metrohash64` and `metrohash128` functions. Both take a value to |
| 33 | +be hashed and an optional `seed` parameter: |
| 34 | + |
| 35 | +``` python |
| 36 | +>>> import metrohash |
| 37 | +... |
| 38 | +>>> metrohash.hash64_int("abc", seed=0) |
| 39 | +17099979927131455419 |
| 40 | +>>> metrohash.hash128_int("abc") |
| 41 | +182995299641628952910564950850867298725 |
| 42 | + |
| 43 | +``` |
| 44 | + |
| 45 | +### Incremental hashing |
| 46 | + |
| 47 | +Unlike its cousins CityHash and FarmHash, MetroHash allows incremental |
| 48 | +(stateful) hashing. For incremental hashing, use `MetroHash64` and |
| 49 | +`MetroHash128` classes. Incremental hashing is associative and |
| 50 | +guarantees that any combination of input slices will result in the same |
| 51 | +final hash value. This is useful for processing large inputs and stream |
| 52 | +data. Example with two slices: |
| 53 | + |
| 54 | +``` python |
| 55 | +>>> mh = metrohash.MetroHash64() |
| 56 | +>>> mh.update("Nobody inspects") |
| 57 | +>>> mh.update(" the spammish repetition") |
| 58 | +>>> mh.intdigest() |
| 59 | +7851180100622203313 |
| 60 | + |
| 61 | +``` |
| 62 | + |
| 63 | +The resulting hash value above should be the same as in: |
| 64 | + |
| 65 | +``` python |
| 66 | +>>> mh = metrohash.MetroHash64() |
| 67 | +>>> mh.update("Nobody inspects the spammish repetition") |
| 68 | +>>> mh.intdigest() |
| 69 | +7851180100622203313 |
| 70 | + |
| 71 | +``` |
| 72 | + |
| 73 | +### Fast hashing of NumPy arrays |
| 74 | + |
| 75 | +The Python [Buffer |
| 76 | +Protocol](https://docs.python.org/3/c-api/buffer.html) allows Python |
| 77 | +objects to expose their data as raw byte arrays to other objects, for |
| 78 | +fast access without copying to a separate location in memory. Among |
| 79 | +others, NumPy is a major framework that supports this protocol. |
| 80 | + |
| 81 | +All hashing functions in this packege will read byte arrays from objects |
| 82 | +that expose them via the buffer protocol. Here is an example showing |
| 83 | +hashing of a 4D NumPy array: |
| 84 | + |
| 85 | +``` python |
| 86 | +>>> import numpy as np |
| 87 | +>>> arr = np.zeros((256, 256, 4)) |
| 88 | +>>> metrohash.hash64_int(arr) |
| 89 | +12125832280816116063 |
| 90 | + |
| 91 | +``` |
| 92 | + |
| 93 | +The arrays need to be contiguous for this to work. To convert a |
| 94 | +non-contiguous array, use NumPy's `ascontiguousarray()` function. |
| 95 | + |
| 96 | +## Development |
| 97 | + |
| 98 | +### Local workflow |
| 99 | + |
| 100 | +For those who want to contribute, here is a quick start using some |
| 101 | +makefile commands: |
| 102 | + |
| 103 | +``` bash |
| 104 | +git clone https://github.com/escherba/python-metrohash.git |
| 105 | +cd python-metrohash |
| 106 | +make env # create a Python virtualenv |
| 107 | +make test # run Python tests |
| 108 | +make cpp-test # run C++ tests |
| 109 | +make shell # enter IPython shell |
| 110 | +``` |
| 111 | + |
| 112 | +The Makefiles provided have self-documenting targets. To find out which |
| 113 | +targets are available, type: |
| 114 | + |
| 115 | +``` bash |
| 116 | +make help |
| 117 | +``` |
| 118 | + |
| 119 | +### Distribution |
| 120 | + |
| 121 | +The wheels are built using |
| 122 | +[cibuildwheel](https://cibuildwheel.readthedocs.io/) and are distributed |
| 123 | +to PyPI using GitHub actions using [this |
| 124 | +workflow](.github/workflows/publish.yml). The wheels contain compiled |
| 125 | +binaries and are available for the following platforms: windows-amd64, |
| 126 | +ubuntu-x86, linux-x86\_64, linux-aarch64, and macosx-x86\_64. |
| 127 | + |
| 128 | +## See Also |
| 129 | + |
| 130 | +For other fast non-cryptographic hash functions available as Python |
| 131 | +extensions, see [FarmHash](https://github.com/escherba/python-cityhash) |
| 132 | +and [MurmurHash](https://github.com/hajimes/mmh3). |
| 133 | + |
| 134 | +## Authors |
| 135 | + |
| 136 | +The MetroHash algorithm and C++ implementation is due to J. Andrew |
| 137 | +Rogers. The Python bindings for it were written by Eugene Scherba. |
| 138 | + |
| 139 | +## License |
| 140 | + |
| 141 | +This software is licensed under the [Apache License, |
| 142 | +Version 2.0](https://opensource.org/licenses/Apache-2.0). See the |
| 143 | +included LICENSE file for details. |
0 commit comments