A pair of confusables is a pair of characters which might be used in spoofing attacks due to their visual similarity (for example ‘ν’ and ‘v’). The wide range of characters supported by Unicode poses security vulnerabilities. Security mechanisms listed in UTS#39 (UTS #39) use confusable data (https://www.unicode.org/Public/security/latest/confusables.txt) to combat such attacks. The purpose of this project is to identify novel pairs of confusables using representation learning and custom distance metrics.
- Download and install Docker: Get Docker Here.
git clone
andcd
into git repository.- Make sure all submodules are updated:
git submodule update --init --recursive
.
- In project source folder, run
./scripts/start.sh
. - In any browser, go to
localhost:8888
. - Copy the token from terminal to browser to access Jupyter Notebook.
- In project source folder, run
./scripts/start_cli.sh
. - Execute setup script
./scripts/setup.sh
.
- Run
docker ps
to get container id/name. - Run
docker exec -it [CONTAINER_NAME/ID] /bin/bash
.
- In Jupyter Notebook terminal, type
ctrl
+c
. - In command-line interface,
exit
.
- From link,
download
full_data.zip
(pre-generated images) file and unzip indata/
directory. - From link,
download
full_data_triplet1.0_meta.tsv
andfull_data_triplet1.0_vec.tsv
(pre-generated embeddings and labels) intoembeddings/
directory. - Create representation clustering object:
from rep_cls import RepresentationClustering rc = RepresentationClustering(embedding_file='embeddings/full_data_triplet1.0_vec.tsv', label_file='embeddings/full_data_triplet1.0_meta.tsv', img_dir='data/full_data/')
- Generate confusables for specific chracter:
rc.get_confusables_for_char('褢') >>> ['裹', '裏', '裛', '裏']
Check main.ipynb
.
From link,
download TripletTransferTF
(pre-trained model) folder into ckpts/
directory.
- To regenerate source files, in
source/
directory, runpython generate_source_file.py
. - To check how the source file is selected, see
source/Radical-stroke_Index_Analysis.ipynb
.
main.ipynb
: Notebook for setting up, building and deploying confusable detector. Also serves as tutorial script.vis_gen.py
: Contains VisualGenerator, class for generating visualization of characters.rep_gen.py
: Contains RepresentationGenerator, class for generating representations (embeddings) used for clustering.rep_cls.py
: Contains RepresentationClustering, class for clustering representations and finding confusables.distance_metrics.py
: Contains Distance, factory class that defines distance metrics for different image format. Also contains enumeration class ImgFormat.
configs/sample_config.ini
: Sample configuration for model training. To start your own training procedure, create new configuration file following the same format.custom_train.py
: Contains ModelTrainer, class that executes training procedure.dataset_builder.py
: Contains DatasetBuilder, class that invokes data pre-processing functions for TensorFlow dataset generation.model_builder.py
: Contains ModelBuilder, class that creates and initialize TensorFlow models.data_preprocessing.py
: Image pre-processing functions.
source/Radical-stroke_Index_Analysis.ipynb
: Jupyter Notebook for radical-stroke analysis and dataset selection.source/generate_source_file.py
: Contains functions that produces the same result as Jupyter Notebook file.source/charset_*k.txt
: Selected Unicode code points.source/randset_*k.txt
: Randomly selected Unicode code points.source/full_dataset.txt
: Full dataset containing 21028 code points, used for clustering.
Expect all scripts to be executed in base directory. For example, ./scripts/start.sh
instead of ./start.sh
.
scripts/start.sh
: Launch a Docker container with Jupyter Notebook.scripts/start_cli.sh
: Launch a Docker container with bash.scripts/setup.sh
: Should run inside the container, setting up the environment and install all packages.scritps/install_fonts.sh
: Install required fonts, included in setup.sh.scripts/download_*.sh
: Scripts for downloading pre-established data, model or embeddings from Google Drive.
*_test.py
: Runpython [MODULE]_test.py
for all the unit tests for[MODULE].py
.
calculate_from_path
: Calculate distance between the two images specified by file path.train_test_split
: Split dataset (already created) into training and testing datasets.
data/
: Default visualization directory.ckpts/
: Default model directory.embeddings/
: Default embedding directory.
Expect all tests to be run under the CLI container setup.
In root folder, run python -m unittest discover -s . -p '*_test.py'
.
In root folder, run python [MODULE]_test.py
Copyright © 2020-2024 Unicode, Inc. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the United States and other countries.
A CLA is required to contribute to this project - please refer to the CONTRIBUTING.md file (or start a Pull Request) for more information.
The contents of this repository are governed by the Unicode Terms of Use and are released under LICENSE.