Skip to content
Assaf Urieli edited this page Jan 8, 2021 · 20 revisions

The Talismane Terminology Extractor is currently available in French and English.

Prerequisites

To install make on Windows:

  • Install chocolatey
  • From the PowerShell, run choco install make

To install make on Debian/Ubuntu:

  • sudo apt-get install build-essential

Installing Talismane terminology extractor

Download the most recent release from https://github.com/joliciel-informatique/talismane-terminology/releases

The file you want will be the asset titled talismane-terminology-distribution-[VERSION]-bin.zip.

After you download the distribution, unzip it, and copy the contents into your Talismane install directory.

Now, install the database using docker-compose.

In the command-line, navigate to the directory where you unzipped the software, and run the following commands:

make start-dep
make create-schema

Note: You should not normally have to run make start-dep again after you restart the computer. However, if the database is not found, try running this command. You only need to rerun make create-schema when you download new versions of the terminology extractor, in order to bring the underlying database schema up-to-date.

Syntax analysis

The first step in terminology extraction is syntax analysis. During this step, Talismane analyses the syntax of a raw text file, and stores its syntax analysis results in another file, using the CoNLL-X format.

First, make sure you have prepared a text file out of which you want to extract terminology (Talismane can currently only analyse raw text files).

Place a file you want to analyse in a directory of your choice, and open a command console, navigate to the Talismane directory.

Run the following command for French:

java -Xmx1G -jar -Dconfig.file=conf/talismane-fr-[VERSION].conf talismane-core-[VERSION]-shaded.jar --analyse --sessionId=fr --encoding=UTF8 --inFile=examples/input/MoteurStirling.txt --outFile=data/MoteurStirling.tal --builtInTemplate=with_location --logConfigFile=conf/logback.xml

Run the following command for English:

java -Xmx1G -jar -Dconfig.file=conf/talismane-en-[VERSION].conf talismane-core-[VERSION]-shaded.jar --analyse --sessionId=en --encoding=UTF8 --inFile=examples/input/StirlingEngine.txt --outFile=data/StirlingEngine.tal --builtInTemplate=with_location --logConfigFile=conf/logback.xml

Options:

  • builtInTemplate=with_location: this is required for the terminology viewer to allow you to open text files correctly from within the viewer.

In order to ensure that progress is being made, you can look at the log files generated in the logs directory, or else you can look at the syntax analysis file generated in the path indicated by the outFile parameter above. Make sure you open these files with an editor which does not lock the files, such as Notepad++.

Note: If you use absolute paths for the --inFile and --outFile options, your terminology database will depend on these files always remaining in this absolute location. If you use relative paths inside the Talismane directory, the terminology viewer will always function if it is run from inside the Talismane directory.

Terminology extraction

Once syntax analysis is complete, you can procede to terminology extraction. During this step, Talismane analyses the syntax analysis file, and extracts term candidates, currently defined as contiguous noun phrases. These terms are stored in the database created above.

Run the following command for French:

java -jar talismane-term-extractor-[VERSION]-shaded.jar --sessionId=fr --inFile=data/MoteurStirling.tal --projectCode=frenchTest --encoding=UTF8 --logConfigFile=conf/logback.xml

Run the following command for English:

java -jar talismane-term-extractor-[VERSION]-shaded.jar --sessionId=en --inFile=data/StirlingEngine.tal --projectCode=englishTest --encoding=UTF8 --logConfigFile=conf/logback.xml

Options:

  • sessionId: fr for French, en for English
  • inFile: taken as the outFile argument of the previous command
  • projectCode: an arbitrary project code used to identify this project downstream.
  • encoding: the encoding used when analyzing the syntax above

Terminology viewing and selection

Finally, you can view and select term candidates using the following command:

java -jar talismane-term-viewer-[VERSION]-shaded.jar --logConfigFile=conf/logback.xml

When the window opens, the first thing you want to do is go to Settings→Preferences.

You need to enter the following items:

  • Database Project Code: Enter the project code chosen during terminology extraction, in the examples above either "frenchTest" for French, or "englishTest" for English.
  • CSV Separator: Enter the CSV separator used by your system (comma for English operating systems, semicolon for French).
  • Editor: optionally, enter the path to a text editor used to open files in the corpus.
    • Example for Notepad++ on Windows: "C:\Program Files\Notepad++\notepad++.exe" (with the quotes)
    • Example for Sublime Text on Windows: "C:\Program Files\Sublime Text\subl.exe" (with the quotes)
    • Example for Sublime Text on Linux: /usr/bin/subl
  • Arguments: optionally, arguments used to tell the editor which filename, line and column to navigate to.
    • Example for Notepad++: -n%line -c%column %file
    • Example for Sublime Text: %file:%line:%column

Make sure you push the OK button.

Then, go to File→Open Database...

At this point, the terms you extracted should appear. If none appear, lower the minimum frequency, and reload.

You can now navigate among terms, mark terms that interest you, show only marked terms, and then export them for usage elsewhere.