-
Notifications
You must be signed in to change notification settings - Fork 51
FreeDict FAQ
Currently only one output format is supported: the dict format. All you need is a dictionary program understanding it. We have a list of Dictionary clients and DICT servers.
There's also a list of Freedict servers, where you can find those which offer web front-ends. The one with the most up-to-date dictionaries from FreeDict can be found on http://dict.uni-leipzig.de/dictd.
If you want to get the source of the dictionaries, you can get a copy of the dictionary using Git:
git clone https://github.com/freedict/fd-dictionaries.git
You can view it in your browser as well.
If you don't know how to use Git, have a look here: http://git-scm.com/docs/gittutorial.
If you are not a developer, you don't need to bother about the source, just have a look at our web site to download the latest dictionary databases.
la1-la2-nophon.tei files are created as a temporary step during the make process of dictionaries, for whose headword languages (la1) FreeDict supports phonetics, because data is available. The process is: 1. convert dictionary data into la1-la2-nophon.tei 2. add phonetics info and create la1-la2.tei.
Practically, this means that renaming la1-la2.tei to la1-la2-nophon.tei is our internal kludge to avoid the phase where the dictionary is processed by a text-to-speech system that adds phonetic information (in <phon/> elements). In most cases, you needn't worry about this :-)
But if you are concerned with switching phonetic processing off for your dictionary, put the line supported_phonetics=
into the Makefile, right below the line defining DISTFILES
.
TEI is an extensive standard for an XML format to encode human speech. Have a look at http://www.tei-c.org/Guidelines/P5/. P5 is just the version number.
FreeDict uses ISO 639-3 language codes for identifying dictionary packages.
Great news :-). First, it is a good idea to announce your plans on the Freedict mailing list. The next step depends on a lot of issues, here are a few possibilities:
The dictionary is on a free license, right? We have tools for importing different formats into our TEI P5 subset, so please post on our mailing list, and we'll see what can be done.
Depending on whether we already have a dictionary for the given language pair, and on the quality of your source and (possibly) our source, we can think of several ways to proceed (creating a new dictionary, merger, parallel distribution, etc.).
I guess this is the place to mention a few basic books on lexicography, just in case (TODO).
Contact us on the list, please :-) And we'll move on from there.
Again, the list is a good place to begin. Also, please have a look at the header section of the dictionary, where you may find information on the current developer and their contact details. You can view the header, along with other accompanying documents (e.g. AUTHORS, ChangeLog, etc.) in the GitHub web view -- this is much better than downloading a distribution, because it may happen that some work is being carried out on a dictionary that has not yet been released as a new version. We also have a separate page listing the dictionary maintainers.
The answer definitely depends on what you understand by lexicographic, and on who you are going to ask. A few personal answers may follow, this issue has never actually been brought up, to my mind.
If by lexicographic project you understand a project that deals with markup and distribution of dictionaries, then by all means, FreeDict is a lexicographic project.
Now the first possible personal view (Piotr's). Let me stress: I don't speak for the others here, though I'd be happy if we agreed on this issue. I would like to treat FreeDict as a project that is not lexicographic in the light of any serious definition of lexicography that you can think of. I would like to see FreeDict's function as restricted to disseminating structured content and (almost) absolutely non-normative with regard to (meta)lexicography as science or art. In other words, I would like to avoid making any recommendations regarding the lexicographic choices concerning the macrostructure, microstructure, Part-of-Speech inventory, etc. This is why I have already remarked in some parts of the HOWTO (I always signed those remarks) that I don't think that FreeDict (or TEI) should recommend an inventory of POS values or anything of that sort.
An entirely different project is needed for this purpose and in fact such projects have already been created — let me name two: the old, foundational EAGLES (Expert Advisory Group on Language Engineering Standards) and the ISO TC37/SC4 Language Resources Management Committee.[1] These projects have produced recommendations/guidelines for, among others, digital lexicographers. Our job, as FreeDict, should be IMHO to encourage developers to submit their dictionaries to us and to do so by working on the tools that translate XML/text into TEI P5 and on tools that render such dictionaries nicely, so that developers can see that their work is being used by as many people as we can reach. And that, in turn, means following the DICT distribution framework as well as other systems (in fact, this bit can be the subject of another discussion, so let me stop here).
Above, I said I thought FreeDict should be almost absolutely non-normative because, naturally, some restrictions are imposed by the format. The TEI Dictionaries module allows for a lot of variation, but if we don't want to end up with huge and buggy translators, some reasonable constraints should be enforced. Among them is the ban on the <entryFree/> element, which is anyway meant for paper dictionaries with messy microstructure. Indeed, "no messy entries" can be reasonably stated as the fundamental format-induced requirement, with its particular applications to be defined later, if need be.
FreeDict scripts extract some information from the source XML, here is the current list:
edition information
Read from teiHeader/fileDesc/editionStmt/edition. In P5 dictionaries, the @n attribute of the <edition/> element is queried first, and if it does not exist, the entire content of the element is read (in the latter case, it is expected that the content is a version number, such as "0.3", etc.). This information is used for creating filenames of distribution packages.
maintainer information
Read from /{TEI,TEI.2}/teiHeader/fileDesc/titleStmt/respStmt/ where the <resp/> element has the value "Maintainer". If it is followed by an email address in angle brackets (you need the <nowiki>< entity for that purpose), the address is also used by the system. Below is an example:
<respStmt>
<!-- for the freedict database -->
<resp>Maintainer</resp>
<name>[your name here] <[email protected]></name>
</respStmt>
Note that the left angle bracket of a non-element has to be escaped in XML with <
.
status information
This is auxiliary information, read from teiHeader/fileDesc/notesStmt/note[@type='status']. Currently, the recommended values are (from freedict.org):
- 'stable'
- 'big enough to be useful' (from 10000 entries on)
- 'too small' (less than 1000 entries)
- 'low quality'
- 'unknown'
URL of the source
Read from teiHeader/fileDesc/sourceDesc/*/xptr/@url. As of today (15:50, 1 March 2009 (UTC)), this is hardcoded to use the <xptr/> element, defined by TEI P4.
author information (for StarDict packages)
Not sure at the moment where this matters. It currently reads the first <name/> element encountered in the first <respStmt/>.
Read from teiHeader/fileDesc/titleStmt/title.
Keep it simple (for starters). Too many projects have been killed by their developers' desire to code the next wonder of the world, the ultimate IT. Let us subscribe to the Open Source motto "publish early, publish often". Make your dictionary a simple glossary at first,
<entry>
<form>
<orth>alasiri</orth>
</form>
<sense>
<def>afternoon</def>
</sense>
</entry>
possibly with parts of speech and some basic attributes:
<entry xml:id="alasiri">
<form xml:lang="swh">
<orth>alasiri</orth>
</form>
<gramGrp>
<pos>n</pos>
</gramGrp>
<sense>
<def>afternoon</def>
</sense>
</entry>
Initially, it's OK to keep the equivalents inside <def>
, and it's OK to separate senses with a semicolon, and equivalents with commas. Later on, you might go for something slightly more complicated, as in:
<entry xml:id="pia-2" n="2">
<form xml:lang="swh">
<orth>pia</orth>
</form>
<gramGrp>
<pos>adv</pos>
</gramGrp>
<sense xml:id="pia-2.1" n="1">
<def>also, too</def>
</sense>
<sense xml:id="pia-2.2" n="2">
<def>equally, likewise</def>
</sense>
</entry>
And if you want to move past this stage, please contact the Freedict mailing list, so that we can talk about that. While we adhere to the schema documented in ch. 9 of the TEI Guidelines, there are some constraints on what out conversion tools can digest. (BTW, above, I assumed that there is a xml:id="eng"
attribute on the <body>
; it is also a good idea to keep one <pos>
per entry, and treat pairs such as the English verb and noun <i>
record as separate; if you need to handle this differently, do contact the mailing list).
This is a stub of a TEI file:
<TEI ns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Title</title>
</titleStmt>
<publicationStmt>
<p>Publication Information</p>
</publicationStmt>
<sourceDesc>
<p>Information about the source</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<entry>...</entry>
</body>
</text>
</TEI>
Freedict expects some extra elements in the header, and requires others (e.g. license statement and project description). Please see above for more concrete information.
From Denis Arnaud:
In breton bilingual dictionaries the usage is to write the gender of the nouns and the plural ending (or irregular form). And when it's a composite word, it's all written between the first (signifiant) word and the second one. As an example here's the translation for week-end (french people use the english word) week-end : dibenn g. (où)-sizhun The singular form is dibenn-sizhun and plural form dibennoù-sizhun Dibenn means end and sizhun week. And dibenn is a masculin noun ('gourel' in breton).
Ok, we have several issues here. The most important thing is that in electronic dictionaries, as opposed to paper dictionaries, where space matters (the publisher pays for it, you pay for it), you really want to use full forms of equivalents, because a) it is more user-friendly (you don't have to "train" the user to understand the system) and b) it is machine-readable, which potentially brings many extra benefits.
So in your example, you want something like:
(assuming xml:lang="fr" for the dictionary text; I use French-like grammatical codes ['m' for 'masculine', etc.])
<sense>
<cit type="trans">
<quote xml:lang="br">dibenn-sizhun</quote>
<gramGrp>
<gen>m</gen>
<number>sg</number>
</gramGrp>
<form type="inflected" xml:lang="br">
<orth>dibennoù-sizhun</orth>
<gramGrp>
<number>pl</number>
</gramGrp>
</form>
</cit>
</sense>
interpretation:
"week-end" in fr is "dibenn-sizhun" in br;
- "dibenn-sizhun" is masc sg.,
- its (orthographic) plural form is "dibennoù-sizhun"
(Note: I am assuming that the gender of "dibenn" becomes the gender of "dibenn-sizhun" -- this is rather standard in (a type of) compounds that one of the elements is more important and imposes its own features onto the entire structure.)
In the following paragraphs we look at a Latin-German dictionary. Latin as a highly inflectional language has many special cses. Consider the following definitions, formatted as in a dictionary:
iubere, iubeo, iussi, ussum - befehlen
perfectus, -a, -um - vollendet, vollkommen
(the 2nd and 3rd form is for feminine/neuter)
maiestas, -atis f. - Hoheit, Erhabenheit, Größe
(-* is the genitive form, used for declination)
It is in genral advisable to not include parts of words in a dictionary, it is wasting its potential. Remember that you can always extract a more human readable form of a technically explicit notation.
A simple format for manual encoding could look like this:
<entry xml:id="iubeo">
<form xml:lang="la">
<orth>iubeo</orth>
<form type="infl">
<orth type="inf">iubere</orth>
<orth type="perf">iussi</orth>
<orth type="sup">ussum</orth>
</form>
</form>
<gramGrp>
<pos>v</pos>
<iType>2</iType> <!-- inflection class (for verbs: conjugation number) -->
</gramGrp>
<sense>
<def>befehlen</def>
</sense>
</entry>
If the above is used, it will be converted by XSLT to the form advocated a.o. by the relevant part of chapter 9 of the TEI Guidelines, the encoding should look roughly as follows. For the sake of simplicity, the head word form is for this example infinitive, however it is more common to use the first person form.
<entry xml:id="iubere">
<form xml:lang="la">
<orth>iubere</orth>
<form type="infl">
<form xml:id="iubere-iubeo">
<orth>iubeo</orth>
<gramGrp xml:lang="de">
<per>1</per>
<number>sg</number>
<mood>ind</mood>
<tns>praes</tns>
<gram type="voice">aktiv</gram>
</gramGrp>
</form>
<form xml:id="iubere-iussi">
<orth>iussi</orth>
<gramGrp xml:lang="de">
<per>1</per>
<number>sg</number>
<mood>ind</mood>
<tns>perf</tns>
<gram type="voice">aktiv</gram>
</gramGrp>
</form>
<!-- and so on for the supine form ussum -->
</form>
</form>
<gramGrp>
<pos>v</pos>
<iType>2</iType> <!-- inflection class (for verbs: conjugation number) -->
</gramGrp>
<sense>
<def>befehlen</def>
</sense>
</entry>
The question arises how to encode that the main head word form is an infinitive form and the answer is: by the convention of each dictionary. Of course it makes sense to have a look at established standards. In any case, this should be noted in the TEI header.
Not every grammatical category is provided as a separate element, but those that are, are actually specializations of the generic element . Hence, <tns> = <gram type="tns">
.
Note also the <iType>
element that holds something that may be referred to as "conjugation/declension/lexical/noun class", depending on the language, assumed grammatical system and the intended purpose. This should assume some verifiable system (e.g. an established grammar of the given languages) and it should be used consistently.
Note that in e.g. the Swahili-English dictionary, we follow a different convention, because the plural forms of nouns are actually references to other entries in the dictionary. But here, you supply a mini-paradigm with every entry (which may indeed be very useful to the user).
Regarding the noun:
<entry xml:id="maiestas">
<form xml:lang="la">
<orth>maiestas</orth>
<form type="infl">
<orth>maiestatis</orth>
<case>gen</case>
</form>
</form>
<gramGrp>
<pos>n</pos>
<gen>f</gen>
</gramGrp>
<sense>
<cit type="trans">
<quote>Hoheit, Erhabenheit, Größe</quote>
</def>
</sense>
</entry>
Remarks:
-
note that the
<form>
for maiestatis is flatter than for the forms of iubere -- whether you keep to the system of iubere (<form>
within<form>
within<form>
) or simplify like here (<form>
within<form>
) is a matter of the convention that you use; just make sure to be consistent, so that when you decide to e.g. provide the entire paradigms for your nouns, the information can be easily added with (say) XSLT; it is good to document such conventions in the dictionary header, for the sake of users and other developers; -
note that again, the information that the form maiestas is Nominative is your dictionary-wide default; but the main
<gramGrp>
identifies the features of the lexeme as an abstract object that is realised by the particular forms in the syntactic context; -
these structures assume that you have put something like
<text xml:lang="de">
at the top of the dictionary, and you only mark the divergence from that in the<form>
element (but in<gramGrp>
elements within the<form>
, you need to reset the language back to German (if this is what you want -- here we touch upon the interesting issue of active versus passive dictionaries that deserves a separate treatment); -
while the above example shows multiple translations within a
<quote>
element, it is in general better to split them in distinct<quote
tags in order to make it machine-readable and easier to format.
In many ways. You can for example make sure to tell us about any mistake or inconsistency in the FAQ or in the HOWTO that you notice. Even a typo counts.
In general, issues are best reported at our [issues page](https://github.com/freedict/fd-dictionaries/issues, but it doesn't hurt to notify us at our mailing list.
Similarly with the dictionaries themselves. We'll be grateful for error reports. With omissions it's a bit different: if you suggest a new translation and provide a means for us to verify it (just in case -- let's say it's a prank-avoidance mechanism), it can surely be added within hours or days. If you just complain about some word missing, the procedure becomes more difficult and may take a long time to complete. Still, the project trackers are the best way to get such reports to us.
We will also ultimately benefit from your bug reports concerning e.g. DICT clients or servers -- just make sure to direct them to the appropriate address :-)
The FREEDICTDIR
is the path to the root of the FreeDict source. It is most
probably the path to the Git repository. It is required to locate the tools
directory, which contains the FreeDict build system.
To build a dictionary, set the FREEDICTDIR environment variable.
Windows users can have a look here
and GNU/Linux users should edit their shell configuration (e.g. ~/.bashrc
or
~/.zshrc
), and add:
FREEDICTDIR=/path/to/dir
Certainly you can. You have to retain a notice that you have obtained the dictionaries from the FreeDict project and you need to obey the free software licenses of each dictionary, but no other restrictions apply.
In fact, we encourage you to use our dictionaries and we provide an API to query for the latest versions of the dictionaries. Please have a look at our FreeDict API for more details.
We do have a mailing list, which is best for general question on getting started or for one project or the other. Browsing through its archive and the [old archive]https://sourceforge.net/p/freedict/mailman/freedict-beta/) is a good idea.
If you want interactive answers or just say hi to us, you can find us on
#freedict
on the OFTC network. If you have never used IRC before, you have two
options:
- Try the online version: http://webchat.oftc.net/?channels=freedict
- Search for an IRC client for your operating system and install it; enter
irc.oftc.net
as server name and join the channel#freedict
.