Skip to content

Section 4 Lexicon

Ruud de Jong edited this page Oct 24, 2018 · 1 revision

This page is based on a page of the wiki for the original SimpleNLG.

Like other Natural Language Generation systems, SimpleNLG-NL needs information about words; the word list containing such information is called a 'lexicon'. SimpleNLG-NL comes with simple lexicons for each language built into the system, which can be accessed via:

    Lexicon lexicon = new simplenlg.lexicon.dutch.XMLLexicon();

Replace dutch with french or english if wanted.

The default lexicon, the English lexicon, is also available using:

    Lexicon lexicon = Lexicon.getDefaultLexicon();

There are three lexicons available for Dutch. These largest lexicon has over 70.000 entries and formed the basis for the smaller lexicons. The subsets for the medium and small lexicon were determined using a word frequency list.

English

SimpleNLG can also use the 300MB NIH Specialist lexicon, which has outstanding coverage of medical terminology as well as excellent coverage of everyday English. For information on setting up this lexicon, please see Appendix C of SimpleNLG.

Using a different lexicon

You can switch between the Dutch lexicon or import lexicons you created yourself. Simply pass the path to your file to the XMLLexicon() constructor. The path can be relative or absolute.

    Lexicon lexicon = new simplenlg.lexicon.dutch.XMLLexicon("my-lexicon.xml");

To access a lexicon outside of the current working directory, provide the full, absolute path name (e.g., “/home/staff/lexicons/my-lexicon.xml”, “C:\lexicons\my-lexicon.xml”).[1]

Once we have a lexicon, we can create an NLGFactory (object which creates SimpleNLG structures for the):

    NLGFactory nlgFactory = new NLGFactory(lexicon);

→ For more examples, look at testsrc/MultipleLexiconTest.java and testsrc/NIHDBLexiconTest.java.


[1] Note that in SimpleNLG V4, there are no lexicon methods to directly get inflected variants of a word; in other words, there is no equivalent in V4 of the SimpleNLG V3 getPlural(), getPastParticiple(), etc. methods. It is possible in V4 to compute inflected variants of words, but the process is more complicated: basically we need to create an InflectedWordElement around the base form, add appropriate features to this InflectedWordElement, and then realise it.