-
Notifications
You must be signed in to change notification settings - Fork 3
/
fuf5.2.tex
8391 lines (6603 loc) · 310 KB
/
fuf5.2.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[10pt,a4paper]{report}
\usepackage[a4paper]{geometry}
\usepackage{comment}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage[english]{isodate}
\usepackage[parfill]{parskip}
\usepackage{imakeidx}
\usepackage{listings}
\usepackage{graphicx}
\usepackage{multicol}
\usepackage[toc,page]{appendix}
\pagenumbering{Roman}
\makeindex
\geometry{
a4paper,
total={170mm,257mm},
left=20mm,
top=20mm,
}
% Program examples
\lstset{frame=single,
language=Lisp,
tabsize=2,
basicstyle=\footnotesize
}
\begin{document}
\pagenumbering{arabic}
\begin{comment}
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; File: FUG5.MSS
; Description: Global documentation on FUG5 package
; Author: Michael Elhadad
; Created: 24-Feb-89
; Modified: 26-Feb-89
; 22 Oct 91 - Ported to FUF 5.0
; 06 Jun 93 - Ported to FUF 5.2
; 02 Jan 17 - Ported to Latex
; Language: Scribe
; Status: Experimental
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
\end{comment}
\bibliographystyle{unsrt}
\begin{comment}
\style(leftmargin 1 inch, indentation 0.25 inches)
\style(topmargin 1.5 inches, rightmargin 1 inch)
\modify(figure, use box)
\end{comment}
\begin{titlepage}
\begin{center}
{\scshape\LARGE FUF: the Universal Unifier\par}
{\scshape\Large User Manual\par}
{\scshape\Large Version 5.2\par}
\vspace{1cm}
{\scshape Michael Elhadad\par}
{\scshape Department of Computer Science\par}
{\scshape Ben Gurion University of the Negev\par}
{\scshape 84105 Beer Sheva, Israel\par}
{\scshape elhadad@cs.bgu.ac.il\par}
\vspace{1.5cm}
\vfill
{\large 27 June 1993\par}
{Copyright \copyright 1993 Michael Elhadad}
\end{center}
\begin{abstract}
This document is the user manual for FUF version 5.2, a natural language
generator program that uses the technique of unification grammars. The
program is composed of two main modules: a unifier and a linearizer. The
unifier takes as input a semantic description of the text to be generated
and a unification grammar, and produces as output a rich syntactic
description of the text. The linearizer interprets this syntactic
description and produces an English sentence. This manual includes a
detailed presentation of the technique of unification grammars and a
reference manual for the current implementation (FUF 5.2).
Version 5.2 includes novel techniques in the unification allowing the
specification of types and the expression of complete information. It also
allows for procedural unification and supports sophisticated forms of control.
\end{abstract}
\end{titlepage}
\tableofcontents
\newpage
\listoffigures
\newpage
\chapter{Introduction}
\section{How to Read this Manual}
This manual is designed to help you use the FUF package and to
describe and explain the technique of unification grammars.
The FUF package is made available to people interested in text
generation and/or functional unification. It can be used:
\begin{itemize}
\item as a front-end to a text generation system, providing a surface realization
component. SURGE, a grammar of English with large syntactic coverage
written in FUF, is included for that purpose.
\item as an environment for grammar development. People interested in
expressing grammatical theories or developing a practical grammar
can experiment with the unifier and linearizer.
\item as an environment for a study of functional unification.
Functional unification is a powerful technique and can be used
for non-linguistic or non-grammatical applications.
\end{itemize}
This manual contains material for people falling in any of these
categories. It starts with an introduction to functional unification, its
syntax, semantics and terminology. The following chapters deal with the
``grammar development'' tools: tracing and indexing, a presentation of the
morphology component and the dictionary. The next two chapters present the
novel features of FUF: typing and control facilities. A chapter is devoted
to typing in FUF: type definition, user-defined unification methods and
expression of complete information. One chapter is devoted to to flow of
control specification (indexing, dependency-directed backtracking and goal
freezing). Finally the last chapter is a reference manual to the package.
One appendix is devoted to possible non-linguistic applications of the
formalism, and compares the formalism with programming languages, in
particular with PROLOG.
Note that this manual does {\bf not} describe or document the example
grammars provided as examples with the unifier. The sample grammars
contain a brief documentation on-line and are accompanied by example
inputs. The SURGE grammar is documented in a separate manual.
\section{Function and Content of the Package}
FUF implements a natural language surface generator using the theory of
unification grammars (cf. bibliography for references). It follows most
closely the original FUG formalism introduced in \cite{Kay79}.
Its input is a Functional Description (fd) describing the meaning of an
utterance and a grammar (also described as an fd).\index{FD}
The Syntax of fds is fully described in section \ref{sect-syntax}.
The output is an English sentence expressing this meaning according to the
grammatical constraints expressed by the grammar.
There are two major stages in this process: unification and linearization.
\index{unification} \index{linearization}
Unification consists in making the input-fd and the grammar ``compatible''
in the sense described in \cite{Kay79}. It comes down to enriching the input-fd
with directives coming from the grammar and indicating word order, syntactic
constructions, number agreement and other features.
The enriched input is then linearized to produce an English sentence.
The linearizer includes a morphology module handling all the problems
of word formation (s's, preterits, ...).
\chapter{Getting Started}
Appendix \ref{installation} describes how to install the package on a new
machine. Contact your local system administrator to learn how to load the
program on your system.
You should know how to load the example grammars and corresponding inputs.
\section{Main User Functions}
Once the system is loaded, you are ready to run the program.
If you are in a hurry to try the system, the user functions are:
\index{uni (function)}
\index{uni-fd (function)}
\begin{lstlisting}[language=Lisp]
(UNI INPUT &key GRAMMAR NON-INTERACTIVE (LIMIT 10000))
;; by default the grammar used is *u-grammar*
;; non-interactive is nil
;; limit is 10000
;; Complete work : unification + linearization. Outputs a sentence.
;; If non-interactive is nil, a line of statistics is
;; also printed.
;; In any case, stops after limit backtracking points.
(UNI-FD INPUT &key GRAMMAR NON-INTERACTIVE (LIMIT 10000))
;; by default the grammar used is *u-grammar*
;; non-interactive is nil.
;; limit is 10000
;; Does only the unification. Outputs the enriched fd. This is the
;; function to use when trying the grammars manipulating lists of gr5.l
;; If non-interactive is nil, a line of statistics is also printed.
;; In any case, stops after limit backtracking points.
CL> (uni ir01)
The boy loves a girl.
CL> (uni-fd ir02)
(# # ...)
(UNIF FD &key (GRAMMAR *u-grammar*))
;; by default the grammar used is *u-grammar*
;; As uni-fd but works even if FD does not contain a CAT feature.
\end{lstlisting}
\index{unif (function)}
If you want to change the grammar, or the input you can edit the files
defining it, or the function with the same name.
There are two other useful functions for grammar developers: {\tt fd-p}
checks whether a Lisp expression is a syntactically correct Functional
Description (FD) to be used as an input. If it is not, helpful error
messages are given. {\tt grammar-p} checks whether a grammar is well-formed.
NOTE: use {\tt fd-p} on inputs only and {\tt grammar-p} on grammars only.
\index{fd-p (function)}
\index{grammar-p (function)}
\begin{lstlisting}[language=Lisp]
(FD-P FD &key (PRINT-MESSAGES t) (PRINT-WARNINGS t))
;; --> T if FD is a well-formed FD.
;; --> nil (and error messages) otherwise.
;; The error messages and warnings are only printed if PRINT-MESSAGES and
;; PRINT-WARNINGS are true.
;; DO NOT USE FD-P ON GRAMMARS
(GRAMMAR-P &optional (GRAMMAR *u-grammar*)
&key (PRINT-MESSAGES t) (PRINT-WARNINGS t))
;; --> T if GRAMMAR (by default *u-grammar*) is a well-formed grammar.
;; --> nil (and error messages) otherwise.
;; - FD is *u-grammar* by default
;; - PRINT-MESSAGES is t by default.
;; If it is non-nil, some statistics on the grammar are printed.
;; It should be nil when the function is called non-interactively.
;; - PRINT-WARNINGS is nil by default.
;; If it is non-nil, warnings are generated for all paths in the
;; grammar. (It is sometimes a good idea to manually check that all
;; paths are valid.)
\end{lstlisting}
\begin{lstlisting}[language=Lisp]
;; Examples:
CL> (fd-p '((a 1) (a 2)))
----> error, attribute a has 2 incompatible values: 1 and 2.
nil
CL> (grammar-p)
----> t
CL> (grammar-p '((a 1) (b 2)))
----> error, a grammar must be a valid FD of the form:
((alt (((cat c1)...) ... ((cat cn) ...)))). nil.
\end{lstlisting}
\chapter{FDs, Unification and Linearization}
In this section, we informally introduce the concepts of FDs and
unification. The next section provides a complete description of
the FDs as used in the package, and presents all available
unification mechanisms.
\section{What is an FD?}
An FD (functional description) is a data structure representing constraints
on an object. It is best viewed as a list of pairs (attribute value).
Here is a simple example: \index{FD}
\begin{lstlisting}
((article "the") (noun "cat"))
\end{lstlisting}
There is a function called {\tt fd-p} in the package that lets you know whether
a given Lisp expression is a valid FD or not and gives you helpful error
messages if it is not. \index{fd-p (function)}
In FUGs, the same formalism is used for representing both the
input expressions and the grammar.
\section{A Simple Example of Unification}
We present here a minimal grammar that contains just enough to
generate the simplest complete sentences. It is included in file
{\tt gr0.l} in the directory containing the examples. A little more
complex grammar, handling the active/passive distinction, is
available in {\tt gr1.l}, and a more interesting one in {\tt gr2.l}.
\footnote{Note that the simplest grammars presented in the manual use the
standard phrase structure approach $S \rightarrow NP VP$. More advanced grammars use
a systemic approach to language (after gr4). In general, the FUG formalism
is convenient to write systemic grammars, but it can also be used to
implement other linguistic models (PS rules, LFG, GPSG or HPSG).}
\index{examples} \index{gr0.l (file)}
\begin{lstlisting}[language=Lisp]
((alt MAIN (
;; a grammar always has the same form: an alternative
;; with one branch for each constituent category.
;; First branch of the alternative
;; Describe the category S.
((cat s)
(prot ((cat np)))
(goal ((cat np)))
(verb ((cat vp)
(number {prot number})))
(pattern (prot verb goal)))
;; Second branch: NP
((cat np)
(n ((cat noun)))
(alt (
;; Proper names don't need an article
((proper yes)
(pattern (n)))
;; Common names do
((proper no)
(pattern (det n))
(det ((cat article)
(lex "the")))))))
;; Third branch: VP
((cat vp)
(pattern (v dots))
(v ((cat verb)))))))
\end{lstlisting}
A few comments on the form of this grammar: the skeleton of a
grammar is always the same, a big {\tt alt} (alternation of possible
branches, the unifier will pick one compatible branch to unify
with the input). Each branch of this alternation corresponds to a single
category (here, {\tt S, NP} and {\tt VP}). \index{alt (keyword)}
\index{branch (of an alt)}
The second remark is about the form of the input: as shown in the
following example, an input is an FD, giving some constraints on
certain constituents. The grammar decides what grammatical
category corresponds to each constituent. \index{constituent}
The next main function of the grammar is to give constraints on
the ordering of the words. This is done using the {\tt pattern} special
attribute. A {\tt pattern} is followed by a picture of how the
constituents of the current FD should be ordered: {\tt (Pattern (prot
verb goal))} means that the prot constituent should come just
before the verb constituent, etc. \index{pattern (keyword)}
\index{ordering constraints}
In the first branch, the only thing to notice is how the
agreement subject/verb is described: the number of the {\tt PROT} will
appear in the input as a feature of the FD appearing under {\tt PROT},
as in: \index{agreement (subject/verb)}
{\tt (prot ((number plural)
(lex "car")))}
standing for ``cars''. To enforce the subject/verb agreement, the
grammar picks the feature {\tt number} from the {\tt prot} sub-fd and
requests that it be unified with the corresponding feature of the
{\tt verb} sub-fd. This is expressed by:
{\tt (verb ((number {prot number})))}
which means: the value of the {\tt number} feature of {\tt verb} must be
the same as the value of the {\tt number} feature of {\tt prot}. The
curly-braces notation denotes what is called a ``path'' which is a pointer
within an fd. Note that in this line of the grammar, we refer to {\tt {prot
number}} even though the {\tt {prot number}} feature does not appear under
{\tt prot} in the rest of the grammar. This is a general feature of FUF: any
attribute can appear in an FD, and its value can be given either by the
grammar directly where it would appear, or by the input, or by the grammar
coming from a distant place and using a path.
Note also that the agreement constraint could have been written in the
``opposite'' direction:
{\tt (prot ((number {verb number})))}
Or even:
{\tt ({prot number} {verb number})}
In the second branch, describing the NPs, we have two cases,
corresponding to proper and common nouns. Common nouns are
preceded by an article, whereas proper nouns just consist of
themselves, {\em e.g.}, ``the car'' vs. ``John''. If the feature {\tt proper} is
not given in the input, the grammar will add it. By default, the
current unifier will always try the first branch of an {\tt alt} first.
That means that in this grammar, proper nouns are the default.
\index{default (in alt)} \index{proper noun} \index{common noun}
Finally, a brief word about the general mechanism of the
unification: the unifier first unifies the input FD with the
grammar. In the following example, this will be the first pass
through the grammar. Then, each sub-constituent of the resulting
FD that is part of the {\tt cset} (constituent-set) of the FD will be
unified again with the whole grammar. This will unify the
sub-constituents {\tt prot, verb} and {\tt goal} also. This is how recursion
is triggered in the grammar. The next section describes how the
{\tt cset} is determined. All you need to know at this point is that if
a constituent contains a feature {\tt (cat xxx)} it will be tried for
unification.
\index{recursion} \index{unification (overall mechanism)}
\index{sub-constituents} \index{cset (keyword)}
\index{constituent traversal}
In the input FDs, the sign {\tt ===} is used as a shortcut for the
notation: \index{lex (special attribute)}
\begin{lstlisting}
(n === John) <===> (n ((lex John)))
\end{lstlisting}
The {\tt lex} feature always contains the single string that is to be
used in the English sentence for all ``terminal'' constituents.
\begin{lstlisting}[language=Lisp]
;; When unified with the following FD, the grammar will output the
;; sentence ``John likes Mary''.
(setq ir01 '((cat s)
(prot ((n === john)))
(verb ((v === like)))
(goal ((n === Mary)))))
;; Which corresponds to the linearization of the following complete
;; FD (this is the result of the unification):
CLISP> (uni-fd ir01)
((cat s)
(prot ((n ((lex "john")
(cat noun)))
(cat np)
(proper yes)
(pattern (n))))
(verb ((v ((lex "like")
(cat verb)))
(cat vp)
(number {prot number})
(pattern (v dots))))
(goal ((n ((lex "Mary")
(cat noun)))
(cat np)
(proper yes)
(pattern (n))))
(pattern (prot verb goal)))
\end{lstlisting}
Following the trace of the program will be the easiest way to
figure out what is going on:
\begin{lstlisting}[language=Lisp]
LISP> (uni ir01)
-->
>STARTING CAT S AT LEVEL {}
-->Entering alt TOP -- Jump indexed to branch #1: S matches input S
-->Updating (CAT NIL) with NP at level {PROT CAT}
-->Updating (CAT NIL) with NP at level {GOAL CAT}
-->Updating (CAT NIL) with VP at level {VERB CAT}
-->Enriching input with (NUMBER {PROT NUMBER}) at level {VERB}
-->Enriching input with (PATTERN (PROT VERB GOAL)) at level {}
-->Success with branch #1 S in alt TOP
>STARTING CAT NP AT LEVEL {PROT}
-->Entering alt TOP -- Jump indexed to branch #2: NP matches input NP
-->Updating (CAT NIL) with NOUN at level {PROT N CAT}
-->Enriching input with (NUMBER {PROT NUMBER}) at level {PROT N}
-->Updating (PROPER NIL) with YES at level {PROT PROPER}
-->Enriching input with (PATTERN (N)) at level {PROT}
-->Success with branch #2 NP in alt TOP
>STARTING CAT VP AT LEVEL {VERB}
-->Entering alt TOP -- Jump indexed to branch #3: VP matches input VP
-->Enriching input with (PATTERN (V DOTS)) at level {VERB}
-->Updating (CAT NIL) with VERB at level {VERB V CAT}
-->Success with branch #3 VP in alt TOP
>STARTING CAT NP AT LEVEL {GOAL}
-->Entering alt TOP -- Jump indexed to branch #2: NP matches input NP
-->Updating (CAT NIL) with NOUN at level {GOAL N CAT}
-->Enriching input with (NUMBER {GOAL NUMBER}) at level {GOAL N}
-->Updating (PROPER NIL) with YES at level {GOAL PROPER}
-->Enriching input with (PATTERN (N)) at level {GOAL}
-->Success with branch #2 NP in alt TOP
{Used 3 backtracking points - 0 wrong branches - 0 undos}
John likes mary.
\end{lstlisting}
In the figure, you can identify each step of the unification: first the
top level category is identified: (cat s). The input is unified with the
corresponding branch of the grammar (branch \#1). Then the constituents are
identified. We have here 3 constituents: PROT of cat NP, VERB of cat VP
and GOAL of CAT NP. Each constituent is unified in turn. Then for each
constituent, the unifier identifies the sub-constituents. In this case, no
constituent has a sub-constituent, and unification succeeds. Note that in
general, the tree of constituents is traversed breadth first.
Now, it is also important to know when unification fails. The
following example tries to override the subject/verb agreement,
causing the failure:
\index{failure (of unification)} \index{agreement (subject/verb)}
\begin{lstlisting}[language=Lisp]
(setq ir02 '((cat s)
(prot ((n === john) (number sing)))
(verb ((v === like) (number plural)))
(goal ((n === Mary)))))
LISP> (uni ir02)
>STARTING CAT S AT LEVEL {}
-->Entering alt TOP -- Jump indexed to branch #1: S matches input S
-->Updating (CAT NIL) with NP at level {PROT CAT}
-->Updating (CAT NIL) with NP at level {GOAL CAT}
-->Updating (CAT NIL) with VP at level {VERB CAT}
-->Fail in trying PLURAL with SING at level {VERB NUMBER}
<fail>
\end{lstlisting}
\section{Linearization}
\index{linearization}
Once the unification has succeeded, the unified fd is sent to the
linearizer. The linearizer works by following the directives included in
the {\tt pattern} \index{pattern (keyword)}. The exact way to define these
features is explained in section \ref{sect-pattern}. The linearizer works
as follows:
\begin{enumerate}
\item Identify the {\tt pattern} feature in the top level: for ir01, it is
{\tt (pattern (prot verb goal))}.
\item If a pattern is found:
\begin{enumerate}
\item For each constituent of the pattern, recursively linearize the constituent.
(That means linearize PROT, VERB and GOAL).
\item The linearization of the fd is the concatenation of the linearizations of
the constituents in the order prescribed by the pattern feature.
\end{enumerate}
\item If no feature pattern is found:
\begin{enumerate}
\item Find the {\tt lex} feature of the fd, and depending on the category of the
constituent, the morphological features needed. For example, if fd is of
{\tt (cat verb)}, the features needed are: {\tt person, number, tense}.
\item Send the lexical item and the appropriate morphological features to the
morphology module \index{Morphology}. The linearization of the fd is the
resulting string. For example, if lex=``give'' and the features are the
default values (as it is in ir01), the result is ``gives.''
\end{enumerate}
\end{enumerate}
When the fd does not contain a morphological feature, the morphology module
provides reasonable defaults. More details on morphology are provided in
section \ref{sect-morphology}.
If a pattern contains a reference to a constituent and that constituent
does not exist, nothing happens: the linearization of an empty constituent
is the empty string. The following example illustrates this feature:
\begin{lstlisting}[language=Lisp]
;; Unified FD:
((cat s)
(pattern (prot verb goal benef))
(prot ((cat noun) (lex "John")))
(verb ((cat verb) (lex "like"))))
;; Linearized string (note that constituents GOAL and BENEF are missing):
John likes.
\end{lstlisting}
Finally, if one of the constituent sent to the morphology is not a known
morphological category, the morphology module can not perform the necessary
agreements. This is indicated by the following output:
\index{Morphology} \index{Unknown category}
\begin{lstlisting}[language=Lisp]
;; Unified FD:
((cat s)
(pattern (prot verb goal))
(prot ((cat noun) (lex "John")))
(verb ((cat verb) (lex "like")))
(goal ((cat zozo) (lex "trotteur"))))
;; Linearized string:
John likes <unknown cat ZOZO: trotteur>
\end{lstlisting}
In general, when you find that in your output, it means that there is
something wrong in the grammar. You should check the list of legal
morphological categories (see section \ref{sect-morphology}) or you should
check why a high level constituent is sent to the morphology (your fd is
too flat). You can use the function {\tt morphology-help} to receive on-line
help on which categories are known to the morphology module.
\index{morphology-help (function)}
\chapter{Writing and Modifying Grammars}
In this section, we briefly outline what steps must be followed to develop
a Functional Unification Grammar. The methodology is the following:
\begin{enumerate}
\item Determine the input to use. In general, input is given by an underlying
application. If not, the criterion to decide what is a good input is that
it should be as much ``semantic'' as possible, and contain the fewest
syntactic features as possible.
\item Identify the types of sentences to produce.
\item For each type of sentence, identify the constituents and sub-constituents,
and their function in the sentence. A constituent is a group of words that
are ``tied together'' in a clause. A constituent in general plays a
certain function with respect to the higher level constituent containing
it. For example, in ``John gives a book to Mary,'' the group ``a book''
forms a constituent, of category ``noun-group,'' and it plays the role of
the ``object upon which action is performed'' in the clause. Such role is
often called ``medium'' or ``affected'' in functional grammars.
\item Determine the output (that is, the unified fds before linearization).
In the output, constituents should be grouped in the same pair and the
attribute should indicate what function the constituent is fulfilling.
In the previous example, we want to have a pair of the form {\tt (medium [fd
describing ``a book''])} in the output. The output must also contain all
ordering constraints necessary to linearize the sentence and provide all
the morphological feature needed to derive all word inflections ({\em e.g.},
number, person, tense).
\item Determine the ``difference'' between the input and the output. All
features that are in the output but not in the input must be added by the
grammar.
\item For each category of constituent, write a branch of the grammar. To do
that, you need to specify under which conditions each feature of the
``difference'' must be added to the input.
\end{enumerate}
This is of course an over-simplified description of the process.
Sometimes, the mapping from the input to the output is best considered if
decomposed in several stages. For example, in gr4 (cf. file
{\tt gr4.l}), \index{gr4.l (file)} the grammar first maps the roles from
semantic functions (like {\tt agent} or {\tt medium}) to syntactic roles (like
{\tt subject} or {\tt direct-object}), and then does the required syntactic
adjustments. In gr11, (cf. file {\tt gr11.l}), \index{gr11.l (file)}, there
are three stages: first the clause grammar maps from semantic roles to a
level called ``oblique'', and then oblique is mapped to syntactic functions
such as subject or adjunct.
In general, the important idea here is that you must first determine your
input and your output and the grammar is the difference of the two.
The process can be complicated if your grammar also includes a lexicon. In
this case, a good part of the output should be provided by the lexicon.
Grammar gr11 illustrates one way of including the lexicon in your grammar.
\chapter{Precise Characterization of FDs}
\label{sect-syntax}
\section{Generalities: Features, Syntax, Paths and Equations}
\index{features} \index{syntax} \index{path} \index{equations} \index{leaf}
An FD is a list of pairs, called features. The attribute of a feature needs
to be a symbol. The value of a feature can be either a leaf or recursively
an FD. Here is an example: \index{pair (attribute/value)}
\begin{lstlisting}[language=Lisp]
(1) ((cat np)
(det ((cat article)
(definite yes)))
(n ((cat noun)
(number plural))))
\end{lstlisting}
A ``leaf'' is a primitive fd. It can be either a symbol, a number, a
string, a character or an array.
A given attribute in an FD must have at most ONE value.
Therefore, the FD {\tt ((size 1) (size 2))} is illegal. In fact FDs
can be viewed as a conjunction of constraints on the description
of an object: for an object to be described by {\tt ((size 1) (size
2))} it would need to have its property {\tt size} to have both the
values 1 and 2. Conversely, if the attribute "size" does not
appear in the FD, that means its value is not constrained and it
can be anything. The FD {\tt nil} (empty list of pairs) thus
represents all the objects in the world. The pair {\tt (att nil)}
expresses the constraint that the value of "att" can be anything.
It is therefore useless, and the FD {\tt ((att1 nil) (att2 val2))} is
exactly equivalent to the FD {\tt ((att2 val2))}. \index{conjunction}
\index{denotation (of FDs)}
\index{constraint (feature as)} \index{nil (special value)}
Any position in an FD can be unambiguously referred to by the ``path''
leading from the top-level of the FD to the value considered. For
example, FD (1) can be described by the set of expressions:
\begin{lstlisting}[language=Lisp]
{cat} = np
{det cat} = article
{det definite} = yes
{n cat} = noun
{n number} = plural
\end{lstlisting}
Paths are represented as simple lists of atoms between curly braces
(for example, {\tt \{det definite\}}). This notation is not ambiguous because
at each level there is at most one feature with a given attribute.
\index{path (flat description of FDs)}
A path can be ``absolute'' or ``relative.'' An absolute path gives the way
from the top-level of the FD down to a value. A relative path starts with
the symbol ``{\tt \^{}}'' (up-arrow). It refers to the FD embedding the current
feature. You can have several ``{\tt \^{}}'' in a row to go up several levels.
\index{absolute path} \index{relative path}
For example:
\begin{lstlisting}[language=Lisp]
((cat s)
(prot ((cat np)
(number sing)))
(verb ((cat vp)
(number {^ ^ prot number}))))
^
_____________________|
this is referring to the absolute path {prot number}
\end{lstlisting}
The notation \{\^{}4 x\} is equivalent to \{\^{} \^{} \^{} \^{} x\}. It is convenient when
dealing with deeply embedded constituents. \index{\^{}n notation}
Relative paths are not simply a syntactic convenience, but they extend the
expressibility of the formalism, by making grammars ``relocatable''. For
example, the grammar for NPs can be unified with a subconstituent of the
input FD at different levels ({\tt {agent}} and {\tt {affected}} for example).
In each case, a feature like {\tt (determiner ((number \{\^{} \^{} number\})))}
points to the number of the appropriate constituent. Without relative
paths such a general constraint could not be expressed.
The value of a pair can be a path. In that case, it means that the values
of the pair pointed to by the path and the value of the current pair must
always be the same. The two features are then said to be unified. In the
previous example, the features at the paths {\tt {verb number}} and {\tt {prot
number}} are unified. This means that they are absolutely equivalent, they
are two names for the same object (structure sharing). This is equivalent
to the systemic operation of ``conflation''.
\index{conflation} \index{structure sharing}
\index{path (unification)}
In general, an expression of the form {\tt x = y}, where either {\tt x} or
{\tt y} is a path or a leaf is called an equation. \index{equations}
An fd can be viewed as a flat list of equations.
In FUF, it is possible to have paths on the left of a pair. It is
therefore possible to represent an fd as a list of equations as follows:
\begin{lstlisting}[language=Lisp]
(({cat} np)
({det cat} article)
({det definite} yes)
({n cat} noun)
({n number} plural))
\end{lstlisting}
This notation allows to freely mix the ``fds as equations'' view with the
``fds as structure'' one.\footnote{Note that the possibility to put paths on
the left increases the expressive power of the {\tt external} construct, as
it becomes possible to express at run-time constraints on constituents
which are not dominated by the position of the external construct in the
structure.}
When using a path on the left, note that the right hand
side of the equation is always interpreted as occurring in the context
pointed to by the left-hand side. So if you need to use relative paths,
the relative path on the right is relative to the end position of the
left-hand side. For example, to unify two features \{verb syntax number\}
and \{prot number\} at level \{verb v\}, you must write:
\begin{lstlisting}[language=Lisp]
((verb ((v (({^ syntax number} {^ ^ ^ prot number}))))))
\end{lstlisting}
and not:
\begin{lstlisting}[language=Lisp]
((verb ((v (({^ syntax number} {^ prot number}))))))
\end{lstlisting}
because in this second equation, the path \{\^{} prot number\} is relative to
the level \{verb syntax number\} (not \{verb v\} as intended) and, therefore,
would end up at level \{verb syntax prot number\} instead of \{prot number\}.
The only case where a given attribute can appear in several pairs is when
it is followed by paths in all but one pairs. That is:
\begin{lstlisting}[language=Lisp]
((a ((a1 v1)))
(a {b})
(a {c}))
\end{lstlisting}
is a valid FD. It is equivalent for example to:
\begin{lstlisting}[language=Lisp]
((b ((a1 v1)))
(a {b})
(c {b}))
or to:
((b ((a1 v1)))
({a} {b})
(c {a}))
\end{lstlisting}
The function {\tt normalize-fd} \index{normalize-fd (function)} is convenient
to put an FD into its canonical form. For example:
\begin{lstlisting}[language=Lisp]
(setf fd1 '((a ((a1 v1)))
(b ((b1 w1)))
(a ((a2 v2)))
(b ((b2 ((w2 2)))))
(b ((b2 ((w3 3)))))))
LISP> (normalize fd1)
((a ((a1 v1)
(a2 v2)))
(b ((b1 w1)
(b2 ((w2 2)
(w3 3))))))
\end{lstlisting}
All unification functions assume that the input fd is given in canonical
form. {\tt normalize-fd} is particularly useful when the inputs are produced
incrementally by a program. Note that {\tt normalize-fd} will fail and
return {\tt *fail*} if the input FD is not consistent (for example ((a 1) (a
2))).
\index{normalize-fd (function)}
\section{FDs as Graphs}
\label{graph}
\index{graph (FD as)}
When the structure of an FD becomes complex, and more conflations with
paths are introduced, a visual representation of the FD becomes extremely
useful. This visual representation also provides a clear interpretation of
the path mechanism and makes reading of relative path much easier. The
structured format of FDs can be viewed as equivalent to a directed graph
with labeled arcs as pointed out in \cite{Karttunen-84}. The
correspondence is established as follows: an FD is a node, each pair
{\tt (attr value)} is a labeled arc leaving this node. The {\tt attr} of the
pair is the label of the arc, the value is the adjacent node. Internal
nodes in the graph have therefore no label whereas leaves are atomic
values. The equivalence is illustrated in Fig.\ref{fig3:graph1}.
% \pic{file="graph12.ps", tag=fig3:graph1, cap="FD as a graph", w=5in, h=2.6in}
\begin{figure}[p]
\centering
\includegraphics[width=5in, height=2.6in]{graph12.eps}
\caption{FD as a graph}
\label{fig3:graph1}
\end{figure}
The graph notation is particularly useful to interpret relative paths.
When a relative path occurs somewhere in an FD, its destination can be
identified by going up on the arcs, one arc for each "{\tt \^{}}".
When the value of a pair is a path, e.g., {\tt (a \{b\})}, then
the corresponding arc actually points to the same node as the
given path. In this case, there is structure sharing between a and b. This
configuration is illustrated in Fig.\ref{fig3:graph2}, where the paths
{\tt \{action number\}} and {\tt \{agent number\}} are conflated, as well as the
paths {\tt \{affected lex\}} and {\tt \{affected head lex\}} and {\tt \{subject\}} and
{\tt \{agent\}}.
%\pic{file="graph22.ps", tag=fig3:graph2, cap="Conflation in an FD graph", w=5in, h=2.8in}
\begin{figure}[p]
\centering
\includegraphics[width=5in, height=2.8in]{graph22.eps}
\caption{Conflation in an FD graph}
\label{fig3:graph2}
\end{figure}
The conflation of {\tt \{subject\}} with {\tt \{agent\}} makes all the paths that
are extensions of either {\tt agent} or {\tt subject} equivalent. For example,
{\tt \{agent lex\}} and {\tt \{subject lex\}} are equivalent. This equivalence is
easily read in the graph notation.
%\pic{file="graph32.ps", tag=fig3:graph3, cap="A grammar for conjunction", w=5in, h=2.5in}
\begin{figure}[p]
\centering
\includegraphics[width=5in, height=2.5in]{graph32.eps}
\caption{A grammar for conjunction}
\label{fig3:graph3}
\end{figure}
\label{relpath-ambiguity}
\index{ambiguity of the \^{} notation}
\index{\^{} notation (ambiguity)}
The graph notation also makes it clear that the up-arrow notation can be
ambiguous. Whenever a Y configuration is met in the graph, for example in
the two black nodes in Fig.\ref{fig3:graph2}, the up-arrow does not specify
which branch of the Y must be taken. This problem is illustrated in the
grammar in Fig.\ref{fig3:graph3}. The FD is extracted from a grammar
dealing with conjunction. The constraint enforced by the grammar is that
all the conjuncts in a conjunction must have the same syntactic category.
A conjunction is represented by an FD with two constituents: {\tt head}
represents the conjunction as a whole as a constituent and {\tt list} is a
list of conjuncts. The list is represented in a singly-linked list of
elements, with a recursive FD containing at each level the first element of
the list (feature {\tt car}) and the rest of the list (feature {\tt cdr}). In
Fig.\ref{fig3:graph3}, the path {\tt c1} is used to point to the first
constituent of the list. {\tt c1} is therefore defined by the equation
{\tt {c1} = {list car}}. The grammar in \textsc{Lisp} notation is shown below
along with a sample input:
\begin{lstlisting}[language=Lisp]
GR = ((c1 {^ list car})
(c1 ((cat {^ ^ head cat}))))
IN = ((head ((cat np)))
(list ((car ((lex "cat")))
(cdr ((car ((lex "dog")))
(cdr none))))))
\end{lstlisting}
The expression (c1 ((cat ...))) corresponds to the black dot in the graph notation
shown in Fig.\ref{fig3:graph3}. The problem is to interpret where the
relative path {\tt \{\^{} \^{} head cat\}} is pointing to. The notation is ambiguous
between {\tt \{head cat\}} and {\tt \{list head cat\}}, depending on whether one
considers the black dot as being located at address {\tt \{c1\}} or {\tt \{list
car\}}. This ambiguity is solved in FUF by following the convention that
up-arrows always refer to the textual location where they appear in the
grammar. So in this example, the up-arrows refer to the address {\tt \{c1
cat\}} and not to the address {\tt \{list car cat\}} because they are written as
a pair {\tt (c1 ((cat \{\^{} \^{} head cat\})))} and not as {\tt (list ((car ((cat \{\^{} \^{}
head cat\})))))}.
There are special attributes and values which cannot be drawn in this graph
notation because they have a special unification behavior. These are, for
attributes: {\tt alt, opt, ralt, pattern, cset, fset, test, control} and
{\tt cat} (or the currently specified cat attribute) and for values:
{\tt none, any} and {\tt given}. The special constructs {\tt \#(under x)} and
{\tt \#(external y)} have also a special meaning for the unifier. These are
all the ``keywords'' known by the unifier. They are presented in the
following sections.
\section{Functional Descriptions vs. First-order Terms}
To conclude the characterization of FDs as a data-structure, it is useful
to contrast functional unification (FU) with the more well known structural
unification (SU) as used in \textsc{Prolog} for example, and to distinguish FDs
from the first-order terms used in SU.
The most important difference is that functional unification is not based
on order and length. Therefore, {\tt \{a:1, b:2\}} and {\tt \{b:2, a:1\}} are
equivalent in FU but not in SU, and {\tt \{a:1\}} and {\tt \{b:2, a:1\}} are
compatible in FU but not in SU (FDs have no fixed arity). The following
quote from Knight summarizes the distinction between feature structures and
the first order terms used in SU \cite[p.105]{Knight}:
\begin{itemize}
\item Substructures are labeled symbolically, not inferred by argument position.
\item Fixed arity is not required.
\item The distinction between function and argument is removed.
\item Variables and coreference are treated separately.
\end{itemize}
A comparison between the FD notation and the first-order term notation
illustrates these differences. The following FD and first-order term can
be used to represent the fact that {\em Steve builds a crane that is 2 lbs
and 4 feet high}: