This is our blog. We write about cool experiments, interesting case studies, open talks and technicalities.
Grab the RSS feed
Back

2017-02-23

Inflection and distribution (or “are ‘lentil’ and ‘lentils’ the same word?”)

People who know about languages with rich inflectional systems often ask us how Gavagai handles morphology, i.e. the inflections and derivations words can be subject to in some languages. In English, for instance, a noun can be in singular and plural number and each can take a genitive ending:

base form genitive
singular lentil lentil’s
plural lentils lentils’

which is not too complex. In Swedish a noun can take number, definiteness, and genitive, yielding eight surface forms for most nouns:

base form genitive
non definite definite non definite definite
singular sill sillen sills sillens
plural sillar sillarna sillars sillarnas

which is somewhat more challenging but still manageable. In general, adjectives are less complex than nouns, verbs somewhat more elaborate.

Most Western European languages have a fairly spare morphology, similarly to English and Swedish, and most of the other largest languages of the world, including languages such as Chinese, historically unrelated to English, are similar in this respect. But not all: Russian, for instance exhibits six noun cases, multiplying the forms further.

nominative genitive accusative dative instrumental locative
singular борщ борща борщ борщу борщом борще
plural борщи борщей борщи борщам борщами борщах

and moving to Finnish, a language from a different language family altogether, the picture changes dramatically. Finnish nouns can take up to sixteen case endings, inflect in plural, add a possessive suffix (corresponding to “my”, “your” and so on), and some other discourse suffixes for emphasis, questions, and similar functions. In all, a Finnish noun in theory can take up to about 5 000 forms. Typically less than a hundred forms are in play in observed texts, but that is still a fairly large number.

muikku muikuiltakaan muikuillannekinhan
vendace (a fish) not even from fishes maybe even with your fishes

So how does this impact the distributional model we work with at Gavagai? We find that the inflectional forms differ from each other with respect to their distribution. This means that e.g. a plural form of a term typically is not the closest neighbours to the singular form. The table below shows that the singular form “lentil” is quite close to the singular form “chickpea” , closer than it is to the plural form “lentils”. This may seem surprising, but what this really tells us that inflections carry meaning, which is reflected in the usage of a term.

Lentil Lentils Chickpea Chickpeas
Lentil 1 0.23 0.61 0.17
Lentils 0.23 1 0.12 0.55
Chickpea 0.61 0.12 1 0.18
Chickpeas 0.17 0.55 0.18 1

In practice, this means that we do not want to automatically add all conceivable forms of a word to e.g. a theme, a monitor tracker, or an attitudinal pole. This should be done deliberately, with more or less the same thought put into to the decision as if another word were added. In the example, it is likely that the plural forms are used e.g. in recipes, whereas the singulars probably are more often used in e.g. botanical texts.

Sometimes both are interesting, sometimes only one or the other. This decision our system leaves to the analyst, suggesting both forms for inclusion, but not adding them in automatically.

Category: technicalities