Ethersource technology under the hood

Ferdinand de Saussure

Ferdinand de Saussure, father of modern Linguistics

At its core, Ethersource computes and tracks relations between terms in symbols in streaming language data. These data are represented in a hyperdimensional vector space. Vector space models, the basis of many or even most information access systems today, use well established and well understood linear algebraic methods to access and manage the knowledge in them. Linguistic items such as terms or words are interpreted as points in a many-dimensional space, and similarity between terms as distances between those points. This is intuitively appealing and easy to talk about.

Distributionality - the basis of meaning in language

But vector space models are only as good as what is in them. In our case, the model is built on distributional data to build relations between terms based on their occurrence patterns. Distributional data are the basis for our semantic model - which is a solid theoretical standing point for a theory of meaning and a theory of meaning of meaning. Once you have a distributionally motivated model, you will be able to extract similarities between observed items in it and use those similarities to model conceptual abstraction in language. Distributional data can be aggregated in many different ways depending on what you want to find from those data. This is best done from an awareness of the basics of how language works.

Vast data, but sparse

Handling many-dimensional spaces poses computational challenges. Collecting data about millions of terms observed in use and the relations between them in a linear algeabric matrix may seem straightforward, but one rapidly finds that the matrix is huge and sparse. There are many many terms and many many documents (or other contexts they occur in). Even more unsettlingly we always will encounter new words: the matrix never stops growing!

There are several computational approaches to process huge matrices and to mine generalities from them such as matrix factorization techniques. Unfortunately such methods come at considerable computational cost.

Randomness is the path of least assumption

The Gavagai word space model is based on a different approach in which distributional data are aggregated from observed language use incrementally, bypassing both the need for the huge matrix and the need for subsequent dimensionality reduction. Our approach is based on the practical evolution of recent techniques related to Random Indexing, which has several important advantages compared to other approaches: it does not require that we collect the data in a huge matrix and it does not require recompilation when new documents and words are encountered: the dimensionality is fixed and never increases.

Now available for application to real-world problems

We have now taken this to the next level, making the research results practicable and effective in the only commercially available implementation of online dimensionality reduction.

Our model is tractable, through the dimensionality reduction technique, dynamic through its online qualities, theoretically sound, through its basis in the distributionality of terms, and complete, since our implementation uses it to read any and all text it is exposed to. This design is what gives us the attractive qualities which have enabled us to build Ethersource and the services we provide with it.