Text has been the most efficient and reliable way to create, aggregate, and store information for the past few thousand years. And it will continue to be so for the foreseeable future; we are still only at the very beginning of the Big Data explosion.
Much of this text will come from social media, where language use is productive, informal, and multilingual, with little respect for grammar rules and lexical conventions. This is a difficult environment for traditional approaches to text analytics, because they are based on a processing pipeline that presumes stability and consistency.
Traditional text analytics is based on a step-wise process where words in the text are annotated with various linguistic properties and relations. Each processing step adds properties or relations of increasing refinement. Typical processing steps include part-of-speech tagging, syntactic parsing, and named entity recognition. The result is a fine-grained account of the structural properties of the text and its component words.
There are several issues with such an approach. One is that there is no account of meaning in such an analysis, and meaning is the quintessential property of language. Another is that it is not very actionable; what business decisions can we take after having seen such an analysis?
Unlike traditional approaches to text analytics, we begin by modelling meaning instead of structure. Instead of relying on the standard processing pipeline, we rely on a live semantic memory model that continuously learns and understands text without any human intervention.
Our semantic memories are inspired by how our human brain understands text. We humans can learn what words mean simply from their usage and context, we do this effortlessly and seamlessly, and it happens to each of us more frequently than we realise. Gavagai’s technology works in a similar way. It learns the meanings of words by observing their usages and contexts, and it never stops learning: language evolves constantly, as do our semantic memories.
Our semantic memories are built for Big Data; they learn from all available text data and are always online. If you invent a new word and start using it on social media, our models will have learnt it in a matter of minutes. The same goes for new languages: as long as there are texts available, we can learn a semantic memory for that language. All this is made possible by clever engineering and the use of hyperdimensional representations.
Would you like to sneak a peek into the wordspaces we have online at the moment? Head over to Gavagai's Living Lexicon where you can look up words to see their current left side neighbors, right side neighbors, n-grams, semantically similar words, and associations.