Our technology is based on 15+ years of research in computational linguistics and computer science. We believe that you cannot make any significant progress unless you push the boundaries and break new ground. Our lab is constantly refining and advancing our algorithms and methodologies. We work in two main areas of research:
Distributional semantics, a research area in which we develop and study theories and methods for quantifying and categorising semantic similarities between linguistic items based on their distributional properties in large samples of language data. Our interest here is manifold: we work on algorithms for the effective acquisition of understanding from text, on rich and useful representations for linguistic content and situational context, and on applications of distributional models to real world tasks (obviously, mostly tasks of commercial interest to us).
Evaluation of learning language models, finding methods and metrics to test and compare algorithms, memory models, and processing approaches, both to benchmark improvements and to validate approaches with respect to tasks of interest.
Adult content is pervasive on the web, has been a driving factor in the adoption of the Internet medium, and is responsible for a significant fraction of traffic and revenues, yet rarely attracts attention in research. The research questions surrounding adult content access behaviours are unique, and interesting and valuable research in this area can be done ethically. WSDM 2016 features a half day workshop on Search and Exploration of X-Rated Information (SEXI) for information access tasks related to adult content. While the scope of the workshop remains broad, special attention is devoted to the privacy and security issues surrounding adult content by inviting keynote speakers with extensive experience on these topics. The recent release of the personal data belonging to customers of the adult dating site Ashley Madison provides a timely context for the focus on privacy and security.
Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM '16). ACM, New York, NY, USA, 697-698. DOI=http://dx.doi.org/10.1145/2835776.2855118
(report from the workshop will be published in SIGIR Forum later in 2016)
This paper presents the Gavagai Living Lexicon, which is an online distributional semantic model currently available in 20 different languages. We describe the underlying distributional semantic model, and how we have solved some of the challenges in applying such a model to large amounts of streaming data. We also describe the architecture of our implementation, and discuss how we deal with continuous quality assurance of the lexicon.
This paper presents our first take at a text generator: Dead Man Tweeting, a system that learns semantic avatars from (dead) people’s texts, and makes the avatars come alive on Twitter. The system includes a language model for generating sequences of words, a topic model for ensuring that the sequences are topically coherent, and a semantic model that ensures the avatars can be productive and generate novel sequences. The avatars are connected to Twitter and are triggered by keywords that are significant for each particular avatar.
We will be continuing the development of this first whimsical prototype for other projects where generating topical text is of interest.
Text categorisation in commercial application poses several limiting constraints on the technology solutions to be employed. This paper describes how a method with some potential improvements is evaluated for practical purposes and argues for a richer and more expressive evaluation procedure. In this paper one such method is exemplified by a precision-recall matrix which exchanges convenience for usefulness.
Presented at the 7th CLEF 2016 Conference and Labs of the Evaluation Forum, 5-8 September 2016, Évora, Portugal.
A support vector classifier was compared to a lexicon-based approach for the task of detecting the stance categories speculation, contrast and conditional in English consumer reviews. Around 3,000 training instances were required to achieve a stable performance of an F-score of 90 for speculation. This outperformed the lexicon-based approach, for which an F-score of just above 80 was achieved. The machine learning results for the other two categories showed a lower average (an approximate F-score of 60 for contrast and 70 for conditional), as well as a larger variance, and were only slightly better than lexicon matching. Therefore, while machine learning was successful for detecting speculation, a well-curated lexicon might be a more suitable approach for detecting contrast and conditional.
This paper discusses the use of factorization techniques in distributional semantic models. We focus on a method for redistributing the weight of latent variables, which have previously been shown to improve the performance of distributional semantic models. However, this result has not been replicated and remains poorly understood. We refine the method, and provide additional theoretical justification, as well as empirical results that demonstrate the viability of the proposed approach.
This paper introduces a novel way to navigate neighborhoods in distributional semantic models. The approach is based on relative neighborhood graphs, which uncover the topological structure of local neighborhoods in semantic space. This has the potential to overcome both the problem with selecting a proper k in k-NN search, and the problem that a ranked list of neighbors may conflate several different senses. We provide both qualitative and quantitative results that support the viability of the proposed method.
This paper reports from the workshop on Evaluating Learning Language Representations hosted by Gavagai in October 2014.
Presented at the 6th CLEF 2015 Conference and Labs of the Evaluation Forum, 8-11 September 2015, Toulouse, France. This work was partially funded by the European Science Foundation through its ELIAS project.
This paper describes a series of experiments to determine how positionally annotated Twitter texts can be used to learn words which indicate location of other texts and their authors. Many texts are locatable but most have no explicit indication of place --- many applications, both commercial and academic, have an interest in knowning where a text or its author is from.
The notion of placeness of a word is introduced as a measure of how locational a word is, and we find that modelling word distributions to account for several locations, using local distributional context, and aggregating locational information in a centroid for each text gives the most useful results. The results are applied to data in the Swedish language.
Despite that the need for a common evaluation framework for multimedia and multimodal documents for various use cases, including non-topical use, is widely acknowledged, such a framework is still not in place. Retrieval system evaluation results are not regularly validated in laboratory or field studies; the infrastructure for generalizing results over tasks, users and collections is still missing. This chapter presents a use case-based framework for experimental design in the field of interactive information access. The framework is highlighted by examples that sketch out how the framework can be productively used in experimental design and reporting with a minimal threshold for adoption.
A reasonable requirement (among many others) for a lexical or semantic component in an information system is that it should be able to learn incrementally from the linguistic data it is exposed to, that it can distinguish between the topical impact of various terms, and that it knows if it knows stuff or not.
We work with a specific representation framework – semantic spaces – which well accommodates the first requirement; in this short paper, we investigate the global qualities of semantic spaces by a topological procedure – mapper – which gives an indication of topical density of the space; we examine the local context of terms of interest in the semantic space using another topologically inspired approach which gives an indication of the neighbourhood of the terms of interest. Our aim is to be able to establish the qualities of the semantic space under consideration without resorting to inspection of the data used to build it.
"Hur ser det politiska opinionsläget ut? Det går förstås att fråga väljarna. Men bättre är kanske att se vad de skriver. Nu är ett datorprogram väljarnas sympatier på spåren."
In this paper we present our experiments on the RepLab 2014 Reputation Dimension task. RepLab is a competitive challenge for Reputation Management Systems. RepLab 2014’s reputation dimensions task focuses on categorization
of Twitter messages with regard to standard reputation dimensions (such as performance, leadership, or innovation). Our approach only relies on the textual content of tweets and ignores both metadata and the content of URLs within tweets. We carried out several experiments focusing on different feature sets including bag of n-grams, distributional semantics features, and deep neural
network representations. The results show that bag of bigram features with minimum frequency thresholding work quite well in reputation dimension task especially with regards to average F1 measure over all dimensions where two of our four submitted runs achieve highest and second highest scores. Our experiments also show that semi-supervised recursive autoencoders outperform other feature sets used in our experiments with regards to accuracy measure and is a promising subject of future research for improvements.
Issue framing has become one of the most important means of elite influence on public opinion. In this paper, we introduce a method for investigating issue framing based on statistic analysis of large samples of language use. Our method uses a technique called Random Indexing (RI), which enables us to extract semantic and associative relations to any target concept of interest, based on co-occurrence statistics collected from large samples of relevant language use. As a first test and evaluation of our proposed method, we apply RI to a large collection of Swedish blog data and extract semantic relations relating to our target concept “outsiders”. This concept is widely used in the public debate both in relation to labour market issues and socially related issues.
This chapter deals with a statistical technique for sense exploration based on distributional semantics known as word space modelling. Word space models rely on feature aggregation, in this case aggregation of co-occurrence events, to build an aggregated view on the distributional behaviour of words. Such models calculate meaning similarity among words on the basis of the contexts in which they occur and represent it as proximity in high-dimensional vector spaces. The main purpose of this study is to test to what extent word-space modelling is in principle suitable for lexical-typological work by taking a first little step in this direction and applying the method for the exploration of the seven central English temperature adjectives in three corpora representing different genres. In order to better capture and account for the potentially different senses of one and the same word we have suggested and applied a new variant of this general method, “syntagmatically labelled partitioning”.
Adult content is pervasive on the Web, has been a driving factor in the adoption of the Internet medium. It is responsible for a significant fraction of traffic and revenues, yet rarely attracts attention in research. We propose that the research questions surrounding adult content access behaviors are unique, and we believe interesting and valuable research in this area can be done ethically. The workshop on Search and Exploration of X-Rated Information (SEXI) addresses these issues for information access tasks related to adult content.
During a three-day workshop in February 2012, 45 Information Retrieval researchers met to discuss long-range challenges and opportunities within the field. The result of the workshop is a diverse set of research directions, project ideas, and challenge areas. This report describes the workshop format, provides summaries of broad themes that emerged, includes brief descriptions of all the ideas, and provides detailed discussion of six proposals that were voted "most interesting" by the participants.
Key themes include the need to: move beyond ranked lists of documents to support richer dialog and presentation, represent the context of search and searchers, provide richer support for information seeking, enable retrieval of a wide range of structured and unstructured content, and develop new evaluation methodologies.
What can text sentiment analysis technology be used for, and does a more usage-informed view on sentiment analysis pose new requirements on technology development?
Gavagai used its first-generation baseline system for the profiling task for evaluation campaign for online reputation management systems of CLEF 2012. The system builds on large scale analysis of streaming text and performed excellently on this task with standard settings.
There is an increasing amount of structure on the web as a result of modern web languages, user tagging and annotation, emerging robust NLP tools, and an ever growing volume of linked data. These meaningful, semantic, annotations hold the promise to significantly enhance information access, by enhancing the depth of analysis of today's systems. Currently, we have only started exploring the possibilities and only begin to understand how these valuable semantic cues can be put to fruitful use.
The ESAIR series of workshops takes as its starting point that there is an increasing amount of structure on the web as a result of modern web languages, user tagging and annotation, emerging robust NLP tools, and an ever growing volume of linked data. These meaningful, semantic, annotations hold the promise to significantly enhance information access, by enhancing the depth of analysis of today's systems. Currently, we have only started exploring the possibilities and only begin to understand how these valuable semantic cues can be put to fruitful use. To complicate matters, standard text search excels at shallow information needs expressed by short keyword queries, and here semantic annotation contributes very little, if anything.
ESAIR'10 focussed on formulating a framework for viewing annotation as a linking procedure, connecting an analysis of information objects with a semantic model of some sort, expressing relations that contribute to a task of interest to end users.
ESAIR'11 brought together discussions on how to unleash the potential of semantic annotations requires us to think outside the box, by combining the insights of natural language processing (NLP) to go beyond bags of words, the insights of database technologies (DB) to use structure efficiently even when aggregating over millions of records, the insights of information retrieval (IR) in effective goal-directed search and evaluation, and the insights of knowledge management (KM) to get grips on the greater whole.
ESAIR'12 focussed on how to leverage the rich context currently available, especially in a mobile search scenario, giving powerful new handles to exploit semantic annotations and on how to fruitfully combine classic information retrieval and knowledge intensive approaches, and for the first time work actively toward a unified view on exploiting semantic annotations.
ESAIR'13 focussed on two of the most challenging aspects to address in the coming years. First, there is a need to include the currently emerging knowledge resources (such as DBpedia, Freebase) as underlying semantic model giving access to an unprecedented scope and detail of factual information. Second, there is a need to include annotations beyond the topical dimension (think of sentiment, reading level, prerequisite level, etc) that contain vital cues for matching the specific needs and profile of the searcher at hand.
ESAIR'14 focussed on how to elicit more articulate queries or expressions of information need, with concepts and relations linking their statement of request to existing semantic models as offered by emerging knowledge bases. The discussion centered to a large extent on how to provide useful event and entity identification from unstructured streaming information.