Our technology is based on 15+ years of research in computational linguistics and computer science. We believe that you cannot make any significant progress unless you push the boundaries and break new ground. Our lab is constantly refining and advancing our algorithms and methodologies. We work in two main areas of research:
Distributional semantics, a research area in which we develop and study theories and methods for quantifying and categorising semantic similarities between linguistic items based on their distributional properties in large samples of language data. Our interest here is manifold: we work on algorithms for the effective acquisition of understanding from text, on rich and useful representations for linguistic content and situational context, and on applications of distributional models to real world tasks (obviously, mostly tasks of commercial interest to us).
Evaluation of learning language models, finding methods and metrics to test and compare algorithms, memory models, and processing approaches, both to benchmark improvements and to validate approaches with respect to tasks of interest.
Adult content is pervasive on the web, has been a driving factor in the adoption of the Internet medium, and is responsible for a significant fraction of traffic and revenues, yet rarely attracts attention in research. The research questions surrounding adult content access behaviours are unique, and interesting and valuable research in this area can be done ethically. WSDM 2016 features a half day workshop on Search and Exploration of X-Rated Information (SEXI) for information access tasks related to adult content. While the scope of the workshop remains broad, special attention is devoted to the privacy and security issues surrounding adult content by inviting keynote speakers with extensive experience on these topics. The recent release of the personal data belonging to customers of the adult dating site Ashley Madison provides a timely context for the focus on privacy and security.
Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM '16). ACM, New York, NY, USA, 697-698. DOI=http://dx.doi.org/10.1145/2835776.2855118
(report from the workshop will be published in SIGIR Forum later in 2016)
This paper presents the Gavagai Living Lexicon, which is an online distributional semantic model currently available in 20 different languages. We describe the underlying distributional semantic model, and how we have solved some of the challenges in applying such a model to large amounts of streaming data. We also describe the architecture of our implementation, and discuss how we deal with continuous quality assurance of the lexicon.
This paper presents our first take at a text generator: Dead Man Tweeting, a system that learns semantic avatars from (dead) people’s texts, and makes the avatars come alive on Twitter. The system includes a language model for generating sequences of words, a topic model for ensuring that the sequences are topically coherent, and a semantic model that ensures the avatars can be productive and generate novel sequences. The avatars are connected to Twitter and are triggered by keywords that are significant for each particular avatar.
We will be continuing the development of this first whimsical prototype for other projects where generating topical text is of interest.
Text categorisation in commercial application poses several limiting constraints on the technology solutions to be employed. This paper describes how a method with some potential improvements is evaluated for practical purposes and argues for a richer and more expressive evaluation procedure. In this paper one such method is exemplified by a precision-recall matrix which exchanges convenience for usefulness.
Presented at the 7th CLEF 2016 Conference and Labs of the Evaluation Forum, 5-8 September 2016, Évora, Portugal.
This paper is a report from a workshop on Evaluation of Information Systems in Commercial Settings, inspired by the industrial day at SIGIR 2016. Small and medium size enterprises often lack the resources needed to develop proper evaluation infrastructures, but also to follow the research development in the field of evaluation. Similarly,academics lag behind in (a) understanding real practical issues raised when it comes to the evaluation of real systems - e.g. even depth-k pooling is often infeasible when an SME has a single ranking algorithm developed, and (b) sensing the breadth of applications and tasks on which systems require evaluation and the challenges of them. Large enterprises with the necessary resources and the data sets and flows to work with are hesitant to make their tests public, for both commercial and legal reasons.This workshop brought together representatives from technology companies, large and small, media houses, industrial consultants and academic research in information access for a discussion on practical issues and solutions to these issues.
A support vector classifier was compared to a lexicon-based approach for the task of detecting the stance categories speculation, contrast and conditional in English consumer reviews. Around 3,000 training instances were required to achieve a stable performance of an F-score of 90 for speculation. This outperformed the lexicon-based approach, for which an F-score of just above 80 was achieved. The machine learning results for the other two categories showed a lower average (an approximate F-score of 60 for contrast and 70 for conditional), as well as a larger variance, and were only slightly better than lexicon matching. Therefore, while machine learning was successful for detecting speculation, a well-curated lexicon might be a more suitable approach for detecting contrast and conditional.
This paper discusses the use of factorization techniques in distributional semantic models. We focus on a method for redistributing the weight of latent variables, which have previously been shown to improve the performance of distributional semantic models. However, this result has not been replicated and remains poorly understood. We refine the method, and provide additional theoretical justification, as well as empirical results that demonstrate the viability of the proposed approach.
This paper introduces a novel way to navigate neighborhoods in distributional semantic models. The approach is based on relative neighborhood graphs, which uncover the topological structure of local neighborhoods in semantic space. This has the potential to overcome both the problem with selecting a proper k in k-NN search, and the problem that a ranked list of neighbors may conflate several different senses. We provide both qualitative and quantitative results that support the viability of the proposed method.
This paper reports from the workshop on Evaluating Learning Language Representations hosted by Gavagai in October 2014.
Presented at the 6th CLEF 2015 Conference and Labs of the Evaluation Forum, 8-11 September 2015, Toulouse, France. This work was partially funded by the European Science Foundation through its ELIAS project.
This paper describes a series of experiments to determine how positionally annotated Twitter texts can be used to learn words which indicate location of other texts and their authors. Many texts are locatable but most have no explicit indication of place --- many applications, both commercial and academic, have an interest in knowning where a text or its author is from.
The notion of placeness of a word is introduced as a measure of how locational a word is, and we find that modelling word distributions to account for several locations, using local distributional context, and aggregating locational information in a centroid for each text gives the most useful results. The results are applied to data in the Swedish language.
Issue framing has become one of the most important means of elite influence on public opinion. In this paper, we introduce a method for investigating issue framing based on statistic analysis of large samples of language use. Our method uses a technique called Random Indexing (RI), which enables us to extract semantic and associative relations to any target concept of interest, based on co-occurrence statistics collected from large samples of relevant language use. As a first test and evaluation of our proposed method, we apply RI to a large collection of Swedish blog data and extract semantic relations relating to our target concept “outsiders”. This concept is widely used in the public debate both in relation to labour market issues and socially related issues.
In this paper we present our experiments on the RepLab 2014 Reputation Dimension task. RepLab is a competitive challenge for Reputation Management Systems. RepLab 2014’s reputation dimensions task focuses on categorization
of Twitter messages with regard to standard reputation dimensions (such as performance, leadership, or innovation). Our approach only relies on the textual content of tweets and ignores both metadata and the content of URLs within tweets. We carried out several experiments focusing on different feature sets including bag of n-grams, distributional semantics features, and deep neural
network representations. The results show that bag of bigram features with minimum frequency thresholding work quite well in reputation dimension task especially with regards to average F1 measure over all dimensions where two of our four submitted runs achieve highest and second highest scores. Our experiments also show that semi-supervised recursive autoencoders outperform other feature sets used in our experiments with regards to accuracy measure and is a promising subject of future research for improvements.
A reasonable requirement (among many others) for a lexical or semantic component in an information system is that it should be able to learn incrementally from the linguistic data it is exposed to, that it can distinguish between the topical impact of various terms, and that it knows if it knows stuff or not.
We work with a specific representation framework – semantic spaces – which well accommodates the first requirement; in this short paper, we investigate the global qualities of semantic spaces by a topological procedure – mapper – which gives an indication of topical density of the space; we examine the local context of terms of interest in the semantic space using another topologically inspired approach which gives an indication of the neighbourhood of the terms of interest. Our aim is to be able to establish the qualities of the semantic space under consideration without resorting to inspection of the data used to build it.
Despite that the need for a common evaluation framework for multimedia and multimodal documents for various use cases, including non-topical use, is widely acknowledged, such a framework is still not in place. Retrieval system evaluation results are not regularly validated in laboratory or field studies; the infrastructure for generalizing results over tasks, users and collections is still missing. This chapter presents a use case-based framework for experimental design in the field of interactive information access. The framework is highlighted by examples that sketch out how the framework can be productively used in experimental design and reporting with a minimal threshold for adoption.
"Hur ser det politiska opinionsläget ut? Det går förstås att fråga väljarna. Men bättre är kanske att se vad de skriver. Nu är ett datorprogram väljarnas sympatier på spåren."
This chapter deals with a statistical technique for sense exploration based on distributional semantics known as word space modelling. Word space models rely on feature aggregation, in this case aggregation of co-occurrence events, to build an aggregated view on the distributional behaviour of words. Such models calculate meaning similarity among words on the basis of the contexts in which they occur and represent it as proximity in high-dimensional vector spaces. The main purpose of this study is to test to what extent word-space modelling is in principle suitable for lexical-typological work by taking a first little step in this direction and applying the method for the exploration of the seven central English temperature adjectives in three corpora representing different genres. In order to better capture and account for the potentially different senses of one and the same word we have suggested and applied a new variant of this general method, “syntagmatically labelled partitioning”.