Labs

RESEARCH

Our technology is based on 15+ years of research in computational linguistics and computer science. We believe that you cannot make any significant progress unless you push the boundaries and break new ground. Our lab is constantly refining and advancing our algorithms and methodologies. We work in two main areas of research:

Distributional semantics, a research area in which we develop and study theories and methods for quantifying and categorising semantic similarities between linguistic items based on their distributional properties in large samples of language data. Our interest here is manifold: we work on algorithms for the effective acquisition of understanding from text, on rich and useful representations for linguistic content and situational context, and on applications of distributional models to real world tasks (obviously, mostly tasks of commercial interest to us).

Evaluation of learning language models, finding methods and metrics to test and compare algorithms, memory models, and processing approaches, both to benchmark improvements and to validate approaches with respect to tasks of interest.

PUBLICATIONS

The language of smell: Connecting linguistic and psychophysical properties of odor descriptors

Iatropoulos, Georgios
Herman, Pawel
Lansner, Anders
Karlgren, Jussi
Larsson, Maria
Olofsson, Jonas
The olfactory sense is a particularly challenging domain for cognitive science investigations of perception, memory, and language. Although many studies show that odors often are difficult to describe verbally, little is known about the associations between olfactory percepts and the words that describe them. Quantitative models of how odor experiences are described in natural language are therefore needed to understand how odors are perceived and communicated. In this study, we develop a computational method to characterize the olfaction-related semantic content of words in a large text corpus of internet sites in English. We introduce two new metrics: olfactory association index (OAI, how strongly a word is associated with olfaction) and olfactory specificity index (OSI, how specific a word is in its description of odors). We validate the OAI and OSI metrics using psychophysical datasets by showing that terms with high OAI have high ratings of perceived olfactory association and are used to describe highly familiar odors. In contrast, terms with high OSI have high inter-individual consistency in how they are applied to odors. Finally, we analyze Dravnieks’s (1985) dataset of odor ratings in terms of OAI and OSI. This analysis reveals that terms that are used broadly (applied often but with moderate ratings) tend to be olfaction-unrelated and abstract (e.g., “heavy” or “light”; low OAI and low OSI) while descriptors that are used selectively (applied seldom but with high ratings) tend to be olfaction-related (e.g., “vanilla” or “licorice”; high OAI). Thus, OAI and OSI provide behaviorally meaningful information about olfactory language. These statistical tools are useful for future studies of olfactory perception and cognition, and might help integrate research on odor perception, neuroimaging, and corpus-based linguistic models of semantic organization.
Published in Cognition

Second Workshop on Search and Exploration of X-Rated Information (SEXI’16)

Vanessa Murdock, Charles L.A. Clarke, Jaap Kamps, and Jussi Karlgren.

Adult content is pervasive on the web, has been a driving factor in the adoption of the Internet medium, and is responsible for a significant fraction of traffic and revenues, yet rarely attracts attention in research. The research questions surrounding adult content access behaviours are unique, and interesting and valuable research in this area can be done ethically. WSDM 2016 features a half day workshop on Search and Exploration of X-Rated Information (SEXI) for information access tasks related to adult content. While the scope of the workshop remains broad, special attention is devoted to the privacy and security issues surrounding adult content by inviting keynote speakers with extensive experience on these topics. The recent release of the personal data belonging to customers of the adult dating site Ashley Madison provides a timely context for the focus on privacy and security.

Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM '16). ACM, New York, NY, USA, 697-698. DOI=http://dx.doi.org/10.1145/2835776.2855118

Pre-presentation of the SEXI workshop in WSDM proceedings, from the ACM digital library

(report from the workshop will be published in SIGIR Forum later in 2016)

The Gavagai Living Lexicon

Magnus Sahlgren, Amaru Cuba Gyllensten, Fredrik Espinoza, Ola Hamfors, Jussi Karlgren, Fredrik Olsson, Per Persson, Akshay Viswanathan, and Anders Holst

This paper presents the Gavagai Living Lexicon, which is an online distributional semantic model currently available in 20 different languages. We describe the underlying distributional semantic model, and how we have solved some of the challenges in applying such a model to large amounts of streaming data. We also describe the architecture of our implementation, and discuss how we deal with continuous quality assurance of the lexicon.

Short paper presented at the 10th edition of the Language Resources and Evaluation Conference (LREC 2016), 23-28 May 2016, Portorož

Dead Man Tweeting

David Nilsson, Magnus Sahlgren, and Jussi Karlgren

This paper presents our first take at a text generator: Dead Man Tweeting, a system that learns semantic avatars from (dead) people’s texts, and makes the avatars come alive on Twitter. The system includes a language model for generating sequences of words, a topic model for ensuring that the sequences are topically coherent, and a semantic model that ensures the avatars can be productive and generate novel sequences. The avatars are connected to Twitter and are triggered by keywords that are significant for each particular avatar.

We will be continuing the development of this first whimsical prototype for other projects where generating topical text is of interest.

Paper presented on May 28 at the RE-WOCHAT 2016 workshop on Collecting and Generating Resources for Chatbots and Conversational Agents Development and Evaluation, in conjunction with LREC 2016 in Portorož

Evaluating Categorisation in Real Life – an argument against simple but impractical metrics

Vide Karlsson, Jussi Karlgren, and Pawel Herman

Text categorisation in commercial application poses several limiting constraints on the technology solutions to be employed. This paper describes how a method with some potential improvements is evaluated for practical purposes and argues for a richer and more expressive evaluation procedure. In this paper one such method is exemplified by a precision-recall matrix which exchanges convenience for usefulness.

Presented at the 7th CLEF 2016 Conference and Labs of the Evaluation Forum, 5-8 September 2016, Évora, Portugal.

Paper is here

En rekommenderad svensk språkteknologisk terminologi

Viggo Kann (KTH), Lars Ahrenberg
(Linköping University), Rickard Domeij (Swedish Language Council),
Ola Karlsson (Swedish Language Council),
Jussi Karlgren,
Henrik Nilsson (Terminologicentrum), Joakim Nivre (Uppsala University)
In 2014 the Swedish Language Technology Terminology Group was created, with representatives from different parts of the language technology community, both higher education and research, industry and governmental agencies. In 2016 we have recommended Swedish terms for the 270 language technological concepts in the Bank of Finnish Terminology in Arts and Sciences. The language technology terms are published on folkets-lexikon.csc.kth.se/LTterminology, where anyone can lookup Swedish and English terms interactively and read the full list of terms. We also try to enter the most important Swedish terminology into the Swedish Wikipedia. We encourage use of these Swedish terms and welcome suggestions for improvements of the Swedish terminology.
Presented to the Sixth Swedish Language Technology Conference

Random indexing of multidimensional data

Fredrik Sandin, Blerim Emruli, and Magnus Sahlgren
This paper gives a model for how to generalise random indexing to multidimensional arrays and therefore enable approximation of higher-order statistical relationships in data. The generalised method is a sparse implementation of random projections, which is the theoretical basis also for ordinary random indexing and other randomisation approaches to dimensionality reduction and data representation. We present numerical experiments which demonstrate that a multidimensional generalisation of random indexing is feasible, including comparisons with ordinary random indexing and principal component analysis. An open source implementation of generalised random indexing is provided.
Knowledge and Information Systems (2016). doi:10.1007/s10115-016-1012-2

A proposal to use distributional models to analyse dolphin vocalization

Mats Amundin, Robert Eklund, Henrik Hållsten, Jussi Karlgren, Lars Molinder
This paper gives a brief introduction to the starting points of our coming---pending favourable decisions by research funding agencies---experimental project to study dolphin communicative behaviour using distributional semantics as implemented by us at Gavagai. It presents some of the challenges and conveys some of the optimism we feel is warranted given the rapid increase of available data and of processing power. This is an opportunity both to test the limits of our model and the characteristics of dolphin communication systems! Co-authors are Mats Amundin from Kolmården Wildlife Park, Robert Eklund from Linköping University, Henrik Hållsten, myself, and Lars Molinder of Carnegie, one of our financial advisors, who came up with the original idea. The paper was presented by Mats Amundin at the 1st International Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots.
In: 1st International Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots, 2017.

Plausibility Testing for Lexical Resources

Magdalena Parks, Jussi Karlgren, Sara Stymne
This paper describes principles for evaluation metrics for lexical components using randomly scrambled sentences compared with naturally occurring ones, and sentences where some salient referent has been replaces with a liar's item, together with a pilot implementation of those principles based on requirements from practical information system
Presented at the 2018 CLEF Conference

Practical Issues in Information Access System Evaluation

Kanoulas, Evangelos and Karlgren, Jussi

This paper is a report from a workshop on Evaluation of Information Systems in Commercial Settings, inspired by the industrial day at SIGIR 2016. Small and medium size enterprises often lack the resources needed to develop proper evaluation infrastructures, but also to follow the research development in the field of evaluation. Similarly,academics lag behind in (a) understanding real practical issues raised when it comes to the evaluation of real systems - e.g. even depth-k pooling is often infeasible when an SME has a single ranking algorithm developed, and (b) sensing the breadth of applications and tasks on which systems require evaluation and the challenges of them. Large enterprises with the necessary resources and the data sets and flows to work with are hesitant to make their tests public, for both commercial and legal reasons.This workshop brought together representatives from technology companies, large and small, media houses, industrial consultants and academic research in information access for a discussion on practical issues and solutions to these issues.

Published in SIGIR Forum, Vol. 51, no 1

Analysis of Open Answers to Survey Questions through Interactive Clustering and Theme Extraction

Espinoza, Fredrik
Hamfors, Ola
Karlgren, Jussi
Olsson, Fredrik
Persson, Per
Hamberg, Lars
Sahlgren, Magnus
This paper describes design principles for and the implementation of Gavagai Explorer—a new application which builds on interactive text clustering to extract themes from topically coherent text sets such as open text answers to surveys or questionnaires.An automated system is quick, consistent, and has full coverage over the study material. A system allows an analyst to analyze more answers in a given time period; provides the same initial results regardless of who does the analysis, reducing the risks of inter-rater discrepancy; and does not risk miss responses due to fatigue or boredom. These factors reduce the cost and increase the reliability of the service. The most important feature, however, is relieving the human analyst from the frustrating aspects of the coding task, freeing the effort to the central challenge of understanding themes.
Presented as a Demo at the 2nd CHIIR Conference, 2018

The Smart Data Layer

Sahlgren, Magnus
Ylipää, Erik
Brown, Barry
Helms, Karey
Lampinen, Airi
McMillan, Donald
Karlgren, Jussi
This paper introduces the notion of a smart data layer for the Internet of Everything. The smart data layer can be seen as an AI that learns a generic representation from heterogeneous data streams with the goal of understanding the state of the user. The smart data layer can be used both as materials for design processes and as the foundation for intelligent data processing.
Presented at the AAAI Spring Symposium

Hyperdimensional utterance spaces

Jussi Karlgren and Pentti Kanerva
Human language has a large and varying number of features, both lexical items and constructions, which interact to represent various aspects of communicative information. High-dimensional semantic spaces have proven useful and effective for aggregating and processing lexical information for many language processing tasks. This paper describes a hyperdimensional processing model for language data, a straightforward extension of models previously used for words to handling utterance or text level information. A hyperdimensional model is able to represent a broad range of linguistic and extra-linguistic features in a common integral framework which is suitable as a bridge between symbolic and continuous representations, as an encoding scheme for symbolic information and as a basis for feature space exploration. This paper provides an overview of the framework and an example of how it is used in a pilot experiment.
Presented at the 1st Biennial Conference on Design of Experimental Search and Information Retrieval Systems (DESIRES), 2018

Authorship profiling without using topical information

Jussi Karlgren, Lewis Esposito, Chantal Gratton, and Pentti Kanerva
This paper describes an experiment made for the PAN 2018 shared task on author profiling. The task is to distinguish female from male authors of microblog posts published on Twitter using no extraneous information except what is in the posts; this experiment focusses on using non-topical information from the posts, rather than gender differences in referential content.
Presented at the PAN Workshop of the CLEF Conference

Detecting Speculations, Contrasts and Conditionals in Consumer Reviews

Maria Skeppstedt, Teri Schamp-Bjerede, Magnus Sahlgren, Carita Paradis and Andreas Kerren

A support vector classifier was compared to a lexicon-based approach for the task of detecting the stance categories speculation, contrast and conditional in English consumer reviews. Around 3,000 training instances were required to achieve a stable performance of an F-score of 90 for speculation. This outperformed the lexicon-based approach, for which an F-score of just above 80 was achieved. The machine learning results for the other two categories showed a lower average (an approximate F-score of 60 for contrast and 70 for conditional), as well as a larger variance, and were only slightly better than lexicon matching. Therefore, while machine learning was successful for detecting speculation, a well-curated lexicon might be a more suitable approach for detecting contrast and conditional. 

Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA '15)

Factorization of Latent Variables in Distributional Semantic Models

David Ödling, Arvid Österlund and Magnus Sahlgren

This paper discusses the use of factorization techniques in distributional semantic models. We focus on a method for redistributing the weight of latent variables, which have previously been shown to improve the performance of distributional semantic models. However, this result has not been replicated and remains poorly understood. We refine the method, and provide additional theoretical justification, as well as empirical results that demonstrate the viability of the proposed approach.

EMNLP 2015

Navigating the Semantic Horizon using Relative Neighborhood Graphs

Amaru Cuba Gyllensten and Magnus Sahlgren

This paper introduces a novel way to navigate neighborhoods in distributional semantic models. The approach is based on relative neighborhood graphs, which uncover the topological structure of local neighborhoods in semantic space. This has the potential to overcome both the problem with selecting a proper k in k-NN search, and the problem that a ranked list of neighbors may conflate several different senses. We provide both qualitative and quantitative results that support the viability of the proposed method.

EMNLP 2015

Evaluating Learning Language Representations

Jussi Karlgren, Jimmy Callin, Kevyn Collins-Thompson, Amaru Cuba Gyllensten, Ariel Ekgren, David Jürgens, Anna Korhonen, Fredrik Olsson, Magnus Sahlgren, and Hinrich Schütze

This paper reports from the workshop on Evaluating Learning Language Representations hosted by Gavagai in October 2014.

Presented at the 6th CLEF 2015 Conference and Labs of the Evaluation Forum, 8-11 September 2015, Toulouse, France. This work was partially funded by the European Science Foundation through its ELIAS project.

Paper is here

Inferring the location of authors from words in their texts

Max Berggren, Jussi Karlgren, Robert Östling, and Mikael Parkvall

This paper  describes a series of experiments to determine how positionally annotated Twitter texts can be used to learn words which indicate location of other texts and their authors. Many texts are locatable but most have no explicit indication of place --- many applications, both commercial and academic, have an interest in knowning where a text or its author is from. 

The notion of placeness of a word is introduced as a measure of how locational a word is, and we find that modelling word distributions to account for several locations, using  local distributional context, and aggregating locational information in a centroid for each text gives the most useful results. The results are applied to data in the Swedish language.

Presented at the 20th NoDaLiDa, Nordic Conference on Computational Linguistics in May 11-13, 2015, Vilnius. This work was done in cooperation with Stockholm University and was partially funded by Vetenskapsrådet, the Swedish Research Council, under its grant SINUS (Spridning av innovationer i nutida svenska).

Issue framing and language use in the Swedish blogosphere: Changing notions of the outsider concept

Stefan Dahlberg and Magnus Sahlgren
Department of Political Science, University of Gothenburg and Gavagai, Stockholm

Issue framing has become one of the most important means of elite influence on public opinion. In this paper, we introduce a method for investigating issue framing based on statistic analysis of large samples of language use. Our method uses a technique called Random Indexing (RI), which enables us to extract semantic and associative relations to any target concept of interest, based on co-occurrence statistics collected from large samples of relevant language use. As a first test and evaluation of our proposed method, we apply RI to a large collection of Swedish blog data and extract semantic relations relating to our target concept “outsiders”. This concept is widely used in the public debate both in relation to labour market issues and socially related issues.

In: Bertie Kaal, Isa Maks and Annemarie van Elfrinkhof (eds.) From Text to Political Positions: Text analysis across disciplines, John Benjamins Publishing Company, 2014, pp. 71–92.

The STAVICTA Group Report for RepLab 2014 Reputation Dimensions Task

Afshin Rahimi, Magnus Sahlgren, Andreas Kerren, and Carita Paradis

In this paper we present our experiments on the RepLab 2014 Reputation Dimension task. RepLab is a competitive challenge for Reputation Management Systems. RepLab 2014’s reputation dimensions task focuses on categorization
of Twitter messages with regard to standard reputation dimensions (such as performance, leadership, or innovation). Our approach only relies on the textual content of tweets and ignores both metadata and the content of URLs within tweets. We carried out several experiments focusing on different feature sets including bag of n-grams, distributional semantics features, and deep neural
network representations. The results show that bag of bigram features with minimum frequency thresholding work quite well in reputation dimension task especially with regards to average F1 measure over all dimensions where two of our four submitted runs achieve highest and second highest scores. Our experiments also show that semi-supervised recursive autoencoders outperform other feature sets used in our experiments with regards to accuracy measure and is a promising subject of future research for improvements.

 

Proceedings of CLEF 2014

Semantic Topology

Jussi Karlgren Gabriel Isheden Martin Bohman Ariel Ekgren Emelie Kullmann David Nilsson
Gavagai and KTH

A reasonable requirement (among many others) for a lexical or semantic component in an information system is that it should be able to learn incrementally from the linguistic data it is exposed to, that it can distinguish between the topical impact of various terms, and that it knows if it knows stuff or not.

We work with a specific representation framework – semantic spaces – which well accommodates the first requirement; in this short paper, we investigate the global qualities of semantic spaces by a topological procedure – mapper – which gives an indication of topical density of the space; we examine the local context of terms of interest in the semantic space using another topologically inspired approach which gives an indication of the neighbourhood of the terms of interest. Our aim is to be able to establish the qualities of the semantic space under consideration without resorting to inspection of the data used to build it.

In: Proceedings of the 23d ACM international conference on Conference on information & knowledge management (CIKM '14) in Shanghai, Nov 3-7. ACM, New York, NY, USA, 2014.

A Use Case Framework for Information Access Evaluation

Preben Hansen, Gunnar Eriksson, Anni Järvelin, and Jussi Karlgren

Despite that the need for a common evaluation framework for multimedia and multimodal documents for various use cases, including non-topical use, is widely acknowledged, such a framework is still not in place. Retrieval system evaluation results are not regularly validated in laboratory or field studies; the infrastructure for generalizing results over tasks, users and collections is still missing. This chapter presents a use case-based framework for experimental design in the field of interactive information access.  The framework is highlighted by examples that sketch out how the framework can be productively used in experimental design and reporting with a minimal threshold for adoption. 

In "Professional Search in the Modern World", Paltoglou, Georgios, Loizides, Fernando, Hansen, Preben (Eds.). Springer. 2014.

Språket avslöjar hur vi röstar

Jussi Karlgren.

"Hur ser det politiska opinionsläget ut? Det går förstås att fråga väljarna. Men bättre är kanske att se vad de skriver. Nu är ett datorprogram väljarnas sympatier på spåren."

2014. Språktidningen. 6: 16-22.
Link

Temperature in the Word Space: Sense Exploration of Temperature Expressions using Word-Space Modelling

Maria Koptjevskaja-Tamm and Magnus Sahlgren
Department of Linguistics, Stockholm University and Gavagai, Stockholm

This chapter deals with a statistical technique for sense exploration based on distributional semantics known as word space modelling. Word space models rely on feature aggregation, in this case aggregation of co-occurrence events, to build an aggregated view on the distributional behaviour of words. Such models calculate meaning similarity among words on the basis of the contexts in which they occur and represent it as proximity in high-dimensional vector spaces. The main purpose of this study is to test to what extent word-space modelling is in principle suitable for lexical-typological work by taking a first little step in this direction and applying the method for the exploration of the seven central English temperature adjectives in three corpora representing different genres. In order to better capture and account for the potentially different senses of one and the same word we have suggested and applied a new variant of this general method, “syntagmatically labelled partitioning”.

In: Benedikt Szmrecsanyi and Bernhard Wälchli (eds.) Aggregating Dialectology, Typology, and Register Analysis: Linguistic Variation in Text and Speech, Berlin, Boston: De Gruyter, 2014, pp. 231–267.