CorCoDial

Interactive Maps for Corpus-Based Dialectology

Traditional data collection methods in dialectology rely on structured surveys, whose results can be easily presented on printed or digital maps. But in recent years, corpora of transcribed dialect speech have become a precious alternative source for data-driven linguistic analysis.

However, transcribed interviews with dialect speakers cannot easily be compared with each other as they differ considerably in length and content. If informant A does not use word x, this does not necessarily mean that the word does not exist in A’s dialect. It may just be that A chose to talk about topics that did not require the use of word x.

The CorCoDial (Corpus-based computational dialectology) project aims to introduce comparability in dialect corpora by relying on NLP techniques such as topic modeling and representation learning from language models. This demonstration website shows how the parameters of trained models can be used to create new visualisations of dialect landscapes.

Data

We work with three datasets consisting of dialect interviews or conversations, which have been both phonetically transcribed and normalized to a standard variety:

Finnish: Samples of Spoken Finnish (Suomen kielen näytteitä) — 99 interviews from 50 locations
Norwegian: Nordic Dialect Corpus — 438 speakers from 111 locations
Swiss German: ArchiMob Corpus — 43 speakers from 27 locations

Topic modeling

We consider each transcribed interview as one document and train a topic model to group similar interviews together. In order to focus on phonological and morphological features, no lemmatization or stopword removal is performed. Various parameters, such as the topic modeling technique, the subword tokenization scheme, and the number of topics, can be adjusted.

Kuparinen, Olli & Scherrer, Yves (2024). Corpus-based dialectometry with topic models. In: Journal of Linguistic Geography 12(1), 1-12. Cambridge University Press.

Speaker embeddings

We train a neural machine translation model to convert the phonetic transcriptions to the standardized spelling. Following work in multilingual machine translation, we add the speaker ID as the first token of each utterance. After training, the learned embeddings of these speaker ID tokens are extracted. The embeddings are processed by a dimensionality reduction technique and visualized accordingly. Three methods are available: PCA (fixed to 3 dimensions), Ward agglomerative clustering and k-means clustering.

Kuparinen, Olli & Scherrer, Yves (2023). Dialect Representation Learning with Neural Dialect-to-Standard Normalization. In Proceedings of VarDial 2023, 200-212. Association for Computational Linguistics.