Ask a librarian

Dataset contains TF-IDF data matrices generated from "Ask a librarian" question/answer corpus and targeted for machine learning use. Corpus is in Finnish. Data matrices are especially suitable for training Extreme Multi-label Text Classification (XMTC) machine learning models.

The original corpus contains 3150 Finnish language relatively short documents from the service Kysy kirjastonhoitajalta (Ask a librarian). Each document is a question from the general public with an answer from a librarian.

The corpus was extracted from the collection of over 25000 question/answer pairs with the requirement that the document must have a minimum of 4 subjects.

The corpus has been split into the following directories:

all: contains all the documents (N=3150)
train: contains questions asked before 2016 (N=2625), intended for training
maui-train: random sample subset (N=200) of train, intended for training a Maui model
validate: contains questions asked in 2016 (N=213), intended for validating (e.g. choosing hyperparameters for a classifier)
test: contains questions asked in 2017 (N=312), intended for final evaluation

The original corpus is available from

Resources (1)

Additional Info

Field Value
Dataset visibility
Outdated No
More about the license

Koulutusmatriisit on tuottanut CSC - Tieteen tietotekniikan keskus Oy. Alkuperäisen datan on kerännyt Kansalliskirjasto.

Geographical coverage
Update frequency
Valid from
Valid until
Links to additional information
Collection type Open data
International benchmarks
State Active
Dataset maintainer Analytiikkaryhmä
Maintainer email
Maintainer website
comments powered by Disqus