Resources

Perceptual norms for Italian words

The effects of concreteness and imageability in word recognition are a central topic in the investigation of grounded effects in language processing. However, it has been proposed that perceptual modality norms are a better way to capture these grounded aspects of word meaning. We have released such norms for 1121 Italian words, extracted from the Italian version of the ANEW database. For each word, participants provided perceptual strength rating for each of the five perceptual modalities, namely hearing, taste, touch, smell and vision. The dataset can be ideally combined with the ANEW measures, and also includes behavioral data (response times and accuracy) from two large-scale experiments, with lexical decision and word naming as tasks.

The full dataset can downloaded here. The paper describing the dataset, authored by Alessandra Vergallito, Marco Petilli and myself, can be found here. It also reports a thorough analysis of the ratings and describes an interesting dissociation between Italian and English in the impact of such measures!


Orthography-Semantics Consistency (OSC) database

Ortography-Semantics Consistency (OSC) is a measure of semantic relatedness between a word and its orthographic relatives, and it is computed, exploiting distributional semantics methods, as the frequency-weighted average semantic similarity between the meaning of a given word and the meanings of all the words containing that very same orthographic string.

We first described OSC in Marelli, Amenta & Crepaldi (2015), where the measure was proposed as an explantaion for a long-standing side phenomenon in the morphological-processing literature, namely the effect of "stem transparency". However, we quickly realized that the measure has a general effect on word-recognition tasks, largely independent from other psycholinguistic predictors. If you want to include OSC as a predictor in you experiments, a database with optimized OSC estimates for 15,017 English words can be downloaded here. The development of the database is described in details in Marelli and Amenta (Behavior Research Methods, 2018).


WEISS: automatic semantic estimates for Italian

Semantic estimates for psycholinguistic experiments are often difficult to obtain, as they tipycally require running pre-studies to collect ratings on a large number of potential stimuli. Distributional semantic models such as LSA offer an ideal shortcut to this purpose, permitting to automatically obtain semantic metrics from corpus data. To this purpose I have released WEISS, semantic models for Italian based on state-of-the-art techniques, validated on psycholinguistic data, and accessible through the great SNAUT web interface (developed by Pawel Mandera). The models are described in paper appeared on Psihologija, and can be consulted through the following links:


Frequency norms from social media

We have recently shown that frequency norms extracted from social media (Facebook and Twitter) provide the best prediction for psycholinguistic purposes (such as studying response times in lexical decision), outperforming other resources based on traditional and subtitle corpora. The study is described in a paper published on Cognitive Science (freely available here), result of a fruitful collaboration with Amaç Herdağdelen from the Facebook Data Science group. The newly proposed frequency norms can be downloaded from the links below:

  • Facebook frequency norms

  • Twitter frequency norms (based on the Rovereto Twitter Corpus)


The SICK dataset

SICK (Sentence Involving Compositional Knowledge) is large dataset of human intuitions on English sentences, collected through crowdsourcing. The dataset includes about 10.000 sentence pairs, each annotated for the degree of semantic relatedness and the type of entailment relation. The data were prepared with the purpose of specifically capturing compositional aspects, thus minimizing elements such as named entities, world-knowledge notions, idioms, and focusing on phenomena of linguistic interest (lexical variations, syntactic alternations, negation). Although the dataset is first and foremost aimed at the validation of computational models (and was indeed employed in a SemEval shared task), it can be also profitably considered for psycholinguistic purposes. The dataset is described in a series of paper (Marelli et al., LREC 2014; Marelli et al., SemEval 2014; Bentivogli et al., under review) that can be downloaded -along with the dataset itself- from the link above.


The FRACSS model

FRACSS (Functional Representation of Affixes in Compositional Semantic Spaces) is a distributional model for representations of morpheme meanings and compositional operations at the sub-word level. The model is discussed in details in Marelli & Baroni (Psychological Review, 2015). Scripts and datasets can be found in the link above.