Resources

WEISS: automatic semantic estimates for Italian
Semantic estimates for psycholinguistic experiments are often difficult to obtain, as they tipycally require running pre-studies to collect ratings on a large number of potential stimuli. Distributional semantic models such as LSA offer an ideal shortcut to this purpose, permitting to automatically obtain semantic metrics from corpus data. To this purpose I have released WEISS, semantic models for Italian based on state-of-the-art techniques, validated on psycholinguistic data, and accessible through the great SNAUT web interface (developed by Pawel Mandera). The models are described in paper appeared on Psihologija, and can be consulted through the following links:


Frequency norms from social media
We have recently shown that frequency norms extracted from social media (Facebook and Twitter) provide the best prediction for psycholinguistic purposes (such as studying response times in lexical decision), outperforming other resources based on traditional and subtitle corpora. The study is described in a paper that will be published on Cognitive Science (freely available here), result of a fruitful collaboration with Amaç Herdağdelen from the Facebook Data Science group. The newly proposed frequency norms can be downloaded from the links below:


The FRACSS model
FRACSS (Functional Representation of Affixes in Compositional Semantic Spaces) is a distributional model for representations of morpheme meanings and compositional operations at the sub-word level. The model is discussed in details in Marelli & Baroni (Psychological Review, 2015). Scripts and datasets can be found in the link above. 




The SICK dataset
SICK (Sentence Involving Compositional Knowledge) is large dataset of human intuitions on English sentences, collected through crowdsourcing. The dataset includes about 10.000 sentence pairs, each annotated for the degree of semantic relatedness and the type of entailment relation. The data were prepared with the purpose of specifically capturing compositional aspects, thus minimizing elements such as named entities, world-knowledge notions, idioms, and focusing on phenomena of linguistic interest (lexical variations, syntactic alternations, negation). Although the dataset is first and foremost aimed at the validation of computational models (and was indeed employed in a SemEval shared task), it can be also profitably considered for psycholinguistic purposes. The dataset is described in a series of paper (Marelli et al., LREC 2014; Marelli et al., SemEval 2014; Bentivogli et al., under review) that can be downloaded -along with the dataset itself- from the link above.