Resources

Orthography-Semantics Consistency (OSC) database
Ortography-Semantics Consistency (OSC) is a measure of semantic relatedness between a word and its orthographic relatives, and it is computed, exploiting distributional semantics methods, as the frequency-weighted average semantic similarity between the meaning of a given word and the meanings of all the words containing that very same orthographic string.
We first described OSC in Marelli, Amenta & Crepaldi (2015), where the measure was proposed as an explantaion for a long-standing side phenomenon in the morphological-processing literature, namely the effect of "stem transparency". However, we quickly realized that the measure has a general effect on word-recognition tasks, largely independent from other psycholinguistic predictors. If you want to include OSC as a predictor in you experiments, a database with optimized OSC estimates for 15,017 English words can be downloaded here. The development of the database is described in details in Marelli and Amenta (Behavior Research Methods, in press).


WEISS: automatic semantic estimates for Italian
Semantic estimates for psycholinguistic experiments are often difficult to obtain, as they tipycally require running pre-studies to collect ratings on a large number of potential stimuli. Distributional semantic models such as LSA offer an ideal shortcut to this purpose, permitting to automatically obtain semantic metrics from corpus data. To this purpose I have released WEISS, semantic models for Italian based on state-of-the-art techniques, validated on psycholinguistic data, and accessible through the great SNAUT web interface (developed by Pawel Mandera). The models are described in paper appeared on Psihologija, and can be consulted through the following links:

Frequency norms from social media
We have recently shown that frequency norms extracted from social media (Facebook and Twitter) provide the best prediction for psycholinguistic purposes (such as studying response times in lexical decision), outperforming other resources based on traditional and subtitle corpora. The study is described in a paper published on Cognitive Science (freely available here), result of a fruitful collaboration with Amaç Herdağdelen from the Facebook Data Science group. The newly proposed frequency norms can be downloaded from the links below:

The SICK dataset
SICK (Sentence Involving Compositional Knowledge) is large dataset of human intuitions on English sentences, collected through crowdsourcing. The dataset includes about 10.000 sentence pairs, each annotated for the degree of semantic relatedness and the type of entailment relation. The data were prepared with the purpose of specifically capturing compositional aspects, thus minimizing elements such as named entities, world-knowledge notions, idioms, and focusing on phenomena of linguistic interest (lexical variations, syntactic alternations, negation). Although the dataset is first and foremost aimed at the validation of computational models (and was indeed employed in a SemEval shared task), it can be also profitably considered for psycholinguistic purposes. The dataset is described in a series of paper (Marelli et al., LREC 2014; Marelli et al., SemEval 2014; Bentivogli et al., under review) that can be downloaded -along with the dataset itself- from the link above.


The FRACSS model
FRACSS (Functional Representation of Affixes in Compositional Semantic Spaces) is a distributional model for representations of morpheme meanings and compositional operations at the sub-word level. The model is discussed in details in Marelli & Baroni (Psychological Review, 2015). Scripts and datasets can be found in the link above.