A common strategy to assign keywords to documents is to select the most appropriate words from the document text. One of the most important criteria for a word to be selected as keyword is its relevance for the text. The tf.idf score of a term is a widely used relevance measure. While easy to compute and giving quite satisfactory results, this measure does not take (semantic) relations between words into account. In this paper we study some alternative relevance measures that do use relations between words. They are computed by defining co-occurrence distributions for words and comparing these distributions with the document and the corpus distribution. We then evaluate keyword extraction algorithms defined by selecting different relevance measures. For two corpora of abstracts with manually assigned keywords, we compare manually extracted keywords with different automatically extracted ones. The results show that using word co-occurrence information can improve precision and recall over tf.idf.
In this paper we describe our work in progress on the development of a set of criteria to predict text difficulty in Sign Language of the Netherlands (NGT). These texts are used in a four year bachelor program, which is being brought in line with the Common European Framework of Reference for Languages (Council of Europe, 2001). Production and interaction proficiency are assessed through the NGT Functional Assessment instrument, adapted from the Sign Language Proficiency Interview (Caccamise & Samar, 2009). With this test we were able to determine that after one year of NGT-study students produce NGT at CEFR-level A2, after two years they sign at level B1, and after four years they are proficient in NGT on CEFR-level B2. As a result of that we were able to identify NGT texts that were matched to the level of students at certain stages in their studies with a CEFR-level. These texts were then analysed for sign familiarity, morpheme-sign rate, use of space and use of non-manual signals. All of these elements appear to be relevant for the determination of a good alignment between the difficulty of NGT signed texts and the targeted CEFR level, although only the morpheme-sign rate appears to be a decisive indicator