A common strategy to assign keywords to documents is to select the most appropriate words from the document text. One of the most important criteria for a word to be selected as keyword is its relevance for the text. The tf.idf score of a term is a widely used relevance measure. While easy to compute and giving quite satisfactory results, this measure does not take (semantic) relations between words into account. In this paper we study some alternative relevance measures that do use relations between words. They are computed by defining co-occurrence distributions for words and comparing these distributions with the document and the corpus distribution. We then evaluate keyword extraction algorithms defined by selecting different relevance measures. For two corpora of abstracts with manually assigned keywords, we compare manually extracted keywords with different automatically extracted ones. The results show that using word co-occurrence information can improve precision and recall over tf.idf.
DOCUMENT
Preprint submitted to Information Processing & Management Tags are a convenient way to label resources on the web. An interesting question is whether one can determine the semantic meaning of tags in the absence of some predefined formal structure like a thesaurus. Many authors have used the usage data for tags to find their emergent semantics. Here, we argue that the semantics of tags can be captured by comparing the contexts in which tags appear. We give an approach to operationalizing this idea by defining what we call paradigmatic similarity: computing co-occurrence distributions of tags with tags in the same context, and comparing tags using information theoretic similarity measures of these distributions, mostly the Jensen-Shannon divergence. In experiments with three different tagged data collections we study its behavior and compare it to other distance measures. For some tasks, like terminology mapping or clustering, the paradigmatic similarity seems to give better results than similarity measures based on the co-occurrence of the documents or other resources that the tags are associated to. We argue that paradigmatic similarity, is superior to other distance measures, if agreement on topics (as opposed to style, register or language etc.), is the most important criterion, and the main differences between the tagged elements in the data set correspond to different topics
DOCUMENT
This study explores how households interact with smart systems for energy usage, providing insights into the field's trends, themes and evolution through a bibliometric analysis of 547 relevant literature from 2015 to 2025. Our findings discover: (1) Research activity has grown over the past decade, with leading journals recognizing several productive authors. Increased collaboration and interdisciplinary work are expected to expand; (2) Key research hotspots, identified through keyword co-occurrence, with two (exploration and development) stages, highlighting the interplay between technological, economic, environmental, and behavioral factors within the field; (3) Future research should place greater emphasis on understanding how emerging technologies interact with human, with a deeper understanding of users. Beyond the individual perspective, social dimensions also demand investigation. Finally, research should also aim to support policy development. To conclude, this study contributes to a broader perspective of this topic and highlights directions for future research development.
MULTIFILE