The platform for open and practice-oriented research

product

An original template solution for FAIR scientific text mining

This method paper presents a template solution for text mining of scientific literature using the R tm package. Literature to be analyzed can be collected manually or automatically using the code provided with this paper. Once the literature is collected, the three steps for conducting text mining can be performed as outlined below:• loading and cleaning of text from articles,• processing, statistical analysis, and clustering, and• presentation of results using generalized and tailor-made visualizations.The text mining steps can be applied to a single, multiple, or time series groups of documents.References are provided to three published peer reviewed articles that use the presented text mining methodology. The main advantages of our method are: (1) Its suitability for both research and educational purposes, (2) Compliance with the Findable Accessible Interoperable and Reproducible (FAIR) principles, and (3) code and example data are made available on GitHub under the open-source Apache V2 license.

PDF

An original template solution for FAIR scientific text mining

product

Machine Vision and Social Media Images: Why Hashtags Matter

Studying images in social media poses specific methodological challenges, which in turn have directed scholarly attention toward the computational interpretation of visual data. When analyzing large numbers of images, both traditional content analysis as well as cultural analytics have proven valuable. However, these techniques do not take into account the contextualization of images within a socio-technical environment. As the meaning of social media images is co-created by online publics, bound through networked practices, these visuals should be analyzed on the level of their networked contextualization. Although machine vision is increasingly adept at recognizing faces and features, its performance in grasping the meaning of social media images remains limited. Combining automated analyses of images with platform data opens up the possibility to study images in the context of their resonance within and across online discursive spaces. This article explores the capacities of hashtags and retweet counts to complement the automated assessment of social media images, doing justice to both the visual elements of an image and the contextual elements encoded through the hashtag practices of networked publics.

PDF

Machine Vision and Social Media Images: Why Hashtags Matter

product

Thesaurus based term ranking for keyword extraction

A common strategy to assign keywords to documents is to select the most appropriate words from the document text. One of the most important criteria for a word to be selected as keyword is its relevance for the text. The tf.idf score of a term is a widely used relevance measure. While easy to compute and giving quite satisfactory results, this measure does not take (semantic) relations between words into account.

PDF

Thesaurus based term ranking for keyword extraction

product

IknowWhatThisIs.

This publication gives an account of the Public Annotation of Cultural Heritage research project (PACE) conducted at the Crossmedialab. The project was carried out between 1 January 2008 and 31 December 2009, and was funded by the Ministry of Education, Culture, and Science. Three members of the Dutch Association of Science Centres (Vereniging Science Centra) actively participated in the execution of the project: the Utrecht University Museum, the National Museum of Natural History (Naturalis), and Museon. In addition, two more knowledge institutes participated: Novay and the Utrecht University of Applied Sciences. BMC Consultancy and Manage¬ment also took part in the project. This broad consortium has enabled us to base the project on both knowledge and experience from a practical and scientific perspective. The purpose of the PACE project was to examine the ways in which social tagging could be deployed as a tool to enrich collections, improve their acces¬sibility and to increase visitor group involvement. The museums’ guiding question for the project was: ‘When is it useful to deploy social tagging as a tool for the benefit of museums and what kind of effect can be expected from such deployment?’ For the Crossmedialab the PACE project presented a unique opportunity to conduct concrete research into the highly interesting phenomenon of social tagging with parties and experts in the field.

PDF

product

Learning to use space: A study into the SL2 acquisition process of adult learners of Sign Language of the Netherlands

The aim of this dissertation is to examine how adult learners with a spoken language background who are acquiring a signed language, learn how to use the space in front of the body to express grammatical and topographical relations. Moreover, it aims at investigating the effectiveness of different types of instruction, in particular instruction that focuses the learner's attention on the agreement verb paradigm. To that end, existing data from a learner corpus (Boers-Visker, Hammer, Deijn, Kielstra & Van den Bogaerde, 2016) were analyzed, and two novel experimental studies were designed and carried out. These studies are described in detail in Chapters 3–6. Each chapter has been submitted to a scientific journal, and accordingly, can be read independently.1 Yet, the order of the chapters follows the chronological order in which the studies were carried out, and the reader will notice that each study served as a basis to inform the next study. As such, some overlap in the sections describing the theoretical background of each study was unavoidable.

MULTIFILE

Learning to use space: A study into the SL2 acquisition process of adult learners of Sign Language of the Netherlands

product

Keyword extraction using co-occurrence.

A common strategy to assign keywords to documents is to select the most appropriate words from the document text. One of the most important criteria for a word to be selected as keyword is its relevance for the text. The tf.idf score of a term is a widely used relevance measure. While easy to compute and giving quite satisfactory results, this measure does not take (semantic) relations between words into account. In this paper we study some alternative relevance measures that do use relations between words. They are computed by defining co-occurrence distributions for words and comparing these distributions with the document and the corpus distribution. We then evaluate keyword extraction algorithms defined by selecting different relevance measures. For two corpora of abstracts with manually assigned keywords, we compare manually extracted keywords with different automatically extracted ones. The results show that using word co-occurrence information can improve precision and recall over tf.idf.

PDF

product

Technologies for a content and language integrated approach to dropout problems in Higher Education

This paper reports on CATS (2006-2007), a project initiated by the Research Centre Teaching in Multicultural Schools, that addresses language related dropout problems of both native and non-native speakers of Dutch in higher education. The projects main objective is to develop a model for the redesign of the curriculum so as to optimize the development of academic and professional language skills. Key pedagogic strategies are the raising of awareness of personal proficiency levels through diagnostic testing, definition of linguistic demands of curriculum tasks, empowerment of student autonomy and peer feedback procedures. More specifically, this paper deals with two key areas of the project. First, it describes the design and development of web-based corpus software tools, aimed at the enhancement of the autonomy of students academic reading and writing skills. Secondly, it describes the design of three pilots, in which the process of a content and language integrated approach - facilitated by the developed web tools - was applied, and these pilots respective evaluations. The paper concludes with a reflection on the project development and the experiences with the pilot implementations.

PDF

Technologies for a content and language integrated approach to dropout problems in Higher Education

product

Automatic categorization of self-acknowledged limitations in randomized controlled trial publications

Objective:Acknowledging study limitations in a scientific publication is a crucial element in scientific transparency and progress. However, limitation reporting is often inadequate. Natural language processing (NLP) methods could support automated reporting checks, improving research transparency. In this study, our objective was to develop a dataset and NLP methods to detect and categorize self-acknowledged limitations (e.g., sample size, blinding) reported in randomized controlled trial (RCT) publications.Methods:We created a data model of limitation types in RCT studies and annotated a corpus of 200 full-text RCT publications using this data model. We fine-tuned BERT-based sentence classification models to recognize the limitation sentences and their types. To address the small size of the annotated corpus, we experimented with data augmentation approaches, including Easy Data Augmentation (EDA) and Prompt-Based Data Augmentation (PromDA). We applied the best-performing model to a set of about 12K RCT publications to characterize self-acknowledged limitations at larger scale.Results:Our data model consists of 15 categories and 24 sub-categories (e.g., Population and its sub-category DiagnosticCriteria). We annotated 1090 instances of limitation types in 952 sentences (4.8 limitation sentences and 5.5 limitation types per article). A fine-tuned PubMedBERT model for limitation sentence classification improved upon our earlier model by about 1.5 absolute percentage points in F1 score (0.821 vs. 0.8) with statistical significance (). Our best-performing limitation type classification model, PubMedBERT fine-tuning with PromDA (Output View), achieved an F1 score of 0.7, improving upon the vanilla PubMedBERT model by 2.7 percentage points, with statistical significance ().Conclusion:The model could support automated screening tools which can be used by journals to draw the authors’ attention to reporting issues. Automatic extraction of limitations from RCT publications could benefit peer review and evidence synthesis, and support advanced methods to search and aggregate the evidence from the clinical trial literature.

MULTIFILE

Automatic categorization of self-acknowledged limitations in randomized controlled trial publications

product

Toward assessing clinical trial publications for reporting transparency

Objective: To annotate a corpus of randomized controlled trial (RCT) publications with the checklist items of CONSORT reporting guidelines and using the corpus to develop text mining methods for RCT appraisal. Methods: We annotated a corpus of 50 RCT articles at the sentence level using 37 fine-grained CONSORT checklist items. A subset (31 articles) was double-annotated and adjudicated, while 19 were annotated by a single annotator and reconciled by another. We calculated inter-annotator agreement at the article and section level using MASI (Measuring Agreement on Set-Valued Items) and at the CONSORT item level using Krippendorff's α. We experimented with two rule-based methods (phrase-based and section header-based) and two supervised learning approaches (support vector machine and BioBERT-based neural network classifiers), for recognizing 17 methodology-related items in the RCT Methods sections. Results: We created CONSORT-TM consisting of 10,709 sentences, 4,845 (45%) of which were annotated with 5,246 labels. A median of 28 CONSORT items (out of possible 37) were annotated per article. Agreement was moderate at the article and section levels (average MASI: 0.60 and 0.64, respectively). Agreement varied considerably among individual checklist items (Krippendorff's α= 0.06–0.96). The model based on BioBERT performed best overall for recognizing methodology-related items (micro-precision: 0.82, micro-recall: 0.63, micro-F1: 0.71). Combining models using majority vote and label aggregation further improved precision and recall, respectively. Conclusion: Our annotated corpus, CONSORT-TM, contains more fine-grained information than earlier RCT corpora. Low frequency of some CONSORT items made it difficult to train effective text mining models to recognize them. For the items commonly reported, CONSORT-TM can serve as a testbed for text mining methods that assess RCT transparency, rigor, and reliability, and support methods for peer review and authoring assistance. Minor modifications to the annotation scheme and a larger corpus could facilitate improved text mining models. CONSORT-TM is publicly available at https://github.com/kilicogluh/CONSORT-TM.

PDF

Search results

Products 38

An original template solution for FAIR scientific text mining

Machine Vision and Social Media Images: Why Hashtags Matter

Thesaurus based term ranking for keyword extraction

IknowWhatThisIs.

Learning to use space: A study into the SL2 acquisition process of adult learners of Sign Language of the Netherlands

Keyword extraction using co-occurrence.

Technologies for a content and language integrated approach to dropout problems in Higher Education

Automatic categorization of self-acknowledged limitations in randomized controlled trial publications

Toward assessing clinical trial publications for reporting transparency

Navigate to

Categories

Filters

Products 38

An original template solution for FAIR scientific text mining

Machine Vision and Social Media Images: Why Hashtags Matter

Thesaurus based term ranking for keyword extraction

IknowWhatThisIs.

Learning to use space: A study into the SL2 acquisition process of adult learners of Sign Language of the Netherlands

Keyword extraction using co-occurrence.

Technologies for a content and language integrated approach to dropout problems in Higher Education

Automatic categorization of self-acknowledged limitations in randomized controlled trial publications

Toward assessing clinical trial publications for reporting transparency