With the proliferation of misinformation on the web, automatic methods for detecting misinformation are becoming an increasingly important subject of study. If automatic misinformation detection is applied in a real-world setting, it is necessary to validate the methods being used. Large language models (LLMs) have produced the best results among text-based methods. However, fine-tuning such a model requires a significant amount of training data, which has led to the automatic creation of large-scale misinformation detection datasets. In this paper, we explore the biases present in one such dataset for misinformation detection in English, NELA-GT-2019. We find that models are at least partly learning the stylistic and other features of different news sources rather than the features of unreliable news. Furthermore, we use SHAP to interpret the outputs of a fine-tuned LLM and validate the explanation method using our inherently interpretable baseline. We critically analyze the suitability of SHAP for text applications by comparing the outputs of SHAP to the most important features from our logistic regression models.
DOCUMENT
With the proliferation of misinformation on the web, automatic misinformation detection methods are becoming an increasingly important subject of study. Large language models have produced the best results among content-based methods, which rely on the text of the article rather than the metadata or network features. However, finetuning such a model requires significant training data, which has led to the automatic creation of large-scale misinformation detection datasets. In these datasets, articles are not labelled directly. Rather, each news site is labelled for reliability by an established fact-checking organisation and every article is subsequently assigned the corresponding label based on the reliability score of the news source in question. A recent paper has explored the biases present in one such dataset, NELA-GT-2018, and shown that the models are at least partly learning the stylistic and other features of different news sources rather than the features of unreliable news. We confirm a part of their findings. Apart from studying the characteristics and potential biases of the datasets, we also find it important to examine in what way the model architecture influences the results. We therefore explore which text features or combinations of features are learned by models based on contextual word embeddings as opposed to basic bag-of-words models. To elucidate this, we perform extensive error analysis aided by the SHAP post-hoc explanation technique on a debiased portion of the dataset. We validate the explanation technique on our inherently interpretable baseline model.
DOCUMENT
Artificial Intelligence (AI) offers organizations unprecedented opportunities. However, one of the risks of using AI is that its outcomes and inner workings are not intelligible. In industries where trust is critical, such as healthcare and finance, explainable AI (XAI) is a necessity. However, the implementation of XAI is not straightforward, as it requires addressing both technical and social aspects. Previous studies on XAI primarily focused on either technical or social aspects and lacked a practical perspective. This study aims to empirically examine the XAI related aspects faced by developers, users, and managers of AI systems during the development process of the AI system. To this end, a multiple case study was conducted in two Dutch financial services companies using four use cases. Our findings reveal a wide range of aspects that must be considered during XAI implementation, which we grouped and integrated into a conceptual model. This model helps practitioners to make informed decisions when developing XAI. We argue that the diversity of aspects to consider necessitates an XAI “by design” approach, especially in high-risk use cases in industries where the stakes are high such as finance, public services, and healthcare. As such, the conceptual model offers a taxonomy for method engineering of XAI related methods, techniques, and tools.
MULTIFILE
Multilevel models (MLMs) are increasingly deployed in industry across different functions. Applications usually result in binary classification within groups or hierarchies based on a set of input features. For transparent and ethical applications of such models, sound audit frameworks need to be developed. In this paper, an audit framework for technical assessment of regression MLMs is proposed. The focus is on three aspects: model, discrimination, and transparency & explainability. These aspects are subsequently divided into sub-aspects. Contributors, such as inter MLM-group fairness, feature contribution order, and aggregated feature contribution, are identified for each of these sub-aspects. To measure the performance of the contributors, the framework proposes a shortlist of KPIs, among others, intergroup individual fairness (DiffInd_MLM) across MLM-groups, probability unexplained (PUX) and percentage of incorrect feature signs (POIFS). A traffic light risk assessment method is furthermore coupled to these KPIs. For assessing transparency & explainability, different explainability methods (SHAP and LIME) are used, which are compared with a model intrinsic method using quantitative methods and machine learning modelling.Using an open-source dataset, a model is trained and tested and the KPIs are computed. It is demonstrated that popular explainability methods, such as SHAP and LIME, underperform in accuracy when interpreting these models. They fail to predict the order of feature importance, the magnitudes, and occasionally even the nature of the feature contribution (negative versus positive contribution on the outcome). For other contributors, such as group fairness and their associated KPIs, similar analysis and calculations have been performed with the aim of adding profundity to the proposed audit framework. The framework is expected to assist regulatory bodies in performing conformity assessments of AI systems using multilevel binomial classification models at businesses. It will also benefit providers, users, and assessment bodies, as defined in the European Commission’s proposed Regulation on Artificial Intelligence, when deploying AI-systems such as MLMs, to be future-proof and aligned with the regulation.
DOCUMENT
INTRODUCTION: Physical Activity (PA) is essential for enhancing the physical function of pre-frail and frail older adults. However, among this group, PA-levels vary significantly. Identifying the factors contributing to these differences could support tailored PA interventions. This study aims to examine factors associated with physical activity levels among pre-frail and frail older adults in rural China.METHODS: This is a cross-sectional study. A total of 284 (pre)frail older adults (aged ≥60 years) were included from ten rural healthcare centers in Northeast China. Participants were categorized into low-moderate and high physical activity groups assessed using the Short Form International Physical Activity Questionnaire. Four-dimensional data were collected, including demographics, health behaviors, objective physical performance measures, and self-reported perceived health profiles. Extreme Gradient Boosting (XGBoost), a machine learning algorithm, was employed for binary classification (low-moderate vs. high physical activity). Model performance was assessed using the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, precision, and F1-score. To enhance interpretability, SHapley Additive exPlanations (SHAP) were utilized to identify key predictive variables.RESULTS: Mean age of participants was 70 years (59% female, 86% farmers). The low-moderate group averaged 1,187 MET/week, while the high physical activity group reached 8,162 MET/week. Physical performance tests showed significantly better scores in the high PA group. The XGBoost model achieved 82.4% accuracy (AUC: 0.769, specificity: 90%, sensitivity: 63%). SHAP analysis revealed that self-reported social support, general health, ambulation, and physical performance measures were the most important factors.CONCLUSION: The high physical activity group demonstrated better physical function than the low-moderate physical activity group; though, both groups showed poorer physical function compared to the general older population. Self-reported health perceptions and social support significantly correlated with physical activity levels. Addressing these factors through targeted interventions-including community-based social support programs and structured mobility-enhancing exercises-may contribute to improved health outcomes and enhanced quality of life in this population.
LINK
This white paper is the result of a research project by Hogeschool Utrecht, Floryn, Researchable, and De Volksbank in the period November 2021-November 2022. The research project was a KIEM project1 granted by the Taskforce for Applied Research SIA. The goal of the research project was to identify the aspects that play a role in the implementation of the explainability of artificial intelligence (AI) systems in the Dutch financial sector. In this white paper, we present a checklist of the aspects that we derived from this research. The checklist contains checkpoints and related questions that need consideration to make explainability-related choices in different stages of the AI lifecycle. The goal of the checklist is to give designers and developers of AI systems a tool to ensure the AI system will give proper and meaningful explanations to each stakeholder.
MULTIFILE
In this post I give an overview of the theory, tools, frameworks and best practices I have found until now around the testing (and debugging) of machine learning applications. I will start by giving an overview of the specificities of testing machine learning applications.
LINK
Multilevel models using logistic regression (MLogRM) and random forest models (RFM) are increasingly deployed in industry for the purpose of binary classification. The European Commission’s proposed Artificial Intelligence Act (AIA) necessitates, under certain conditions, that application of such models is fair, transparent, and ethical, which consequently implies technical assessment of these models. This paper proposes and demonstrates an audit framework for technical assessment of RFMs and MLogRMs by focussing on model-, discrimination-, and transparency & explainability-related aspects. To measure these aspects 20 KPIs are proposed, which are paired to a traffic light risk assessment method. An open-source dataset is used to train a RFM and a MLogRM model and these KPIs are computed and compared with the traffic lights. The performance of popular explainability methods such as kernel- and tree-SHAP are assessed. The framework is expected to assist regulatory bodies in performing conformity assessments of binary classifiers and also benefits providers and users deploying such AI-systems to comply with the AIA.
DOCUMENT
In this paper, we explore the design of web-based advice robots to enhance users' confidence in acting upon the provided advice. Drawing from research on algorithm acceptance and explainable AI, we hypothesise four design principles that may encourage interactivity and exploration, thus fostering users' confidence to act. Through a value-oriented prototype experiment and value-oriented semi-structured interviews, we tested these principles, confirming three of them and identifying an additional principle. The four resulting principles: (1) put context questions and resulting advice on one page and allow live, iterative exploration, (2) use action or change oriented questions to adjust the input parameters, (3) actively offer alternative scenarios based on counterfactuals, and (4) show all options instead of only the recommended one(s), appear to contribute to the values of agency and trust. Our study integrates the Design Science Research approach with a Value Sensitive Design approach.
MULTIFILE
Explainable Artificial Intelligence (XAI) aims to provide insights into the inner workings and the outputs of AI systems. Recently, there’s been growing recognition that explainability is inherently human-centric, tied to how people perceive explanations. Despite this, there is no consensus in the research community on whether user evaluation is crucial in XAI, and if so, what exactly needs to be evaluated and how. This systematic literature review addresses this gap by providing a detailed overview of the current state of affairs in human-centered XAI evaluation. We reviewed 73 papers across various domains where XAI was evaluated with users. These studies assessed what makes an explanation “good” from a user’s perspective, i.e., what makes an explanation meaningful to a user of an AI system. We identified 30 components of meaningful explanations that were evaluated in the reviewed papers and categorized them into a taxonomy of human-centered XAI evaluation, based on: (a) the contextualized quality of the explanation, (b) the contribution of the explanation to human-AI interaction, and (c) the contribution of the explanation to human- AI performance. Our analysis also revealed a lack of standardization in the methodologies applied in XAI user studies, with only 19 of the 73 papers applying an evaluation framework used by at least one other study in the sample. These inconsistencies hinder cross-study comparisons and broader insights. Our findings contribute to understanding what makes explanations meaningful to users and how to measure this, guiding the XAI community toward a more unified approach in human-centered explainability.
MULTIFILE