Challenges that surveys are facing are increasing data collection costs and declining budgets. During the past years, many surveys at Statistics Netherlands were redesigned to reduce costs and to increase or maintain response rates. From 2018 onwards, adaptive survey design has been applied in several social surveys to produce more accurate statistics within the same budget. In previous years, research has been done into the effect on quality and costs of reducing the use of interviewers in mixed-mode surveys starting with internet observation, followed by telephone or face-to-face observation of internet nonrespondents. Reducing follow-ups can be done in different ways. By using stratified selection of people eligible for follow-up, nonresponse bias may be reduced. The main decisions to be made are how to divide the population into strata and how to compute the allocation probabilities for face-to-face and telephone observation in the different strata. Currently, adaptive survey design is an option in redesigns of social surveys at Statistics Netherlands. In 2018 it has been implemented in the Health Survey and the Public Opinion Survey, in 2019 in the Life Style Monitor and the Leisure Omnibus, in 2021 in the Labour Force Survey, and in 2022 it is planned for the Social Coherence Survey. This paper elaborates on the development of the adaptive survey design for the Labour Force Survey. Attention is paid to the survey design, in particular the sampling design, the data collection constraints, the choice of the strata for the adaptive design, the calculation of follow-up fractions by mode of observation and stratum, the practical implementation of the adaptive design, and the six-month parallel design with corresponding response results.
Completeness of data is vital for the decision making and forecasting on Building Management Systems (BMS) as missing data can result in biased decision making down the line. This study creates a guideline for imputing the gaps in BMS datasets by comparing four methods: K Nearest Neighbour algorithm (KNN), Recurrent Neural Network (RNN), Hot Deck (HD) and Last Observation Carried Forward (LOCF). The guideline contains the best method per gap size and scales of measurement. The four selected methods are from various backgrounds and are tested on a real BMS and meteorological dataset. The focus of this paper is not to impute every cell as accurately as possible but to impute trends back into the missing data. The performance is characterised by a set of criteria in order to allow the user to choose the imputation method best suited for its needs. The criteria are: Variance Error (VE) and Root Mean Squared Error (RMSE). VE has been given more weight as its ability to evaluate the imputed trend is better than RMSE. From preliminary results, it was concluded that the best K‐values for KNN are 5 for the smallest gap and 100 for the larger gaps. Using a genetic algorithm the best RNN architecture for the purpose of this paper was determined to be Gated Recurrent Units (GRU). The comparison was performed using a different training dataset than the imputation dataset. The results show no consistent link between the difference in Kurtosis or Skewness and imputation performance. The results of the experiment concluded that RNN is best for interval data and HD is best for both nominal and ratio data. There was no single method that was best for all gap sizes as it was dependent on the data to be imputed.
With the proliferation of misinformation on the web, automatic misinformation detection methods are becoming an increasingly important subject of study. Large language models have produced the best results among content-based methods, which rely on the text of the article rather than the metadata or network features. However, finetuning such a model requires significant training data, which has led to the automatic creation of large-scale misinformation detection datasets. In these datasets, articles are not labelled directly. Rather, each news site is labelled for reliability by an established fact-checking organisation and every article is subsequently assigned the corresponding label based on the reliability score of the news source in question. A recent paper has explored the biases present in one such dataset, NELA-GT-2018, and shown that the models are at least partly learning the stylistic and other features of different news sources rather than the features of unreliable news. We confirm a part of their findings. Apart from studying the characteristics and potential biases of the datasets, we also find it important to examine in what way the model architecture influences the results. We therefore explore which text features or combinations of features are learned by models based on contextual word embeddings as opposed to basic bag-of-words models. To elucidate this, we perform extensive error analysis aided by the SHAP post-hoc explanation technique on a debiased portion of the dataset. We validate the explanation technique on our inherently interpretable baseline model.
Receiving the first “Rijbewijs” is always an exciting moment for any teenager, but, this also comes with considerable risks. In the Netherlands, the fatality rate of young novice drivers is five times higher than that of drivers between the ages of 30 and 59 years. These risks are mainly because of age-related factors and lack of experience which manifests in inadequate higher-order skills required for hazard perception and successful interventions to react to risks on the road. Although risk assessment and driving attitude is included in the drivers’ training and examination process, the accident statistics show that it only has limited influence on the development factors such as attitudes, motivations, lifestyles, self-assessment and risk acceptance that play a significant role in post-licensing driving. This negatively impacts traffic safety. “How could novice drivers receive critical feedback on their driving behaviour and traffic safety? ” is, therefore, an important question. Due to major advancements in domains such as ICT, sensors, big data, and Artificial Intelligence (AI), in-vehicle data is being extensively used for monitoring driver behaviour, driving style identification and driver modelling. However, use of such techniques in pre-license driver training and assessment has not been extensively explored. EIDETIC aims at developing a novel approach by fusing multiple data sources such as in-vehicle sensors/data (to trace the vehicle trajectory), eye-tracking glasses (to monitor viewing behaviour) and cameras (to monitor the surroundings) for providing quantifiable and understandable feedback to novice drivers. Furthermore, this new knowledge could also support driving instructors and examiners in ensuring safe drivers. This project will also generate necessary knowledge that would serve as a foundation for facilitating the transition to the training and assessment for drivers of automated vehicles.
Moderatie van lezersreacties onder nieuwsartikelen is erg arbeidsintensief. Met behulp van kunstmatige intelligentie wordt moderatie mogelijk tegen een redelijke prijs. Aangezien elke toepassing van kunstmatige intelligentie eerlijk en transparant moet zijn, is het belangrijk om te onderzoeken hoe media hieraan kunnen voldoen.Doel Dit promotieproject zal zich richten op de rechtvaardigheid, accountability en transparantie van algoritmische systemen voor het modereren van lezersreacties. Het biedt een theoretisch kader en bruikbare matregelen die nieuwsorganisaties zullen ondersteunen in het naleven van recente beleidsvorming voor een waardegedreven implementatie van AI. Nu steeds meer nieuwsmedia AI gaan gebruiken, moeten ze rechtvaardigheid, accountability en transparantie in hun gebruik van algoritmen meenemen in hun werkwijzen. Resultaten Hoewel moderatie met AI zeer aantrekkelijk is vanuit economisch oogpunt, moeten nieuwsmedia weten hoe ze onnauwkeurigheid en bias kunnen verminderen (fairness), de werking van hun AI bekendmaken (accountability) en de gebruikers laten begrijpen hoe beslissingen via AI worden genomen (transparancy). Dit proefschrift bevordert de kennis over deze onderwerpen. Looptijd 01 februari 2022 - 01 februari 2025 Aanpak De centrale onderzoeksvraag van dit promotieonderzoek is: Hoe kunnen en moeten nieuwsmedia rechtvaardigheid, accountability en transparantie in hun gebruik van algoritmes voor commentmoderatie? Om deze vraag te beantwoorden is het onderzoek opgesplitst in vier deelvragen. Hoe gebruiken nieuwsmedia algoritmes voor het modereren van reacties? Wat kunnen nieuwsmedia doen om onnauwkeurigheid en bias bij het modereren via AI van reacties te verminderen? Wat moeten nieuwsmedia bekendmaken over hun gebruik van moderatie via AI? Wat maakt uitleg van moderatie via AI begrijpelijk voor gebruikers van verschillende niveaus van digitale competentie?
This project researches risk perceptions about data, technology, and digital transformation in society and how to build trust between organisations and users to ensure sustainable data ecologies. The aim is to understand the user role in a tech-driven environment and her perception of the resulting relationships with organisations that offer data-driven services/products. The discourse on digital transformation is productive but does not truly address the user’s attitudes and awareness (Kitchin 2014). Companies are not aware enough of the potential accidents and resulting loss of trust that undermine data ecologies and, consequently, forfeit their beneficial potential. Facebook’s Cambridge Analytica-situation, for instance, led to 42% of US adults deleting their accounts and the company losing billions. Social, political, and economic interactions are increasingly digitalised, which comes with hands-on benefits but also challenges privacy, individual well-being and a fair society. User awareness of organisational practices is of heightened importance, as vulnerabilities for users equal vulnerabilities for data ecologies. Without transparency and a new “social contract” for a digital society, problems are inevitable. Recurring scandals about data leaks and biased algorithms are just two examples that illustrate the urgency of this research. Properly informing users about an organisation’s data policies makes a crucial difference (Accenture 2018) and for them to develop sustainable business models, organisations need to understand what users expect and how to communicate with them. This research project tackles this issue head-on. First, a deeper understanding of users’ risk perception is needed to formulate concrete policy recommendations aiming to educate and build trust. Second, insights about users’ perceptions will inform guidelines. Through empirical research on framing in the data discourse, user types, and trends in organisational practice, the project develops concrete advice - for users and practitioners alike - on building sustainable relationships in a resilient digital society.