Completeness of data is vital for the decision making and forecasting on Building Management Systems (BMS) as missing data can result in biased decision making down the line. This study creates a guideline for imputing the gaps in BMS datasets by comparing four methods: K Nearest Neighbour algorithm (KNN), Recurrent Neural Network (RNN), Hot Deck (HD) and Last Observation Carried Forward (LOCF). The guideline contains the best method per gap size and scales of measurement. The four selected methods are from various backgrounds and are tested on a real BMS and metereological dataset. The focus of this paper is not to impute every cell as accurately as possible but to impute trends back into the missing data. The performance is characterised by a set of criteria in order to allow the user to choose the imputation method best suited for its needs. The criteria are: Variance Error (VE) and Root Mean Squared Error (RMSE). VE has been given more weight as its ability to evaluate the imputed trend is better than RMSE. From preliminary results, it was concluded that the best K‐values for KNN are 5 for the smallest gap and 100 for the larger gaps. Using a genetic algorithm the best RNN architecture for the purpose of this paper was determined to be GatedRecurrent Units (GRU). The comparison was performed using a different training dataset than the imputation dataset. The results show no consistent link between the difference in Kurtosis or Skewness and imputation performance. The results of the experiment concluded that RNN is best for interval data and HD is best for both nominal and ratio data. There was no single method that was best for all gap sizes as it was dependent on the data to be imputed.
MULTIFILE
Research studies and recruitment processes often rely on psychometric instruments to profile respondents with regards to their ethical orientation. Completing such questionnaires can be tedious and is prone to self-presentation bias. Noting how video games often expose players to complex plots, filled with dilemmas and morally dubious options, the opportunity emerges to evaluate player’s moral orientation by analysing their in-game behaviour. In order to explore the feasibility of such an approach, we examine how users’ moral judgment correlates with choices they make in non-linear narratives, frequently present in video games. An interactive narrative presenting several moral dilemmas was created. An initial user study (N = 80) revealed only weak correlations between the users’ choices and their ethical inclinations in all ethical scales. However, by training a genetic algorithm on this data set to quantify the influence of each branch on recognising moral inclination we found a strong positive correlation between choice behaviour and self-reported ethical inclinations on a second independent group of participants (N = 20). The contribution of this work is to demonstrate how genetic algorithms can be applied in interactive stories to profile users’ ethical stance.
LINK
Completeness of data is vital for the decision making and forecasting on Building Management Systems (BMS) as missing data can result in biased decision making down the line. This study creates a guideline for imputing the gaps in BMS datasets by comparing four methods: K Nearest Neighbour algorithm (KNN), Recurrent Neural Network (RNN), Hot Deck (HD) and Last Observation Carried Forward (LOCF). The guideline contains the best method per gap size and scales of measurement. The four selected methods are from various backgrounds and are tested on a real BMS and meteorological dataset. The focus of this paper is not to impute every cell as accurately as possible but to impute trends back into the missing data. The performance is characterised by a set of criteria in order to allow the user to choose the imputation method best suited for its needs. The criteria are: Variance Error (VE) and Root Mean Squared Error (RMSE). VE has been given more weight as its ability to evaluate the imputed trend is better than RMSE. From preliminary results, it was concluded that the best K‐values for KNN are 5 for the smallest gap and 100 for the larger gaps. Using a genetic algorithm the best RNN architecture for the purpose of this paper was determined to be Gated Recurrent Units (GRU). The comparison was performed using a different training dataset than the imputation dataset. The results show no consistent link between the difference in Kurtosis or Skewness and imputation performance. The results of the experiment concluded that RNN is best for interval data and HD is best for both nominal and ratio data. There was no single method that was best for all gap sizes as it was dependent on the data to be imputed.
Huntington’s disease (HD) and various spinocerebellar ataxias (SCA) are autosomal dominantly inherited neurodegenerative disorders caused by a CAG repeat expansion in the disease-related gene1. The impact of HD and SCA on families and individuals is enormous and far reaching, as patients typically display first symptoms during midlife. HD is characterized by unwanted choreatic movements, behavioral and psychiatric disturbances and dementia. SCAs are mainly characterized by ataxia but also other symptoms including cognitive deficits, similarly affecting quality of life and leading to disability. These problems worsen as the disease progresses and affected individuals are no longer able to work, drive, or care for themselves. It places an enormous burden on their family and caregivers, and patients will require intensive nursing home care when disease progresses, and lifespan is reduced. Although the clinical and pathological phenotypes are distinct for each CAG repeat expansion disorder, it is thought that similar molecular mechanisms underlie the effect of expanded CAG repeats in different genes. The predicted Age of Onset (AO) for both HD, SCA1 and SCA3 (and 5 other CAG-repeat diseases) is based on the polyQ expansion, but the CAG/polyQ determines the AO only for 50% (see figure below). A large variety on AO is observed, especially for the most common range between 40 and 50 repeats11,12. Large differences in onset, especially in the range 40-50 CAGs not only imply that current individual predictions for AO are imprecise (affecting important life decisions that patients need to make and also hampering assessment of potential onset-delaying intervention) but also do offer optimism that (patient-related) factors exist that can delay the onset of disease.To address both items, we need to generate a better model, based on patient-derived cells that generates parameters that not only mirror the CAG-repeat length dependency of these diseases, but that also better predicts inter-patient variations in disease susceptibility and effectiveness of interventions. Hereto, we will use a staggered project design as explained in 5.1, in which we first will determine which cellular and molecular determinants (referred to as landscapes) in isogenic iPSC models are associated with increased CAG repeat lengths using deep-learning algorithms (DLA) (WP1). Hereto, we will use a well characterized control cell line in which we modify the CAG repeat length in the endogenous ataxin-1, Ataxin-3 and Huntingtin gene from wildtype Q repeats to intermediate to adult onset and juvenile polyQ repeats. We will next expand the model with cells from the 3 (SCA1, SCA3, and HD) existing and new cohorts of early-onset, adult-onset and late-onset/intermediate repeat patients for which, besides accurate AO information, also clinical parameters (MRI scans, liquor markers etc) will be (made) available. This will be used for validation and to fine-tune the molecular landscapes (again using DLA) towards the best prediction of individual patient related clinical markers and AO (WP3). The same models and (most relevant) landscapes will also be used for evaluations of novel mutant protein lowering strategies as will emerge from WP4.This overall development process of landscape prediction is an iterative process that involves (a) data processing (WP5) (b) unsupervised data exploration and dimensionality reduction to find patterns in data and create “labels” for similarity and (c) development of data supervised Deep Learning (DL) models for landscape prediction based on the labels from previous step. Each iteration starts with data that is generated and deployed according to FAIR principles, and the developed deep learning system will be instrumental to connect these WPs. Insights in algorithm sensitivity from the predictive models will form the basis for discussion with field experts on the distinction and phenotypic consequences. While full development of accurate diagnostics might go beyond the timespan of the 5 year project, ideally our final landscapes can be used for new genetic counselling: when somebody is positive for the gene, can we use his/her cells, feed it into the generated cell-based model and better predict the AO and severity? While this will answer questions from clinicians and patient communities, it will also generate new ones, which is why we will study the ethical implications of such improved diagnostics in advance (WP6).