Trustworthy data-driven prognostics in gas turbine engines are crucial for safety, cost-efficiency, and sustainability. Accurate predictions depend on data quality, model accuracy, uncertainty estimation, and practical implementation. This work discusses data quality attributes to build trust using anonymized real-world engine data, focusing on traceability, completeness, and representativeness. A significant challenge is handling missing data, which introduces bias and affects training and predictions. The study compares the accuracy of predictions using Exhaust Gas Temperature (EGT) margin, a key health indicator, by keeping missing values, using KNN-imputation, and employing a Generalized Additive Model (GAM). Preliminary results indicate that while KNN-imputation can be useful for identifying general trends, it may not be as effective for specific predictions compared to GAM, which considers the context of missing data. The choice of method depends on the study’s objective: broad trend forecasting or specific event prediction, each requiring different approaches to manage missing data.
DOCUMENT
Completeness of data is vital for the decision making and forecasting on Building Management Systems (BMS) as missing data can result in biased decision making down the line. This study creates a guideline for imputing the gaps in BMS datasets by comparing four methods: K Nearest Neighbour algorithm (KNN), Recurrent Neural Network (RNN), Hot Deck (HD) and Last Observation Carried Forward (LOCF). The guideline contains the best method per gap size and scales of measurement. The four selected methods are from various backgrounds and are tested on a real BMS and metereological dataset. The focus of this paper is not to impute every cell as accurately as possible but to impute trends back into the missing data. The performance is characterised by a set of criteria in order to allow the user to choose the imputation method best suited for its needs. The criteria are: Variance Error (VE) and Root Mean Squared Error (RMSE). VE has been given more weight as its ability to evaluate the imputed trend is better than RMSE. From preliminary results, it was concluded that the best K‐values for KNN are 5 for the smallest gap and 100 for the larger gaps. Using a genetic algorithm the best RNN architecture for the purpose of this paper was determined to be GatedRecurrent Units (GRU). The comparison was performed using a different training dataset than the imputation dataset. The results show no consistent link between the difference in Kurtosis or Skewness and imputation performance. The results of the experiment concluded that RNN is best for interval data and HD is best for both nominal and ratio data. There was no single method that was best for all gap sizes as it was dependent on the data to be imputed.
MULTIFILE
Data is widely recognized as a potent catalyst for advancing healthcare effectiveness, increasing worker satisfaction, and mitigating healthcare costs. The ongoing digital transformation within the healthcare sector promises to usher in a new era of flexible patient care, seamless inter-provider communication, and data-informed healthcare practices through the application of data science. However, more often than not data lacks interoperability across different healthcare institutions and are not readily available for analysis. This inability to share data leads to a higher administrative burden for healthcare providers and introduces risks when data is missing or when delays occur. Moreover, medical researchers face similar challenges in accessing medical data due to thedifficulty of extracting data from applications, a lack of standardization, and the required data transformations before it can be used for analysis. To address these complexities, a paradigm shift towards a data-centric application landscape is essential, where data serves as the bedrock of the healthcare infrastructure and is application agnostic. In short, a modern way to think about data in general is to go from an application driven landscape to a data driven landscape, which will allow for better interoperability and innovative healthcare solutions.In the current project the research group Digital Transformation at Hanze University of Applied Sciences works together with industry partners to build an openEHR implementation for a Groningen-based mental healthcare provider.
DOCUMENT
Learning analytics is the analysis of student data with the purpose of improving learning. However, the process of data cleaning remains underexposed within learning analytics literature. In this paper, we elaborate on choices made in the cleaning process of student data and their consequences. We illustrate this with a case where data was gathered during six courses taught via Moodle. In this data set, only 21% of the logged activities were linked to a specific course. We illustrate possible choices in dealing with missing data by applying the cleaning process twelve times with different choices on copies of the raw data. Consequently, the analysis of the data shows varying outcomes. As the purpose of learning analytics is to intervene based on analysis and visualizations, it is of utmost importance to be aware of choices made during data cleaning. This paper's main goal is to make stakeholders of (learning) analytics activities aware of the fact that choices are made during data cleaning have consequences on the outcomes. We believe that there should be transparency to the users of these outcomes and give them a detailed report of the decisions made.
DOCUMENT
Data is widely recognized as a potent catalyst for advancing healthcare effectiveness, increasing worker satisfaction, and mitigating healthcarecosts. The ongoing digital transformation within the healthcare sector promises to usher in a new era of flexible patient care, seamless inter-provider communication, and data-informed healthcare practices through the application of data science. However, more often than not data lacks interoperability across different healthcare institutions andare not readily available for analysis. This inability to share data leads to a higher administrative burden for healthcare providers and introduces risks when data is missing or when delays occur. Moreover, medical researchers face similar challenges in accessing medical data due to thedifficulty of extracting data from applications, a lack of standardization, and the required data transformations before it can be used for analysis. To address these complexities, a paradigm shift towards a data-centricapplication landscape is essential, where data serves as the bedrock of the healthcare infrastructure and is application agnostic.In short, a modern way to think about data in general is to go from an application driven landscape to a data driven landscape, which willallow for better interoperability and innovative healthcare solutions.
LINK
Citizens regularly search the Web to make informed decisions on daily life questions, like online purchases, but how they reason with the results is unknown. This reasoning involves engaging with data in ways that require statistical literacy, which is crucial for navigating contemporary data. However, many adults struggle to critically evaluate and interpret such data and make data-informed decisions. Existing literature provides limited insight into how citizens engage with web-sourced information. We investigated: How do adults reason statistically with web-search results to answer daily life questions? In this case study, we observed and interviewed three vocationally educated adults searching for products or mortgages. Unlike data producers, consumers handle pre-existing, often ambiguous data with unclear populations and no single dataset. Participants encountered unstructured (web links) and structured data (prices). We analysed their reasoning and the process of preparing data, which is part of data-ing. Key data-ing actions included judging relevance and trustworthiness of the data and using proxy variables when relevant data were missing (e.g., price for product quality). Participants’ statistical reasoning was mainly informal. For example, they reasoned about association but did not calculate a measure of it, nor assess underlying distributions. This study theoretically contributes to understanding data-ing and why contemporary data may necessitate updating the investigative cycle. As current education focuses mainly on producers’ tasks, we advocate including consumers’ tasks by using authentic contexts (e.g., music, environment, deferred payment) to promote data exploration, informal statistical reasoning, and critical web-search skills—including selecting and filtering information, identifying bias, and evaluating sources.
LINK
Completeness of data is vital for the decision making and forecasting on Building Management Systems (BMS) as missing data can result in biased decision making down the line. This study creates a guideline for imputing the gaps in BMS datasets by comparing four methods: K Nearest Neighbour algorithm (KNN), Recurrent Neural Network (RNN), Hot Deck (HD) and Last Observation Carried Forward (LOCF). The guideline contains the best method per gap size and scales of measurement. The four selected methods are from various backgrounds and are tested on a real BMS and meteorological dataset. The focus of this paper is not to impute every cell as accurately as possible but to impute trends back into the missing data. The performance is characterised by a set of criteria in order to allow the user to choose the imputation method best suited for its needs. The criteria are: Variance Error (VE) and Root Mean Squared Error (RMSE). VE has been given more weight as its ability to evaluate the imputed trend is better than RMSE. From preliminary results, it was concluded that the best K‐values for KNN are 5 for the smallest gap and 100 for the larger gaps. Using a genetic algorithm the best RNN architecture for the purpose of this paper was determined to be Gated Recurrent Units (GRU). The comparison was performed using a different training dataset than the imputation dataset. The results show no consistent link between the difference in Kurtosis or Skewness and imputation performance. The results of the experiment concluded that RNN is best for interval data and HD is best for both nominal and ratio data. There was no single method that was best for all gap sizes as it was dependent on the data to be imputed.
DOCUMENT
During the COVID-19 pandemic, the bidirectional relationship between policy and data reliability has been a challenge for researchers of the local municipal health services. Policy decisions on population specific test locations and selective registration of negative test results led to population differences in data quality. This hampered the calculation of reliable population specific infection rates needed to develop proper data driven public health policy. https://doi.org/10.1007/s12508-023-00377-y
MULTIFILE
Abstract The Government of the Netherlands wants to be energy neutral by 2050 (Rijksoverheid, sd). A transition towards non-fossil energy sources also affects transport, which is one of the industries significantly contributing to CO2 emission (Centraal Bureau Statistiek, 2019). Road authorities at municipalities and provinces want a shift from fossil fuel-consuming to zero-emission transport choices by their inhabitants. For this the Province of Utrecht has data available. However, they struggle how to deploy data to positively influence inhabitants' mobility behavior. A problem analysis scoped the research and a survey revealed the gap between the province's current data-item approach that is infrastructure oriented and the required approach that adopts traveler’s personas to successfully stimulate cycling. For this more precisely defined captured data is needed and the focus should shift from already motivated cyclists to non-cyclers.
DOCUMENT
To tackle the continuous criticism of a lack of coherence and being disconnected from practice, teacher education programs have started to focus on the study and practice of teaching on campus. Yet, without theory, candidates may develop a technical view of teaching, lacking the understanding of the theoretical rationale behind the practices. Additionally, learning about research methods is a part of many teacher education programs as it helps candidates to become reflective and creative teachers who are able to learn systematically about their practice. Against this background, we investigate how the studying and practicing of teaching and attention to theory and research within campus courses influence teacher candidates’ perception of coherence in their teacher education program. Data from 270 candidates from Norway, Sweden and the US (California) were analyzed. Stepwise regression analyses show that, after controlling for the program candidates belong to, the study and practice of teaching and the opportunity to learn about theory contribute to explaining differences in perceptions of coherence between courses and opportunities to connect the various parts of the program. However, it seems that other variables come into play when candidates are asked about coherence between field experiences and campus courses. We furthermore find that learning about, reading, discussing, or analyzing research methods within methods courses is not a significant predictor of candidates’ perception of coherence. This finding seems to contrast the call for more attention to research methods in teacher education.
DOCUMENT