Trustworthy data-driven prognostics in gas turbine engines are crucial for safety, cost-efficiency, and sustainability. Accurate predictions depend on data quality, model accuracy, uncertainty estimation, and practical implementation. This work discusses data quality attributes to build trust using anonymized real-world engine data, focusing on traceability, completeness, and representativeness. A significant challenge is handling missing data, which introduces bias and affects training and predictions. The study compares the accuracy of predictions using Exhaust Gas Temperature (EGT) margin, a key health indicator, by keeping missing values, using KNN-imputation, and employing a Generalized Additive Model (GAM). Preliminary results indicate that while KNN-imputation can be useful for identifying general trends, it may not be as effective for specific predictions compared to GAM, which considers the context of missing data. The choice of method depends on the study’s objective: broad trend forecasting or specific event prediction, each requiring different approaches to manage missing data.
DOCUMENT
Completeness of data is vital for the decision making and forecasting on Building Management Systems (BMS) as missing data can result in biased decision making down the line. This study creates a guideline for imputing the gaps in BMS datasets by comparing four methods: K Nearest Neighbour algorithm (KNN), Recurrent Neural Network (RNN), Hot Deck (HD) and Last Observation Carried Forward (LOCF). The guideline contains the best method per gap size and scales of measurement. The four selected methods are from various backgrounds and are tested on a real BMS and metereological dataset. The focus of this paper is not to impute every cell as accurately as possible but to impute trends back into the missing data. The performance is characterised by a set of criteria in order to allow the user to choose the imputation method best suited for its needs. The criteria are: Variance Error (VE) and Root Mean Squared Error (RMSE). VE has been given more weight as its ability to evaluate the imputed trend is better than RMSE. From preliminary results, it was concluded that the best K‐values for KNN are 5 for the smallest gap and 100 for the larger gaps. Using a genetic algorithm the best RNN architecture for the purpose of this paper was determined to be GatedRecurrent Units (GRU). The comparison was performed using a different training dataset than the imputation dataset. The results show no consistent link between the difference in Kurtosis or Skewness and imputation performance. The results of the experiment concluded that RNN is best for interval data and HD is best for both nominal and ratio data. There was no single method that was best for all gap sizes as it was dependent on the data to be imputed.
MULTIFILE
Learning analytics is the analysis of student data with the purpose of improving learning. However, the process of data cleaning remains underexposed within learning analytics literature. In this paper, we elaborate on choices made in the cleaning process of student data and their consequences. We illustrate this with a case where data was gathered during six courses taught via Moodle. In this data set, only 21% of the logged activities were linked to a specific course. We illustrate possible choices in dealing with missing data by applying the cleaning process twelve times with different choices on copies of the raw data. Consequently, the analysis of the data shows varying outcomes. As the purpose of learning analytics is to intervene based on analysis and visualizations, it is of utmost importance to be aware of choices made during data cleaning. This paper's main goal is to make stakeholders of (learning) analytics activities aware of the fact that choices are made during data cleaning have consequences on the outcomes. We believe that there should be transparency to the users of these outcomes and give them a detailed report of the decisions made.
DOCUMENT
Data is widely recognized as a potent catalyst for advancing healthcare effectiveness, increasing worker satisfaction, and mitigating healthcare costs. The ongoing digital transformation within the healthcare sector promises to usher in a new era of flexible patient care, seamless inter-provider communication, and data-informed healthcare practices through the application of data science. However, more often than not data lacks interoperability across different healthcare institutions and are not readily available for analysis. This inability to share data leads to a higher administrative burden for healthcare providers and introduces risks when data is missing or when delays occur. Moreover, medical researchers face similar challenges in accessing medical data due to thedifficulty of extracting data from applications, a lack of standardization, and the required data transformations before it can be used for analysis. To address these complexities, a paradigm shift towards a data-centric application landscape is essential, where data serves as the bedrock of the healthcare infrastructure and is application agnostic. In short, a modern way to think about data in general is to go from an application driven landscape to a data driven landscape, which will allow for better interoperability and innovative healthcare solutions.In the current project the research group Digital Transformation at Hanze University of Applied Sciences works together with industry partners to build an openEHR implementation for a Groningen-based mental healthcare provider.
DOCUMENT
Het project van Aeres Hogeschool Dronten heeft als doel om via het delen en analyseren van telersdata binnen een groep van dertien telers te komen tot nieuwe inzichten, betere bedrijfsvoering en efficiëntere ketens, gericht op economische en ecologische duurzaamheid. Hiervoor wordt een data-infrastructuur gerealiseerd waarmee telers gefaciliteerd worden in het verzamelen, delen en analyseren van data en toegang krijgen tot complexere analyse technieken. Het project beoogt een groep telers op te leiden om de infrastructuur en tools te gebruiken en gezamenlijk data te delen en te analyseren om de teelt te verbeteren. Aan het einde van het project worden concrete verbeteringen verwacht op het gebied van input en opbrengst in de aardappelteelt.Het project richtte zich op het onderzoeken van hoe data van agrarische ondernemers in Flevoland gebruikt en gedeeld kan worden om economische en ecologische verbeteringen te bereiken. De landbouwsector verzamelt steeds meer gegevens over variabelen die de groei en bewaring van gewassen beïnvloeden, waarmee de benadering van landbouw verduurzaamd kan worden. Echter, het gebruik van data staat nog in de kinderschoenen en beslissingen worden vaak genomen op basis van advisering van externe commerciële partijen. Het delen van data is ook nog gevoelige materie. Het project wil deze drempels verlagen door telers meer data onderling te laten uitwisselen en met partners in de keten.De data-infrastructuur wordt gerealiseerd voor een groep van 15-20 telers die bereid zijn teelt- en/of bewaarsturing te doen op basis van beschikbare object-specifieke en actuele data. De data kunnen met elkaar gedeeld worden en zo kunnen de bedrijven verbeterd worden. De telers krijgen via de infrastructuur toegang tot complexere analyse technieken. Het project is opgedeeld in drie groepen op basis van locatie in de provincie: een groep telers rond een pilot bedrijf in Dronten, een groep rond een pilot bedrijf in Swifterbant en een groep in de NOP.De drie pilot bedrijven hebben aan het begin van het project een inventarisatie gedaan op basis van een door Aeres opgestelde vragenlijst om inzicht te krijgen in de minimale beschikbare data voor deelname aan het project. De meeste gevraagde data zijn reeds beschikbaar, behalve bij het pilot bedrijf in de NOP. De ontbrekende data kunnen worden opgevraagd bij lokale weerstations of in het project door projectpartners worden gerealiseerd.In de agrarische sector komt het vaak voor dat er ontbrekende data zijn over de factoren die bijdragen aan mislukkingen in de precisielandbouw. Dit komt doordat er vaak wordt gedacht in termen van wat wel werkt, in plaats van wat niet werkt. Een manier om dit tegen te gaan is door bewust te zijn van de ontbrekende data en deze proactief op te zoeken. Dit kan bijvoorbeeld door onderzoek te doen naar de milieu-impact van landbouw.Door dit project is beter inzicht verkregen in de effectiviteit van inputs alsmede met betrekking tot de impact op de omgeving. De volgende verbeteringen zijn gerealiseerd:• Beter inzicht in timing van teelthandelingen waardoor de bodem wordt ontzien.• Beter inzicht in effecten van teeltrotaties waardoor gekozen kan worden voor rotaties met minder impact en toch goede financiële resultaten behaald worden.• Door vergelijking kan er effectiever omgegaan worden met inputs zoals mest en gewasbeschermingsmiddelen waardoor naast minder gebruik ook minder af- en uitspoeling zal plaatsvinden.• Door effectiever gebruik van inputs zal per kg geproduceerde aardappelen minder oppervlakte, energie en chemie nodig zijn.Trefwoorden: digitalisering boerenbedrijf, data, pop3, databoeren, precisielandbouw RVO zaaknummer: 17717000042
DOCUMENT
Data is widely recognized as a potent catalyst for advancing healthcare effectiveness, increasing worker satisfaction, and mitigating healthcarecosts. The ongoing digital transformation within the healthcare sector promises to usher in a new era of flexible patient care, seamless inter-provider communication, and data-informed healthcare practices through the application of data science. However, more often than not data lacks interoperability across different healthcare institutions andare not readily available for analysis. This inability to share data leads to a higher administrative burden for healthcare providers and introduces risks when data is missing or when delays occur. Moreover, medical researchers face similar challenges in accessing medical data due to thedifficulty of extracting data from applications, a lack of standardization, and the required data transformations before it can be used for analysis. To address these complexities, a paradigm shift towards a data-centricapplication landscape is essential, where data serves as the bedrock of the healthcare infrastructure and is application agnostic.In short, a modern way to think about data in general is to go from an application driven landscape to a data driven landscape, which willallow for better interoperability and innovative healthcare solutions.
LINK
In the course of our supervisory work over the years, we have noticed that qualitative research tends to evoke a lot of questions and worries, so-called frequently asked questions (FAQs). This series of four articles intends to provide novice researchers with practical guidance for conducting high-quality qualitative research in primary care. By ‘novice’ we mean Master’s students and junior researchers, as well as experienced quantitative researchers who are engaging in qualitative research for the first time. This series addresses their questions and provides researchers, readers, reviewers and editors with references to criteria and tools for judging the quality of qualitative research papers. The second article focused on context, research questions and designs, and referred to publications for further reading. This third article addresses FAQs about sampling, data collection and analysis. The data collection plan needs to be broadly defined and open at first, and become flexible during data collection. Sampling strategies should be chosen in such a way that they yield rich information and are consistent with the methodological approach used. Data saturation determines sample size and will be different for each study. The most commonly used data collection methods are participant observation, face-to-face in-depth interviews and focus group discussions. Analyses in ethnographic, phenomenological, grounded theory, and content analysis studies yield different narrative findings: a detailed description of a culture, the essence of the lived experience, a theory, and a descriptive summary, respectively. The fourth and final article will focus on trustworthiness and publishing qualitative research.
DOCUMENT
Citizens regularly search the Web to make informed decisions on daily life questions, like online purchases, but how they reason with the results is unknown. This reasoning involves engaging with data in ways that require statistical literacy, which is crucial for navigating contemporary data. However, many adults struggle to critically evaluate and interpret such data and make data-informed decisions. Existing literature provides limited insight into how citizens engage with web-sourced information. We investigated: How do adults reason statistically with web-search results to answer daily life questions? In this case study, we observed and interviewed three vocationally educated adults searching for products or mortgages. Unlike data producers, consumers handle pre-existing, often ambiguous data with unclear populations and no single dataset. Participants encountered unstructured (web links) and structured data (prices). We analysed their reasoning and the process of preparing data, which is part of data-ing. Key data-ing actions included judging relevance and trustworthiness of the data and using proxy variables when relevant data were missing (e.g., price for product quality). Participants’ statistical reasoning was mainly informal. For example, they reasoned about association but did not calculate a measure of it, nor assess underlying distributions. This study theoretically contributes to understanding data-ing and why contemporary data may necessitate updating the investigative cycle. As current education focuses mainly on producers’ tasks, we advocate including consumers’ tasks by using authentic contexts (e.g., music, environment, deferred payment) to promote data exploration, informal statistical reasoning, and critical web-search skills—including selecting and filtering information, identifying bias, and evaluating sources.
LINK
Completeness of data is vital for the decision making and forecasting on Building Management Systems (BMS) as missing data can result in biased decision making down the line. This study creates a guideline for imputing the gaps in BMS datasets by comparing four methods: K Nearest Neighbour algorithm (KNN), Recurrent Neural Network (RNN), Hot Deck (HD) and Last Observation Carried Forward (LOCF). The guideline contains the best method per gap size and scales of measurement. The four selected methods are from various backgrounds and are tested on a real BMS and meteorological dataset. The focus of this paper is not to impute every cell as accurately as possible but to impute trends back into the missing data. The performance is characterised by a set of criteria in order to allow the user to choose the imputation method best suited for its needs. The criteria are: Variance Error (VE) and Root Mean Squared Error (RMSE). VE has been given more weight as its ability to evaluate the imputed trend is better than RMSE. From preliminary results, it was concluded that the best K‐values for KNN are 5 for the smallest gap and 100 for the larger gaps. Using a genetic algorithm the best RNN architecture for the purpose of this paper was determined to be Gated Recurrent Units (GRU). The comparison was performed using a different training dataset than the imputation dataset. The results show no consistent link between the difference in Kurtosis or Skewness and imputation performance. The results of the experiment concluded that RNN is best for interval data and HD is best for both nominal and ratio data. There was no single method that was best for all gap sizes as it was dependent on the data to be imputed.
DOCUMENT
During the COVID-19 pandemic, the bidirectional relationship between policy and data reliability has been a challenge for researchers of the local municipal health services. Policy decisions on population specific test locations and selective registration of negative test results led to population differences in data quality. This hampered the calculation of reliable population specific infection rates needed to develop proper data driven public health policy. https://doi.org/10.1007/s12508-023-00377-y
MULTIFILE