In this post I give an overview of the theory, tools, frameworks and best practices I have found until now around the testing (and debugging) of machine learning applications. I will start by giving an overview of the specificities of testing machine learning applications.
LINK
With summaries in Dutch, Esperanto and English. DOI: 10.4233/uuid:d7132920-346e-47c6-b754-00dc5672b437 "The subject of this study is deformation analysis of the earth's surface (or part of it) and spatial objects on, above or below it. Such analyses are needed in many domains of society. Geodetic deformation analysis uses various types of geodetic measurements to substantiate statements about changes in geometric positions.Professional practice, e.g. in the Netherlands, regularly applies methods for geodetic deformation analysis that have shortcomings, e.g. because the methods apply substandard analysis models or defective testing methods. These shortcomings hamper communication about the results of deformation analyses with the various parties involved. To improve communication solid analysis models and a common language have to be used, which requires standardisation.Operational demands for geodetic deformation analysis are the reason to formulate in this study seven characteristic elements that a solid analysis model needs to possess. Such a model can handle time series of several epochs. It analyses only size and form, not position and orientation of the reference system; and datum points may be under influence of deformation. The geodetic and physical models are combined in one adjustment model. Full use is made of available stochastic information. Statistical testing and computation of minimal detectable deformations is incorporated. Solution methods can handle rank deficient matrices (both model matrix and cofactor matrix). And, finally, a search for the best hypothesis/model is implemented. Because a geodetic deformation analysis model with all seven elements does not exist, this study develops such a model.For effective standardisation geodetic deformation analysis models need: practical key performance indicators; a clear procedure for using the model; and the possibility to graphically visualise the estimated deformations."
DOCUMENT
Citizens regularly search the Web to make informed decisions on daily life questions, like online purchases, but how they reason with the results is unknown. This reasoning involves engaging with data in ways that require statistical literacy, which is crucial for navigating contemporary data. However, many adults struggle to critically evaluate and interpret such data and make data-informed decisions. Existing literature provides limited insight into how citizens engage with web-sourced information. We investigated: How do adults reason statistically with web-search results to answer daily life questions? In this case study, we observed and interviewed three vocationally educated adults searching for products or mortgages. Unlike data producers, consumers handle pre-existing, often ambiguous data with unclear populations and no single dataset. Participants encountered unstructured (web links) and structured data (prices). We analysed their reasoning and the process of preparing data, which is part of data-ing. Key data-ing actions included judging relevance and trustworthiness of the data and using proxy variables when relevant data were missing (e.g., price for product quality). Participants’ statistical reasoning was mainly informal. For example, they reasoned about association but did not calculate a measure of it, nor assess underlying distributions. This study theoretically contributes to understanding data-ing and why contemporary data may necessitate updating the investigative cycle. As current education focuses mainly on producers’ tasks, we advocate including consumers’ tasks by using authentic contexts (e.g., music, environment, deferred payment) to promote data exploration, informal statistical reasoning, and critical web-search skills—including selecting and filtering information, identifying bias, and evaluating sources.
LINK
KEY MESSAGE: • Statistical significance testing alone is not the most adequate manner to evaluate if there is indeed a clinically relevant effect • Effect sizes should be added to significance testing • Effect sizes facilitate the decision whether a clinically relevant effect is found, helps determining the sample size for future studies, and facilitates comparison between scientific studies
DOCUMENT
Implementation of reliable methodologies allowing Reduction, Refinement, and Replacement (3Rs) of animal testing is a process that takes several decades and is still not complete. Reliable methods are essential for regulatory hazard assessment of chemicals where differences in test protocol can influence the test outcomes and thus affect the confidence in the predictive value of the organisms used as an alternative for mammals. Although test guidelines are common for mammalian studies, they are scarce for non-vertebrate organisms that would allow for the 3Rs of animal testing. Here, we present a set of 30 reporting criteria as the basis for such a guideline for Developmental and Reproductive Toxicology (DART) testing in the nematode Caenorhabditis elegans. Small organisms like C. elegans are upcoming in new approach methodologies for hazard assessment; thus, reliable and robust test protocols are urgently needed. A literature assessment of the fulfilment of the reporting criteria demonstrates that although studies describe methodological details, essential information such as compound purity and lot/batch number or type of container is often not reported. The formulated set of reporting criteria for C. elegans testing can be used by (i) researchers to describe essential experimental details (ii) data scientists that aggregate information to assess data quality and include data in aggregated databases (iii) regulators to assess study data for inclusion in regulatory hazard assessment of chemicals.
DOCUMENT
BACKGROUND: Patients who underwent surgery for aortic coarctation (COA) have an increased risk of arterial hypertension. We aimed at evaluating (1) differences between hypertensive and non-hypertensive patients and (2) the value of cardiopulmonary exercise testing (CPET) to predict the development or progression of hypertension. METHODS: Between 1999 and 2010, CPET was performed in 223 COA-patients of whom 122 had resting blood pressures of <140/90 mmHg without medication, and 101 were considered hypertensive. Comparative statistics were performed. Cox regression analysis was used to assess the relation between demographic, clinical and exercise variables and the development/progression of hypertension. RESULTS: At baseline, hypertensive patients were older (p=0.007), were more often male (p=0.004) and had repair at later age (p=0.008) when compared to normotensive patients. After 3.6 ± 1.2 years, 29/120 (25%) normotensive patients developed hypertension. In normotensives, VE/VCO2-slope (p=0.0016) and peak systolic blood pressure (SBP; p=0.049) were significantly related to the development of hypertension during follow-up. Cut-off points related to higher risk for hypertension, based on best sensitivity and specificity, were defined as VE/VCO2-slope ≥ 27 and peak SBP ≥ 220 mmHg. In the hypertensive group, antihypertensive medication was started/extended in 48/101 (48%) patients. Only age was associated with the need to start/extend antihypertensive therapy in this group (p=0.042). CONCLUSIONS: Higher VE/VCO2-slope and higher peak SBP are risk factors for the development of hypertension in adults with COA. Cardiopulmonary exercise testing may guide clinical decision making regarding close blood pressure control and preventive lifestyle recommendations.
DOCUMENT
From the article: Abstract Adjustment and testing of a combination of stochastic and nonstochastic observations is applied to the deformation analysis of a time series of 3D coordinates. Nonstochastic observations are constant values that are treated as if they were observations. They are used to formulate constraints on the unknown parameters of the adjustment problem. Thus they describe deformation patterns. If deformation is absent, the epochs of the time series are supposed to be related via affine, similarity or congruence transformations. S-basis invariant testing of deformation patterns is treated. The model is experimentally validated by showing the procedure for a point set of 3D coordinates, determined from total station measurements during five epochs. The modelling of two patterns, the movement of just one point in several epochs, and of several points, is shown. Full, rank deficient covariance matrices of the 3D coordinates, resulting from free network adjustments of the total station measurements of each epoch, are used in the analysis.
MULTIFILE
During the COVID-19 pandemic, the bidirectional relationship between policy and data reliability has been a challenge for researchers of the local municipal health services. Policy decisions on population specific test locations and selective registration of negative test results led to population differences in data quality. This hampered the calculation of reliable population specific infection rates needed to develop proper data driven public health policy. https://doi.org/10.1007/s12508-023-00377-y
MULTIFILE
Objective: In myocardial perfusion single-photon emission computed tomography (SPECT), abdominal activity often interferes with the evaluation of perfusion in the inferior wall, especially after pharmacological stress. In this randomized study, we examined the effect of carbonated water intake versus still water intake on the quality of images obtained during myocardial perfusion images (MPI) studies. Methods: A total of 467 MIBI studies were randomized into a carbonated water group and a water group. The presence of intestinal activity adjacent to the inferior wall was evaluated by two observers. Furthermore, a semiquantitative analysis was performed in the adenosine subgroup,using a count ratio of the inferior myocardial wall and adjacent abdominal activity. Results: The need for repeated SPECT in the adenosine studies was 5.3 % in the carbonated water group versus 19.4 % in the still water group (p = 0.019). The inferior wall-to-abdomen count ratio was significantly higher in the carbonated water group compared to the still water group (2.11 ± 1.00 vs. 1.72 ± 0.73, p\0.001). The effect of carbonated water during rest and after exercise was not significant. Conclusions: This randomized study showed that carbonated water significantly reduced the interference of extra-cardiac activity in adenosine SPECT MPI. Keywords: Extra-cardiac radioactivity, Myocardial SPECT, Image quality enhancement, Carbonated water
DOCUMENT
Background: Dermoscopy is known to increase the diagnostic accuracy of pigmented skin lesions (PSLs) when used by trained professionals. The effect of dermoscopy training on the diagnostic ability of dermal therapists (DTs) has not been studied so far. Objectives: This study aimed to investigate whether DTs, in comparison with general practitioners (GPs), benefited from a training programme including dermoscopy, in both their ability to differentiate between different forms of PSL and to assign the correct therapeutic strategy. Methods: In total, 24 DTs and 96 GPs attended a training programme on PSLs. Diagnostic skills as well as therapeutic strategy were assessed, prior to the training (pretest) and after the training (post-test) using clinical images alone, as well as after the addition of dermatoscopic images (integrated post-test). Bayesian hypothesis testing was used to determine statistical significance of differences between pretest, post-test and integrated post-test scores. Results: Both the DTs and the GPs demonstrated benefit from the training: at the integrated post-test, the median proportion of correctly diagnosed PSLs was 73% (range 30–90) for GPs and 63% (range 27–80) for DTs. A statistically significant difference between pretest results and integrated test results was seen, with a Bayes factor>100. At 12 percentage points higher, the GPs outperformed DTs in the accuracy of detecting PSLs. Conclusions: The study shows that a training programme focusing on PSLs while including dermoscopy positively impacts detection of PSLs by DTs and GPs. This training programme could form an integral part of the training of DTs in screening procedures, although additional research is needed.
DOCUMENT