eCite Digital Repository

A universal global measure of univariate and bivariate data utility for anonymised microdata

Citation

Kocar, S, A universal global measure of univariate and bivariate data utility for anonymised microdata, Centre for Social Research & Methods, (4) pp. 1-19. ISSN 2209-184X (2018) [Refereed Article]

Copyright Statement

Copyright 2018 ANU

Abstract

A universal global measure of univariate and bivariate data utility for anonymised microdata

This paper presents a new global data utility measure, based on a benchmarking approach. Data utility measures assess the utility of anonymised microdata by measuring changes in distributions and their impact on bias, variance and other statistics derived from the data. Most existing data utility measures have significant shortcomings - that is, they are limited to continuous variables, to univariate utility assessment, or to local information loss measurements. Several solutions are presented in the proposed global data utility model. It combines univariate and bivariate data utility measures, which calculate information loss using various statistical tests and association measures, such as two-sample Kolmogorov-Smirnov test, chi-squared test (Cramer's V), ANOVA F test (eta squared), Kruskal-Wallis H test (epsilon squared), Spearman coefficient (rho) and Pearson correlation coefficient (r). The model is universal, since it also includes new local utility measures for global recoding and variable removal data reduction approaches, and it can be used for data protected with all common masking methods and techniques, from data reduction and data perturbation to generation of synthetic data and sampling. At the bivariate level, the model includes all required data analysis steps: assumptions for statistical tests, statistical significance of the association, direction of the association and strength of the association (size effect).

Since the model should be executed automatically with statistical software code or a package, our aim was to allow all steps to be done with no additional user input. For this reason, we propose approaches to automatically establish the direction of the association between two variables using test-reported standardised residuals and sums of squares between groups.

Although the model is a global data utility model, individual local univariate and bivariate utility can still be assessed for different types of variables, as well as for both normal and non-normal distributions. The next important step in global data utility assessment would be to develop either program code or an R statistical software package for measuring data utility, and to establish the relationship between univariate, bivariate and multivariate data utility of anonymised data.

Item Details

Item Type:Refereed Article
Keywords:data confidentialisation; disclosure risk; data utility; univariate analysis; bivariate analysis
Research Division:Mathematical Sciences
Research Group:Statistics
Research Field:Applied statistics
Objective Division:Expanding Knowledge
Objective Group:Expanding knowledge
Objective Field:Expanding knowledge in human society
UTAS Author:Kocar, S (Dr Sebastian Kocar)
ID Code:153129
Year Published:2018
Deposited By:CALE Research Institute
Deposited On:2022-09-07
Last Modified:2022-10-05
Downloads:0

Repository Staff Only: item control page