A universal global measure of univariate and bivariate data utility for anonymised microdata

Kocar, Sebastian

File(s) under permanent embargo

A universal global measure of univariate and bivariate data utility for anonymised microdata

journal contribution

posted on 2023-05-21, 13:26 authored by Sebastian Kocar

A universal global measure of univariate and bivariate data utility for anonymised microdata

This paper presents a new global data utility measure, based on a benchmarking approach. Data utility measures assess the utility of anonymised microdata by measuring changes in distributions and their impact on bias, variance and other statistics derived from the data. Most existing data utility measures have significant shortcomings - that is, they are limited to continuous variables, to univariate utility assessment, or to local information loss measurements. Several solutions are presented in the proposed global data utility model. It combines univariate and bivariate data utility measures, which calculate information loss using various statistical tests and association measures, such as two-sample Kolmogorov-Smirnov test, chi-squared test (Cramer's V), ANOVA F test (eta squared), Kruskal-Wallis H test (epsilon squared), Spearman coefficient (rho) and Pearson correlation coefficient (r). The model is universal, since it also includes new local utility measures for global recoding and variable removal data reduction approaches, and it can be used for data protected with all common masking methods and techniques, from data reduction and data perturbation to generation of synthetic data and sampling. At the bivariate level, the model includes all required data analysis steps: assumptions for statistical tests, statistical significance of the association, direction of the association and strength of the association (size effect).

Since the model should be executed automatically with statistical software code or a package, our aim was to allow all steps to be done with no additional user input. For this reason, we propose approaches to automatically establish the direction of the association between two variables using test-reported standardised residuals and sums of squares between groups.

Although the model is a global data utility model, individual local univariate and bivariate utility can still be assessed for different types of variables, as well as for both normal and non-normal distributions. The next important step in global data utility assessment would be to develop either program code or an R statistical software package for measuring data utility, and to establish the relationship between univariate, bivariate and multivariate data utility of anonymised data.

History

Publication title

Centre for Social Research & Methods

Issue

4

Pagination

1-19

ISSN

2209-184X

Publisher

Australia

Place of publication

Australian National University

Rights statement

Repository Status

Restricted

Socio-economic Objectives

Expanding knowledge in human society

Usage metrics

Keywords

data confidentialisation disclosure risk data utility univariate analysis bivariate analysis

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

File(s) under permanent embargo

A universal global measure of univariate and bivariate data utility for anonymised microdata

History

Publication title

Issue

Pagination

ISSN

Publisher

Place of publication

Rights statement

Repository Status

Socio-economic Objectives

Usage metrics

Categories

Keywords

Licence

Exports