GPAbin: an R package for unified visualisations of imputed data sets

Johané Nienkemper-Swanepoel

Centre for Multi-Dimensional Data Visualisation (MuViSU).
Department of Statistics and Actuarial Science, Stellenbosch University.
National Institute for Theoretical and Computational Sciences.

SU seminar series: 13 February 2026

Home life

Looking back 2016

Looking back 2016: exploring ideas

Looking back 2020

Looking back: making promises

Back to today

GPAbin - pronounced G - P - A - bin

An R package to unify multiple biplot visualisations into a single display.

Nienkemper-Swanepoel J (2025). GPAbin: Unifying Multiple Biplot Visualisations into a Single Display. R package version 1.1.1, https://CRAN.R-project.org/package=GPAbin.

Required background

Missing data
Multiple imputation
Biplot visualisation

Missing data

Missing data mechanisms (MDMs):

Missing Completely At Random (MCAR): Missing values are independent of observed and missing observations.
Missing At Random (MAR): Missing values depend on observed values, but not on missing observations.
Missing Not At Random (MNAR): Missing values depend on both observed and missing observations.

Focus placed on MAR scenarios - assumed to be the standard occurrence in practice.

Multiple imputation (MI)

Each missing value is replaced with multiple plausible values.
Result in multiple completed data sets for standard complete-case analysis.
Estimates are combined using Rubin’s rules (Rubin 1987).
Realistic representation of variation due to distribution of multiple responses per missing value.

MI strategies

Two general approaches to impute missing values in multiple variables:

Joint modelling (JM): The same imputation model is used for all variables.
- Multiple imputation with multiple correspondence analysis (MIMCA)
- Multilevel joint modelling multiple imputation (jomo)
- Dirichlet process mixture of products of multinomial distribution model (DPMPM)
Fully conditional specification (FCS) / sequential regression / chained equations: Imputation models are specified per variable that is conditioned on the other variables.
- Multivariate imputation by chained equations (mice).

Biplot visualisation

Multivariate categorical data visualisation.
Representation of two modes of data (samples and variables) in a single display.
Multiple Correspondence Analysis (MCA) biplots.

If \(\mathbf{X}\) is categorical with \(n\) rows (samples) and \(p\) columns (variables) with a total of \(q\) category levels:

\[\mathbf{R}^{-\frac{1}{2}}\mathbf{G}\mathbf{C}^{-\frac{1}{2}} = \mathbf{U}\mathbf{ \Sigma}\mathbf{V}^\prime\]

\(\mathbf{G}\Rightarrow\) indicator matrix of \(\mathbf{X}\) with \(n\) samples and \(q\) columns.
\(\mathbf{R}^{-\frac{1}{2}}\),\(\mathbf{C}^{-\frac{1}{2}}\Rightarrow\) diagonal matrix of row and column weights of \(\mathbf{G}\).

Generally, plot the first two columns of:

\(\mathbf{U\Sigma}\) (principal coordinates) for the sample coordinates and
the first two columns of \(\mathbf{V}\) (standard coordinates) for the category level point coordinates.

Pipeline of functions: `GPAbin`

missmi: This function produces a list of elements to be used when producing a GPAbin biplot.
impute: Choose between four available multiple imputation strategies in R.
DRT: Multiple correspondence analysis (MCA) is performed on the multiple imputed datasets.
GPAbin: Combines multiple configurations from dimension reduction solutions applied to multiple imputed data sets.
biplFig: Creates a biplot. Current version (1.1.1): MCA biplot.
evalMeas: Calculates measures of comparison based on distances between two configurations in two dimensions.

data(missdat)

missbp <- missmi(missdat) |> impute(imp.method = "DPMPM", m = 5)

imp.method: choose between c("MIMCA", "jomo", "DPMPM", "mice").
m: number of imputations

data(implist)

data(implist)
missbp <- missmi(implist) |> DRT()

implist: an object of multiple imputations to use for illustration of the algorithm.
method: in the current version only MCA is available.

Image 1: \({\bf{A}}\) is the target visualisation
Image 2: \({\bf{B}}\) is the testee visualisation, figures are already centred (translation not required)
Image 3: The coordinates of \({\bf{B}}\) are reflected
Image 4: The coordinates of \({\bf{B}}\) are rotated and scaled

Borg, I. & Groenen, P. 2005. Modern Multidimensional Scaling. 2nd ed. United States of America: Springer. (Page 433)

Methodology step-by-step

GPA illustration
GPAbin
Biplot visualisation
Evaluation

missbp <- missmi(implist) |> DRT() |> GPAbin()

Solid green triangles: testee (completed) category level points.
Solid red squares: target (centroid configuration) category level points.

missbp <- missmi(implist) |> DRT() |> GPAbin()

G.target: the default is NULL to utilise the centroid coordinates of the m imputations.

missbp <- missmi(implist) |> DRT() |> GPAbin() |> biplFig()

Z.col, CLP.col: Colour of sample coordinates and category level points, respectively.
Z.pch, CLP.pch: Plotting character of sample coordinates and category level points, respectively.
Z.cex, CLP.cex: Size of plotting character for sample points and category level points, respectively.

Measures
Measures illustration
Extraction
Visualisation

Orthogonal Procrustes Analysis of complete MCA biplot (target) vs. GPAbin biplot (testee):

Procrustes Statistic (PS): between 0 (good) and 1 (bad).
Absolute Mean Bias (AMB): low (good) compared to other AMB values.
Root Mean Squared Bias (RMSB): low (good) compared to other RMSB values.

Evaluation measures based on response profiles:

Similarity Percentage (SP): between 0 (bad) and 1 (good). Coordinates of category levels in closest proximity to sample coordinates per variable.
Response Pattern Recovery (RPR): between 0 (bad) and 1 (good). The number of recovered response profiles predicted from the GPAbin biplot compared to the true response profiles.

compdat: Complete data matrix representing the input data of missmi(). This only applies to simulated data.

data(compdat)
missbp <- missmi(implist) |> DRT() |> GPAbin() |> evalMeas(compdat = compdat)
missbp$eval

#      Evaluation measures
# PS                0.0127
# SP                0.9646
# RPR               0.8720
# AMB               0.1268
# RMSB              0.1555

Some general tips

CRUCIAL: GitHub and Git for version control, project management and collaboration.
- Improves work flow and the process of maintaining individual scripts.
- Promotes open-source software.
- Creates visibility.
- Valuable to receive feedback from users (experts, non-technical users, students)
Happy Git and GitHub for the useR: Jenny Bryan and fellow contributors.
R Packages: Hadley Wickham and Jennifer Bryan.
R Forwards: Partners with community groups to advance inclusive and open-source software and technologies.
R Forwards package development: resources of workshops.

Promises for next time

Addition of PCA GPAbin biplots to GPAbin package.
ggplot2 additions to package.
Understand the variation / uncertainty in the visualisations.
- Unsupervised multivariate data tours.
…

Acknowledgements

I acknowledge the support and contributions of:

Prof. Niël le Roux and Prof. Sugnet Lubbe:
Dianne Cook - Monash University
- NGA(MaSS) - funding for research visit May 2025
Emi Tanaka - slide inspiration
Ursula Laa - slide inspiration

GPAbin: an R package for unified visualisations of imputed data sets

Home life

Looking back 2016

Looking back 2016: exploring ideas

Looking back 2020

Looking back: making promises

Back to today

Required background

Missing data

Multiple imputation (MI)

MI strategies

Biplot visualisation

Pipeline of functions: GPAbin

Methodology step-by-step

Methodology step-by-step

Some general tips

Promises for next time

Acknowledgements

Pipeline of functions: `GPAbin`