GPAbin: an R package for unified visualisations of imputed data sets

Johané Nienkemper-Swanepoel


Centre for Multi-Dimensional Data Visualisation (MuViSU).
Department of Statistics and Actuarial Science, Stellenbosch University.
National Institute for Theoretical and Computational Sciences.


SU seminar series: 13 February 2026

Looking back 2016

Looking back 2016: exploring ideas

Looking back 2020

Looking back: making promises

Back to today

GPAbin - pronounced G - P - A - bin

An R package to unify multiple biplot visualisations into a single display.

Nienkemper-Swanepoel J (2025). GPAbin: Unifying Multiple Biplot Visualisations into a Single Display. R package version 1.1.1, https://CRAN.R-project.org/package=GPAbin.

Required background

  • Missing data
  • Multiple imputation
  • Biplot visualisation

Missing data

Missing data mechanisms (MDMs):

  • Missing Completely At Random (MCAR): Missing values are independent of observed and missing observations.

  • Missing At Random (MAR): Missing values depend on observed values, but not on missing observations.

  • Missing Not At Random (MNAR): Missing values depend on both observed and missing observations.

Focus placed on MAR scenarios - assumed to be the standard occurrence in practice.

Multiple imputation (MI)

  • Each missing value is replaced with multiple plausible values.
  • Result in multiple completed data sets for standard complete-case analysis.
  • Estimates are combined using Rubin’s rules (Rubin 1987).
  • Realistic representation of variation due to distribution of multiple responses per missing value.

MI strategies

Two general approaches to impute missing values in multiple variables:

  • Joint modelling (JM): The same imputation model is used for all variables.

    • Multiple imputation with multiple correspondence analysis (MIMCA)

    • Multilevel joint modelling multiple imputation (jomo)

    • Dirichlet process mixture of products of multinomial distribution model (DPMPM)

  • Fully conditional specification (FCS) / sequential regression / chained equations: Imputation models are specified per variable that is conditioned on the other variables.

    • Multivariate imputation by chained equations (mice).

Biplot visualisation

  • Multivariate categorical data visualisation.
  • Representation of two modes of data (samples and variables) in a single display.
  • Multiple Correspondence Analysis (MCA) biplots.

If \(\mathbf{X}\) is categorical with \(n\) rows (samples) and \(p\) columns (variables) with a total of \(q\) category levels:

\[\mathbf{R}^{-\frac{1}{2}}\mathbf{G}\mathbf{C}^{-\frac{1}{2}} = \mathbf{U}\mathbf{ \Sigma}\mathbf{V}^\prime\]

  • \(\mathbf{G}\Rightarrow\) indicator matrix of \(\mathbf{X}\) with \(n\) samples and \(q\) columns.

  • \(\mathbf{R}^{-\frac{1}{2}}\),\(\mathbf{C}^{-\frac{1}{2}}\Rightarrow\) diagonal matrix of row and column weights of \(\mathbf{G}\).

Generally, plot the first two columns of:

  • \(\mathbf{U\Sigma}\) (principal coordinates) for the sample coordinates and

  • the first two columns of \(\mathbf{V}\) (standard coordinates) for the category level point coordinates.

Pipeline of functions: GPAbin

  • missmi: This function produces a list of elements to be used when producing a GPAbin biplot.

  • impute: Choose between four available multiple imputation strategies in R.

  • DRT: Multiple correspondence analysis (MCA) is performed on the multiple imputed datasets.

  • GPAbin: Combines multiple configurations from dimension reduction solutions applied to multiple imputed data sets.

  • biplFig: Creates a biplot. Current version (1.1.1): MCA biplot.

  • evalMeas: Calculates measures of comparison based on distances between two configurations in two dimensions.

Some general tips

  • CRUCIAL: GitHub and Git for version control, project management and collaboration.

    • Improves work flow and the process of maintaining individual scripts.

    • Promotes open-source software.

    • Creates visibility.

    • Valuable to receive feedback from users (experts, non-technical users, students)

  • Happy Git and GitHub for the useR: Jenny Bryan and fellow contributors.

  • R Packages: Hadley Wickham and Jennifer Bryan.

  • R Forwards: Partners with community groups to advance inclusive and open-source software and technologies.

  • R Forwards package development: resources of workshops.

Promises for next time

  • Addition of PCA GPAbin biplots to GPAbin package.
  • ggplot2 additions to package.
  • Understand the variation / uncertainty in the visualisations.
    • Unsupervised multivariate data tours.

Acknowledgements

I acknowledge the support and contributions of:

  • Prof. Niël le Roux and Prof. Sugnet Lubbe:

  • Dianne Cook - Monash University

    • NGA(MaSS) - funding for research visit May 2025
  • Emi Tanaka - slide inspiration

  • Ursula Laa - slide inspiration