# Synthetic Data Generation on the Adult dataset

In this case study we will use the Census Income dataset (also known as the Adult dataset), which consists of 32,561 instances, described over 15 attributes (nine categorical and six numerical). This dataset has been primarily used a classification dataset, where the objective is to the predict the value of the binary attribute Income (<=50K, <50K) on the basis of the other 14 attributes. Unlike the Iris dataset, the Census Income data contains some missing values. A snippet of the original dataset is shown below, together with a snippet of synthetic examples generated using the UNCRi framework.

As is always the case when using the UNCRi framework, the first step is to create an UNCRi model for the dataset of interest. This model can then be used to perform a variety of tasks. But the tasks can only be performed accurately if the model adequately represents the original data. While visual inspection of the tables above suggest that the generated examples appear plausible, we need to apply more sophisticated tests to validate the UNCRi model so that we can be assured that it is modeling conditional distributions accurately. In the first part of the case study we will focus on methods that we can use to test the validity of UNCRi models.

## Model Validation

##### Vizualizing Marginal and Bivariate Distributions

After we create synthetic data the first thing that we will normally want to do is to check how well the synthetic data matches the data that we used to create the model. There are many tests that we can use, among which are to compare the distribution of the original examples with the distribution of synthetic examples generated by the model. Not only would we expect the marginal (or ‘univariate’) distributions (i.e., the distribution of values for a single variable) to be similar, but we would also expect bivariate (joint distribution of two variables) and higher order distributions to be similar.

The UNCRi toolbox includes a visualization tool that allows convenient comparison between marginal and bivariate distributions of a synthetic dataset with those of the original dataset. The figures below show three different pairwise comparisons. Synthetic data is shown on top, original data below. The panels on the left and right show the marginal distributions of the two variables either as histograms or continuous pdfs. The center panel shows the bivariate distribution, with the first variable plotted on the horizontal axis, and the second on the vertical axis. To make comparison easier the distribution in the right panel is plotted on its side.

##### The TSTR ('Train on Synthetic, Test on Real') test

Although comparing univariate and bivariate distributions is useful, it doesn’t tell us about higher-order relationships in the data, and performing trivariate or higher comparisons is generally not feasible due to both the number of combinations as well as the difficulty in visualizing them.

A useful means of testing whether synthetic data has captured the relationships present in the original dataset is the ‘Train on Synthetic, Test on Real’ (TSTR) test. This test involves training a prediction model (usually a classifier or regressor) using the synthetic data, and then using this prediction model to predict the value of some variable in the original data. If the prediction model performs satisfactorily, then it is inferred that the synthetic examples have adequately captured the relationships between variables in the original dataset.

The TSTR test for the Census Income dataset yields an AUC (area under ROC curve) value of **0.880**. This compares favorably against the AUC value of **0.906** obtained using 20-fold cross-validation on the original examples. We can conclude from this that the synthetic examples have captured the important higher order relationships present in the original dataset.

##### Comparing full joint probabilities

The UNCRi model can be used to calculate the value of the full joint probability density function at any point (most synthetic data generators cannot do this). The histograms below compare the distribution of (negative log) joint probabilities for each set. We can see clearly that the distribution of these values is similar in the original and synthetic datasets, adding to our confidence that the model has captured the statistical properties of the original dataset.

##### The Maximum Similarity test

This purpose of this test is to assess whether the synthetic examples are closer to the real examples than the real examples are to each other. If the synthetic examples are closer, this indicates that there has been some ‘leakage’; i.e., the real data has leaked into the synthetic data. Consequently, this test often comes under the umbrella term of ‘privacy’ tests. The Average Maximum Similarity Test operates as follows:

- For the real dataset calculate the average maximum similarity between each data point and its closest neighbor and call this R_max.
- For the synthetic dataset calculate the average maximum similarity between each data point and its closest neighbor in the ‘real’ dataset and call this S_max.
- Compare R_max and S_max. Ideally R_max and S_max should be roughly equal. If S_max is greater than R_max, the synthetic examples are probably too similar to the real examples.

The Income Census dataset contains approximately 32,000 instances, and it is not computationally feasible to perform the test on this number of examples. Instead, a random selection of 1,000 points was chosen from each of the original and synthetic datasets, and the test was applied to these sets. The Average Maximum Similarity between examples in the original set was **0.932 **and the average maximum similarity between examples in the synthetic set with those in the original was **0.926**, which supports the claim that the synthetic examples are independent and not just ‘perturbations’ of those in the original set.

Rather than calculating the average maximum similarity it can sometimes be useful to compare the actual distribution of maximum similarities. This can be done by constructing a histogram where the maximum similarities have been binned. The histogram will display not only where the mean lies, but how the similarities are distributed about this mean. Histograms for the Census Income data are shown below. It can be seen that the general shape of the distributions is very similar, with the maximum similarities between real data points (left) being slightly higher than that between synthetic an real data points (right).

From the results of these various tests we can conclude that our UNCRi model for the Census Income dataset is able to reliably estimate conditional distributions. This means that we can confidently use it perform tasks such synthetic data generation and estimation of joint probability estimation, which we have already seen above.

## Synthetic Data from Conditional Distributions

All of the results above have been based on generating data from the full joint (or *unconditional*) distribution. In some cases we may wish to generate examples matching some condition. An obvious way of doing this (which is how it is done with many synthetic data generators) is to simply generate examples from the full joint distribution but to only accept examples that match the condition. But this approach is impractical if there are very few examples matching the condition. The UNCRi generator explicitly estimates the conditional distribution, and samples directly from this distribution. This makes it extremely efficient and able to generate points even in situations where there are few (if any!) points matching the condition in the original dataset.

The figure below shows the toolbox interface for selecting conditions. In this example, the conditions are that the variable race must be Amer-Indian-Eskimo and gender must be female.

The figure below shows the univariate and bivariate distributions for income and marital-status. There are clear differences in the distributions between the generated examples (top) and the original examples (bottom).

## Prediction and Imputation

The Census Income dataset contains missing values. The UNCRi imputation tool estimates the distribution of the missing value given the known attribute values, and randomly imputes a value from this distribution. Similarly, the prediction tool also estimates the conditinal distribution, but in this case assigns the expected value of the estimated distribution. Either of these can be applied to the original dataset if required.

## Conclusion

The Census Income dataset is one of the most popular datasets used in evaluating synthetic data generation methods. We constructed an UNCRi model for this dataset, and applied a number of tests to validate the model. The tests indicate that the generated data displays similar distributional characteristics to the original data, and that it is of utility in tasks such as classification. Importantly, the maximum similarity test indicates that the generated examples are no more similar to the original examples than the original examples are to each other, supporting the claim that the real data has not been ‘leaked’ into the synthetic data. We can conclude that the UNCRi model can reliably estimate conditional distributions on this dataset, so we can confidently apply it to broad range of inference tasks.