The UNCRi Framework and Tools
The UNCRi Framework
At the heart of the UNCRi (Unified Numerical/Categorical Representation and Inference) framework lies a unique data representation scheme coupled with a powerful and flexible inference procedure. Features of the framework include:
- Based on a unique data representation scheme that treats numerical and categorical variables in a unified fashion and means that a single robust measure can be used to calculate the similarity between any two data points irrespective of the mix of attribute types or the distribution of their values. However, in UNCRi the similarity between points is never explicitly calculated, but is implicit in the graph-based inference procedure.
- No need for training separate models for regression, classification, etc. Once a model has been created for a dataset ALL inference is performed using that model. The source dataset is no longer required.
- Can estimate and sample from any conditional distribution, so can be applied to a broad range of tasks including conditional and unconditional synthetic data generation, prediction tasks such as classification and regression, missing value imputation, outlier analysis, etc.
- Numerical variables can be highly skewed; categorical variables can vary from binary to high cardinality.
- Works seamlessly even when the missing value ratio is very high. The importance of this is not to be underestimated because it means that independent datasets that may have only a small number of common attributes can be represented in the one model. This allows the integration of data from different domains, and can be useful in application areas such as recommender systems.
If you would like to know more about what’s under the hood of the UNCRi framework you can find more information at our FAQ page.
What is an UNCRi Model?
An UNCRi model consists of a graph-based representation of a dataset together with a small number of hyperparameters. Creating a model is always the first step you perform when using UNCRi on a new dataset. Once a model has been created, all inference is performed using that model — the source dataset is no longer required. UNCRi models can be saved to disk and retrieved for later use.
When saved to disk UNCRi models are typically much smaller (i.e., require fewer kB) than the dataset that they model, especially for large datasets. This is because the UNCRi framework provides the option of representing either the original data points, or alternatively, ‘prototypes’ derived from these points. Prototypes are points (not necessarily present in the original dataset) which are sufficiently representative such that they can replace the original data without significant degradation in inference quality, but result in dramatically improved performance in terms of speed and memory usage.
UNCRi models involve only one hyperparameter for each variable. Hyperparameter optimization is a quick and straightforward process and is performed when the model is created.
Since all tasks performed in the UNCRi framework rely on the model, model validation is important. You can find out more on model validation in the case studies.
The UNCRi toolbox
The UNCRi toolbox has been developed as a convenient GUI-packaged collection of common generic tasks that can be performed under the UNCRi framework. The toolbox can be used to create and save UNCRi data models, and currently contains tools for synthetic data generation, prediction, data imputation, and joint probability estimation. These tools can be applied to solving a large variety of important and challenging real-world problems.
It is important to highlight that UNCRi is not just a collection of tools. It is a flexible framework that can be applied in many custom situations. For example, the framework can easily be used to create effective content-based, collaborative and hybrid recommender systems.
Synthetic Data Generation
The Synthetic Data Generator is a powerful tool that can be used to generate synthetic data, i.e., novel data points that are distinct from — yet distributed in the same way as — those in the dataset from which the model was created. The uses of such data are numerous: dataset expansion to facilitate experimentation and model building, dealing with imbalanced datasets, addressing security and privacy concerns, etc. Not surprisingly this field has become enormously popular over just the last few years.
But as well as generating data points from the full joint distribution, it is also possible to generate points from any conditional distribution, which is useful for data expansion in scenarios where there are few data points matching some condition of interest. For example, consider an organisation conducting a market segmentation exercise where the objective is to discover different groupings of customers sharing particular attributes (e.g., middle-aged single men with annual spend greater than $10k per year), but that the number of data points matching those conditions is too small to perform any reliable clustering. The UNCRi generator can generate an arbitrary number of synthetic data points matching this condition. This expanded dataset can then be clustered.
Importantly, and unlike many approaches to synthetic data generation, the UNCRi tool does not generate points from conditional distributions simply by generating points from the joint distribution and then rejecting those not matching the conditions. Rather, the unconditional distribution is estimated, and examples are drawn directly from this distribution. This allows the UNCRi tool to efficiently generate data even when the imposed conditions rarely occur in the original dataset.
Prediction and Imputation
The Prediction and Imputation tools are similar to each other, but whereas prediction involves determining the most likely value of some missing attribute, imputation involves selecting a value from the distribution of possible values for that attribute.
Predicting the value of a variable is straightforward because once the relevant probability distribution has been estimated (i.e., the probability of the target variable conditioned on the values of the other variables), the predicted value is just the expected value (mean) of this distribution. This means that the Prediction tool can be used for conventional classification or regression tasks. In fact, because numerical and categorical variables are treated uniformly in the UNCRi framework, there is effectively no difference between regression and classification—they are both simply ‘prediction’ tasks.
In the case of imputation, the imputed value is a random value drawn from the estimated distribution for that variable. This makes the UNCRi approach vastly superior to conventional data imputation methods which simply select the most frequent value for categorical variables, and the mean or median for numerical variables.
Joint Probability Estimation
The Joint Probability Estimator can be used to estimate the probability with which any data point belongs to the same distribution as the dataset from which the model was created. Points with very low probability might then be treated as outliers.
What to do with outliers will of course depend on context. In some cases it may be desirable to identify and remove outliers (e.g., if it is believed that the outlier appeared as a result of some faulty sensor or incorrect data collection procedure); in other cases it may be desirable to generate new data points in low density regions of the feature space.
The Joint Probability Tool can be applied to the source dataset from which the model was created or any synthetic dataset generated from it. This makes it useful for model validation, where we would expect the distribution of joint probabilities for synthetic data to be similar to those for the source data.