Frequently Asked Questions
Machine learning algorithms are often categorized as supervised or unsupervised. Can the UNCRi framework be described in these terms?
No, neither of these labels fit the UNCRi approach. UNCRi is not a specific learning algorithm in the sense that decision tree learners, clustering algorithms, or back-propagation are. UNCRi is a general framework for representation and inferencing from data. This is what makes the UNCRi approach special. A single model can be used to perform a wide variety of tasks, all by virtue of the clever way that data is represented and the inferencing that this makes possible.
Broadly speaking, machine learning models are either parametric or non-parametric. Where does UNCRi fit in this regard?
Under the UNCRi framework, instances are first represented in a specially encoded (but not dimensionally reduced) version of the original space. Instances and their (newly encoded) attributes are then represented as a graph, in which all subsequent inferencing is performed. In this sense, the UNCRi approach can loosely be thought of as non-parametric (think k-means, nearest neighbors, etc.). This is a fundamentally different approach to that taken in connectionist models such as neural models, where a large number of parameters (weights) must be optimized. A short, simple and intuitive motivation for the approach can be found here: Graph Data Science for Tabular Data
Non-parametric models usually require a lot of storage, which can be a problem when dealing with large datasets. Is this a problem with UNCRi?
It is true that most non-parametric models require the dataset to be resident in memory; however, except for the case of very small datasets (e.g., fewer than 500 examples or so) the UNCRi model does not store the original datapoints, but rather stores prototypes (typically about 100). This means that UNCRi models, when saved to disk, are much smaller than the size of the original dataset.
What are the relative advantages and disadvantages of UNCRi compared to neural network models such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs), which are currently popular in the synthetic data generation space?
Because UNCRi models involve only a small number of hyperparameters they are much faster to develop than connectionist parametric models such as neural networks, which typically must undergo a lengthy training process. On the other side of the coin, neural models such as VAEs and GANs require only a single forward pass to generate an entire new data point which means that they can generate data very quickly. The UNCRi framework uses a sequential imputation approach in which a separate inference is required for each variable, so data generation is significantly slower.
However, the speed of GANs and VAEs is countered by their limited inferencing capabilities. GANs and VAEs generate entire datapoints directly, without estimating any intermediate probability distributions. This means that they cannot be used to perform even relatively basic tasks such as classification or regression. In UNCRi, any probability distribution – conditional or unconditional – can be estimated explicitly, leading to the wide range inference tasks that UNCRi models can perform.
Finally, like most neural network-based models, GANs and VAEs are prone to overfitting. This is not so with UNCRi. The non-parametric nature of the UNCRi framework, in conjunction with the cross-validation procedure used for hyper-parameter optimization, prevents overfitting.
On the topic of synthetic data generation, are there any studies comparing performance of UNCRi with that of other methods?
Synthetic data generation is a new and evolving field, and there are still no standard evaluation techniques. Here is a short article that proposes an evaluation measure and uses it to compare performance of the UNCRi generator with that of some popular Copula-based, GAN-based and VAE-based synthetic data generators: Evaluating Synthetic Data — The Million Dollar Question
Can the UNCRi framework generate synthetic data matching specified conditions?
Yes, the UNCRi data generator can be used to generate examples matching specified conditions on multiple variables. For numerical variables the conditions can be greater than, equal to, or less than some specified valued; for categorical variables it is possible to specify one or more values that the variable must take. Where synthetic data generators such as GANs and VAEs generate examples from the full joint distribution and then use rejection sampling to produce examples matching some condition, which can be very inefficient – and even impossible – if very few of the original data points match the condition, UNCRi estimates and samples from the conditional distribution. This means that it can efficiently generate examples even when the specified conditions are rarely met in the source dataset.
The number of hyperparameters depends on the dimensionality of the problem. Each categorical variables requires one hyperparameter, and each numerical variable requires two. Hyperparameter optimization is performed independently for each variable, so the cost is linear in the number of dimensions. A small validation set is required for optimization, and cost is linear in the number of validation examples. To provide an indication of the expense of the optimization, a dataset containing 1000 examples or prototypes described over 10 variables should take only a minute or so on a modest desktop computer. There is also expense in generating the prototypes, and this can be considerable for large datasets. For example, prototype discovery on a 15-dimensional dataset containing 50,000 examples takes about 5 minutes.
Can UNCRi deal with missing values?
Absolutely! In fact, the ability to deal with missing values was one of the key considerations in the design of the UNCRi framework. The only requirement is that an example must have at least one known attribute value. This means that the UNCRi approach is an excellent tool to use in a semi-supervised classification setting (i.e., where a small number of examples have a value for the target variable to be predicted, but a large number are ‘unlabeled’ and have this attribute value missing). When inference is performed using a model constructed from such a dataset, similarities will be propagated through the unlabeled examples, resulting in better performance than from a model developed using the labeled examples alone.
The ability to deal with a high missing value ration also means that independent datasets that may have a small number of common attributes can be represented in the one model. This allows the integration of data from different domains, and can be useful in application areas such as recommender systems.
Can the UNCRi framework be applied to datasets containing all numerical or all categorical variables?
Although the UNCRi framework was designed to work effectively with mixed-type data, it can also be applied to datasets containing all categorical variables or all numerical variables.
Can the UNCRi framework be applied to time series such as financial time series?
Yes, there is an UNCRi tool for generating synthetic time series. It operates by creating a table of lagged returns, and then estimates a model for the returns. The synthetic time series capture the essential characteristics of the original series – skew, kurtosis, and volatility clustering. Some hand tuning of parameters is required; e.g., number of lags, and lag decay rate. It is also possible to include categorical features in the model; for example, the sentiment for the company on any day, which might might be derived from free text (see below).
What about text? Can UNCRi be used to represent variables containing free text?
Yes, variables containing free text can be used in UNCRi models. The usual approach is to use n-dimensional word-embeddings, which can be encoded in the same space as numerical and categorical variables.
Do numerical variables need to be scaled or normalized?
Numerical variables do not need to be scaled, as any necessary scaling will be performed internally. Non-linear normalization techniques such as quantile transformation should definitely NOT be applied. The UNCRi framework has been designed to deal with highly-skewed numerical data. Non-linear normalization will result in poor performance, especially if there are outliers in the data or if the distribution is highly asymmetric.
Is UNCRi source code available on GitHub or elsewhere?
No, source code is not publicly available. UNCRi is proprietary software.