Evaluation Framework

SDV contains a Synthetic Data Evaluation Framework that facilitates the task of evaluating the quality of your Synthetic Dataset by applying multiple Synthetic Data Metrics on it and reporting results in a comprehensive way.

Using the SDV Evaluation Framework

To evaluate the quality of synthetic data we essentially need two things: real data and synthetic data that pretends to resemble it.

Let us start by loading a demo table and generate a synthetic replica of it using the GaussianCopula model.

In [1]: from sdv.demo import load_tabular_demo

In [2]: from sdv.tabular import GaussianCopula

In [3]: real_data = load_tabular_demo('student_placements')

In [4]: model = GaussianCopula()

In [5]: model.fit(real_data)

In [6]: synthetic_data = model.sample()

After the previous steps we will have two tables:

  • real_data: A table containing data about student placements

In [7]: real_data.head()
Out[7]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date  duration
0       17264      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12       3.0
1       17265      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09       3.0
2       17266      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13       6.0
3       17267      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT       NaN
4       17268      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27       3.0
  • synthetic_data: A synthetically generated table that contains data in the same format and with similar statistical properties as the real_data.

In [8]: synthetic_data.head()
Out[8]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date  duration
0       17269      M        86.66      80.43  Commerce        84.98   Comm&Mgmt            False                 0               78.88  Mkt&Fin     64.69  25400.0    True 2020-02-14 2020-08-19       5.0
1       17458      M        74.61      69.23   Science        60.83   Comm&Mgmt            False                 1               87.22  Mkt&Fin     62.83  34400.0    True 2020-02-28 2021-02-13      10.0
2       17276      M        80.29      80.90  Commerce        71.77   Comm&Mgmt            False                 0               75.24  Mkt&Fin     63.40  23300.0    True 2020-01-25 2020-07-08       4.0
3       17401      M        87.41      77.76   Science        65.68    Sci&Tech            False                 1               66.23   Mkt&HR     64.23  30000.0    True 2020-01-04 2020-07-06       5.0
4       17375      M        71.71      78.48  Commerce        66.07   Comm&Mgmt            False                 0               51.67  Mkt&Fin     57.77  25100.0    True 2020-02-03 2020-04-20       3.0

Note

For more details about this process, please visit the GaussianCopula Model guide.

Computing an overall score

The simplest way to see how similar the two tables are is to import the sdv.evaluation.evaluate function and run it passing both the synthetic_data and the real_data tables.

In [9]: from sdv.evaluation import evaluate

In [10]: evaluate(synthetic_data, real_data)
Out[10]: 0.6250121152956873

The output of this function call will be a number between 0 and 1 that will indicate how similar the two tables are, being 0 the worst and 1 the best possible score.

How was the obtained score computed?

The evaluate function applies a collection of pre-configured metric functions and returns the average of the scores that the data obtained on each one of them. In most scenarios this can be enough to get an idea about the similarity of the two tables, but you might want to explore the metrics in more detail.

In order to see the different metrics that were applied you can pass and additional argument aggregate=False, which will make the evaluate function return a dictionary with the scores that each one of the metrics functions returned:

In [11]: evaluate(synthetic_data, real_data, aggregate=False)
Out[11]: 
                              metric                                     name  raw_score  normalized_score  min_value  max_value      goal                                              error
0                    BNLogLikelihood           BayesianNetwork Log Likelihood        NaN               NaN       -inf        0.0  MAXIMIZE  Please install pomegranate with `pip install p...
1                  LogisticDetection             LogisticRegression Detection   0.367954      3.679541e-01        0.0        1.0  MAXIMIZE                                               None
2                       SVCDetection                            SVC Detection   0.483965      4.839647e-01        0.0        1.0  MAXIMIZE                                               None
3       BinaryDecisionTreeClassifier                                     None        NaN               NaN        0.0        1.0  MAXIMIZE  `target` must be passed either directly or ins...
4           BinaryAdaBoostClassifier                                     None        NaN               NaN        0.0        1.0  MAXIMIZE  `target` must be passed either directly or ins...
5           BinaryLogisticRegression                                     None        NaN               NaN        0.0        1.0  MAXIMIZE  `target` must be passed either directly or ins...
6                BinaryMLPClassifier                                     None        NaN               NaN        0.0        1.0  MAXIMIZE  `target` must be passed either directly or ins...
7   MulticlassDecisionTreeClassifier                                     None        NaN               NaN        0.0        1.0  MAXIMIZE  `target` must be passed either directly or ins...
8            MulticlassMLPClassifier                                     None        NaN               NaN        0.0        1.0  MAXIMIZE  `target` must be passed either directly or ins...
9                   LinearRegression                                     None        NaN               NaN       -inf        1.0  MAXIMIZE  `target` must be passed either directly or ins...
10                      MLPRegressor                                     None        NaN               NaN       -inf        1.0  MAXIMIZE  `target` must be passed either directly or ins...
11                   GMLogLikelihood           GaussianMixture Log Likelihood -37.790174      3.872004e-17       -inf        inf  MAXIMIZE                                               None
12                            CSTest                              Chi-Squared   0.872711      8.727115e-01        0.0        1.0  MAXIMIZE                                               None
13                            KSTest  Inverted Kolmogorov-Smirnov D statistic   0.906460      9.064599e-01        0.0        1.0  MAXIMIZE                                               None
14                    KSTestExtended  Inverted Kolmogorov-Smirnov D statistic   0.901661      9.016611e-01        0.0        1.0  MAXIMIZE                                               None
15                    CategoricalCAP                           CategoricalCAP        NaN               NaN        0.0        1.0  MAXIMIZE  `key_fields` must be passed either directly or...
16                CategoricalZeroCAP                                     0CAP        NaN               NaN        0.0        1.0  MAXIMIZE  `key_fields` must be passed either directly or...
17         CategoricalGeneralizedCAP               Categorical GeneralizedCAP        NaN               NaN        0.0        1.0  MAXIMIZE  `key_fields` must be passed either directly or...
18                     CategoricalNB                Categorical NaiveBayesian        NaN               NaN        0.0        1.0  MAXIMIZE  `key_fields` must be passed either directly or...
19                    CategoricalKNN                      K-Nearest Neighbors        NaN               NaN        0.0        1.0  MAXIMIZE  `key_fields` must be passed either directly or...
20                     CategoricalRF                Categorical Random Forest        NaN               NaN        0.0        1.0  MAXIMIZE  `key_fields` must be passed either directly or...
21                    CategoricalSVM                Support Vector Classifier        NaN               NaN        0.0        1.0  MAXIMIZE  `key_fields` must be passed either directly or...
22               CategoricalEnsemble                                 Ensemble        NaN               NaN        0.0        1.0  MAXIMIZE  '<' not supported between instances of 'float'...
23                       NumericalLR              Numerical Linear Regression        NaN               NaN        0.0        inf  MAXIMIZE  `key_fields` must be passed either directly or...
24                      NumericalMLP        Multi-layer Perceptron Regression        NaN               NaN        0.0        inf  MAXIMIZE  `key_fields` must be passed either directly or...
25                      NumericalSVR      Numerical Support-vector Regression        NaN               NaN        0.0        inf  MAXIMIZE  `key_fields` must be passed either directly or...
26    NumericalRadiusNearestNeighbor        Numerical Radius Nearest Neighbor        NaN               NaN        0.0        inf  MAXIMIZE  `key_fields` must be passed either directly or...
27            ContinuousKLDivergence   Continuous Kullback–Leibler Divergence   0.494927      4.949269e-01        0.0        1.0  MAXIMIZE                                               None
28              DiscreteKLDivergence     Discrete Kullback–Leibler Divergence   0.875069      8.750690e-01        0.0        1.0  MAXIMIZE                                               None

Can I control which metrics are applied?

By default, the evaluate function will apply all the metrics that are included within the SDV Evaluation framework. However, the list of metrics that are applied can be controlled by passing a list with the names of the metrics that you want to apply.

For example, if you were interested on obtaining only the CSTest and KSTest metrics you can call the evaluate function as follows:

In [12]: evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSTest'])
Out[12]: 0.8895856996969825

Or, if we want to see the scores separately:

In [13]: evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSTest'], aggregate=False)
Out[13]: 
   metric                                     name  raw_score  normalized_score  min_value  max_value      goal error
0  CSTest                              Chi-Squared   0.872711          0.872711        0.0        1.0  MAXIMIZE  None
1  KSTest  Inverted Kolmogorov-Smirnov D statistic   0.906460          0.906460        0.0        1.0  MAXIMIZE  None

For more details about all the metrics that exist for the different data modalities please check the corresponding guides.