SDV contains a Synthetic Data Evaluation Framework that facilitates the task of evaluating the quality of your Synthetic Dataset by applying multiple Synthetic Data Metrics on it and reporting results in a comprehensive way.
To evaluate the quality of synthetic data we essentially need two things: real data and synthetic data that pretends to resemble it.
Let us start by loading a demo table and generate a synthetic replica of it using the GaussianCopula model.
GaussianCopula
In [1]: from sdv.demo import load_tabular_demo In [2]: from sdv.tabular import GaussianCopula In [3]: real_data = load_tabular_demo('student_placements') In [4]: model = GaussianCopula() In [5]: model.fit(real_data) In [6]: synthetic_data = model.sample()
After the previous steps we will have two tables:
real_data: A table containing data about student placements
real_data
In [7]: real_data.head() Out[7]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17264 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0 1 17265 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0 2 17266 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0 3 17267 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN 4 17268 M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
synthetic_data: A synthetically generated table that contains data in the same format and with similar statistical properties as the real_data.
synthetic_data
In [8]: synthetic_data.head() Out[8]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17306 M 56.199618 57.892111 Commerce 61.591230 Sci&Tech False 0 52.534918 Mkt&Fin 58.340510 NaN False NaT NaT NaN 1 17319 M 77.482574 68.507005 Commerce 60.842343 Comm&Mgmt False 1 80.469771 Mkt&HR 56.546865 26604.236226 True 2020-01-30 2020-06-23 3.0 2 17377 F 66.261257 77.328159 Commerce 68.059109 Comm&Mgmt False 1 79.750220 Mkt&Fin 62.749491 21513.379475 True 2020-02-06 2020-12-08 3.0 3 17381 F 82.646843 91.749992 Commerce 74.522105 Comm&Mgmt False 0 53.803237 Mkt&HR 72.071654 27790.671438 True 2020-01-30 2020-10-12 12.0 4 17283 M 59.788533 60.037129 Science 66.389419 Sci&Tech False 0 51.110733 Mkt&Fin 53.813831 22306.492187 True 2020-02-22 2020-08-28 3.0
Note
For more details about this process, please visit the GaussianCopula Model guide.
The simplest way to see how similar the two tables are is to import the sdv.evaluation.evaluate function and run it passing both the synthetic_data and the real_data tables.
sdv.evaluation.evaluate
In [9]: from sdv.evaluation import evaluate In [10]: evaluate(synthetic_data, real_data) Out[10]: 0.48293646303892107
The output of this function call will be a number between 0 and 1 that will indicate us how similar the two tables are, being 0 the worst and 1 the best possible score.
The evaluate function applies a collection of pre-configured metric functions and returns the average of the scores that the data obtained on each one of them. In most scenarios this can be enough to get an idea about the similarity of the two tables, but you might want to explore the metrics in more detail.
evaluate
In order to see the different metrics that were applied you can pass and additional argument aggregate=False, which will make the evaluate function return a dictionary with the scores that each one of the metrics functions returned:
aggregate=False
In [11]: evaluate(synthetic_data, real_data, aggregate=False) Out[11]: metric name score min_value max_value goal 1 LogisticDetection LogisticRegression Detection 0.395440 0.0 1.0 MAXIMIZE 2 SVCDetection SVC Detection 0.405284 0.0 1.0 MAXIMIZE 11 GMLogLikelihood GaussianMixture Log Likelihood -35.313872 -inf inf MAXIMIZE 12 CSTest Chi-Squared 0.865763 0.0 1.0 MAXIMIZE 13 KSTest Inverted Kolmogorov-Smirnov D statistic 0.913372 0.0 1.0 MAXIMIZE 14 KSTestExtended Inverted Kolmogorov-Smirnov D statistic 0.899767 0.0 1.0 MAXIMIZE 15 ContinuousKLDivergence Continuous Kullback–Leibler Divergence 0.580561 0.0 1.0 MAXIMIZE 16 DiscreteKLDivergence Discrete Kullback–Leibler Divergence 0.818773 0.0 1.0 MAXIMIZE
By default, the evaluate function will apply all the metrics that are included within the SDV Evaluation framework. However, the list of metrics that are applied can be controlled by passing a list with the names of the metrics that you want to apply.
For example, if you were interested on obtaining only the CSTest and KSTest metrics you can call the evaluate function as follows:
CSTest
KSTest
In [12]: evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSTest']) Out[12]: 0.8895677339539922
Or, if we want to see the scores separately:
In [13]: evaluate(synthetic_data, real_data, metrics=['CSTest', 'KSTest'], aggregate=False) Out[13]: metric name score min_value max_value goal 0 CSTest Chi-Squared 0.865763 0.0 1.0 MAXIMIZE 1 KSTest Inverted Kolmogorov-Smirnov D statistic 0.913372 0.0 1.0 MAXIMIZE
For more details about all the metrix that exist for the different data modalities please check the corresponding guides.