Evaluation Framework

The SDMetrics library includes reports, metrics and visualizations that you can use to evaluate your synthetic data.

Required Information

To use the SDMetrics library, you’ll need:

  1. Real data, loaded as a pandas DataFrame

  2. Synthetic data, loaded as a pandas DataFrame

  3. Metadata, represented as a dictionary format

We can get started using the demo data

In [1]: from sdv.demo import load_tabular_demo

In [2]: from sdv.lite import TabularPreset

In [3]: metadata_obj, real_data = load_tabular_demo('student_placements', metadata=True)

In [4]: model = TabularPreset(metadata=metadata_obj, name='FAST_ML')

In [5]: model.fit(real_data)

In [6]: synthetic_data = model.sample(num_rows=real_data.shape[0])

After the previous steps, we will have two tables

  • real_data, containing data about student placements

In [7]: real_data.head()
Out[7]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date  duration
0       17264      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12       3.0
1       17265      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09       3.0
2       17266      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13       6.0
3       17267      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT       NaN
4       17268      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27       3.0
  • synthetic_data, containing synthesized students with the same format and mathematical properties as the original

In [8]: synthetic_data.head()
Out[8]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc   salary  placed start_date   end_date  duration
0           0      F    77.804240  89.590779  Commerce    73.580316   Comm&Mgmt             True                 1           51.044248  Mkt&Fin  68.310939  20000.0    True        NaT        NaT       3.0
1           1      F    81.100762  73.310850  Commerce    64.835107   Comm&Mgmt             True                 1           69.049133  Mkt&Fin  63.356339      NaN    True        NaT        NaT       3.0
2           2      M    69.579115  72.880687  Commerce    69.976234   Comm&Mgmt             True                 1           77.837113  Mkt&Fin  59.034499  25346.0    True        NaT 2020-10-10      12.0
3           3      F    70.708199  65.083878   Science    60.169305   Comm&Mgmt             True                 2           50.000000   Mkt&HR  67.754057  20000.0    True        NaT        NaT      12.0
4           4      M    67.336427  67.108045  Commerce    63.060561   Comm&Mgmt             True                 1           88.074447  Mkt&Fin  71.138535      NaN   False 2019-10-31 2020-05-04      12.0

We can also convert metadata to a Python dictionary by calling the to_dict method

In [9]: metadata_dict = metadata_obj.to_dict()

Computing an overall score

Use the sdmetrics library to generate a Quality Report. This report evaluates the shapes of the columns (marginal distributions) and the pairwise trends between the columns (correlations).

In [10]: from sdmetrics.reports.single_table import QualityReport

In [11]: report = QualityReport()

In [12]: report.generate(real_data, synthetic_data, metadata_dict)

Overall Quality Score: 87.2%

Properties:
Column Shapes: 88.95%
Column Pair Trends: 85.45%

The report uses information in the metadata to select which metrics to apply to your data. The final score is a number between 0 and 1, where 0 indicates the lowest quality and 1 indicates the highest.

How was the obtained score computed?

The report includes a breakdown for every property that it computed.

In [13]: report.get_details(property_name='Column Shapes')
Out[13]: 
                Column        Metric  Quality Score
0          second_perc  KSComplement       0.944186
1            high_perc  KSComplement       0.860465
2          degree_perc  KSComplement       0.911628
3     experience_years  KSComplement       0.906977
4   employability_perc  KSComplement       0.897674
5             mba_perc  KSComplement       0.934884
6               salary  KSComplement       0.747084
7           start_date  KSComplement       0.638622
8             end_date  KSComplement       0.852560
9               gender  TVComplement       0.911628
10           high_spec  TVComplement       0.939535
11         degree_type  TVComplement       0.869767
12     work_experience  TVComplement       0.972093
13            mba_spec  TVComplement       0.934884
14              placed  TVComplement       0.944186
15            duration  TVComplement       0.896396

In the detailed view, you can see the quality score for each column of the table. Based on the data type, different metrics may be used for the computation.

For more information about the Quality Report, see the SDMetrics Docs.

Can I apply different metrics?

Outside of reports, the SDMetrics library contains a variety of metrics that you can apply manually. For example the NewRowSynthesis metric measures whether each row in the synthetic data is novel or whether it exactly matches a row in the real data.

In [14]: from sdmetrics.single_table import NewRowSynthesis

In [15]: NewRowSynthesis.compute(real_data, synthetic_data, metadata_dict)
Out[15]: 1.0

See the SDMetrics Glossary for a full list of metrics that you can apply.