Danger

You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software

Click here to go to the new docs pages.

Evaluation Framework

The SDMetrics library includes reports, metrics and visualizations that you can use to evaluate your synthetic data.

Required Information

To use the SDMetrics library, you’ll need:

  1. Real data, loaded as a pandas DataFrame

  2. Synthetic data, loaded as a pandas DataFrame

  3. Metadata, represented as a dictionary format

We can get started using the demo data

In [1]: from sdv.demo import load_tabular_demo

In [2]: from sdv.lite import TabularPreset

In [3]: metadata_obj, real_data = load_tabular_demo('student_placements', metadata=True)

In [4]: model = TabularPreset(metadata=metadata_obj, name='FAST_ML')

In [5]: model.fit(real_data)

In [6]: synthetic_data = model.sample(num_rows=real_data.shape[0])

After the previous steps, we will have two tables

  • real_data, containing data about student placements

In [7]: real_data.head()
Out[7]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec  mba_perc   salary  placed start_date   end_date  duration
0       17264      M        67.00      91.00  Commerce        58.00    Sci&Tech            False                 0                55.0   Mkt&HR     58.80  27000.0    True 2020-07-23 2020-10-12       3.0
1       17265      M        79.33      78.33   Science        77.48    Sci&Tech             True                 1                86.5  Mkt&Fin     66.28  20000.0    True 2020-01-11 2020-04-09       3.0
2       17266      M        65.00      68.00      Arts        64.00   Comm&Mgmt            False                 0                75.0  Mkt&Fin     57.80  25000.0    True 2020-01-26 2020-07-13       6.0
3       17267      M        56.00      52.00   Science        52.00    Sci&Tech            False                 0                66.0   Mkt&HR     59.43      NaN   False        NaT        NaT       NaN
4       17268      M        85.80      73.60  Commerce        73.30   Comm&Mgmt            False                 0                96.8  Mkt&Fin     55.50  42500.0    True 2020-07-04 2020-09-27       3.0
  • synthetic_data, containing synthesized students with the same format and mathematical properties as the original

In [8]: synthetic_data.head()
Out[8]: 
   student_id gender  second_perc  high_perc high_spec  degree_perc degree_type  work_experience  experience_years  employability_perc mba_spec   mba_perc   salary  placed start_date   end_date  duration
0           0      F    70.658585  68.683326  Commerce    75.710274   Comm&Mgmt            False                 0           61.280385   Mkt&HR  60.264673      NaN    True        NaT        NaT       NaN
1           1      F    84.809614  90.194495  Commerce    80.914146   Comm&Mgmt             True                 2           89.802255  Mkt&Fin  69.745012      NaN    True        NaT        NaT       NaN
2           2      M    51.771017  62.051246  Commerce    60.549893   Comm&Mgmt            False                 0           82.609888  Mkt&Fin  58.754591  20828.0   False 2020-02-29 2020-05-17       NaN
3           3      M    71.632168  67.676724  Commerce    68.565213   Comm&Mgmt             True                 1           83.736241  Mkt&Fin  59.543187      NaN    True        NaT        NaT       NaN
4           4      M    68.547130  57.755845      Arts    72.672045   Comm&Mgmt            False                 0           67.600846  Mkt&Fin  61.930229  30020.0    True 2020-01-07 2020-04-22       NaN

We can also convert metadata to a Python dictionary by calling the to_dict method

In [9]: metadata_dict = metadata_obj.to_dict()

Computing an overall score

Use the sdmetrics library to generate a Quality Report. This report evaluates the shapes of the columns (marginal distributions) and the pairwise trends between the columns (correlations).

In [10]: from sdmetrics.reports.single_table import QualityReport

In [11]: report = QualityReport()

In [12]: report.generate(real_data, synthetic_data, metadata_dict)

Overall Quality Score: 85.67%

Properties:
Column Shapes: 88.34%
Column Pair Trends: 82.99%

The report uses information in the metadata to select which metrics to apply to your data. The final score is a number between 0 and 1, where 0 indicates the lowest quality and 1 indicates the highest.

How was the obtained score computed?

The report includes a breakdown for every property that it computed.

In [13]: report.get_details(property_name='Column Shapes')
Out[13]: 
                Column        Metric  Quality Score
0          second_perc  KSComplement       0.930233
1            high_perc  KSComplement       0.911628
2          degree_perc  KSComplement       0.893023
3     experience_years  KSComplement       0.930233
4   employability_perc  KSComplement       0.888372
5             mba_perc  KSComplement       0.953488
6               salary  KSComplement       0.718601
7           start_date  KSComplement       0.666667
8             end_date  KSComplement       0.740991
9               gender  TVComplement       0.897674
10           high_spec  TVComplement       0.958140
11         degree_type  TVComplement       0.893023
12     work_experience  TVComplement       0.995349
13            mba_spec  TVComplement       0.893023
14              placed  TVComplement       0.953488
15            duration  TVComplement       0.910453

In the detailed view, you can see the quality score for each column of the table. Based on the data type, different metrics may be used for the computation.

For more information about the Quality Report, see the SDMetrics Docs.

Can I apply different metrics?

Outside of reports, the SDMetrics library contains a variety of metrics that you can apply manually. For example the NewRowSynthesis metric measures whether each row in the synthetic data is novel or whether it exactly matches a row in the real data.

In [14]: from sdmetrics.single_table import NewRowSynthesis

In [15]: NewRowSynthesis.compute(real_data, synthetic_data, metadata_dict)
Out[15]: 1.0

See the SDMetrics Glossary for a full list of metrics that you can apply.