Danger You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software Click here to go to the new docs pages.
Danger
You are looking at the documentation for an older version of the SDV! We are no longer supporting or maintaining this version of the software
Click here to go to the new docs pages.
The SDMetrics library includes reports, metrics and visualizations that you can use to evaluate your synthetic data.
To use the SDMetrics library, you’ll need:
Real data, loaded as a pandas DataFrame
Synthetic data, loaded as a pandas DataFrame
Metadata, represented as a dictionary format
We can get started using the demo data
In [1]: from sdv.demo import load_tabular_demo In [2]: from sdv.lite import TabularPreset In [3]: metadata_obj, real_data = load_tabular_demo('student_placements', metadata=True) In [4]: model = TabularPreset(metadata=metadata_obj, name='FAST_ML') In [5]: model.fit(real_data) In [6]: synthetic_data = model.sample(num_rows=real_data.shape[0])
After the previous steps, we will have two tables
real_data, containing data about student placements
real_data
In [7]: real_data.head() Out[7]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17264 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0 1 17265 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0 2 17266 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0 3 17267 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN 4 17268 M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
synthetic_data, containing synthesized students with the same format and mathematical properties as the original
synthetic_data
In [8]: synthetic_data.head() Out[8]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 F 70.658585 68.683326 Commerce 75.710274 Comm&Mgmt False 0 61.280385 Mkt&HR 60.264673 NaN True NaT NaT NaN 1 1 F 84.809614 90.194495 Commerce 80.914146 Comm&Mgmt True 2 89.802255 Mkt&Fin 69.745012 NaN True NaT NaT NaN 2 2 M 51.771017 62.051246 Commerce 60.549893 Comm&Mgmt False 0 82.609888 Mkt&Fin 58.754591 20828.0 False 2020-02-29 2020-05-17 NaN 3 3 M 71.632168 67.676724 Commerce 68.565213 Comm&Mgmt True 1 83.736241 Mkt&Fin 59.543187 NaN True NaT NaT NaN 4 4 M 68.547130 57.755845 Arts 72.672045 Comm&Mgmt False 0 67.600846 Mkt&Fin 61.930229 30020.0 True 2020-01-07 2020-04-22 NaN
We can also convert metadata to a Python dictionary by calling the to_dict method
to_dict
In [9]: metadata_dict = metadata_obj.to_dict()
Use the sdmetrics library to generate a Quality Report. This report evaluates the shapes of the columns (marginal distributions) and the pairwise trends between the columns (correlations).
sdmetrics
In [10]: from sdmetrics.reports.single_table import QualityReport In [11]: report = QualityReport() In [12]: report.generate(real_data, synthetic_data, metadata_dict) Overall Quality Score: 85.67% Properties: Column Shapes: 88.34% Column Pair Trends: 82.99%
The report uses information in the metadata to select which metrics to apply to your data. The final score is a number between 0 and 1, where 0 indicates the lowest quality and 1 indicates the highest.
The report includes a breakdown for every property that it computed.
In [13]: report.get_details(property_name='Column Shapes') Out[13]: Column Metric Quality Score 0 second_perc KSComplement 0.930233 1 high_perc KSComplement 0.911628 2 degree_perc KSComplement 0.893023 3 experience_years KSComplement 0.930233 4 employability_perc KSComplement 0.888372 5 mba_perc KSComplement 0.953488 6 salary KSComplement 0.718601 7 start_date KSComplement 0.666667 8 end_date KSComplement 0.740991 9 gender TVComplement 0.897674 10 high_spec TVComplement 0.958140 11 degree_type TVComplement 0.893023 12 work_experience TVComplement 0.995349 13 mba_spec TVComplement 0.893023 14 placed TVComplement 0.953488 15 duration TVComplement 0.910453
In the detailed view, you can see the quality score for each column of the table. Based on the data type, different metrics may be used for the computation.
For more information about the Quality Report, see the SDMetrics Docs.
Outside of reports, the SDMetrics library contains a variety of metrics that you can apply manually. For example the NewRowSynthesis metric measures whether each row in the synthetic data is novel or whether it exactly matches a row in the real data.
In [14]: from sdmetrics.single_table import NewRowSynthesis In [15]: NewRowSynthesis.compute(real_data, synthetic_data, metadata_dict) Out[15]: 1.0
See the SDMetrics Glossary for a full list of metrics that you can apply.