Try the new SDV 1.0 Beta! We are transitioning to a new version of SDV with improved workflows, new features and an updated documentation site. Click here to go to the new docs pages.
Try the new SDV 1.0 Beta!
We are transitioning to a new version of SDV with improved workflows, new features and an updated documentation site.
Click here to go to the new docs pages.
The SDMetrics library includes reports, metrics and visualizations that you can use to evaluate your synthetic data.
To use the SDMetrics library, you’ll need:
Real data, loaded as a pandas DataFrame
Synthetic data, loaded as a pandas DataFrame
Metadata, represented as a dictionary format
We can get started using the demo data
In [1]: from sdv.demo import load_tabular_demo In [2]: from sdv.lite import TabularPreset In [3]: metadata_obj, real_data = load_tabular_demo('student_placements', metadata=True) In [4]: model = TabularPreset(metadata=metadata_obj, name='FAST_ML') In [5]: model.fit(real_data) In [6]: synthetic_data = model.sample(num_rows=real_data.shape[0])
After the previous steps, we will have two tables
real_data, containing data about student placements
real_data
In [7]: real_data.head() Out[7]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 17264 M 67.00 91.00 Commerce 58.00 Sci&Tech False 0 55.0 Mkt&HR 58.80 27000.0 True 2020-07-23 2020-10-12 3.0 1 17265 M 79.33 78.33 Science 77.48 Sci&Tech True 1 86.5 Mkt&Fin 66.28 20000.0 True 2020-01-11 2020-04-09 3.0 2 17266 M 65.00 68.00 Arts 64.00 Comm&Mgmt False 0 75.0 Mkt&Fin 57.80 25000.0 True 2020-01-26 2020-07-13 6.0 3 17267 M 56.00 52.00 Science 52.00 Sci&Tech False 0 66.0 Mkt&HR 59.43 NaN False NaT NaT NaN 4 17268 M 85.80 73.60 Commerce 73.30 Comm&Mgmt False 0 96.8 Mkt&Fin 55.50 42500.0 True 2020-07-04 2020-09-27 3.0
synthetic_data, containing synthesized students with the same format and mathematical properties as the original
synthetic_data
In [8]: synthetic_data.head() Out[8]: student_id gender second_perc high_perc high_spec degree_perc degree_type work_experience experience_years employability_perc mba_spec mba_perc salary placed start_date end_date duration 0 0 F 62.650439 73.154386 Science 75.972731 Comm&Mgmt True 2 53.327043 Mkt&Fin 67.713611 NaN True NaT NaT 12.0 1 1 F 72.868463 66.883681 Commerce 65.720537 Comm&Mgmt False 0 84.613120 Mkt&Fin 71.773283 NaN False NaT NaT 3.0 2 2 F 78.133815 65.523606 Arts 71.612115 Others True 1 52.868627 Mkt&Fin 72.569259 NaN True 2020-07-02 2020-08-25 3.0 3 3 M 78.960420 72.274036 Science 67.880697 Comm&Mgmt True 0 73.737140 Mkt&Fin 69.296605 NaN True NaT NaT 12.0 4 4 F 68.305342 84.727306 Science 68.272651 Comm&Mgmt False 1 85.774004 Mkt&Fin 69.401578 22644.0 True 2020-03-17 NaT 3.0
We can also convert metadata to a Python dictionary by calling the to_dict method
to_dict
In [9]: metadata_dict = metadata_obj.to_dict()
Use the sdmetrics library to generate a Quality Report. This report evaluates the shapes of the columns (marginal distributions) and the pairwise trends between the columns (correlations).
sdmetrics
In [10]: from sdmetrics.reports.single_table import QualityReport In [11]: report = QualityReport() In [12]: report.generate(real_data, synthetic_data, metadata_dict) Overall Quality Score: 85.8% Properties: Column Shapes: 88.97% Column Pair Trends: 82.64%
The report uses information in the metadata to select which metrics to apply to your data. The final score is a number between 0 and 1, where 0 indicates the lowest quality and 1 indicates the highest.
The report includes a breakdown for every property that it computed.
In [13]: report.get_details(property_name='Column Shapes') Out[13]: Column Metric Quality Score 0 second_perc KSComplement 0.934884 1 high_perc KSComplement 0.897674 2 degree_perc KSComplement 0.916279 3 experience_years KSComplement 0.851163 4 employability_perc KSComplement 0.888372 5 mba_perc KSComplement 0.888372 6 salary KSComplement 0.791340 7 start_date KSComplement 0.713179 8 end_date KSComplement 0.759875 9 gender TVComplement 0.920930 10 high_spec TVComplement 0.967442 11 degree_type TVComplement 0.879070 12 work_experience TVComplement 0.948837 13 mba_spec TVComplement 0.986047 14 placed TVComplement 0.995349 15 duration TVComplement 0.895711
In the detailed view, you can see the quality score for each column of the table. Based on the data type, different metrics may be used for the computation.
For more information about the Quality Report, see the SDMetrics Docs.
Outside of reports, the SDMetrics library contains a variety of metrics that you can apply manually. For example the NewRowSynthesis metric measures whether each row in the synthetic data is novel or whether it exactly matches a row in the real data.
In [14]: from sdmetrics.single_table import NewRowSynthesis In [15]: NewRowSynthesis.compute(real_data, synthetic_data, metadata_dict) Out[15]: 1.0
See the SDMetrics Glossary for a full list of metrics that you can apply.