<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[SDV Blog]]></title><description><![CDATA[Industry trends commentary and product updates]]></description><link>https://sdv.dev/</link><image><url>https://sdv.dev/favicon.png</url><title>SDV Blog</title><link>https://sdv.dev/</link></image><generator>Ghost 2.9</generator><lastBuildDate>Tue, 01 Mar 2022 19:00:21 GMT</lastBuildDate><atom:link href="https://sdv.dev/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[ML Model Development using Synthetic Data Clones]]></title><description><![CDATA[What happens when you train a machine learning model on synthetic data instead of real data? Let's experiment to find out.]]></description><link>https://sdv.dev/synthetic-clones-for-ml/</link><guid isPermaLink="false">Ghost__Post__6216679682795d003d91f6e5</guid><category><![CDATA[Technical]]></category><dc:creator><![CDATA[Arnav Modi]]></dc:creator><pubDate>Thu, 24 Feb 2022 16:33:56 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2022/02/ML-Model-Development-Banner-04.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2022/02/ML-Model-Development-Banner-04.png" alt="ML Model Development using Synthetic Data Clones"/><p><em>This article was researched by Arnav Modi, a community user. Arnav is a high school student and aspiring data scientist who spent his summer learning about the SDV and how synthetic data is used to perform ML tasks.</em></p><p>One potential use for synthetic data is to replace real data in the development of new machine learning (ML) models. Imagine a scenario where you need to build a predictive ML model -- perhaps for a function critical to your business, like predicting customer satisfaction or sales success -- with one important consideration: <strong>The data is sensitive, so only trusted employees can access it with specific credentials.</strong></p><p>Access to sensitive data may create a barrier for a variety of reasons:</p><ul><li>You might not have ML expertise in your organization, which means you need to use external software or contractors to complete the task. However, you are unable to share the data with them.</li><li>Your data is available on a secure, cloud-based platform for trusted employees to access remotely. They work on this data using interactive notebooks. Every time they lose their connection – due to WiFi outages, their laptops falling asleep, etc. – they may lose their work or have to reconnect.</li><li>You have a robust authentication system that your team uses. However, it creates a barrier to entry for rapid, iterative collaboration between members, sharing work and debugging data pipelines. As a result, your collaboration is much slower than it would be if your team could access the data without the need to authenticate.</li></ul><p>In cases like this, synthetic data can be an ideal solution: You can create synthetic data based on the original, sensitive data set, and use it more freely during ML development.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/02/ML-Model-Development-03.png" class="kg-image" alt="ML Model Development using Synthetic Data Clones" loading="lazy" width="2000" height="783" srcset="https://sdv.ghost.io/content/images/size/w600/2022/02/ML-Model-Development-03.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/02/ML-Model-Development-03.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/02/ML-Model-Development-03.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/02/ML-Model-Development-03.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Synthetic data can be useful for ML development. You can use synthetic data to develop models in a variety of environments, like data science platforms, local machines or 3rd party software. Meanwhile, the real data never leaves your premises.</figcaption></img></figure><p>One key question will determine if this method succeeds: Is the synthetic data actually useful for your ML task? We performed an experiment to find out.</p><p>In the rest of this article, we'll describe our experimental setup and findings. (You can double-check our work in this <a href="https://colab.research.google.com/drive/13-1xy5t7veizWBsb_dDgTRBdhGcCqjCJ?usp=sharing">Colab Notebook</a>.)</p><h3 id="experimental-setup">Experimental Setup</h3><p>If an ML model is trained using synthetic data instead of real data, what happens to the model's performance? To answer this question, we identified 3 publicly available datasets (<a href="https://www.kaggle.com/mastmustu/income?select=train.csv">Income</a>, <a href="https://archive.ics.uci.edu/ml/datasets/Bank+Marketing">Bank</a> and <a href="https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction?select=train.csv">Airline</a>) that are associated with particular ML prediction tasks. The datasets and tasks are summarized below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/02/ML-Model-Development-04.png" class="kg-image" alt="ML Model Development using Synthetic Data Clones" loading="lazy" width="2000" height="671" srcset="https://sdv.ghost.io/content/images/size/w600/2022/02/ML-Model-Development-04.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/02/ML-Model-Development-04.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/02/ML-Model-Development-04.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/02/ML-Model-Development-04.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>A description of our datasets. *[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014</figcaption></img></figure><p>Our experiment compared the performance of an  ML model trained on the original data, vs. one trained on the synthetic data provided by the SDV.</p><ul><li><strong><strong><strong>Control (Original data): </strong></strong></strong>How successfully can we complete the ML prediction task if we use the real data? Because some predictions are harder than others, this control helped us identify the overall difficulty of these specific tasks.</li><li><strong><strong>Experiment (Synthetic data):</strong> </strong>How successfully can we complete the ML prediction task if we use synthetic data instead? We used the SDV's <a href="https://sdv.dev/SDV/user_guides/single_table/copulagan.html">CopulaGAN</a> to generate synthetic data from the three original datasets.</li></ul><p>In order to develop and test the ML model, we turned to the SDMetrics library — specifically the <a href="https://sdv.dev/SDV/user_guides/evaluation/single_table_metrics.html#machine-learning-efficacy-metrics">ML Efficacy metrics</a>, which build an ML model and evaluate its performance. We used the Binary Decision Tree Classifier and Binary Logistic Regression models. The overall experimental setup is illustrated below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/02/ML-Model-Development-05.png" class="kg-image" alt="ML Model Development using Synthetic Data Clones" loading="lazy" width="2000" height="724" srcset="https://sdv.ghost.io/content/images/size/w600/2022/02/ML-Model-Development-05.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/02/ML-Model-Development-05.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/02/ML-Model-Development-05.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/02/ML-Model-Development-05.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>The experimental setup evaluated synthetic data against a test set of original data that we set aside at the start. This allows us to compare the usefulness of both types of data for ML tasks.</figcaption></img></figure><p>To obtain reliable findings, we ran 3 iterations and averaged the results.</p><h3 id="results">Results</h3><p>The graph below shows how well we are able to perform an ML task using the original vs the synthetic data.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/02/Machine-Learning-Efficacy.png" class="kg-image" alt="ML Model Development using Synthetic Data Clones" loading="lazy" width="638" height="395" srcset="https://sdv.ghost.io/content/images/size/w600/2022/02/Machine-Learning-Efficacy.png 600w, https://sdv.ghost.io/content/images/2022/02/Machine-Learning-Efficacy.png 638w"><figcaption>A comparison of ML accuracy scores obtained using real vs. synthetic data, allowing us to assess any loss of accuracy that comes from replacing the original data with synthetic data.</figcaption></img></figure><p><strong>Discussion</strong></p><p>The original data quantifies the general difficulty of the ML task. Looking at these values, we can see that the Income Dataset is the hardest task, as neither of our methods were able to get above 90% accuracy using the original data.</p><p>Comparing the datasets allows us to quantify the suitability of synthetic data for ML development. Our results show a loss of between 1 and 9% of the original efficacy value for all comparisons, with a median loss of roughly 2.5%.</p><p>It's important to note that the simplifications we've made for this experiment may be resulting in worse accuracy than we would see in real-world use.</p><ul><li>Applying CopulaGAN out-of-the-box to each dataset is simplistic. In a real-world scenario, the model's parameters would likely be explicitly <a href="https://sdv.dev/SDV/user_guides/single_table/copulagan.html">tuned</a> and <a href="https://sdv.dev/SDV/user_guides/single_table/constraints.html">constraints</a> would be used to improve synthetic data quality.</li><li>The Decision Tree and Logistic Regression evaluators are relatively simplistic ML classifiers. An ML expert (or ML software) might use more advanced techniques.</li><li>In our scenario, the 3rd party delivers a fully trained, ready-to-go ML model. Another approach is to ask them to use the synthetic data to deliver an <em>untrained</em> model – so that you can train it yourself on the real dataset. This alternative setup, which should increase the prediction accuracy, will be a topic for a future article.</li></ul><p>In summary, the accuracy loss we observe represents the worst case scenario. In a production environment, higher-quality ML models and more careful tuning of the SDV will likely minimize performance differences between original and synthetic data.</p><h3 id="takeaways">Takeaways</h3><p>In this article, we quantified the effect of replacing real data with a synthetic data clone for ML development. Our results show a loss of 2.5% accuracy when using synthetic data. Considering these results, we assess that <strong>it is reasonable to explore the use of synthetic data for the purpose of ML development</strong>.</p><p>In order to maximize the utility of the synthetic data, we recommend tuning the SDV model and using constraints to improve the data quality. In future articles, we'll explore more details about using synthetic data for ML.</p><p><em>Are you using the SDV to solve your ML business needs? Publish your findings on the SDV blog as a guest author! Contact us at </em><a href="mailto:info@sdv.dev"><em>info@sdv.dev</em></a><em>.</em></p><p><br/></p><p><br/></p>]]></content:encoded></item><item><title><![CDATA[Building the Unique Combinations Constraint in the SDV]]></title><description><![CDATA[Sometimes, you want to limit the amount of permutations in your synthetic data. Explore the strategies we used for enforcing this kind of logic.]]></description><link>https://sdv.dev/building-unique-combinations/</link><guid isPermaLink="false">Ghost__Post__61e841116361ff003b9ca712</guid><category><![CDATA[Engineering]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Tue, 25 Jan 2022 18:25:20 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2022/01/Banner-UC.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2022/01/Banner-UC.png" alt="Building the Unique Combinations Constraint in the SDV"/><p>By default, a machine learning model (ML) may not always learn the deterministic rules in your dataset. We've previously explored how the SDV allows user to <a href="https://sdv.dev/blog/eng-sdv-constraints/" rel="nofollow">input their logic</a> using constraints. With constraints, an SDV model produces logically correct data 100% of the time.</p><p>While an end user might expect the constraint to "just work," engineering this functionality requires some creative techniques. In this article, we'll describe the techniques we used to build the <code>UniqueCombinations</code> constraint. You can also follow along in our <a href="https://colab.research.google.com/drive/1bY8y6m7-CjTxWDepw32-ZT3Ubb9RGK5F?usp=sharing">notebook</a>.</p><pre><code>!pip install sdv==0.13.1</code></pre><pre><code class="language-python">import numpy as np
import warnings

warnings.filterwarnings('ignore')</code></pre><h3 id="what-is-a-unique-combinations-constraint">What is a Unique Combinations Constraint?</h3><p>Users frequently encounter logical constraints on the permutations -- mixing &amp; matching -- that are allowed in synthetic data.</p><p>To illustrate this, let's use the <code>world_v1</code> dataset from the SDV tabular dataset demos. This simple dataset describes the population of different cities around the world.</p><pre><code class="language-python">from sdv.demo import load_tabular_demo

data = load_tabular_demo('world_v1')
data = data.drop(['add_numerical'], axis=1) # not needed for this demo
data.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.51.49-AM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="1014" height="362" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-11.51.49-AM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/Screen-Shot-2022-01-19-at-11.51.49-AM.png 1000w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.51.49-AM.png 1014w" sizes="(min-width: 720px) 720px"/></figure><p><strong>Relationship between <code>Name</code>, <code>CountryCode</code> and <code>District</code></strong></p><p>Looking at the data, we can observe that there is a special relationship between the <code>Name</code> of the city, its <code>CountryCode</code> and its geographical <code>District</code>: When generating synthetic data, the model should not blindly mix-and-match these values. Instead, it should <strong>reference the real data to verify whether the combination is valid.</strong> This is called a <code>UniqueCombinations</code> constraint.</p><p>For example, take a particular city, like <code>Cambridge</code>, which appears 3 times in our dataset.</p><pre><code class="language-python">data[data.Name == 'Cambridge']</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.53.07-AM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="1020" height="248" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-11.53.07-AM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/Screen-Shot-2022-01-19-at-11.53.07-AM.png 1000w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.53.07-AM.png 1020w" sizes="(min-width: 720px) 720px"/></figure><p>The constraint states that <code>Cambridge</code> should only ever appear with <code>GBR (England)</code>, <code>CAN (Ontario)</code> or <code>USA (Massachusetts)</code>. It is invalid if it appears in any other region -- for eg. Cambridge, France.</p><p><strong>How does the SDV handle a Unique Combination out-of-the-box?</strong></p><p>Let's try running the <code>sdv</code> as-is on the dataset to see what happens. We'll use the <code>GaussianCopula</code> model on our dataset.</p><pre><code class="language-python">from sdv.tabular import GaussianCopula

np.random.seed(0)

model = GaussianCopula(
  categorical_transformer='label_encoding' # optimize speed
) 
model.fit(data)</code></pre><p>Now, let's generate some rows to inspect the synthetic data.</p><pre><code class="language-python">np.random.seed(12)
model.sample(5)</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.54.31-AM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="940" height="360" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-11.54.31-AM.png 600w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.54.31-AM.png 940w" sizes="(min-width: 720px) 720px"/></figure><p>Although the <code>sdv</code> is generating known city names, countries and districts, their combinations don't make sense. We can also go back to our original example and generate only some rows for <code>Cambridge</code>.</p><pre><code class="language-python">np.random.seed(10)

conditions = {'Name': 'Cambridge'}
model.sample(5, conditions=conditions)</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.55.06-AM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="1022" height="364" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-11.55.06-AM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/Screen-Shot-2022-01-19-at-11.55.06-AM.png 1000w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.55.06-AM.png 1022w" sizes="(min-width: 720px) 720px"/></figure><p>The result is a variety of Cambridges that aren't necessarily in USA, GBR, or CAN. These aren't valid cities!</p><p><strong>What's going on?</strong> The SDV models include probabilities that some unseen combinations are possible. This is by design: Synthesizing new combinations -- that don't blatantly match the original data -- helps with privacy.</p><p>However in this particular case, we aren't worried about the privacy of a city belonging to a country or district. We actually <em>do</em> want the data to match. This is why we need to build a constraint.</p><h3 id="fixing-the-data-using-rejecting-sampling">Fixing the data using rejecting sampling</h3><p>In our <a href="https://sdv.dev/blog/eng-sdv-constraints/" rel="nofollow">previous article</a>, we described a solution called <code>reject_sampling</code> that works on any type of constraint and is very easy to build: We simply create the synthetic data as usual and then throw out (reject) any data that doesn't match.</p><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/UniqueCombinations-02.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="883" height="316" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/UniqueCombinations-02.png 600w, https://sdv.ghost.io/content/images/2022/01/UniqueCombinations-02.png 883w" sizes="(min-width: 720px) 720px"/></figure><p>In theory, this can solve our <code>UniqueCombinations</code> constraint. In practice, this strategy is only efficient if the model can easily generate acceptable data. Let's calculate the chances of getting an acceptable combination (<code>Name</code>, <code>CountryCode</code>, <code>District</code>) from the model.</p><pre><code class="language-python">np.random.seed(0)

# Sample data from the model
# The sample may include combinations that aren't valid
n = 100000
new_data = model.sample(n)

# Calculate how many rows are valid
combo = ['Name', 'CountryCode', 'District']
merged = new_data.merge(data, left_on=combo, right_on=combo, how='left')
passed = merged[merged['ID_y'].notna()].shape[0]

# Print out our results
print("Valid rows: ", (passed/n)*100, "%")
print("Rejected rows: ", (1 - passed/n)*100, "%")</code></pre><pre><code>Valid rows:  0.038 %
Rejected rows:  99.96199999999999 %</code></pre><p>With such a low probability of passing the constraint, this strategy can become intractable.</p><h3 id="fixing-the-data-using-transformations">Fixing the data using transformations</h3><p>A more efficient strategy is for the ML model to learn the constraint directly, so it always produces acceptable data. We can do this by transforming the data in a clever way, forcing the model to learn the logic.</p><p>Our <a href="https://sdv.dev/blog/eng-sdv-constraints/" rel="nofollow">previous article</a> described how to do this for a different constraint. Unfortunately, the exact same transformation won't work to solve our current <code>UniqueCombinations</code> constraint. <strong>The transform strategy requires a different, creative solution for each constraint.</strong> So we have to start from scratch.</p><p>Can you think of any other ways to enforce <code>UniqueCombinations</code>?</p><p><strong>A solution: Concatenating the data</strong></p><p>One solution is to concatenate the data. That is, rather than treating the city <code>Name</code>, <code>CountryCode</code> and <code>District</code> as separate items, we treat them as a single value. This will force the model to learn them as 1 single concept rather than as multiple columns that can be recombined.</p><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/UniqueCombinations-01.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="1524" height="1200" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/UniqueCombinations-01.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/UniqueCombinations-01.png 1000w, https://sdv.ghost.io/content/images/2022/01/UniqueCombinations-01.png 1524w" sizes="(min-width: 720px) 720px"/></figure><p>Let's see this in action.</p><pre><code class="language-python"># create transformed data that concatenates the columns
data_transform = data.copy()

# Concatenate the data using a separator
data_transform['concatenated'] = data_transform['Name'] + '#' + data_transform['CountryCode'] + '#' + data_transform['District']

# We can drop the individual columns
data_transform.drop(labels=['Name', 'CountryCode', 'District'],
                    axis=1, inplace=True)

data_transform.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.58.21-AM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="828" height="368" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-11.58.21-AM.png 600w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.58.21-AM.png 828w" sizes="(min-width: 720px) 720px"/></figure><p>Now, we can train the model using the transformed (concatenated) data instead.</p><pre><code class="language-python">np.random.seed(35)

# create a new model that will learn from the transformed data
model_transform = GaussianCopula(categorical_transformer='label_encoding')
model_transform.fit(data_transform)

# this will produce transformed data
output = model_transform.sample()
output.head(5)</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.58.53-AM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="882" height="368" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-11.58.53-AM.png 600w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.58.53-AM.png 882w" sizes="(min-width: 720px) 720px"/></figure><p>To get back realistic-looking data, we can convert the concatenated column back into <code>Name</code>, <code>City</code> and <code>District</code>.</p><pre><code class="language-python">import pandas as pd

# Split the conatenated column by the separator and save the reuslts
names = []
countrycodes = []
districts = []

for x in output['concatenated']:
  try:
    name, countrycode, district = x.split('#')
  except:
    name, countrycode, district = [np.nan]*3
  names.append(name)
  countrycodes.append(countrycode)
  districts.append(district)

# Add the individual columns back in
output['Name'] = pd.Series(names)
output['CountryCode'] = pd.Series(countrycodes)
output['District'] = pd.Series(districts)

# Drop the concatenated column
output.drop(labels=['concatenated'], axis=1, inplace=True)</code></pre><p>As a result, the output now looks like our original data.</p><pre><code class="language-python">output.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.59.41-AM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="1020" height="368" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-11.59.41-AM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/Screen-Shot-2022-01-19-at-11.59.41-AM.png 1000w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.59.41-AM.png 1020w" sizes="(min-width: 720px) 720px"/></figure><p>Most importantly, the <code>Name</code>, <code>CountryCode</code> and <code>District</code> columns now make sense!</p><p><strong>Caveats of transforming the data</strong></p><p>The transform strategy is an efficient and elegant approach to modeling. But there is a downside: <strong>The transform strategy might lose some mathematical properties.</strong></p><p>To see why, consider the model's perspective:</p><ul><li><code>Cambridge#GBR#England</code> is completely different from</li><li><code>Cambridge#USA#Massachusetts</code> is completely different from</li><li><code>Boston#USA#Massachusetts</code></li></ul><p>The problem is that two of these actually have something in common -- they are located in <code>Massachusetts, USA</code>. So the model will not be able to learn anything special about <code>Massachusetts</code> or <code>USA</code> as a whole.</p><p>As an example, let's see how well the model was able to learn populations of US-based cities.</p><pre><code class="language-python">import matplotlib.pyplot as plt

# Populations of real US cities
real_usa = data.loc[data['CountryCode'] == 'USA', 'Population']

# Populations of synthetic US cities
synth_usa = output.loc[output['CountryCode'] == 'USA', 'Population']

# Plot the distributions
plt.ylabel('US City Data')
plt.xlabel('Population')
_ = plt.boxplot([real_usa, synth_usa],
                showfliers=False,
                labels=['Real', 'Synthetic'],
                vert=False
)
plt.show()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-12.00.53-PM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="1022" height="500" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-12.00.53-PM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/Screen-Shot-2022-01-19-at-12.00.53-PM.png 1000w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-12.00.53-PM.png 1022w" sizes="(min-width: 720px) 720px"/></figure><p>The real data shows less variation in city population than the synthetic data. The differences make sense because our model wasn't able to learn about the USA as one complete concept.</p><p><strong>Can we fix this?</strong> It's challenging to fix this issue without degrading the mathematical correlations in some other way. If you have any ideas, we welcome you to <a href="https://github.com/sdv-dev/SDV/issues/414" rel="nofollow">join our discussion</a>!</p><h3 id="inputting-a-uniquecombination-into-the-sdv">Inputting a UniqueCombination into the SDV</h3><p>We built the constraint -- both the <code>reject_sampling</code> and <code>transform</code> approaches -- directly into the SDV library. If you have <code>sdv</code> installed, this is ready to use. Import the <code>UniqueCombinations</code> class from the <code>constraints</code> module.</p><pre><code class="language-python">from sdv.constraints import UniqueCombinations

# Create a Unique Combinations constraint
unique_city_country_district = UniqueCombinations(
  columns=['Name', 'CountryCode', 'District'],
  handling_strategy='transform' # you can change this 'reject_sampling' too
)

# Create a new model using the constraint
updated_model = GaussianCopula(
  constraints=[unique_city_country_district],
  categorical_transformer='label_encoding'
)</code></pre><p>Now, you can train the model on your data and sample synthetic data.</p><pre><code class="language-python">np.random.seed(35)

updated_model.fit(data)
updated_model.sample(5)</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-12.02.30-PM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="1146" height="382" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-12.02.30-PM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/Screen-Shot-2022-01-19-at-12.02.30-PM.png 1000w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-12.02.30-PM.png 1146w" sizes="(min-width: 720px) 720px"/></figure><p>All of the synthetic data is guaranteed to follow the <code>UniqueCombinations</code> constraint.</p><h3 id="takeaways">Takeaways</h3><ol><li>We can identify a <code>UniqueCombinations</code> requirement by asking: Should it be possible to further mix-and-match the data?</li><li>We can enforce any logical constraint by using reject sampling, which throws out any invalid data. This is not efficient for <code>UniqueCombinations</code>.</li><li>An alternative approach is to transform the data, forcing the ML model to learn the constraint. For <code>UniqueCombinations</code> we transformed the data by concatenating it.</li><li>The logic for <code>UniqueCombinations</code> is already built into the SDV's <code>constraints</code> module, and is ready to use.</li></ol><p>Further reading:</p><ul><li><a href="https://sdv.dev/blog/eng-sdv-constraints/" rel="nofollow">Engineering Constraints Blog Article</a></li><li><a href="https://sdv.dev/SDV/user_guides/single_table/constraints.html" rel="nofollow">Handling Constraints User Guide</a></li><li><a href="https://sdv.dev/SDV/api_reference/constraints/tabular.html" rel="nofollow">Tabular Constraints API</a></li></ul>]]></content:encoded></item><item><title><![CDATA[The SDV in 2021: A year in review]]></title><description><![CDATA[In this article, we summarize SDV growth – downloads as well as community building – that indicates increasing market demand for synthetic data.]]></description><link>https://sdv.dev/2021-year-review/</link><guid isPermaLink="false">Ghost__Post__61d3611b6317ec003be8e4b3</guid><category><![CDATA[Project]]></category><dc:creator><![CDATA[Kalyan Veeramachaneni]]></dc:creator><pubDate>Mon, 03 Jan 2022 21:07:19 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2022/01/Year-in-review-with-sdv.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2022/01/Year-in-review-with-sdv.png" alt="The SDV in 2021: A year in review"/><p>We started SDV open source in 2018 at MIT with the goal of creating a powerful, usable, machine learning-based synthetic data generation software system. The core belief that drove us was the conviction that more than 90% of data work can be done using synthetic data instead of real data. Early<a href="https://news.mit.edu/2017/artificial-data-give-same-results-as-real-data-0303"> experiments at MIT</a> had been promising and we were ready to invest our time and energy into that promise.</p><p>Now, 3 years later, we are pleased to see that the market demand for synthetic data is increasing. In a 2021 article, Gartner <a href="https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/">predicted</a> that 60% of data used for AI &amp; analytics will be synthetic by 2024. </p><p>As time progressed, we used feedback from our users to make numerous improvements to the SDV (see articles <a href="https://sdv.dev/blog/community-feedback-models/">Part 1</a> and <a href="https://sdv.dev/blog/community-feedback-workflow/">Part 2</a>). In response, we've seen increased usage, validating the market need for synthetic data generation software. In this article, we'll describe the SDV growth trends in detail.</p><h3 id="persistent-4xyear-growth-in-downloads">Persistent 4x/year growth in downloads</h3><p>Every year we are experiencing a 4x increase in SDV downloads. In 2021, we had 135,000 downloads of SDV – up from 30,576 in 2020. From the start of 2020 to the end of 2021, we have seen 16x total increase in SDV downloads. The figure below shows our yearly usage.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/01/downloads-graphic-1.png" class="kg-image" alt="The SDV in 2021: A year in review" loading="lazy" width="2000" height="889" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/downloads-graphic-1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/downloads-graphic-1.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/01/downloads-graphic-1.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/01/downloads-graphic-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Downloads of SDV per year since we open sourced the library in 2018. By downloading the SDV, a user is signaling their need for synthetic data – which we can interpret as a vote from the market.</figcaption></img></figure><p>The downloads are coming from all over the world. In the map below, we list the top 10 countries.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/01/worldmap-graphic.png" class="kg-image" alt="The SDV in 2021: A year in review" loading="lazy" width="2000" height="1156" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/worldmap-graphic.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/worldmap-graphic.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/01/worldmap-graphic.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/01/worldmap-graphic.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Downloads of the SDV in 2021, broken down by the top 10 countries. Notice that Europe accounts for 50 half the countries.</figcaption></img></figure><p>Why are users downloading the SDV? We know that they want to create synthetic data, but they are using the synthetic data to solve a variety of different needs. We will explore this more and share it in a future article.</p><h3 id="over-a-thousand-new-community-members">Over a thousand new community members</h3><p>Another measure of our growth – and validation from the market – comes from the SDV community we've built on our <a href="https://github.com/sdv-dev/SDV">GitHub</a> and <a href="https://join.slack.com/t/sdv-space/shared_invite/zt-gdsfcb5w-0QQpFMVoyB2Yd6SRiMplcw">Slack</a>. In 2021, we welcomed more than 1000 new members to these spaces.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/01/community-graphic-2.png" class="kg-image" alt="The SDV in 2021: A year in review" loading="lazy" width="2000" height="761" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/community-graphic-2.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/community-graphic-2.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/01/community-graphic-2.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/01/community-graphic-2.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>A summary of how the SDV Community grew in 2021. Any user can join the community and actively participate through the SDV GitHub and Slack.</figcaption></img></figure><p>As <a href="https://www.bvp.com/atlas/measuring-the-engagement-of-an-open-source-software-community">this article</a> points out, members contribute in several different ways: Many help increase awareness of an open source solution for this enterprise pain point. Meanwhile, others jump in, use it and give feedback actively. In 2021, we doubled the number of unique users raising issues on our GitHub. Throughout the  year, over 200 members actively participated in our forums by raising GitHub issues or contributing to discussions on Slack.</p><p>Enterprise feedback is particularly useful to us. This type of feedback comes from users who are solving targeted business problems with the SDV. Direct and succinct feedback explains what would make the SDV more useful. An example is shown below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-03-at-12.55.08-PM.png" class="kg-image" alt="The SDV in 2021: A year in review" loading="lazy" width="1718" height="282" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-03-at-12.55.08-PM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/Screen-Shot-2022-01-03-at-12.55.08-PM.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/01/Screen-Shot-2022-01-03-at-12.55.08-PM.png 1600w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-03-at-12.55.08-PM.png 1718w" sizes="(min-width: 720px) 720px"><figcaption>Feedback about a missing feature – composite keys – that would make a direct impact on an enterprise use case. We've removed the user's GitHub account name for privacy. In this case, the missing feature did make it into our pre-alpha.</figcaption></img></figure><p>Our team addresses the user feedback throughout the entire SDV ecosystem. The ecosystem includes not only modeling, but also the ability to compare models through <a href="https://github.com/sdv-dev/SDGym">SDGym</a> and measure synthetic data quality through <a href="https://github.com/sdv-dev/SDMetrics">SDMetrics</a>. In 2021, the team put out 49 releases throughout the SDV ecosystem, doubling our number of releases in 2020.</p><h3 id="looking-forward-to-2022">Looking forward to 2022!</h3><p>We are looking forward to 2022! With so many users giving us feedback, we have a long list of features that we want to incorporate. We can't wait to share with our community what everyone is using SDV for, and keep on climbing to our original goal: 90% of data work accomplished with synthetic data.</p>]]></content:encoded></item><item><title><![CDATA[How we engineered constraint handling strategies in SDV]]></title><description><![CDATA[The SDV enforces deterministic rules using constraints. What strategies did we use to engineer this ML system? Dive into the details.]]></description><link>https://sdv.dev/eng-sdv-constraints/</link><guid isPermaLink="false">Ghost__Post__61c10f636317ec003be8e39d</guid><category><![CDATA[Engineering]]></category><dc:creator><![CDATA[Andrew Montanez]]></dc:creator><pubDate>Tue, 21 Dec 2021 00:14:45 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2021/12/Banner-01.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2021/12/Banner-01.png" alt="How we engineered constraint handling strategies in SDV"/><p>The SDV uses machine learning (ML) to automatically learn rules (aka correlations) from real data and generate accurate synthetic data. While these models are powerful, they may not learn everything. In our <a href="https://sdv.dev/blog/user-input-synthetic-data/" rel="nofollow">previous article</a>, we described how the SDV models may not learn <strong>deterministic rules</strong>. These are patterns and laws that are inherent to the dataset:</p><ul><li>They are unchangeable, no matter what data you input.</li><li>They describe rules that must apply to every row, no exceptions.</li></ul><p>Luckily, it's possible for you to improve the machine learning model: When you input constraints, it ensures the model will learn deterministic rules and ultimately improve the quality of your synthetic data.</p><p>In this article, we'll dive into the technical details of how you can apply constraints and how they work under-the-hood. You can also follow along in our <a href="https://colab.research.google.com/drive/1cVGv2Xtzhd9qHgbkjsYLeLzsA8bDd1uA?usp=sharing">notebook</a>.</p><pre><code>!pip install sdv==0.13.0</code></pre><pre><code class="language-python">import numpy as np
import warnings

warnings.filterwarnings('ignore')</code></pre><h3 id="the-dataset">The Dataset</h3><p>The dataset we're using comes from a <a href="https://www.kaggle.com/c/expedia-hotel-recommendations/data?select=train.csv" rel="nofollow">Kaggle Competition</a> hosted by Expedia. We've modified the data slightly for our use.</p><pre><code class="language-python">from sdv.demo import load_tabular_demo

data = load_tabular_demo('expedia_hotel_logs')</code></pre><p>In this real-world dataset, each row represents a search result for a hotel booking.</p><p>For the purposes of this notebook, we'll drop some columns that aren't useful to us.</p><pre><code class="language-python">import pandas as pd

# Drop some columns that aren't useful for this demo
drop_columns = ['date_time', 'user_location_country', 'user_location_region',
                'user_location_city', 'user_id', 'srch_destination_id',
                'hotel_country', 'hotel_market', 'hotel_cluster',
                'srch_destination_type_id', 'orig_destination_distance',
                'posa_continent', 'site_name', 'channel']
data = data.drop(drop_columns, axis=1)

# make sure these columns are read as datetimes
for col in ['srch_ci', 'srch_co']:
  data[col] = pd.to_datetime(data[col])

# Inspect the data
data.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.36.14-PM.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="349" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/Screen-Shot-2021-12-20-at-3.36.14-PM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/Screen-Shot-2021-12-20-at-3.36.14-PM.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/Screen-Shot-2021-12-20-at-3.36.14-PM.png 1600w, https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.36.14-PM.png 2122w" sizes="(min-width: 720px) 720px"/></figure><p>The search parameters, for finding a hotel room, saved in this dataset come from from user's input. For example:</p><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/EngineeredConstraint-08.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="912" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/EngineeredConstraint-08.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/EngineeredConstraint-08.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/EngineeredConstraint-08.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2021/12/EngineeredConstraint-08.png 2400w" sizes="(min-width: 720px) 720px"/></figure><p><strong>Deterministic Rule</strong></p><p>In order for the search to be valid, the searched check-in date must happen before the searched check-out date. That is: <code>srch_ci &lt; srch_co</code>.</p><p>This is an inherent property of any search, not just for this particular dataset -- we call this a <strong>deterministic rule</strong>. We can verify if this is true by checking for any exceptions.</p><pre><code class="language-python">print('Violations of the deterministic rule')
len(data[data['srch_ci'] &gt; data['srch_co']])</code></pre><pre><code>0</code></pre><p><strong>Will SDV's machine learning model learn this out of the box?</strong></p><p>To test this, let's use SDV to learn a <code>GaussianCopula</code> model from the data and sample synthetic data.</p><pre><code class="language-python">from sdv.tabular import GaussianCopula

np.random.seed(0)

model = GaussianCopula(primary_key='log_id')
model.fit(data)

synth_data = model.sample(500)
synth_data.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.36.54-PM-1.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="388" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/Screen-Shot-2021-12-20-at-3.36.54-PM-1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/Screen-Shot-2021-12-20-at-3.36.54-PM-1.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/Screen-Shot-2021-12-20-at-3.36.54-PM-1.png 1600w, https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.36.54-PM-1.png 2050w" sizes="(min-width: 720px) 720px"/></figure><p>Now, we can inspect the synthetic data to see if there are any invalid rows.</p><pre><code class="language-python">invalid_row_indices = synth_data['srch_ci'] &gt; synth_data['srch_co']
invalid_rows = synth_data[invalid_row_indices]

num_invalid = len(invalid_rows)
perc_invalid = num_invalid / len(synth_data) * 100
print('Number of invalid rows:', num_invalid, '(', round(perc_invalid, 2), '%)')

invalid_rows.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.37.27-PM-1.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="414" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/Screen-Shot-2021-12-20-at-3.37.27-PM-1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/Screen-Shot-2021-12-20-at-3.37.27-PM-1.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/Screen-Shot-2021-12-20-at-3.37.27-PM-1.png 1600w, https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.37.27-PM-1.png 2070w" sizes="(min-width: 720px) 720px"/></figure><p>The majority of the rows (94.8%) are valid, meaning the model learned the rule pretty accurately. It learned probabilistically that if the <code>srch_ci</code> is higher <code>srch_co</code> should be even higher. However, some invalid rows (~5%) are still created so <strong>the model did not learn this deterministic rule.</strong></p><p>This raises the question: What can we do to enforce a deterministic rule?</p><h3 id="improving-the-synthetic-data">Improving the synthetic data</h3><p>Let's explore some options for enforcing our deterministic rule in order to improve the overall quality of the synthetic data.</p><p><strong>Rejecting invalid data</strong></p><p>The simplest solution is to simply drop the invalid rows, and continually sample from the model until the desired amount of valid rows are produced. We call this <strong>reject sampling</strong>.</p><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/EngineeredConstraint-07--1-.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="493" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/EngineeredConstraint-07--1-.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/EngineeredConstraint-07--1-.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/EngineeredConstraint-07--1-.png 1600w, https://sdv.ghost.io/content/images/2021/12/EngineeredConstraint-07--1-.png 2071w" sizes="(min-width: 720px) 720px"/></figure><p>The code below performs reject sampling until we have synthesized 500 rows.</p><pre><code class="language-python">import pandas as pd

# Keep track of how many valid rows we've sampled
num_valid_rows = synth_data.shape[0] - invalid_rows.shape[0]

while num_valid_rows &lt; 500:
  # Reject the invalid data 
  synth_data = synth_data.drop(invalid_rows.index)
  
  # Create new data to replace the invalid data
  new_data = model.sample(500-num_valid_rows)
  synth_data = pd.concat([synth_data, new_data])
  invalid_rows = synth_data[synth_data['srch_ci'] &gt; synth_data['srch_co']]
  num_valid_rows = synth_data.shape[0] - invalid_rows.shape[0]

synth_data.reset_index(drop=True, inplace=True)</code></pre><p>Now, there are no invalid rows in our dataset.</p><pre><code class="language-python">invalid_rows = synth_data[synth_data['srch_ci'] &gt; synth_data['srch_co']]
invalid_rows.shape[0]</code></pre><pre><code>0</code></pre><p>In this example, we got lucky. Only a small percentage of the rows were invalid each time <code>sample</code> was called.</p><p>What would happen if majority of the rows were invalid every time we sampled? It would take a longer time to get all the desired rows. <strong>Sampling time is the primary drawback of reject sampling. </strong>Is there another approach we can use to improve the time?</p><p><strong>Transforming your data</strong></p><p>Instead of reject sampling, what if the model never produced invalid rows in the first place? To achieve this, we can alter the input data to the model so it's forced to learn the constraint.</p><p>Let's stop giving the <code>srch_ci</code> and <code>srch_co</code> to the model. Instead, let's teach the model to learn the <code>srch_ci</code> and the <code>difference</code> between the dates.</p><pre><code>difference = srch_co - srch_ci</code></pre><p>The model will produce <code>srch_ci</code> and <code>difference</code> as a result. Then, we can re-compute <code>srch_co</code> with the opposite formula.</p><pre><code>srch_co = srch_ci + difference</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/EngineeredConstraint-06--1-.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="879" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/EngineeredConstraint-06--1-.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/EngineeredConstraint-06--1-.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/EngineeredConstraint-06--1-.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2021/12/EngineeredConstraint-06--1-.png 2400w" sizes="(min-width: 720px) 720px"/></figure><p>(Of course, we need to make sure the difference is always positive, which we can do using a <code>log + 1</code>.)</p><p>Let's see this in action.</p><pre><code class="language-python"># Compute the difference
diff = (data['srch_co'] - data['srch_ci']).astype('timedelta64[D]')

# Take the log and add one to ensure that it's positive
date_diff = np.log(diff + 1)

# The model should learn this column instead of the checkout date
modified_data = data.drop('srch_co', axis=1)
modified_data['difference'] = date_diff
modified_data[['srch_ci', 'difference']].head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.30.15-PM.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="390" height="360"/></figure><p>Now, we can fit the model with the modified data. The new samples will include the <code>srch_ci</code> and <code>date_diff</code> columns.</p><pre><code class="language-python">np.random.seed(20)

modified_model = GaussianCopula(primary_key='log_id')
modified_model.fit(modified_data)

modified_synth_data = modified_model.sample(500)
modified_synth_data[['srch_ci', 'difference']].head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.31.03-PM.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="392" height="356"/></figure><p>We can recompute the <code>srch_co</code> based on <code>srch_ci</code> and <code>difference</code>.</p><pre><code class="language-python"># Undo the log+1 that we added
diff = (np.exp(modified_synth_data['difference'].values).round() - 1).clip(0).astype('timedelta64[ns]')

# Reconstruct the end_date and remove the date_diff column
modified_synth_data['srch_co'] = modified_synth_data['srch_ci'] + diff
modified_synth_data = modified_synth_data.drop('difference', axis=1)

modified_synth_data.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.38.38-PM.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="491" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/Screen-Shot-2021-12-20-at-3.38.38-PM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/Screen-Shot-2021-12-20-at-3.38.38-PM.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/Screen-Shot-2021-12-20-at-3.38.38-PM.png 1600w, https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.38.38-PM.png 2142w" sizes="(min-width: 720px) 720px"/></figure><p>Let's verify that this computation does not create any invalid rows.</p><pre><code class="language-python">invalid_rows = modified_synth_data[modified_synth_data['srch_ci'] &gt; modified_synth_data['srch_co']]
invalid_rows.shape[0]</code></pre><pre><code>0</code></pre><p>The transformation worked! In our case, this was a more efficient way to enforce the deterministic rule.</p><p>But if our rule were more complex -- and we couldn't think of a transformation -- we could always fall back to reject sampling.</p><h3 id="inputting-deterministic-rules-in-the-sdv">Inputting deterministic rules in the SDV</h3><p>We've seen how reject sampling and transform can be used to improve the quality of the synthetic data by accounting for deterministic rules. However, it may be cumbersome for you to manually implement these strategies. In fact, we saw some common problems in our SDV user community:</p><ul><li>Users had multiple deterministic rules in their dataset. For example, there could be multiple comparisons between different pairs of columns.</li><li>Users from multiple domains often had the same kind of deterministic rule. For example, one column being greater than another is a common deterministic rule, agonistic of a use case or domain.</li></ul><p>To solve these problems, we introduced a constraints module in the SDV. <strong>With the constraints module, SDV users can easily input deterministic rules. </strong>Let's look at an example.</p><p><strong>Using the SDV constraints module</strong></p><p>The <code>constraints</code> module in the SDV contains several different types of pre-defined deterministic rules.</p><p>We will use the <code>GreaterThan</code> constraint, which will enforce that one column's values are always greater than another's.</p><pre><code class="language-python">from sdv.constraints import GreaterThan</code></pre><p>Next, we can input the logic of our deterministic rule by creating a constraint object. The <code>GreaterThan</code> constraint accepts the column names as input.</p><pre><code class="language-python">gt_constraint = GreaterThan(
  low='srch_ci',
  high='srch_co')</code></pre><p>Finally, we can input this constraint when instantiating the model.</p><pre><code class="language-python">np.random.seed(10)

# Apply the constraint to the model
model_with_constraint = GaussianCopula(
  primary_key='log_id',
  constraints=[gt_constraint])

model_with_constraint.fit(data)

# Sample synthetic data
constrained_data = model_with_constraint.sample(500)
constrained_data.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.39.11-PM.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="389" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/Screen-Shot-2021-12-20-at-3.39.11-PM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/Screen-Shot-2021-12-20-at-3.39.11-PM.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/Screen-Shot-2021-12-20-at-3.39.11-PM.png 1600w, https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.39.11-PM.png 2046w" sizes="(min-width: 720px) 720px"/></figure><p>As a result, we should see that all 500 generated rows are valid on the first try. No invalid rows are present in our dataset.</p><pre><code class="language-python">invalid_rows = constrained_data[constrained_data['srch_ci'] &gt; constrained_data['srch_co']]
invalid_rows.shape[0]</code></pre><pre><code>0</code></pre><p>Using the SDV was much simpler than writing the code ourselves! Plus, we can create multiple constraints for the same dataset an easily use them on other datasets.</p><p><strong>Specifying the strategy in the constraints module</strong></p><p>By default, the <code>GreaterThan</code> constraint uses the <code>transform</code> strategy. However, you can use the <code>handling_strategy</code> argument to control this. This argument accepts <code>'reject_sampling'</code> or <code>'transform'</code> as valid strategies.</p><pre><code class="language-python">gt_reject_constraint = GreaterThan(
  low='srch_ci',
  high='srch_co',
  handling_strategy='reject_sampling' # specify the strategy
)</code></pre><p>Similar to before, we can then input this constraint into the model.</p><pre><code class="language-python">np.random.seed(30)

# Apply the constraint to the model
model_with_reject_constraint = GaussianCopula(
  primary_key='log_id',
  constraints=[gt_reject_constraint])

model_with_reject_constraint.fit(data)

# Sample synthetic data
constrained_reject_data = model_with_reject_constraint.sample(500)
constrained_reject_data.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.40.44-PM.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="377" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/Screen-Shot-2021-12-20-at-3.40.44-PM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/Screen-Shot-2021-12-20-at-3.40.44-PM.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/Screen-Shot-2021-12-20-at-3.40.44-PM.png 1600w, https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.40.44-PM.png 2048w" sizes="(min-width: 720px) 720px"/></figure><pre><code class="language-python">invalid_rows = constrained_reject_data[constrained_reject_data['srch_ci'] &gt; constrained_reject_data['srch_co']
invalid_rows.shape[0]</code></pre><pre><code>0</code></pre><h3 id="what-other-deterministic-rules-are-already-available-in-sdv">What other deterministic rules are already available in SDV?</h3><p>The <code>GreaterThan</code> constraint is one kind of deterministic rule, but there may be others that apply to your dataset. The SDV offers more constraints for other types of logic.</p><ul><li><a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#unique-constraint" rel="nofollow">Unique</a> when values in a column must be unique to the entire dataset.</li><li><a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#uniquecombinations-constraint" rel="nofollow">UniqueCombinations</a> to limit the permutations between multiple columns.</li><li><a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#positive-and-negative-constraints" rel="nofollow">Positive and Negative</a> to enforce boundaries.</li><li><a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#columnformula-constraint" rel="nofollow">ColumnFormula</a> when there is a formulaic association between columns.</li><li><a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#rounding-constraint" rel="nofollow">Rounding</a> to enforce decimal precision.</li><li><a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#between-constraint" rel="nofollow">Between</a> when one column's values must be between 2 other values.</li><li><a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#onehotencoding-constraint" rel="nofollow">OneHotEncoding</a> when your data includes a variable with one hot encoding.</li></ul><p>For each of them, you can specify handling strategies for <code>reject_sampling</code> to discard invalid data or <code>transform</code> to modify the data (unique to each constraint).</p><p><strong>What if my rule isn't included in the module?</strong></p><p>You may come across a rule that cannot be described by any of the constraints classes in the SDV. In this case, you can define a <a href="https://sdv.dev/SDV/user_guides/single_table/custom_constraints.html#defining-custom-constraints" rel="nofollow">CustomConstraint</a> with logic specific to your use case.</p><p>Additionally, consider <a href="https://github.com/sdv-dev/SDV/issues/new/choose" rel="nofollow"><strong>filing a feature request on GitHub</strong></a> with details about your use case &amp; scenario. We can add your logic as a pre-defined constraint so others can benefit from it too!</p><h3 id="takeaways">Takeaways</h3><p>In this notebook, we explored what happens when we have a deterministic rule in our dataset.</p><ol><li>Machine learning models may not able to learn the deterministic rules out of the box, but it is possible to improve the model to learn these types of rules.</li><li>Deterministic rules can be handled by discarding invalid data (<strong>reject sampling</strong>) or by adding some clever preprocessing to your code (<strong>transforming</strong>).</li><li>The SDV offers a <code>constraints</code> module that allows you to input commonly found deterministic rules. You can specify the handling strategy for each constraint and apply multiple rules to the same dataset.</li></ol><p><strong>Further Reading</strong></p><p>For further information about constraints refer to the <a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html" rel="nofollow">Handling Constraints User Guide</a>.</p>]]></content:encoded></item><item><title><![CDATA[User input to enhance synthetic data generation]]></title><description><![CDATA[ML models learn some rules out of the box, while other logic requires more work. Which is which? Read more to find out.]]></description><link>https://sdv.dev/user-input-synthetic-data/</link><guid isPermaLink="false">Ghost__Post__61a68d091b683e0048b2a2f3</guid><category><![CDATA[Product]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Wed, 01 Dec 2021 16:06:49 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2021/11/Banner.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2021/11/Banner.png" alt="User input to enhance synthetic data generation"/><p>In our <a href="https://sdv.dev/blog/fake-to-synthetic-ml">previous article</a>, we explored how machine learning (ML) plays a key role in synthetic data creation. One of the biggest strengths of ML is <em>automatic rule detection</em> (also known in ML terms as <em>correlations</em>): The algorithms are designed to learn patterns in the data, even without additional user input. The result is synthetic data that resembles the original, right down to its mathematical properties!</p><p>However, in some cases, applying an ML model right out of the box may not immediately achieve the desired result. In this article, we'll explore the strengths of ML models and go through those areas where user input may be required.</p><h3 id="strengths-of-ml-models">Strengths of ML Models</h3><p>The goal of any ML-based synthetic data generation software is to learn from and emulate the input data. To illustrate this, let's pretend you work in the car insurance business, and you're in possession of a real dataset related to drivers and their insurance:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Figure-03.png" class="kg-image" alt="User input to enhance synthetic data generation" loading="lazy" width="1916" height="835" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Figure-03.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Figure-03.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Figure-03.png 1600w, https://sdv.ghost.io/content/images/2021/11/Figure-03.png 1916w" sizes="(min-width: 720px) 720px"><figcaption>An example dataset, including license and collision coverage information associated with different drivers.</figcaption></img></figure><p>An ML-based system, such as the <a href="https://sdv.dev/blog/intro-to-sdv/">Synthetic Data Vault</a> (SDV), will learn patterns from the real data and use it to create new synthetic data. Recall some of the important patterns that ML algorithms detect:</p><ul><li><strong><strong><strong>Shapes. </strong></strong></strong>The general shape of the data. For example, in the dataset above, 50% of drivers have Collision Coverage and the Annual Premium is uniformly scattered between $3,000 and $9,000.</li><li><strong><strong><strong>Correlations.</strong> </strong></strong>The trends within the data. For example, having Collision Coverage -- especially Standard coverage -- means a higher Annual Premium.</li></ul><p>These shapes and correlations will be present in the synthetic data that is outputted by the ML model, as shown below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Figure-04.png" class="kg-image" alt="User input to enhance synthetic data generation" loading="lazy" width="1875" height="832" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Figure-04.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Figure-04.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Figure-04.png 1600w, https://sdv.ghost.io/content/images/2021/11/Figure-04.png 1875w" sizes="(min-width: 720px) 720px"><figcaption>An example of a synthetic dataset created by an ML-based algorithm. The algorithm will learn patterns from the real data and emulate them.</figcaption></img></figure><p>Perhaps <strong>the single biggest strength of an ML algorithm is its ability to learn rules by looking for general patterns in the data,</strong> using probability and statistics.</p><h3 id="what-ml-models-do-not-learn-out-of-the-box">What ML models do not learn out of the box</h3><p>Let's take a closer look at the synthetic car insurance data. You might notice that two of the rows in the synthetic data don't make complete sense. Below, we've highlighted the errors.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Figure-05.png" class="kg-image" alt="User input to enhance synthetic data generation" loading="lazy" width="1867" height="831" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Figure-05.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Figure-05.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Figure-05.png 1600w, https://sdv.ghost.io/content/images/2021/11/Figure-05.png 1867w" sizes="(min-width: 720px) 720px"><figcaption>The synthetic car insurance data, with errors highlighted.</figcaption></img></figure><p>Do you see what has gone wrong? In the first row, the license expired 3 years earlier than it was issued. In the last row, a driver without Collision Coverage has a Collision Policy Type. Additionally, the same Customer ID has been repeated in Row 3 and Row 4.</p><p>There are three rules that the ML algorithm did not follow:</p><ol><li>License Expiration &gt; License Issue Year</li><li>If Has Collision Coverage = NO, then Collision Policy Type must be empty</li><li>All Customer IDs must be unique</li></ol><p>Why does the ML model easily pick up on some rules and not others? To answer this question, we can look closely at the rules themselves. All of the rules that the ML model successfully learned -- including the distribution shapes and the correlations -- were based on general trends. These <strong>probabilistic rules</strong> apply to a majority of the relationships within the dataset, but not all of them. Although they have to make sense in aggregate, a few rows may be exceptions.</p><p>By contrast, the rules that the ML model failed to learn were stricter. These <strong>deterministic rules</strong> describe intrinsic laws of nature, time or logic. Each and every row must adhere to them, and they won't change regardless of  how much (or how little) data has been given to the ML model.</p><p>To continue with the driving theme: A probabilistic rule is like a yield sign, signaling a general recommendation that works out differently for each individual situation -- some cars may need to stop, while others just slow down. Meanwhile, a deterministic rule is like a stop sign, demanding that every single car must come to a full stop.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Figure-06.png" class="kg-image" alt="User input to enhance synthetic data generation" loading="lazy" width="1607" height="662" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Figure-06.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Figure-06.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Figure-06.png 1600w, https://sdv.ghost.io/content/images/2021/11/Figure-06.png 1607w" sizes="(min-width: 720px) 720px"><figcaption>A probabilistic rule applies to a majority of rows, but leaves room for exceptions. Meanwhile, a deterministic rule applies to every single row.</figcaption></img></figure><p><strong>By default, our ML model assumed that all rules were probabilistic.</strong> When this happens, synthetic data still generally follows the desired properties -- for example, License Expiration &gt; License Issue Year -- for <em>most</em> of the rows, but not for every row.</p><h3 id="improving-the-ml-models-using-constraints">Improving the ML Models using constraints</h3><p>Just because the ML model didn't automatically follow a deterministic rule doesn't mean that it can't. It's possible to improve the model so that it understands this type of rule. As a user working with the SDV, you can input deterministic rules into your model using <strong>constraints</strong>.</p><p>An ML model built using constraints will accommodate both probabilistic and deterministic rules.</p><p><strong>Do you need SDV constraints?</strong></p><p>Deterministic rules are often easy to spot in your dataset: They are the rules that every single row must follow in order to be valid, regardless of how much data there is overall.  But even if you identify the right constraints, there are some cases where you might not actually want to supply them to the SDV.</p><p>Because the SDV learns probabilistic rules, most of the synthesized data is generally valid. Having a few errors sprinkled in might actually be beneficial if you want your synthetic data to cover some edge cases. For example, if you're using the synthetic data to test insurance claim software, leaving in some weird data points might help you ensure that the software can handle unexpected cases -- like the License Expiration accidentally being set too early.</p><p>The figure below shows a few questions you can ask to determine whether adding a constraint is the right approach.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/12/Figure-07.png" class="kg-image" alt="User input to enhance synthetic data generation" loading="lazy" width="2000" height="586" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/Figure-07.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/Figure-07.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/Figure-07.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2021/12/Figure-07.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Should you input a rule using constraints? First, determine whether the rule is deterministic, and then take your use case into account.</figcaption></img></figure><p><strong>The SDV Constraints offering</strong></p><p>If you decide that adding deterministic rules is important for generating your synthetic data, the SDV has many different constraints to choose from! The table below describes the constraints you would need in order to define the deterministic rules that would best mold your Car Insurance dataset.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Figure-08.png" class="kg-image" alt="User input to enhance synthetic data generation" loading="lazy" width="2000" height="578" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Figure-08.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Figure-08.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Figure-08.png 1600w, https://sdv.ghost.io/content/images/2021/11/Figure-08.png 2204w" sizes="(min-width: 720px) 720px"><figcaption>The GreaterThan, ColumnFormula and Unique constraints -- all available in the SDV -- set the deterministic rules that ensure your synthetic Car Insurance Data is useful and makes sense.</figcaption></img></figure><p>The SDV offers many more possible constraints, including:</p><ul><li>UniqueCombinations</li><li>Positive and Negative</li><li>Rounding</li><li>Between</li><li>OneHotEncoding</li></ul><p>You can add multiple constraints to the same dataset in order to accommodate all the deterministic rules you need. For more details, read the <a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html">Constraints User Guide</a>.</p><h3 id="takeaways">Takeaways</h3><p>In this article, we learned that:</p><ul><li>Data is governed by rules. The SDV automatically learns probabilistic rules, which describe overall trends or patterns in the data.</li><li>However, sometimes the data has <strong>deterministic rules</strong>, which are always inherent no matter how much or how little data there is. ML-based systems, including the SDV, may not enforce deterministic rules out of the box.</li><li>Users can input deterministic rules to the SDV using <strong>constraints</strong>. To figure out whether you should input a constraint, ask yourself whether there are any rules that the data must always follow. There are many constraints to choose from.</li></ul><p>In future articles, we'll dive deeper into this topic. We'll explore the technical details behind constraints, and how exactly the SDV's ML models are able to learn deterministic rules.<br/></p>]]></content:encoded></item><item><title><![CDATA[From fake to synthetic data: Machine learning changes the game]]></title><description><![CDATA[Creating fake data is an old concept -- but machine learning is a whole new ballgame. Learn about why ML is a key ingredient to synthetic data.]]></description><link>https://sdv.dev/fake-to-synthetic-ml/</link><guid isPermaLink="false">Ghost__Post__61927ca167598b003b3d944a</guid><category><![CDATA[Product]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Tue, 16 Nov 2021 16:33:56 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2021/11/Article-13.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2021/11/Article-13.png" alt="From fake to synthetic data: Machine learning changes the game"/><p>Data is a great source of information. Real data — which is based on observations of real-world phenomena like weather, movements on a factory floor or the activities of a user base — can help us notice trends, increase business efficiency and solve problems. </p><p>But data can be helpful even if it isn’t real. This data, sometimes called fake or test data, doesn’t come directly from real-world observations, but is instead artificially crafted by a human or machine. The latest and most complex iteration of this data type — what we call synthetic data — builds on previous work done in this space. </p><p>In this article, we'll go through the history of fake data. By the end, you'll be able to answer the following questions:</p><ul><li>What were the original motivations and tools for manually creating data?</li><li>What differentiates synthetic data from other types of fake data?</li><li>What role does machine learning play in generating synthetic data?</li></ul><h3 id="the-dawn-of-fake-data-test-data-management">The Dawn of Fake Data: Test Data Management</h3><p>One group of people has worked with fake data for a long time: software engineers. They need data in order to test the systems they build, and the real stuff isn't always usable (for example, due to privacy). </p><p>Let's pretend it's the early 2000s, and you're an IT professional working at a bank. You're responsible for the software that updates account balances after each transaction. You'd like to test this software before putting it into production. What do you do?</p><p>Most likely, you'll come up with a few test scenarios to ensure that your functionality — updating the balance — can properly handle a variety of inputs.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Article-09--1-.png" class="kg-image" alt="From fake to synthetic data: Machine learning changes the game" loading="lazy" width="2000" height="541" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Article-09--1-.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Article-09--1-.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Article-09--1-.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2021/11/Article-09--1-.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>This table shows a few scenarios you may use to test your system. In these scenarios, you're testing how a monetary transfer of $20 changes the balance in different accounts.</figcaption></img></figure><p>Notice that in order to create these scenarios, you had to generate data: various starting balances ($500, $20, $10) as well as a transfer amount ($20). This is an early version of using fake data in order to test your software!</p><p><strong>Using Tools for Manual Creation</strong></p><p>Now let's fast forward in time. Over the years, your software has gotten even more complex, and you're constantly adding new functionalities. For example, maybe you start allowing transfers with foreign currency. </p><p>You need to test these functionalities before you roll them out. To save time, you might end up using -- or creating -- a tool that allows you to generate and manage fake data for testing. </p><p>The simplest tool may be a basic permutation, as illustrated below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Article-07-2.png" class="kg-image" alt="From fake to synthetic data: Machine learning changes the game" loading="lazy" width="1723" height="809" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Article-07-2.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Article-07-2.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Article-07-2.png 1600w, https://sdv.ghost.io/content/images/2021/11/Article-07-2.png 1723w" sizes="(min-width: 720px) 720px"><figcaption>A simple manual test data generation tool that uses permutations. The resulting scenarios -- with different starting balances, transfer amounts and transfer currencies -- are outputted as a data table.</figcaption></img></figure><p>A more sophisticated tool might allow you greater control over the rules the data must follow. It will also allow you to create more columns as your functionalities increase. For example, maybe the bank now offers two different account types: Premium and Normal. </p><p>Now you need a test data generation tool that can handle all of these variables and come out with something like this:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Article-11.png" class="kg-image" alt="From fake to synthetic data: Machine learning changes the game" loading="lazy" width="1955" height="655" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Article-11.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Article-11.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Article-11.png 1600w, https://sdv.ghost.io/content/images/2021/11/Article-11.png 1955w" sizes="(min-width: 720px) 720px"><figcaption>A more sophisticated test data tool will allow you to specify rules manually. It will follow them to generate test data.</figcaption></img></figure><p>Many test data management tools use sophisticated logic to precisely create these data columns and their values. But the rules they use are manually written, and rely on human intuition and domain knowledge. For example:</p><ul><li>Account type = Premium 10% of the time and Normal 90% of the time</li><li>Starting balance is between $10,000 and $250,000 if Account type = Premium<br>or between -$1,000 and $10,000 if Account type = Normal</br></li><li>Transfer amount follows a bell curve with a mean of $7,500 and standard deviation of $1,000</li><li>Etc.</li></ul><p>There are downsides to this manual approach. It takes time and effort to come up with these rules, to keep track of them, and to update them as your application changes.</p><h3 id="adding-machine-learning">Adding Machine Learning</h3><p>Adopting machine learning (ML) opens up entirely new avenues in data generation. In the process, it gets rid of some of these downsides.</p><p>At a high level, ML-based software (such as the <a href="https://sdv.dev/blog/intro-to-sdv/">Synthetic Data Vault</a>) works in three steps:</p><ol><li>The user inputs real data into the ML software</li><li>The ML software automatically learns patterns in the data</li><li>The software outputs data that contains those patterns</li></ol><p>Let's go back to our banking example to see how this works. It's now 2021 and you're using <a href="https://sdv.dev/">the SDV</a> to generate your test data. You input all the transactions your bank has handled in the last week. </p><p>After modeling, the SDV outputs entirely new data that looks and behaves like the original. An illustration of this is shown below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Article-10.png" class="kg-image" alt="From fake to synthetic data: Machine learning changes the game" loading="lazy" width="2000" height="516" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Article-10.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Article-10.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Article-10.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2021/11/Article-10.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>With ML tools (like the SDV), you input real data into the software. The software then learns patterns from the data and outputs data that matches those patterns.</figcaption></img></figure><p>Notice that the output data contains many of the same properties as the original. The model learned all of the following information:</p><ul><li><strong>Ranges &amp; Categories.</strong> Transfers range from $5K to $10K. Bank accounts can be either premium or normal. Etc.</li><li><strong>Shapes.</strong> 10% of accounts are premium. Transfers follow a bell curve distribution with a mean of $7,500 and a standard deviation of $1,000. Etc.</li><li><strong>Correlations.</strong> Premium bank accounts tend to have higher balances ($10K to $250K) than normal accounts (-$1K to $10K).</li></ul><p>In other words: <strong>while the old test data management tools required you to manually come up with rules, ML-based tools learn these rules automatically.</strong> <strong> </strong>Moreover, they can learn new information. For example, the ML picked upon a couple of extra correlations:</p><ul><li>Premium accounts are more likely to transfer foreign currency.</li><li>Normal accounts are more likely to be overdrawn (transfer more than their current balance).</li></ul><p>Using an ML-based data generation tool will help you ensure that your software is robust against these typical cases. And while manual data generation tools generate fake data, <strong>ML-based approaches generate what we call synthetic data.</strong></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Article-08-1.png" class="kg-image" alt="From fake to synthetic data: Machine learning changes the game" loading="lazy" width="1574" height="419" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Article-08-1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Article-08-1.png 1000w, https://sdv.ghost.io/content/images/2021/11/Article-08-1.png 1574w" sizes="(min-width: 720px) 720px"><figcaption>Ask whether you had to input any real data or rules. Based on this, you'll know whether you are dealing with synthetic data or fake data.</figcaption></img></figure><p><strong>Benefits of Synthetic Data</strong></p><p>There are some clear advantages to using synthetic data over fake data, especially in software testing. Below, we've detailed a few.</p><ul><li><strong>Saves time with automation.</strong> Because ML automatically learns patterns from the real data, there is no need to spend a lot of time coming up with and inputting rules. ML learns rules that you may even miss.</li><li><strong>Is usable by non-experts. </strong>Realistic fake data can only be generated by domain experts, who know the precise rules governing the dataset. However, anyone can generate synthetic data. All they have to do is input the real data and the ML software takes care of the rest!</li><li><strong>Increases adaptability. </strong>Applications and data will inevitably change over time. It's easy to update synthetic data as this happens, simply by retraining the ML model with newer data.</li></ul><p>Benefits of synthetic data expand beyond software testing. The SDV Community is using synthetic data for an ever-increasing variety of tasks, including machine learning development, de-biasing datasets and scenario planning.</p><h3 id="key-takeaways">Key Takeaways</h3><p>In this article, we surveyed numerous ways of creating and using data  that is not real. In particular, we learned that:</p><ul><li>Creating fake data is not a novel concept. Older generations of tools will output fake data when given an explicit list of rules. This is especially useful for software testing.</li><li>Adding ML to this process is a newer evolution. Users input real data into the ML model, and it's able to automatically infer the rules. Data generated using ML-based systems is known as <strong>synthetic data</strong>.</li><li>Synthetic data's key advantages include its automation and adaptability. The uses of synthetic data expand beyond software testing.</li></ul><p>In future articles, we'll put ML models to the test! We'll uncover their strengths and weaknesses, and guide you through getting the most from synthetic data using the Synthetic Data Vault.</p>]]></content:encoded></item><item><title><![CDATA[Your Feedback in Action, Part 2: Data Workflow]]></title><description><![CDATA[After thousands of downloads, see how the synthetic data workflow in the SDV has evolved based on feedback from users.]]></description><link>https://sdv.dev/blog/community-feedback-workflow</link><guid isPermaLink="false">Ghost__Post__609c384488b3f9003e080016</guid><category><![CDATA[Product]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Wed, 19 May 2021 16:52:14 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2021/05/Banner-2-1.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2021/05/Banner-2-1.png" alt="Your Feedback in Action, Part 2: Data Workflow"/><p>The Synthetic Data Vault (SDV) is a software system that allows users all over the world to input a dataset and generate synthetic data. The SDV was born out of academic research at MIT — but in 2018, we open-sourced it, so that people all over the world could use it.</p><p>Since then, we've been listening carefully to our community's feedback, making sure that we address any gaps between theoretical academic research and practical use. This article is the second in a multi-part series detailing recent improvements to the SDV that make it work in the real world. Here we'll discuss how we've amped up the data synthesis workflow. (For our previous discussion about how we've improved core models, see <a href="http://sdv.dev/blog/community-feedback-models">Part 1</a>.)</p><h3 id="what-are-workflows">What are workflows?</h3><p>We open sourced the SDV not just to let users generate synthetic data, but also to allow them <em>use</em> that data to solve real-world problems. Our community taught us that actually using the SDV involves a multi-step process — and that improving the system means paying attention to this entire workflow, not just the core machine learning.</p><p>According to our users, this workflow boils down to a few generalizable steps:</p><ol><li>Identifying real datasets that need to be synthesized</li><li>Transforming the datasets into a machine-readable format</li><li>Running the machine learning model</li><li>Synthesizing data according to particular specifications</li><li>Reversing the transformations such that the synthesized data looks like the original</li><li>Evaluating the synthesized data that results</li></ol><p>These steps are illustrated in the diagram below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-2----1.png" class="kg-image" alt="Your Feedback in Action, Part 2: Data Workflow" loading="lazy" width="2000" height="1106" srcset="https://sdv.ghost.io/content/images/size/w600/2021/05/Community-Feedback--Part-2----1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/05/Community-Feedback--Part-2----1.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/05/Community-Feedback--Part-2----1.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2021/05/Community-Feedback--Part-2----1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>The entire synthetic data workflow involves more than just modeling. Data also needs to be transformed, synthesized, reverse transformed, and evaluated.</figcaption></img></figure><p>The key insight from our users was that the application of machine learning models is only one step of a much larger puzzle. When the open source community helped us understand this, we were able to improve on the SDV software by adding in transformations, synthesizing options, and evaluation tools -- all detailed below.</p><h3 id="transforming-data">Transforming Data</h3><p>One major lesson from our open source community was how messy real-world datasets are compared to those used in academia. Academic datasets often come pre-sanitized and ready for numerical use. In the real world, however, databases are growing and changing constantly, and are often significantly different from the optimal yet theoretical structures used by machine learning researchers.</p><p>Two thorny data types frequently encountered in the real world are <em>datetimes</em> and <em>null values</em>.</p><ul><li><strong>Datetimes</strong> can follow many different formats, including YYYY-MM-DD or MM-DD-YY. However, machine learning models accept numerical values only. Usually these are Unix timestamps, defined as the number of seconds that have elapsed since January 1, 1970. By this logic, a date like 2021-01-01 will transform into the number 1609488000.</li><li><strong>Null values</strong> also present a problem for mathematical models when they appear in numerical data. While users can tell models to ignore these values, the presence of a null might actually indicate something important, like a user declining to answer a question. To account for this, the SDV creates a new, binary column to address whether the original value is null.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-2----2-1.png" class="kg-image" alt="Your Feedback in Action, Part 2: Data Workflow" loading="lazy" width="2000" height="754" srcset="https://sdv.ghost.io/content/images/size/w600/2021/05/Community-Feedback--Part-2----2-1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/05/Community-Feedback--Part-2----2-1.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/05/Community-Feedback--Part-2----2-1.png 1600w, https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-2----2-1.png 2280w" sizes="(min-width: 720px) 720px"><figcaption>When working with real-world datasets, it's necessary to apply transformations between real data and machine-readable data. This example transforms datetimes and null values.</figcaption></img></figure><p>To solve this problem, we introduced a new library called <a href="https://github.com/sdv-dev/RDT">Reversible Data Transforms</a> (RDT). The RDT library contains necessary logic for transforming different types of real world data to its machine-ready counterpart — as well as the logic for its reversal, so that a synthetic data user won't know the difference. The RDT is a standalone library that can reach beyond the synthetic data space, helping data scientists and academics across fields to clean their data. Since November 2020, the RDT has been supported on all major platforms including MacOS, Windows, and Linux.</p><h3 id="synthesizing-data-conditionally">Synthesizing Data Conditionally</h3><p>When we first imagined the SDV, we assumed users would simply want to use all the synthetic data generated by the model. However, we soon found that some users have more complex needs, and require more control over the data they synthesize — opening up new possibilities for synthetic data in the process.</p><p>For example, one of our users, an engineer, found a whole new use for SDV. The engineer was writing a machine learning classifier on a dataset when they noticed it was unbalanced. Applying any algorithms to this dataset would lead to biased models. The engineer realized that, if used strategically, SDV could actually debias the data — if it only generated data with rarer attributes, the synthetic data it created could be combined with the real data to form a fully balanced dataset.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-2----3.png" class="kg-image" alt="Your Feedback in Action, Part 2: Data Workflow" loading="lazy" width="1705" height="960" srcset="https://sdv.ghost.io/content/images/size/w600/2021/05/Community-Feedback--Part-2----3.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/05/Community-Feedback--Part-2----3.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/05/Community-Feedback--Part-2----3.png 1600w, https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-2----3.png 1705w" sizes="(min-width: 720px) 720px"><figcaption>Synthesized data can help remove bias by creating balanced datasets. In this example, synthesizing those rows that only correspond to females creates a balance between males and females.</figcaption></img></figure><p>In February of 2021, we added <a href="https://sdv.dev/SDV/user_guides/single_table/gaussian_copula.html#conditional-sampling">conditional sampling</a> to the SDV to enable this use case. Now, users can specify attributes or values that must be present in the synthesized data. In addition to debiasing datasets, users can use this feature to test hypothetical scenarios.</p><h3 id="evaluating-synthesized-data">Evaluating Synthesized Data</h3><p>When the entire system is working smoothly and outputting synthetic data, users still need to know: Is the data good enough to use? This vital question inspired us to add evaluation capabilities to the SDV. In doing so, we faced two key challenges: Defining the metrics, and creating a useful process<strong>.</strong></p><p><strong>Metrics</strong></p><p>No single metric perfectly captures the different dimensions of synthetic data users may want to evaluate. Some want to preserve a high degree of mathematical likeness, others want to emphasize a particular column for machine learning predictions, and still others are more focused on threat models that can compromise privacy. </p><p>To address this, we created a separate library, <a href="https://github.com/sdv-dev/SDMetrics">SDMetrics</a>, to define evaluation metrics. The library now includes a suite of metrics that cover differentiation of synthetic and real data, statistical likeness, and privacy.</p><p><strong>Application</strong></p><p>Rather than apply metrics on an ad-hoc basis, some SDV power users were creating mini-workflows to rapidly test out different models, datasets and evaluation criteria in succession. Inspired by their innovation, we created <a href="https://github.com/sdv-dev/SDGym">SDGym</a>, a system that allows users to input models, datasets and success metrics to build a comprehensive evaluation framework.</p><h3 id="the-sdv-software-today">The SDV Software Today</h3><p>The SDV software is continuously evolving based on community feedback. In this article, we discussed improvements to the workflow surrounding synthetic data generation, including data transformations, sampling methods and evaluation tools. Earlier, in <a href="https://sdv.dev/blog/community-feedback-models">Part 1</a> of this series, we discussed the core synthetic data models themselves. In future blog articles, we plan to dig deeper into each of these areas, and to uncover new ones with you.</p><p>Like the SDV, this blog is a collaborative effort. Use our <a href="https://join.slack.com/t/sdv-space/shared_invite/zt-gdsfcb5w-0QQpFMVoyB2Yd6SRiMplcw">Slack</a> to let us know which topics you'd like to hear more about. And as always, use <a href="https://github.com/sdv-dev/SDV">GitHub</a> to file technical issues with the system. Working together, we can make SDV the most trusted, transparent and comprehensive platform for synthetic data generation!</p><p><em>For other inquiries, please contact <a href="mailto: info@sdv.dev">info@sdv.dev</a>.</em><br/></p>]]></content:encoded></item><item><title><![CDATA[Your Feedback in Action, Part 1: Data Models]]></title><description><![CDATA[After thousands of downloads, see how SDV's machine learning models have evolved based on feedback from users.]]></description><link>https://sdv.dev/community-feedback-models/</link><guid isPermaLink="false">Ghost__Post__609c351b88b3f9003e07ffb8</guid><category><![CDATA[Product]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Wed, 12 May 2021 20:15:30 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2021/05/Banner-2.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2021/05/Banner-2.png" alt="Your Feedback in Action, Part 1: Data Models"/><p>In our <a href="https://sdv.dev/blog/intro-to-sdv/">last post</a>, we introduced the <a href="https://github.com/sdv-dev/SDV">Synthetic Data Vault</a> (SDV) — a software system that allows users to input a dataset and generate synthetic data. The SDV was born out of academic research at MIT — but in 2018, we open-sourced it, so that people all over the world could use it.</p><p>Since then, we've been listening carefully to our community's feedback, making sure that we address any gaps between theoretical academic research and practical use. This multi-part series details recent improvements we've made so that SDV works in the real world. In this article, we focus on the machine learning-based modeling techniques that form the core of the system, while <a href="https://sdv.dev/blog/community-feedback-workflow/">Part 2</a> will cover the surrounding workflow.</p><h3 id="whats-in-a-model">What's in a model?</h3><p>At its core, the SDV is a set of machine learning models designed to understand and mimic real world data. Once the SDV creates a particular model, developers can generate synthetic data by sampling it. For synthetic data to be successful, this generative model must be correct — but through discussions with our open source community, we realized that there is no such thing as a single, winning approach that works every time. Each dataset and use case is different.<br/></p><p>Our solution is to provide choices, giving users all the necessary tools to make useful synthetic data for each new case at hand. Let's dive into three popular uses of the SDV where such options are available: Tabular models, sequential data and business logic.</p><h3 id="more-options-for-tabular-models">More Options for Tabular Models</h3><p>The earliest version of SDV was based on a classic statistical method: <a href="https://en.wikipedia.org/wiki/Copula_(probability_theory)">Gaussian Copulas</a>. This model is transparent by definition. It allows us to understand and exert control over formulas in the model, notably the distributions of each variable. This can be especially useful for business applications, where data often follows predictable distributions. For example, wind speed is known to follow a <a href="https://en.wikipedia.org/wiki/Wind_power">Weibull distribution</a>, biological measures like height usually follow <a href="https://en.wikipedia.org/wiki/Normal_distribution#Occurrence_and_applications">normal distributions</a> and credit default rates often follow <a href="https://en.wikipedia.org/wiki/Exponential_distribution">exponential distributions</a>.</p><p>Meanwhile, advances in the AI space had also produced a robust, alternative model for those willing to sacrifice transparency: A deep learning technique called <a href="https://en.wikipedia.org/wiki/Generative_adversarial_network">Generative Adversarial Networks</a> (GANs). GANs model complex processes that don't follow known formulas. While these models’ inner workings aren’t easily explained by humans, they produce highly accurate results. We created a GAN, called CTGAN, specifically for synthetic data. This black box model is especially good at figuring out complex correlations between variables in large datasets.</p><p>For a long time, SDV allowed users a choice between our Gaussian Copulas based model, called GaussianCopula, and CTGAN to model tabular data. While this choice provided some flexibility, our users reported they had a hard time choosing between such extreme alternatives. We wondered if a middle ground was possible: Could we specify distributions while also using GANs to identify complex correlations?</p><p>We couldn't find any model that fit both of these requirements, so we made our own! A key insight was that we could use Gaussian Copulas to understand the data and transform it before applying it to a GAN. The result is <a href="https://sdv.dev/SDV/user_guides/single_table/copulagan.html">CopulaGAN</a>, a hybrid model we released in October 2020.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-1----1.png" class="kg-image" alt="Your Feedback in Action, Part 1: Data Models" loading="lazy" width="2000" height="695" srcset="https://sdv.ghost.io/content/images/size/w600/2021/05/Community-Feedback--Part-1----1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/05/Community-Feedback--Part-1----1.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/05/Community-Feedback--Part-1----1.png 1600w, https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-1----1.png 2100w" sizes="(min-width: 720px) 720px"><figcaption>CopulaGAN is in the middle of the spectrum, between simple, easily understood models (like GaussianCopula) and complex black box models (like CTGAN).</figcaption></img></figure><p>CopulaGAN combines the human accessibility of Gaussian Copulas with the robust accuracy of GANs. This innovation provides users with a new choice: a hybrid approach.</p><h3 id="the-special-case-of-sequential-data">The Special Case of Sequential Data</h3><p>Another tricky case pointed out by our users involved sequential data. While sequential data is stored in a table, it is unlike a regular table in that its rows are linked together, usually by a time component. This use case is extremely frequent, especially in finance — any table with credit card transactions, stock prices, or payments is almost certainly sequential. </p><p>At the time, solutions treated sequential data as a case of general tabular modeling. After all, sequential data is inside a table. However, these solutions failed to incorporate the key information that makes sequential data unique: The relationships that exist between rows.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-1----2.png" class="kg-image" alt="Your Feedback in Action, Part 1: Data Models" loading="lazy" width="1840" height="1210" srcset="https://sdv.ghost.io/content/images/size/w600/2021/05/Community-Feedback--Part-1----2.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/05/Community-Feedback--Part-1----2.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/05/Community-Feedback--Part-1----2.png 1600w, https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-1----2.png 1840w" sizes="(min-width: 720px) 720px"><figcaption>In this table of stock prices, rows that describe the same company — in this case, Google — are related to each other through time. Related rows are a special feature of sequential datasets.</figcaption></img></figure><p>While considering this pain point, we recognized sequential data as an entirely new case that required its own unique set of modeling techniques. In October 2020, we released our DeepEcho library, which focuses entirely on sequential data. We also introduced our <a href="https://sdv.dev/SDV/user_guides/timeseries/par.html">PAR model:</a> a GAN approach made specifically for sequential data.</p><h3 id="encoding-business-logic-using-constraints">Encoding Business Logic using Constraints</h3><p>Even with a plethora of modeling choices, it's vital to capture nuances in business logic while modeling synthetic data. This is due to differences in how humans and machines understand datasets.</p><p>Often, humans can easily glean the meaning of a dataset using context clues. Consider a table showing the names and ages of students and their legal guardians. A human will intuitively realize that a student must be younger than their guardian.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-1----3.png" class="kg-image" alt="Your Feedback in Action, Part 1: Data Models" loading="lazy" width="2000" height="969" srcset="https://sdv.ghost.io/content/images/size/w600/2021/05/Community-Feedback--Part-1----3.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/05/Community-Feedback--Part-1----3.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/05/Community-Feedback--Part-1----3.png 1600w, https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-1----3.png 2230w" sizes="(min-width: 720px) 720px"><figcaption>In this table of students and their guardians, the student is always younger than their guardian. This is a constraint that humans intuitively understand.</figcaption></img></figure><p>But will a machine understand the same rule? Because all of the SDV's models use statistics, they analyze trends generally — meaning that in this case, they will include a small possibility that a student could be older than their guardian. After all, is it totally out of the question that an older individual could enroll and list their child as their guardian? Either way, only a human expert can truly figure out what makes sense for this dataset!</p><p>To solve this pain point, SDV introduced the concept of <a href="https://sdv.dev/SDV/user_guides/single_table/constraints.html">constraints</a> in July 2020. Constraints give users the ability to encode their business knowledge and expertise into an SDV model. In our example, they could specify that a guardian's age must be greater than the student's. Currently, the GreaterThan and UniqueCombination constraints allow for easy handling of common scenarios. We also provide a blanket CustomConstraint class, which gives users flexibility to capture more nuanced knowledge.</p><h3 id="more-community-feedback">More Community Feedback</h3><p>We believe that the more humans and machines can work together, the more efficient our processes can become. In this article, we explained how user feedback about the SDV led to new core modeling techniques and innovations — enabling a system that now provides a choice of multiple models, handles sequential data, and understands constraints. In <a href="https://sdv.dev/blog/community-feedback-workflow/">Part 2</a>, we will discuss similar feedback-driven innovations in the rest of the workflow.</p><p>Using SDV — and giving us feedback — fuels this rapid evolution. To start a discussion, please message us on <a href="https://join.slack.com/t/sdv-space/shared_invite/zt-gdsfcb5w-0QQpFMVoyB2Yd6SRiMplcw">Slack</a> or file an issue on <a href="https://github.com/sdv-dev/SDV">GitHub</a>. Working together, we can make SDV the most trusted, transparent and comprehensive platform for synthetic data generation!</p><p><em>For other inquiries, please contact <strong>info@sdv.dev</strong>.</em><br/></p>]]></content:encoded></item><item><title><![CDATA[Meet the Synthetic Data Vault]]></title><description><![CDATA[Welcome to the SDV Blog! The SDV is a comprehensive, open source software for synthetic data generation. Join our growing community as we create an ecosystem to solve real world problems!]]></description><link>https://sdv.dev/intro-to-sdv/</link><guid isPermaLink="false">Ghost__Post__608c5562f9741d003b6f73b8</guid><category><![CDATA[Project]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Tue, 04 May 2021 13:00:00 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2021/05/blog-header--1-.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2021/05/blog-header--1-.png" alt="Meet the Synthetic Data Vault"/><p>Hello world! We, the creators of MIT's Synthetic Data Vault, warmly welcome you to our official blog. Soon we'll be using this space to deep-dive into topics related to our libraries, and to unpack ideas in the synthetic data space. We're looking forward to exploring this exciting area with you.</p><p>But first, we want to properly introduce our project: The <a href="https://github.com/sdv-dev/SDV">Synthetic Data Vault</a> (SDV), an open source software ecosystem for generating synthetic data. In this post, we’ll explain why synthetic data is important, and tell the story of how we created the vault. We’ll also lay out what’s in store — and how you can get involved. Let’s get started with a brief overview.</p><h3 id="synthetic-data-what">Synthetic Data What?</h3><p>Synthetic data is a bold new frontier in machine learning. It allows developers to share and use data more effectively.</p><p>It may seem counterintuitive, but although billions of gigabytes of data are produced every day, there are still huge gaps in what developers are actually able to use. Accessibility concerns, regulatory issues and imbalanced datasets can all keep experts from using data. This impedes progress in finance, health care and other domains.</p><p>Good synthetic data can fill these gaps. The SDV uses machine learning to analyze data. Then, it creates fully synthetic datasets that mimic the original. Although the synthetic data is entirely machine generated, it maintains the original format and mathematical properties. This makes synthetic data versatile. It can completely replace the existing data in a workflow, or it can supplement the data to enhance its utility. Already, our users have successfully used the SDV to augment datasets, test applications, remove bias and more.</p><h3 id="a-history-of-the-sdv">A History of the SDV</h3><p>Our story starts in 2013. In MIT's Laboratory for Information and Decision Systems (LIDS), we were working on general data science projects. We had developed new techniques, and we were excited to test them on real datasets. However, as soon as we asked for the data, we hit roadblocks. The process for getting access to data turned out to be much more complex than we anticipated, with many regulations and security red tape. </p><p>We wondered: What if we didn't need the real data in the first place? If we had synthetic data with the same mathematical properties as the original, it would be much easier for everyone to share and use.</p><p>In 2016, we released a paper describing the very first iteration of the <a href="https://dai.lids.mit.edu/wp-content/uploads/2018/03/SDV.pdf">SDV</a>. It introduced a novel technique for synthesizing multi-table data, and included trials where data scientists successfully used synthetic data instead of real data for machine learning tasks. Related research to come out of the lab included <a href="https://arxiv.org/pdf/1907.00503.pdf">CTGAN</a>, a novel approach to generating synthetic data using deep learning.</p><p>After these successes in the research community, we decided to move beyond purely academic solutions. Synthetic data has the potential to solve real-world problems faced by people on all sides of data science: internal developers writing software, external contractors working offshore, 3rd party partners offering services and even the end users who create the data. After some pilot testing on enterprise applications, we open sourced our work in 2018, publishing <a href="https://pypi.org/project/sdv/">sdv on PyPi</a> for general use. Open sourcing offered ample opportunities for collaboration and customization. It allowed users all over the world to test the SDV in enterprise settings, and helped the SDV ecosystem evolve into a one-stop shop for synthetic data needs!</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/04/Blog-Map.png" class="kg-image" alt="Meet the Synthetic Data Vault" loading="lazy" width="2000" height="1279" srcset="https://sdv.ghost.io/content/images/size/w600/2021/04/Blog-Map.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/04/Blog-Map.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/04/Blog-Map.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2021/04/Blog-Map.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Users all over the world are using our software to create synthetic data. This map shows the total downloads* of <a href="https://github.com/sdv-dev/CTGAN">CTGAN</a> (our most popular synthetic data model) per continent.</figcaption></img></figure><p>We listened to feedback and, as of today, have made 93 releases (across all our libraries), addressing 504 issues. We have been thrilled to see a burgeoning community of invested users using the SDV to solve problems. We've seen over 200K user downloads from PyPi, 400 stars in the SDV <a href="https://github.com/sdv-dev/SDV">GitHub repository</a> and 200 developers in our Slack channel. Our community is global and includes people in diverse roles: academics, data scientists, operations managers, engineers and more. We are continually learning from our community, and we're excited to bring new innovations to you!</p><h3 id="just-the-beginning">Just the Beginning</h3><p>Synthetic data has the potential to revolutionize the entire field of data science, allowing us to solve problems that once seemed untouchable. We want the Synthetic Data Vault to be the most trusted, transparent and comprehensive platform for synthetic data generation, but we can't do it without our users. It's our ever-growing open source community that allows us to quickly repair bugs, triage feature requests and improve to serve a variety of real-world needs.  </p><p>That’s where you come in. If you’re already a member of this community, we can’t thank you enough. And if you’d like to get involved, see below for ways to get started. Either way, watch this space for more nuanced discussions about synthetic data. We're excited to share what we've learned from you, and show how we are collectively improving the ecosystem. It’s time to open the vault!</p><p><strong>Want more ways to get involved?</strong></p><ul><li>Follow us on Twitter <a href="https://twitter.com/sdv_dev">@sdv_dev</a> for release announcements, blog updates and more</li><li>Join our <a href="https://join.slack.com/t/sdv-space/shared_invite/zt-gdsfcb5w-0QQpFMVoyB2Yd6SRiMplcw">Slack</a> community to meet other users, discuss synthetic data solutions and suggest topics for the blog</li><li>Visit &amp; star our <a href="https://github.com/sdv-dev">GitHub repositories</a></li><li>If you've successfully used the SDV for your project, share your experience and tag us</li></ul><p>For other inquiries, please contact us at <em><strong>info@sdv.dev</strong></em>.<br/></p><p><em>*Total download statistics per continent come from the </em><a href="https://github.com/pypa/linehaul"><em>Linehaul project</em></a><em> using BigQuery, and include mirrors. Are you aware of more accurate ways to count Python package downloads? Let us know!</em></p>]]></content:encoded></item></channel></rss>