<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[DataCebo Blog]]></title><description><![CDATA[Creating and evaluating synthetic data with open source tools]]></description><link>https://sdv.dev/</link><image><url>https://sdv.dev/favicon.png</url><title>DataCebo Blog</title><link>https://sdv.dev/</link></image><generator>Ghost 2.9</generator><lastBuildDate>Tue, 31 Jan 2023 13:38:55 GMT</lastBuildDate><atom:link href="https://sdv.dev/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[3 user-centric growth strategies for open source]]></title><description><![CDATA[Our open source grew faster when we adopted a user-centric mindset. Here are 3 strategies we used along the way.]]></description><link>https://datacebo.com/blog/os-user-strategies</link><guid isPermaLink="false">Ghost__Post__63d180e2304f20003d70b744</guid><category><![CDATA[Open Source]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Thu, 26 Jan 2023 00:39:12 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2023/01/Header.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2023/01/Header.png" alt="3 user-centric growth strategies for open source"/><p>In a <a href="https://datacebo.com/blog/open-source-user-demographic">previous article</a>, we discussed why, when developing our open source libraries, we emphasize growing our overall users – not just our contributors. We elaborated on why we focus on everyone who <em>uses</em> our software to solve a problem, as opposed to following the more traditional open source practice of catering specifically to code contributors.</p><p>But we cannot stop at defining our focus – we have to put it into practice. In this article, we'll share some practical strategies we have learned in the course of adopting this more user-centric mindset. </p><p>We have come to these strategies by regularly interacting with the users who have adopted the <a href="https://sdv.dev/">SDV ecosystem</a>, and we've iterated them until we found success. (Some of our libraries are already at <a href="https://pepy.tech/project/copulas">1 million downloads</a>!) This has given us confidence in our approach, which we're excited to share below.</p><h3 id="1-find-the-right-channels-to-reach-more-users">1. Find the right channels to reach more users</h3><p>Our first strategy involves our overall presence as an open source ecosystem.</p><p>Shifting our focus to users has allowed us to think more critically about who we are trying to help, and whether we are actually reaching them. For example, we were initially prioritizing checking and responding to questions on <a href="https://github.com/sdv-dev/SDV/issues">GitHub</a>, a platform that makes it easy to reference technical material and scrutinize bugs. </p><p>But GitHub caters primarily to contributors – and thus leaves out the rest of the users we'd like to reach. In fact, those users may not have GitHub accounts at all! They are more likely to feel at home when they can ask us questions directly, working with us to improve their understanding. They often don't have the time to dig through technical discussions, or the desire to create an issue on GitHub, especially if their question is more fundamental. (Since “issue creation” has always been defined as a part of the software development lifecycle, users may not feel comfortable asking more basic questions there.)</p><p>To find the users we wanted, we decided to <strong>expand our presence to other platforms that cater to their needs</strong>. We have found Slack to be a great solution – it is welcoming and easy to use, and enables direct communication. Today, <a href="https://bit.ly/sdv-slack-invite">our Slack</a> is a fast-growing community of over 800 members, and a new space for us to learn about how people use our software.</p><h3 id="2-users-are-just-as-important-as-contributors">2. Users are just as important as contributors</h3><p>Our attitude towards users matters just as much as their ability to find us.</p><p>As the core maintainers for an open source project, we all have a deep passion for software. It is natural for us to want our users to share this passion – and also natural to perceive a lack of initiative if a user hasn't understood certain concepts. But in our drive to recognize the importance of all users, we have learned to understand — and even embrace — that many users have different needs and time pressures than we are used to.</p><p>For example, we frequently receive questions about how to upload a CSV file into Python. This is a standard data science procedure, so some might label this question as "lazy." We don't believe that's true. In fact, these users may be picking up Python for the first time because they think our software could solve their problem, which shows a lot of initiative. They are not unqualified to use our library; they might just need a helping hand.</p><p>To figure out what will actually help users, we put ourselves in their shoes. This mindset has led to some of our current best practices:</p><ul><li><strong>Empathize with the user's pain points.</strong> Working on software openly means that we'll get more feedback more often. Often, an issue identified by a user may already be on our roadmap, is inspiring internal debates, or is on hold until we have more resources. When we're reminded of such an issue, it can be easy to get defensive and engage in a debate – which ultimately wastes everyone's time. Instead, we use the opportunity to build camaraderie. We always try to replicate users' issues, which helps us acknowledge the frustration because we feel it too. No software is perfect!</li><li><strong>Focus on the problem, not the solution.</strong> Because our users do not have our domain expertise, they'll sometimes request features that seem difficult to accommodate. When this happens, we remind ourselves that it's not the user's job to understand our system. Rather, it's our duty to dig deeper and find the root of the concern. This helps us design new features in a way that matches our vision <em>and</em> satisfies users.</li><li><strong>Above all else, move them forward. </strong>When a user has a request, we aim to provide a timely, focused response so that they can take the next step in their usage journey. If we can't immediately resolve the issue, we provide workarounds that allow their projects to proceed. This is more difficult than it seems. At times, we want to passionately respond with our own long-term vision — but this is not useful to users who just want their project to work.</li></ul><p>The illustration below shows a hypothetical example of using these best practices.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2023/01/Conversation-2.png" class="kg-image" alt="3 user-centric growth strategies for open source" loading="lazy" width="1500" height="1000" srcset="https://sdv.ghost.io/content/images/size/w600/2023/01/Conversation-2.png 600w, https://sdv.ghost.io/content/images/size/w1000/2023/01/Conversation-2.png 1000w, https://sdv.ghost.io/content/images/2023/01/Conversation-2.png 1500w" sizes="(min-width: 720px) 720px"><figcaption><em>An example of a conversation the SDV team might have with a user. In this instance, the user is requesting a new algorithm, which may not be compatible with our current software. But the root concern is data quality – a need that we can address more quickly through other workarounds.</em></figcaption></img></figure><p>These best practices reflect our overall attitude, which elevates users to the same level of importance as open source contributors.</p><h3 id="3-go-the-extra-mile-%E2%80%93-it-only-takes-a-few-minutes">3. Go the extra mile – it only takes a few minutes!</h3><p>Going above and beyond can mean creating special material for users and learning to speak their language. By now, it has become standard practice for us to disambiguate and translate our technical communications for a more general audience. </p><p>Our SDV ecosystem is filled with examples of conveying the same information in multiple ways. Below are some excerpts from our announcement of a new version of SDV (0.16.0) in July 2022.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2023/01/Communication-Styles-Comparison-1.png" class="kg-image" alt="3 user-centric growth strategies for open source" loading="lazy" width="2000" height="1000" srcset="https://sdv.ghost.io/content/images/size/w600/2023/01/Communication-Styles-Comparison-1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2023/01/Communication-Styles-Comparison-1.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2023/01/Communication-Styles-Comparison-1.png 1600w, https://sdv.ghost.io/content/images/2023/01/Communication-Styles-Comparison-1.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><em>Selected excerpts from announcements of a new SDV version, disseminated on two different platforms. We communicate the same information in different ways based on what we know about our users.</em></figcaption></img></figure><p><br>The SDV open source contributors are familiar with technical concepts like “unify sampling params for reject sampling” or “Add create_custom_constraint factory method”. They're also interested in following along with specific GitHub issues, which link to the code changes.</br></p><p>Meanwhile, user-centric communication focuses on the pain points that we've solved. This is informative for current users and welcoming for new ones. As a result, users coming to our library for the first time can scan through the Slack channel to see what we're working on. Best of all, because we're thinking in these different ways already, it only takes a few minutes to draft these different types of announcements!</p><h3 id="conclusion">Conclusion</h3><p>Adopting a user-centric mindset has significantly contributed to our open source growth. We started by identifying users and finding the right channels to reach them, which naturally expanded our open source presence. Then we learned to empathize with users and embrace their needs, which has manifested as more productive conversations and relationships. Finally, we always think it's great to go above and beyond – especially if it only takes a few minutes!</p><p><em>Are there any strategies we've missed? Let us know what you think in the comments below!</em></p>]]></content:encoded></item><item><title><![CDATA[The Most Important Open Source Demographic That No One Thinks About]]></title><description><![CDATA[How we define a user in 2023 to build a community around synthetic data. ]]></description><link>https://datacebo.com/blog/open-source-user-demographic</link><guid isPermaLink="false">Ghost__Post__63cee0ef89aa88003d872ba8</guid><category><![CDATA[Open Source]]></category><dc:creator><![CDATA[Kalyan Veeramachaneni]]></dc:creator><pubDate>Mon, 23 Jan 2023 21:23:26 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2023/01/Frame.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2023/01/Frame.png" alt="The Most Important Open Source Demographic That No One Thinks About"/><p><strong><strong>Defining an Open Source User for 2023 and beyond</strong></strong></p><p>Code contributors are an essential part of an open source (OS) project. But in our experience, making code contributors the sole focus of an open source project ends up disenfranchising another large segment of important people: a library's<strong> </strong><em><strong>users</strong></em>. This segment, we have found, is more critical to our success, providing indispensable feedback, finding use cases and helping us to improve our open source and the product that relies on it. In this first article (in a series), we synthesize key attributes that we use to identify a user. </p><p>Traditionally, open source libraries have been centered around software development, as collaborating on code is vital for maintaining complex software. It has become customary to use the number of unique code contributors as a core metric of a given library's success. Metrics of success drive how the overall ecosystem is maintained, including how software is designed (APIs), the audience for which usage guides are developed, the types of demos that are built, and how communications are handled. To bring users into fold, all these need to be revisited keeping them in mind. </p><p>At the same time, open source is <a href="https://www.bvp.com/atlas/roadmap-open-source">proving to be a successful model</a> for startups. As the core maintainers of the <a href="https://github.com/sdv-dev/SDV">Synthetic Data Vault project</a> — the world's largest open source library for modeling and generating tabular synthetic data — we are constantly striving to realize the benefits of this model firsthand. For us, open source has been <a href="https://sdv.dev/blog/intro-to-sdv/">vital to building a trusted and usable machine learning system</a>. With this and subsequent articles we are synthesizing our current thinking about open source, and some key lessons we have learned on our way to this point.</p><h3 id="who-is-a-user">Who is a user?</h3><p>Our definition of a<em> </em>user is: Anyone who attempts to use <em>our</em> open source library to solve <em>their</em> problem. Generally, <em>users</em>:</p><ul><li><strong>…are goal-oriented.</strong> A user comes to our library with a specific project that they're working on.</li><li><strong>…have limited time.</strong> A user often has a deadline for their project. They may not have time to learn the nitty-gritty details of our software, or engage in deeper conversations about its development.</li><li><strong>… have different expertise.</strong> A user is probably coming to our library to help with a project in their own domain, whether that's healthcare, clean energy or something else. They might not have the same knowledge base as a professional software developer would (although they also might — more on this later). </li></ul><p>While this definition may seem straightforward, these attributes have become the cornerstone of how we maintain and communicate about our software, and  how we develop APIs. They have also inspired the main question we use to measure our progress: <em>Is our library making a material difference in users' projects?</em> In subsequent articles, we plan to share how we applied these strategies to build the largest open source <em>user</em> community around synthetic data, and what we have learned in the process.</p><h3 id="what-changes-are-we-making-to-set-up-an-os-for-success">What changes are we making to set up an OS for success?</h3><p>Charting this path with a laser-sharp focus on the “<em>user</em>” has required us to address some commonly asked questions up front, both for our team and externally. Here are just a few.</p><p><strong>Users are developers too and probably more critical for our success </strong></p><p>Just because a user isn't interested in learning the internal details of <em>our</em> software doesn't mean we can automatically categorize them as <em>not a developer</em>. They may be experts in other fields and may be developing software there. In addition, they are still using our Python API to help them with their project — and therefore, they are developing software. To expand and serve our user base, we focus our efforts on what we want to achieve with our open source strategy, rather than creating different strategies based on the perceived skill level of who is using our library. As a result, we want every API we publish to be understandable and usable by everyone. We want our communications to be cognizant of the fact - they don't have time! In 2023, <em>we believe that everyone is a developer</em> — or at least, we like to serve everyone and make them part of the software movement.</p><p><strong>User friendly APIs are game changers</strong></p><p>One question we asked ourselves was "shouldn't the user friendliness delegated to graphical user interfaces (GUIs)?". GUIs finalize a <em>straight, stepwise process to successful project completion</em>, while the code provides flexibility to try things slightly differently. When they feel restricted by the straight stepwise process for their specific project/use case, pioneering users instead use code. Creating a user-friendly API that lets users apply our open source to their project in a transparent way, and provides access to different metrics and progress states at different stages, gives our users a great chance at succeeding. It also helps us to efficiently discover more pathways, and most importantly, more use cases. This makes our open source essentially a <em>low-code</em> version of what goes into the product.</p><p><strong>Github stars are not enough</strong></p><p>Github star histories are regularly used to indicate an exponential growth curve for a library. They are often considered a leading indicator for a need in the market that the library may be targeting, or top of the funnel for an open source, and there are now well-developed strategies for growing stars over time. Used effectively, we find these strategies to be a good marketing tool, and well-intentioned for increasing the top of the funnel. We ourselves use them from time to time, as they increase reach and can bring in more users. But we find that they should be balanced with feature development, carefully listening to users, and measuring how often folks are downloading and using the library and raising issues. Star growth should be followed by growth in downloads and issues raised by users.</p><p/><p><em>We look forward to discussing our experiences with open sourcing in 2023 and beyond. In the articles that follow, we will share some more of our strategies and measures for engagement. We welcome any thoughts, comments, suggestions and questions below.</em><br/></p>]]></content:encoded></item><item><title><![CDATA[Can you use synthetic data for label balancing?]]></title><description><![CDATA[Imbalanced data can prevent your projects from succeeding. Will synthetic data work? Explore the rationale behind label balancing.]]></description><link>https://datacebo.com/blog/synthetic-label-balancing</link><guid isPermaLink="false">Ghost__Post__63b4711dac52ed003d6a1744</guid><category><![CDATA[Applications]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Tue, 10 Jan 2023 17:59:16 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2023/01/Header--1-.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2023/01/Header--1-.png" alt="Can you use synthetic data for label balancing?"/><p>A dataset can unlock many doors for your organization, helping with everything from predictive forecasting to data-driven decision making. But in some situations, you may not have all of the data that you need. A common scenario is having too few data points, which can lead to an imbalance of variables.</p><p>In this article, we'll take a closer look at this scenario. We'll recap why this can be a problem for your projects and walk you through some possible solutions. We'll end by explaining why <strong>synthetic data</strong><em><strong> </strong></em>might be especially useful for overcoming this challenge.</p><h3 id="why-is-it-a-problem-to-have-a-data-imbalance">Why is it a problem to have a data imbalance?</h3><p>At a basic level, data is a record of events, and a <strong>data</strong> <strong>imbalance</strong> happens when some events occur much less frequently than others. For example:</p><ul><li>In healthcare, cancer occurs less frequently than diabetes</li><li>In finance, it's rare to see fraudulent credit card charges</li><li>In local government, it's rare to have a day with a major natural disaster such as a fire</li></ul><p>A natural data imbalance isn't inherently a problem, but it can become one if the rare events are important. As an example, let's assume you're working at a hospital that is treating COVID patients. One day a new COVID variant – let's call it Variant X – appears in the population. Since it is so new, it currently occurs very rarely (&lt;2.5% of the time). This leads to a data imbalance for this COVID variant, as illustrated below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2023/01/Original-Dataset--2-.png" class="kg-image" alt="Can you use synthetic data for label balancing?" loading="lazy" width="2000" height="1000" srcset="https://sdv.ghost.io/content/images/size/w600/2023/01/Original-Dataset--2-.png 600w, https://sdv.ghost.io/content/images/size/w1000/2023/01/Original-Dataset--2-.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2023/01/Original-Dataset--2-.png 1600w, https://sdv.ghost.io/content/images/2023/01/Original-Dataset--2-.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><em>A hypothetical dataset of COVID patients. Each patient has multiple variables – Age, BMI, etc. The variable "COVID Variant" has an imbalance: Variant X is far less common than Omicron.</em></figcaption></img></figure><p>This imbalance is a problem when it's critical to account for Variant X. For example, you may want to build a predictive model for who is most likely to be hospitalized. If you use the data as-is, your model may only consider Omicron (the majority) and treat Variant X as an outlier. This can lead to poor predictions – and bad planning – because Variant X may soon become the dominant strain.</p><h3 id="using-data-augmentation-to-fix-imbalances">Using Data Augmentation to Fix Imbalances</h3><p>In an ideal world, your data would include more patients with Variant X. But until then, you need to find a solution that will allow you to produce reasonable predictions. What if you create some artificial patient data for the sake of making a robust predictive model? </p><p>Let's assume you have data for 1,000 COVID hospital patients, 975 with the Omicron variant and 25 with Variant X. If you can create 950 additional artificial Variant X patients, then you can create an evenly-balanced dataset. This process is illustrated below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2023/01/Label-Balancing--2-.png" class="kg-image" alt="Can you use synthetic data for label balancing?" loading="lazy" width="2000" height="1000" srcset="https://sdv.ghost.io/content/images/size/w600/2023/01/Label-Balancing--2-.png 600w, https://sdv.ghost.io/content/images/size/w1000/2023/01/Label-Balancing--2-.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2023/01/Label-Balancing--2-.png 1600w, https://sdv.ghost.io/content/images/2023/01/Label-Balancing--2-.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><em>You can balance your data by creating artificial patients who all have Variant X. When you combine the new patients with the existing ones, you will have a balanced dataset, with 50% Omicron and 50% Variant X.</em></figcaption></img></figure><p>You may be skeptical because there are only 25 patients with Variant X to begin with. How can we reasonably produce 950 more based on that? As usual with data science, the devil is in the details. Let's go through some approaches to see what works.</p><h3 id="attempt-1-oversampling">Attempt #1: Oversampling</h3><p>Your first instinct may be to take the existing 25 Variant X patients and weigh them more heavily. One easy way to achieve this: You can duplicate each original patient 40 times to get 1,000 patients.</p><p>In data science, this is known as <a href="https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis">oversampling</a>. Sometimes, this is done programmatically, sampling patients (with replacement) as many times as needed. Other times, this can be achieved using mathematical formulas to provide weights. An illustration is shown below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2023/01/Oversampling--2-.png" class="kg-image" alt="Can you use synthetic data for label balancing?" loading="lazy" width="2000" height="1000" srcset="https://sdv.ghost.io/content/images/size/w600/2023/01/Oversampling--2-.png 600w, https://sdv.ghost.io/content/images/size/w1000/2023/01/Oversampling--2-.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2023/01/Oversampling--2-.png 1600w, https://sdv.ghost.io/content/images/2023/01/Oversampling--2-.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><em>One way to fix an imbalance is by oversampling your data. You can manually duplicate the rows (shown here), sample them with replacement or use mathematics to weigh them more heavily.</em></figcaption></img></figure><p>With oversampling, Variant X is no longer rare, so your model cannot ignore it. But if you actually use this data, your project may not be successful. Your model may confidently predict that all Variant X patients must be over the age of 50. But this is not necessarily right – just because the existing patients had these characteristics doesn't mean Variant X patients always will.</p><p>The mistake was over-emphasizing the same set of patients, making the model more likely to create strong claims. This is commonly referred to as <a href="https://en.wikipedia.org/wiki/Overfitting">overfitting the data</a>: The model over-emphasizes the importance of a small number of records, and makes blanket predictions that lack nuance.</p><h3 id="attempt-2-randomizing">Attempt #2: Randomizing</h3><p>To avoid the problems that come with oversampling, let's explore the opposite direction for argument's sake: What if we created artificial Variant X patients by choosing variables completely at random? An example is illustrated below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2023/01/Randomizing--2-.png" class="kg-image" alt="Can you use synthetic data for label balancing?" loading="lazy" width="2000" height="1000" srcset="https://sdv.ghost.io/content/images/size/w600/2023/01/Randomizing--2-.png 600w, https://sdv.ghost.io/content/images/size/w1000/2023/01/Randomizing--2-.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2023/01/Randomizing--2-.png 1600w, https://sdv.ghost.io/content/images/2023/01/Randomizing--2-.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><em>Another way to fix an imbalance problem is by using randomization. You can create artificial Variant X patients by selecting the other variables (Age, BMI, etc.) at random.</em></figcaption></img></figure><p>Randomization avoids overfitting because no patient is repeated. But this approach introduces problems of its own: You may find that the data doesn't make sense anymore. The example data above highlights some problems that can arise. We see an artificial 23-year-old patient with dementia, and many diabetic patients with low BMIs. In the medical world, these events are not likely and indicate that there is a problem with the data.</p><p>These inconsistencies may (rightfully) dissuade you from using randomization. Since random data lacks patterns, a model will not be able to draw conclusions from it. In data science, we call this a problem of <em>noisiness</em>. <a href="https://en.wikipedia.org/wiki/Noisy_data">Noisy data</a> has too many random combinations to produce any useful learnings.</p><h3 id="a-better-solution-defining-neighborhoods">A Better Solution: Defining Neighborhoods</h3><p>So far, we've seen attempts at extreme ends.</p><ol><li>Oversampling will emphasize one set of patients, leading to an overfit model</li><li>Randomizing will make the dataset noisy, precluding useful conclusions</li></ol><p>The solution we need falls somewhere in the middle: We'd like to <em>loosely</em> base the artificial patients on the real ones. This is related to the data science concept of <a href="https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm">neighborhoods</a>. Drawing a <strong>neighborhood</strong> around some patients identifies general commonalities between them – without setting any strict rules. For example, Variant X patients <em>may</em> be more likely to have had a known exposure, but it's not guaranteed. Note that there is no exact definition for a neighborhood. It can change based on our assumptions and how broad we want to make it.</p><p>Once we know a neighborhood, we can create artificial patients that are inside it. These patients won't be exactly the same as the existing ones, but they won't have completely random values either.</p><p><strong>Synthetic Data for Label Balancing</strong></p><p>A compelling solution for discovering neighborhoods is <strong>synthetic data</strong>. A synthetic data<strong> </strong>software – such as our open source <a href="https://sdv.dev/">Synthetic Data Vault</a> (SDV) – uses machine learning to learn patterns from real patients, and then creates synthetic patients.</p><p>The SDV discovers neighborhoods at a variety of levels in the form of trends. It's able to learn overall trends (for all patients) as well as trends that are unique to a variable (such as Variant X). For example:</p><ul><li>For all patients, a higher age corresponds to a greater risk of dementia and a higher BMI corresponds to a greater risk of diabetes.</li><li>Variant X patients tend to be older, while Omicron patients tend to be younger.</li><li>Etc.</li></ul><p>As a result, synthetic patients have some variation – but the data still makes sense in context. An example table of SDV-generated patients is shown below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2023/01/Synthesizing--2-.png" class="kg-image" alt="Can you use synthetic data for label balancing?" loading="lazy" width="2000" height="1000" srcset="https://sdv.ghost.io/content/images/size/w600/2023/01/Synthesizing--2-.png 600w, https://sdv.ghost.io/content/images/size/w1000/2023/01/Synthesizing--2-.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2023/01/Synthesizing--2-.png 1600w, https://sdv.ghost.io/content/images/2023/01/Synthesizing--2-.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><em>A software like the SDV generates synthetic patients with Variant X. The synthetic patients are not exact duplicates of the original, but they aren't completely random either.</em></figcaption></img></figure><p>This is the middle solution we were looking for: Synthetic data won't cause overfitting and is less noisy than randomization. The best part is that there are multiple synthetic data techniques and settings available in the SDV, providing flexibility and tradeoffs.</p><h3 id="takeaways">Takeaways</h3><p>In this article, we explored imbalanced datasets. It is common to have an imbalanced dataset due to rare events – which becomes a problem if those rare events are important for your project.</p><p>Fixing the imbalance problem requires a careful tradeoff between overfitting data and creating noisy data. <strong>Synthetic data</strong> is a compelling solution that achieves a middle ground by discovering neighborhoods of similar data. This allows you to realistically fix the imbalance without resorting to either extreme.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2023/01/Data-Creation-Spectrum--2-.png" class="kg-image" alt="Can you use synthetic data for label balancing?" loading="lazy" width="2000" height="1000" srcset="https://sdv.ghost.io/content/images/size/w600/2023/01/Data-Creation-Spectrum--2-.png 600w, https://sdv.ghost.io/content/images/size/w1000/2023/01/Data-Creation-Spectrum--2-.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2023/01/Data-Creation-Spectrum--2-.png 1600w, https://sdv.ghost.io/content/images/2023/01/Data-Creation-Spectrum--2-.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><em>Synthetic data is a solution that balances the extremes of overfitting the data and creating noisy data.</em></figcaption></img></figure><p>Are you interested in label balancing? Have you already explored using the SDV for this problem? Drop us a comment below!</p>]]></content:encoded></item><item><title><![CDATA[Interpreting the Progress of CTGAN]]></title><description><![CDATA[It can be difficult to verify the progress that a GAN is making. What if we combined it with easily interpretable metrics and visualizations?]]></description><link>https://datacebo.com/blog/interpreting-ctg-progress</link><guid isPermaLink="false">Ghost__Post__63a0bc44ac52ed003d6a169a</guid><category><![CDATA[Product]]></category><dc:creator><![CDATA[Santiago Gomez Paz]]></dc:creator><pubDate>Tue, 20 Dec 2022 19:13:29 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2022/12/Header--4-.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2022/12/Header--4-.png" alt="Interpreting the Progress of CTGAN"/><p><em>This article was researched by Santiago Gomez Paz, a DataCebo intern. Santiago is a Sophomore at BYU and an aspiring entrepreneur who spent his summer learning and experimenting with CTGAN.</em></p><p>The <a href="https://github.com/sdv-dev/SDV">open source SDV library</a> offers many options for creating synthetic data tables. Some of the library's models use tried-and-true methods from classical statistics, while others use newer innovations like deep learning. One of the newest and most popular models is <strong>CTGAN</strong>, which uses a type of neural network called a Generative Adversarial Network (GAN). </p><p>Generative models are a popular choice for creating all kinds of synthetic data – for example, you may have heard of <a href="https://openai.com/dall-e-2/">OpenAI's DALL-E</a> or <a href="https://openai.com/blog/chatgpt/">ChatGPT</a> tools, which use trained models to create synthetic images and text respectively. A large driver behind their popularity is that they work well — they create synthetic data that closely resembles the real deal. But this high quality often comes at a cost.</p><p>Generative models can be resource-intensive. It can take a lot of time to properly train one, and it's not always clear whether the model is improving much during the training process. </p><p>In this article, we'll unpack this complexity by performing experiments on CTGAN. We'll cover –</p><ul><li>A high-level explanation of how GANs work</li><li>How to measure and interpret the progress of CTGAN</li><li>How to confirm this progress with more interpretable, user-centric metrics</li></ul><p>Since the library is open source, you can see and run the code yourself with this <a href="https://colab.research.google.com/drive/1RbIYxkbPP3JQY7W0S1p_XprY25wOYTPL?usp=sharing">Colab Notebook</a>.</p><h3 id="how-do-gans-work">How do GANs work?</h3><p>Before we begin, it's important to understand how GANs work. At a high level, a GAN is an algorithm that makes two neural networks compete against each other (thus the label “Adversarial”). These neural networks are known as the <strong>generator</strong> and the <strong>discriminator</strong>, and they each have competing goals:</p><ul><li>The discriminator's goal is to tell real data apart from synthetic data</li><li>The generator's goal is to create synthetic data that fools the discriminator</li></ul><p>The setup is illustrated below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2023/01/How-a-GAN-works-2.png" class="kg-image" alt="Interpreting the Progress of CTGAN" loading="lazy" width="2000" height="1000" srcset="https://sdv.ghost.io/content/images/size/w600/2023/01/How-a-GAN-works-2.png 600w, https://sdv.ghost.io/content/images/size/w1000/2023/01/How-a-GAN-works-2.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2023/01/How-a-GAN-works-2.png 1600w, https://sdv.ghost.io/content/images/2023/01/How-a-GAN-works-2.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><em>The <strong>generator</strong> is a neural network that creates synthetic data. In this case, it creates a table describing the names of different people, along with their heights and ages. The <strong>discriminator</strong> is an adversarial network that tries to tell these synthetic people apart from the real ones.</em></figcaption></img></figure><p>This setup allows us to measure – and improve – both neural networks over many iterations by telling them what they got wrong. Each of these iterations is called an <strong>epoch</strong>, and CTGAN tracks inaccuracies as <strong>loss values</strong>. The neural networks are trying to minimize their loss values for every epoch.</p><p>The CTGAN algorithm calculates loss values using a specific formula that can be found in <a href="https://github.com/sdv-dev/SDV/discussions/980">this discussion</a>. The intuition behind it is shown below.</p><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2023/01/Loss-Values-Interpretation-2.png" class="kg-image" alt="Interpreting the Progress of CTGAN" loading="lazy" width="2000" height="800" srcset="https://sdv.ghost.io/content/images/size/w600/2023/01/Loss-Values-Interpretation-2.png 600w, https://sdv.ghost.io/content/images/size/w1000/2023/01/Loss-Values-Interpretation-2.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2023/01/Loss-Values-Interpretation-2.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2023/01/Loss-Values-Interpretation-2.png 2400w" sizes="(min-width: 720px) 720px"/></figure><p>As shown by the table, lower loss values – even if they are <em>negative </em>– mean that the neural networks are doing well.</p><p>As the epochs progress, we expect both neural networks to improve at their respective goals – but each epoch is resource-intensive and takes time to run. A common request is to find a tradeoff between the improvement achieved and the resources used.</p><h3 id="measuring-progress-using-ctgan">Measuring progress using CTGAN</h3><p>The open source SDV library makes it easy to train a CTGAN model and inspect its progress. The code below shows the steps. We train CTGAN using a publicly available SDV demo dataset named <code>RacketSports</code>, which stores various measurements of the strokes that tennis and squash players make over the course of a game.</p><pre><code class="language-python">from sdv.demo import load_tabular_demo
from sdv.tabular import CTGAN

metadata, real_data = load_tabular_demo('RacketSports', metadata=True)
table_metadata = metadata.to_dict()

model = CTGAN(table_metadata, verbose=True, epochs=800)
model.fit(real_data)</code></pre><p>As part of the fitting process, CTGAN trains the neural networks for multiple epochs. After each epoch, it prints out the count, the generator loss (G) and the discriminator loss (D). Keep in mind that lower numbers are better – even if they are <em>negative</em>. An example is shown below.</p><pre><code>Epoch 1, Loss G:  1.0435,Loss D: -0.1401
Epoch 2, Loss G:  0.4489,Loss D: -0.1455
Epoch 3, Loss G:  0.4756,Loss D: -0.0956
Epoch 4, Loss G:  0.3902,Loss D:  0.0344
Epoch 5, Loss G:  0.0912,Loss D:  0.3030
...</code></pre><p>To see how the neural networks are improving, we plot the loss values for every epoch. The results from our experiment are shown in the graph below. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/12/Racket-Sports-Loss.png" class="kg-image" alt="Interpreting the Progress of CTGAN" loading="lazy" width="2000" height="645" srcset="https://sdv.ghost.io/content/images/size/w600/2022/12/Racket-Sports-Loss.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/12/Racket-Sports-Loss.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/12/Racket-Sports-Loss.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/12/Racket-Sports-Loss.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><em>A graph of the GAN's progress over time. The generator loss is shown in blue, while the discriminator loss for the same epoch is shown in red.</em></figcaption></img></figure><p>Based on the characteristics of this graph, it's possible to deduce how the GAN is progressing.</p><h3 id="interpreting-the-loss-values">Interpreting the loss values</h3><p>The graph above may seem confusing at first glance: Why is the discriminator's loss value score oscillating at 0 if it is supposed to improve (minimize and become negative) over time? The key to interpreting the loss values is to remember that the neural networks are adversaries. As one improves, the other must also improve just to keep its score consistent. Here are three scenarios that we frequently see:</p><ol><li><strong>Generator loss is slightly positive while discriminator loss is 0. </strong>This means that the generator is producing poor quality synthetic data while the discriminator is blindly guessing what is real vs. synthetic. This is a common starting point, where neither neural network has optimized for its goal.</li><li><strong>Generator loss is becoming negative while the discriminator loss remains at 0.</strong> This means that the generator is producing better and better synthetic data. The discriminator is improving too, but because the synthetic data quality has increased, it is still unable to clearly differentiate real vs. synthetic data.</li><li><strong>Generator loss has stabilized at a negative value while the discriminator loss remains at 0. </strong>This means that the generator has optimized, creating synthetic data that looks so real, the discriminator cannot tell it apart.</li></ol><p>It is encouraging to see that the general pattern for the <code>RacketSports</code> dataset is similar to a variety of other datasets. These are shown below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/12/Multi-Datsets.png" class="kg-image" alt="Interpreting the Progress of CTGAN" loading="lazy" width="2000" height="645" srcset="https://sdv.ghost.io/content/images/size/w600/2022/12/Multi-Datsets.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/12/Multi-Datsets.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/12/Multi-Datsets.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/12/Multi-Datsets.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><em>The generator and discriminator loss values for a variety of other datasets all follow the same learning pattern. The dataset names are shown in <strong>bold.</strong> They can be downloaded from the SDV demo module.</em></figcaption></img></figure><p>Of course, other patterns may be possible for different datasets. But if loss values are not stabilizing, watch out! This would indicate that the neural networks were not able to effectively learn patterns in the real data.</p><h3 id="metrics-powered-analysis">Metrics-Powered Analysis</h3><p>You may be wondering whether to trust the loss values. Do they indicate a meaningful difference in synthetic data quality? To answer this question, it's helpful to create synthetic data sets after training the model for different numbers of epochs, and assess the quality of the data sets.</p><pre><code class="language-python">NUM_SYNTHETIC_ROWS = len(real_data)

synthetic_data = model.sample(num_rows=NUM_SYNTHETIC_ROWS)</code></pre><p>It is important to select a few key metrics for a quantifiable quality measure. For our experiments, we chose 4 metrics from the open source <a href="https://docs.sdv.dev/sdmetrics/">SDMetrics library</a>:</p><ul><li><a href="https://docs.sdv.dev/sdmetrics/metrics/metrics-glossary/kscomplement"><strong>KSComplement</strong></a> evaluates the shape of numerical columns</li><li><a href="https://docs.sdv.dev/sdmetrics/metrics/metrics-glossary/tvcomplement"><strong>TVComplement</strong></a> evaluates the shape of discrete columns</li><li><a href="https://docs.sdv.dev/sdmetrics/metrics/metrics-glossary/correlationsimilarity"><strong>CorrelationSimilarity</strong></a> evaluates pairwise correlations between columns</li><li><a href="https://docs.sdv.dev/sdmetrics/metrics/metrics-glossary/categorycoverage"><strong>CategoryCoverage</strong></a> evaluates whether the synthetic data covers all possible values</li></ul><p>Each metric produces a score ranging from 0 (worst quality) to 1 (best quality). In the example below, we use the <code>KSComplement</code> metric on a numerical column in the <code>RacketSports</code> dataset.</p><pre><code class="language-python">from sdmetrics.single_column import KSComplement

NUMERICAL_COLUMN_NAME='dim_2'

score = KSComplement.compute(
   real_data[NUMERICAL_COLUMN_NAME],
   synthetic_data[NUMERICAL_COLUMN_NAME])</code></pre><p>Our results validate that the scores do, indeed, correlate with the loss value from the generator: The quality improves as the loss is minimized. Some of the metrics – such as <code>CorrelationSimilarity</code> and <code>CategoricalCoverage</code> – are high to begin with, so there is not much room to improve. But other metrics, like <code>KSComplement</code>, show significant improvement. This is shown in the graph below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/12/CTGAN-Loss-vs.-KSComplement.png" class="kg-image" alt="Interpreting the Progress of CTGAN" loading="lazy" width="2000" height="1290" srcset="https://sdv.ghost.io/content/images/size/w600/2022/12/CTGAN-Loss-vs.-KSComplement.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/12/CTGAN-Loss-vs.-KSComplement.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/12/CTGAN-Loss-vs.-KSComplement.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/12/CTGAN-Loss-vs.-KSComplement.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><em>A comparison of loss values and the KSComplement metric. The two are linked: Lower generator loss (blue) correspond to higher quality scores (green).</em></figcaption></img></figure><p>It's also possible to visualize the synthetic data that corresponds to a specific metric. For example, <code>KSComplement</code> compares the overall shape of a real and a synthetic data column, so we can visualize it using histograms.</p><pre><code class="language-python">from sdmetrics.reports import utils

utils.get_column_plot(
  real_data,
  synthetic_data,
  column_name=NUMERICAL_COLUMN_NAME,
  metadata=table_metadata)</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/12/CTGAN-Epochs-vs.-Improvement.png" class="kg-image" alt="Interpreting the Progress of CTGAN" loading="lazy" width="2000" height="645" srcset="https://sdv.ghost.io/content/images/size/w600/2022/12/CTGAN-Epochs-vs.-Improvement.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/12/CTGAN-Epochs-vs.-Improvement.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/12/CTGAN-Epochs-vs.-Improvement.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/12/CTGAN-Epochs-vs.-Improvement.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><em>Three histograms were created after training CTGAN for 10, 100 and 500 epochs on the RacketSports dataset. We plotted the dim_2 column. The real data (gray) doesn't change, but the synthetic data (green) improves with more epochs. The KSComplement metric measures the similarity: 0.74, 0.89 and 0.91 (left to right).</em></figcaption></img></figure><p>Overall, we can conclude that the generator and discriminator losses correspond to the quality metrics that we measured – which means we can trust the loss values, as well as the synthetic data that our CTGAN created!</p><h3 id="conclusion">Conclusion</h3><p>In this article, we explored the improvements that the CTGAN model makes as it iterates over many epochs. We started by interpreting the loss values that each of the neural networks – the generator and the discriminator – reports over time. This helped us reason about how they were progressing. But to fully trust the progress of our model, we then turned to the <a href="https://docs.sdv.dev/sdmetrics/">SDMetrics library</a>, which provides metrics that are easier to interpret. Using this library, we could verify whether the reported loss values truly resulted in synthetic data quality improvements.</p><p>This may lead us to a new, potential feature: What if we integrated these easily interpretable, user-centric metrics into the CTGAN training progress? This feature would allow you to specify the exact metrics you'd like to optimize upfront – for example, KSComplement. In addition to the generator and discriminator loss, CTGAN may be able to report a snapshot of this metric. A hypothetical example is shown below.</p><pre><code class="language-python">model = CTGAN(
  table_metadata,
  verbose=True,
  epochs=800,
  optimization_metric='KSComplement',
  optimization_column='dim_2')
  
model.fit(real_data)</code></pre><pre><code>Epoch 1, Loss G: 1.0435, Loss D: -0.1401, KSComplement: 0.7832
Epoch 2, Loss G: 0.4489, Loss D: -0.1455, KSComplement: 0.7671
Epoch 3, Loss G: 0.4756, Loss D: -0.0956, KSComplement: 0.7664
…
Epoch 200: Loss G: -2.542, Loss D: 0.0002911, KSComplement: 0.92391
</code></pre><p>Such a feature would allow more transparency over CTGAN's learning process, and allow you to stop training your models once the metrics are high. </p><p><strong>What do you think? </strong>If you're interested in exploring the inner workings of CTGAN and optimizing your synthetic data, drop us a comment below!</p>]]></content:encoded></item><item><title><![CDATA[How to evaluate synthetic data for your project — and avoid the biggest mistake we see]]></title><description><![CDATA[Proper evaluation is critical when using synthetic data. Avoid this common mistake and lead your project to success.]]></description><link>https://sdv.dev/how-to-evaluate-synthetic-data/</link><guid isPermaLink="false">Ghost__Post__633b5bbbda16fc003d4eab71</guid><category><![CDATA[Product]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Fri, 07 Oct 2022 14:24:15 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2022/10/Header-V2.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2022/10/Header-V2.png" alt="How to evaluate synthetic data for your project — and avoid the biggest mistake we see"/><p>In recent years, synthetic data has shown great promise for solving a variety of problems – like <a href="https://www.gartner.com/en/newsroom/press-releases/2022-06-22-is-synthetic-data-the-future-of-ai">addressing data scarcity for AI</a> and <a href="https://www.agmatix.com/blog/driving-innovation-in-agriculture-with-synthetic-data/?utm_source=LinkedIn&amp;utm_medium=Social">overcoming barriers to data access</a>. As your organization becomes serious about adopting synthetic data, it's crucial to incorporate the right metrics and evaluation frameworks into your projects.</p><p>Since the synthetic data space is so new, there aren't yet industry standards for setting and measuring outcomes. At <a href="https://datacebo.com/">DataCebo</a>, we've worked with a variety of teams using synthetic data. In this article, we're sharing the best practices we've learned along the way, as well as the one key mistake to avoid.</p><h3 id="what-are-synthetic-data-metrics">What are synthetic data metrics?</h3><p>In some fields – such as synthetic image generation – it's easy to visually inspect the output (in this case, synthetic images) and determine its quality. But if you are creating synthetic data in a tabular format (with rows and columns), it's difficult to make an overall assessment just by looking at the raw data. This is evident in the table below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/10/Real-vs.-Synthetic-Data.png" class="kg-image" alt="How to evaluate synthetic data for your project — and avoid the biggest mistake we see" loading="lazy" width="2000" height="667" srcset="https://sdv.ghost.io/content/images/size/w600/2022/10/Real-vs.-Synthetic-Data.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/10/Real-vs.-Synthetic-Data.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/10/Real-vs.-Synthetic-Data.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/10/Real-vs.-Synthetic-Data.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><em>When the synthetic data is tabular, it's hard to assess its quality. In this example, 3 of the rows show data from real students, while the other 3 are synthetically created. Can you tell which is which?</em></figcaption></img></figure><p>For tabular synthetic data, it's necessary to create metrics that quantify how the synthetic data compares to the real data. Each metric measures a particular aspect of the data – such as coverage or correlation – allowing you to identify which specific elements have been preserved or forgotten during the synthetic data process.</p><p>In our open source library, <a href="https://github.com/sdv-dev/SDMetrics">SDMetrics</a>, we've provided a variety of metrics for evaluating synthetic data against the real data. For instance, you can use the <a href="https://docs.sdv.dev/sdmetrics/metrics/metrics-glossary/categorycoverage">CategoryCoverage</a> and <a href="https://docs.sdv.dev/sdmetrics/metrics/metrics-glossary/rangecoverage">RangeCoverage</a> metrics to quantify whether your synthetic data covers the same range of possible values as the real data:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/10/Range-Coverage.png" class="kg-image" alt="How to evaluate synthetic data for your project — and avoid the biggest mistake we see" loading="lazy" width="2000" height="645" srcset="https://sdv.ghost.io/content/images/size/w600/2022/10/Range-Coverage.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/10/Range-Coverage.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/10/Range-Coverage.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/10/Range-Coverage.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><em>In this example, the numerical distributions of real and synthetic data are overlaid to compare coverage. Using SDMetrics, you can apply the RangeCoverage, which quantifies the coverage. In this case: 82%.</em></figcaption></img></figure><p>You may also be curious about whether the synthetic data captures trends between pairs of columns. To compare correlations, you can use the <a href="https://docs.sdv.dev/sdmetrics/metrics/metrics-glossary/correlationsimilarity">CorrelationSimilarity</a> metric:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/10/Column-Pairs.png" class="kg-image" alt="How to evaluate synthetic data for your project — and avoid the biggest mistake we see" loading="lazy" width="2000" height="645" srcset="https://sdv.ghost.io/content/images/size/w600/2022/10/Column-Pairs.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/10/Column-Pairs.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/10/Column-Pairs.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/10/Column-Pairs.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><em>This example shows two side-by-side heatmaps of the pairwise correlations for real and synthetic data. Using SDMetrics, you can apply the CorrelationSimilarity metric, which quantifies the similarity as a score from 0 to 1.</em></figcaption></img></figure><p>The SDMetrics library has over 30 metrics, with more still in development.</p><p>But having access to metrics is just one part of the story. With so many metrics, it can be difficult to decide which ones to focus on – and how to make progress in your synthetic data project. <strong>To successfully deploy synthetic data, it's important to consider metrics during all steps of your project development cycle.</strong></p><p>In the rest of this article, we'll share a 3-step plan for incorporating metrics – and the SDMetrics library – into your synthetic data project to increase your chances of success. </p><h3 id="step-1-start-with-the-project-goals">Step 1: Start with the project goals </h3><p>It is tempting to create synthetic data quickly and then test it using all the available metrics. After all, it's hard not to be curious about what synthetic data can do! But to succeed with your project, it's important to take a step back and focus on the problem you are trying to solve first.</p><p>Synthetic data creation isn't an end in itself. Just as with most data work, you don't create synthetic data for its own sake — you use it to solve a problem. If you want your synthetic data project to succeed, pay close attention to what that problem is, as this will help you narrow down a few key metrics. </p><p>For example, imagine that your organization has two different synthetic data projects related to software testing and machine learning, respectively. Because these projects have different goals, you’ll need to consider different metrics:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/10/Project-Driven-Metrics.png" class="kg-image" alt="How to evaluate synthetic data for your project — and avoid the biggest mistake we see" loading="lazy" width="2000" height="667" srcset="https://sdv.ghost.io/content/images/size/w600/2022/10/Project-Driven-Metrics.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/10/Project-Driven-Metrics.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/10/Project-Driven-Metrics.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/10/Project-Driven-Metrics.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Use your project goals to help you decide which metrics to prioritize.</figcaption></img></figure><p>Focusing on your goals allows you to identify which metrics are important for the ultimate success of your project. <strong>The single biggest mistake we see people making is to skip this critical step.</strong> Without focus, you can easily get bogged down running multiple tests and tweaking your synthetic dataset, rather than meeting the specific considerations for your project. This can derail your efforts – leading to a complex project that doesn't add any value.</p><h3 id="step-2-let-your-goals-guide-the-synthetic-data-creation">Step 2: Let your goals guide the synthetic data creation </h3><p>Your goals can help you appropriately scope your project and cut costs. A core subset of metrics can guide your synthetic data creation, making it faster and more targeted to your needs.</p><p>Chances are, you'll be faced with many decisions throughout your project. For example, in the <a href="https://github.com/sdv-dev/SDV">SDV library</a>, there are 5 different algorithms that create synthetic data, each with their own settings that lead to hundreds of potential models you can create. But if you know that your synthetic data project is software testing, you've identified that coverage, boundaries and business rules are the highest priority metrics. This will guide your decision-making.</p><p>In this case, you may find success choosing our preset model, <a href="https://github.com/sdv-dev/SDV/discussions/786">FAST ML</a>. This model uses statistical methods to achieve your minimal requirements while also providing high performance – FAST ML can train a mid-size data table (100 columns and 100K rows) in only a few minutes. You can compare this to other GAN-based models that are more resource-intensive, taking hours to finish. If your project metrics are satisfied with FAST ML, it is reasonable to choose this model over a GAN, even if it isn't perfectly optimized across all possible metrics.</p><p>From this example, we can see that metrics are not just something to evaluate at the end of the project – they are useful tools for decision-making <em>throughout</em> your project. </p><h3 id="step-3-test-the-end-to-end-workflow-upfront">Step 3: Test the end-to-end workflow upfront</h3><p>The purpose of metrics is to provide guardrails and focus for your project – scoping it so that you can drive business value most efficiently. For the highest chances of success, it's important to apply the synthetic data end-to-end for downstream applications, so that you can verify that business value upfront.</p><p>Continuing with our example of software testing, it's important to use the synthetic data for your downstream software testing suite as quickly as you can to verify the benefits of synthetic data. If you've chosen your metrics correctly and considered them when making decisions (steps #1 and #2), then you'll see that this translates to business value.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/10/Business-Value-Driven-Metrics.png" class="kg-image" alt="How to evaluate synthetic data for your project — and avoid the biggest mistake we see" loading="lazy" width="2000" height="667" srcset="https://sdv.ghost.io/content/images/size/w600/2022/10/Business-Value-Driven-Metrics.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/10/Business-Value-Driven-Metrics.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/10/Business-Value-Driven-Metrics.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/10/Business-Value-Driven-Metrics.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Going end-to-end means applying the synthetic data and verifying the ultimate business value it provides.</figcaption></img></figure><p>This type of cost-benefit analysis can help you make the case for synthetic data adoption within your enterprise. It can also help you iterate – to get more business value from your synthetic data, you can continue to optimize the metrics you've chosen in step #1, or identify new ones.</p><h3 id="the-takeaway">The Takeaway</h3><p>Technically, there are an infinite number of metrics that could be used to evaluate synthetic data. The key to success is to incorporate select metrics into your synthetic data project development rather than just applying all metrics at the end.</p><p>Your project goals are critical to helping you choose the right metrics. Setting them upfront allows you to make better decisions during your project development. And going end-to-end allows you to measure the business value that your synthetic data brings to the organization.</p><p><strong>What are your thoughts? </strong>Leave comments below! If you noticed other evaluation pitfalls in your projects, let us know below or reach us directly at <a href="mailto:info@sdv.dev">info@sdv.dev</a>.</p>]]></content:encoded></item><item><title><![CDATA[ML Model Development using Synthetic Data Clones]]></title><description><![CDATA[What happens when you train a machine learning model on synthetic data instead of real data? Let's experiment to find out.]]></description><link>https://datacebo.com/blog/synthetic-clones-for-ml</link><guid isPermaLink="false">Ghost__Post__6216679682795d003d91f6e5</guid><category><![CDATA[Product]]></category><dc:creator><![CDATA[Arnav Modi]]></dc:creator><pubDate>Thu, 24 Feb 2022 16:33:56 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2022/02/ML-Model-Development-Banner-04.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2022/02/ML-Model-Development-Banner-04.png" alt="ML Model Development using Synthetic Data Clones"/><p><em>This article was researched by Arnav Modi, a community user. Arnav is a high school student and aspiring data scientist who spent his summer learning about the SDV and how synthetic data is used to perform ML tasks.</em></p><p>One potential use for synthetic data is to replace real data in the development of new machine learning (ML) models. Imagine a scenario where you need to build a predictive ML model – perhaps for a function critical to your business, like predicting customer satisfaction or sales success – with one important consideration: <strong>The data is sensitive, so only trusted employees can access it with specific credentials.</strong></p><p>Access to sensitive data may create a barrier for a variety of reasons:</p><ul><li>You might not have ML expertise in your organization, which means you need to use external software or contractors to complete the task. However, you are unable to share the data with them.</li><li>Your data is available on a secure, cloud-based platform for trusted employees to access remotely. They work on this data using interactive notebooks. Every time they lose their connection – due to WiFi outages, their laptops falling asleep, etc. – they may lose their work or have to reconnect.</li><li>You have a robust authentication system that your team uses. However, it creates a barrier to entry for rapid, iterative collaboration between members, sharing work and debugging data pipelines. As a result, your collaboration is much slower than it would be if your team could access the data without the need to authenticate.</li></ul><p>In cases like this, synthetic data can be an ideal solution: You can create synthetic data based on the original, sensitive data set, and use it more freely during ML development.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/02/ML-Model-Development-03.png" class="kg-image" alt="ML Model Development using Synthetic Data Clones" loading="lazy" width="2000" height="783" srcset="https://sdv.ghost.io/content/images/size/w600/2022/02/ML-Model-Development-03.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/02/ML-Model-Development-03.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/02/ML-Model-Development-03.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/02/ML-Model-Development-03.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Synthetic data can be useful for ML development. You can use synthetic data to develop models in a variety of environments, like data science platforms, local machines or 3rd party software. Meanwhile, the real data never leaves your premises.</figcaption></img></figure><p>One key question will determine if this method succeeds: Is the synthetic data actually useful for your ML task? We performed an experiment to find out.</p><p>In the rest of this article, we'll describe our experimental setup and findings. (You can double-check our work in this <a href="https://colab.research.google.com/drive/13-1xy5t7veizWBsb_dDgTRBdhGcCqjCJ?usp=sharing">Colab Notebook</a>.)</p><h3 id="experimental-setup">Experimental Setup</h3><p>If an ML model is trained using synthetic data instead of real data, what happens to the model's performance? To answer this question, we identified 3 publicly available datasets (<a href="https://www.kaggle.com/mastmustu/income?select=train.csv">Income</a>, <a href="https://archive.ics.uci.edu/ml/datasets/Bank+Marketing">Bank</a> and <a href="https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction?select=train.csv">Airline</a>) that are associated with particular ML prediction tasks. The datasets and tasks are summarized below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/02/ML-Model-Development-04.png" class="kg-image" alt="ML Model Development using Synthetic Data Clones" loading="lazy" width="2000" height="671" srcset="https://sdv.ghost.io/content/images/size/w600/2022/02/ML-Model-Development-04.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/02/ML-Model-Development-04.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/02/ML-Model-Development-04.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/02/ML-Model-Development-04.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>A description of our datasets. *[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014</figcaption></img></figure><p>Our experiment compared the performance of an  ML model trained on the original data, vs. one trained on the synthetic data provided by the SDV.</p><ul><li><strong><strong><strong>Control (Original data): </strong></strong></strong>How successfully can we complete the ML prediction task if we use the real data? Because some predictions are harder than others, this control helped us identify the overall difficulty of these specific tasks.</li><li><strong><strong>Experiment (Synthetic data):</strong> </strong>How successfully can we complete the ML prediction task if we use synthetic data instead? We used the SDV's <a href="https://sdv.dev/SDV/user_guides/single_table/copulagan.html">CopulaGAN</a> to generate synthetic data from the three original datasets.</li></ul><p>In order to develop and test the ML model, we turned to the SDMetrics library — specifically the <a href="https://sdv.dev/SDV/user_guides/evaluation/single_table_metrics.html#machine-learning-efficacy-metrics">ML Efficacy metrics</a>, which build an ML model and evaluate its performance. We used the Binary Decision Tree Classifier and Binary Logistic Regression models. The overall experimental setup is illustrated below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/02/ML-Model-Development-05.png" class="kg-image" alt="ML Model Development using Synthetic Data Clones" loading="lazy" width="2000" height="724" srcset="https://sdv.ghost.io/content/images/size/w600/2022/02/ML-Model-Development-05.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/02/ML-Model-Development-05.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/02/ML-Model-Development-05.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/02/ML-Model-Development-05.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>The experimental setup evaluated synthetic data against a test set of original data that we set aside at the start. This allows us to compare the usefulness of both types of data for ML tasks.</figcaption></img></figure><p>To obtain reliable findings, we ran 3 iterations and averaged the results.</p><h3 id="results">Results</h3><p>The graph below shows how well we are able to perform an ML task using the original vs the synthetic data.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/02/Machine-Learning-Efficacy.png" class="kg-image" alt="ML Model Development using Synthetic Data Clones" loading="lazy" width="638" height="395" srcset="https://sdv.ghost.io/content/images/size/w600/2022/02/Machine-Learning-Efficacy.png 600w, https://sdv.ghost.io/content/images/2022/02/Machine-Learning-Efficacy.png 638w"><figcaption>A comparison of ML accuracy scores obtained using real vs. synthetic data, allowing us to assess any loss of accuracy that comes from replacing the original data with synthetic data.</figcaption></img></figure><p><strong>Discussion</strong></p><p>The original data quantifies the general difficulty of the ML task. Looking at these values, we can see that the Income Dataset is the hardest task, as neither of our methods were able to get above 90% accuracy using the original data.</p><p>Comparing the datasets allows us to quantify the suitability of synthetic data for ML development. Our results show a loss of between 1 and 9% of the original efficacy value for all comparisons, with a median loss of roughly 2.5%.</p><p>It's important to note that the simplifications we've made for this experiment may be resulting in worse accuracy than we would see in real-world use.</p><ul><li>Applying CopulaGAN out-of-the-box to each dataset is simplistic. In a real-world scenario, the model's parameters would likely be explicitly <a href="https://sdv.dev/SDV/user_guides/single_table/copulagan.html">tuned</a> and <a href="https://sdv.dev/SDV/user_guides/single_table/constraints.html">constraints</a> would be used to improve synthetic data quality.</li><li>The Decision Tree and Logistic Regression evaluators are relatively simplistic ML classifiers. An ML expert (or ML software) might use more advanced techniques.</li><li>In our scenario, the 3rd party delivers a fully trained, ready-to-go ML model. Another approach is to ask them to use the synthetic data to deliver an <em>untrained</em> model – so that you can train it yourself on the real dataset. This alternative setup, which should increase the prediction accuracy, will be a topic for a future article.</li></ul><p>In summary, the accuracy loss we observe represents the worst case scenario. In a production environment, higher-quality ML models and more careful tuning of the SDV will likely minimize performance differences between original and synthetic data.</p><h3 id="takeaways">Takeaways</h3><p>In this article, we quantified the effect of replacing real data with a synthetic data clone for ML development. Our results show a loss of 2.5% accuracy when using synthetic data. Considering these results, we assess that <strong>it is reasonable to explore the use of synthetic data for the purpose of ML development</strong>.</p><p>In order to maximize the utility of the synthetic data, we recommend tuning the SDV model and using constraints to improve the data quality. In future articles, we'll explore more details about using synthetic data for ML.</p><p><em>Are you using the SDV to solve your ML business needs? Publish your findings on the SDV blog as a guest author! Contact us at </em><a href="mailto:info@sdv.dev"><em>info@sdv.dev</em></a><em>.</em></p><p><br/></p><p><br/></p>]]></content:encoded></item><item><title><![CDATA[Building the Unique Combinations Constraint in the SDV]]></title><description><![CDATA[Sometimes, you want to limit the amount of permutations in your synthetic data. Explore the strategies we used for enforcing this kind of logic.]]></description><link>https://sdv.dev/building-unique-combinations/</link><guid isPermaLink="false">Ghost__Post__61e841116361ff003b9ca712</guid><category><![CDATA[Engineering]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Tue, 25 Jan 2022 18:25:20 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2022/01/Banner-UC.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2022/01/Banner-UC.png" alt="Building the Unique Combinations Constraint in the SDV"/><p>By default, a machine learning model (ML) may not always learn the deterministic rules in your dataset. We've previously explored how the SDV allows user to <a href="https://sdv.dev/blog/eng-sdv-constraints/" rel="nofollow">input their logic</a> using constraints. With constraints, an SDV model produces logically correct data 100% of the time.</p><p>While an end user might expect the constraint to "just work," engineering this functionality requires some creative techniques. In this article, we'll describe the techniques we used to build the <code>UniqueCombinations</code> constraint. You can also follow along in our <a href="https://colab.research.google.com/drive/1bY8y6m7-CjTxWDepw32-ZT3Ubb9RGK5F?usp=sharing">notebook</a>.</p><pre><code>!pip install sdv==0.13.1</code></pre><pre><code class="language-python">import numpy as np
import warnings

warnings.filterwarnings('ignore')</code></pre><h3 id="what-is-a-unique-combinations-constraint">What is a Unique Combinations Constraint?</h3><p>Users frequently encounter logical constraints on the permutations -- mixing &amp; matching -- that are allowed in synthetic data.</p><p>To illustrate this, let's use the <code>world_v1</code> dataset from the SDV tabular dataset demos. This simple dataset describes the population of different cities around the world.</p><pre><code class="language-python">from sdv.demo import load_tabular_demo

data = load_tabular_demo('world_v1')
data = data.drop(['add_numerical'], axis=1) # not needed for this demo
data.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.51.49-AM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="1014" height="362" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-11.51.49-AM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/Screen-Shot-2022-01-19-at-11.51.49-AM.png 1000w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.51.49-AM.png 1014w" sizes="(min-width: 720px) 720px"/></figure><p><strong>Relationship between <code>Name</code>, <code>CountryCode</code> and <code>District</code></strong></p><p>Looking at the data, we can observe that there is a special relationship between the <code>Name</code> of the city, its <code>CountryCode</code> and its geographical <code>District</code>: When generating synthetic data, the model should not blindly mix-and-match these values. Instead, it should <strong>reference the real data to verify whether the combination is valid.</strong> This is called a <code>UniqueCombinations</code> constraint.</p><p>For example, take a particular city, like <code>Cambridge</code>, which appears 3 times in our dataset.</p><pre><code class="language-python">data[data.Name == 'Cambridge']</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.53.07-AM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="1020" height="248" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-11.53.07-AM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/Screen-Shot-2022-01-19-at-11.53.07-AM.png 1000w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.53.07-AM.png 1020w" sizes="(min-width: 720px) 720px"/></figure><p>The constraint states that <code>Cambridge</code> should only ever appear with <code>GBR (England)</code>, <code>CAN (Ontario)</code> or <code>USA (Massachusetts)</code>. It is invalid if it appears in any other region -- for eg. Cambridge, France.</p><p><strong>How does the SDV handle a Unique Combination out-of-the-box?</strong></p><p>Let's try running the <code>sdv</code> as-is on the dataset to see what happens. We'll use the <code>GaussianCopula</code> model on our dataset.</p><pre><code class="language-python">from sdv.tabular import GaussianCopula

np.random.seed(0)

model = GaussianCopula(
  categorical_transformer='label_encoding' # optimize speed
) 
model.fit(data)</code></pre><p>Now, let's generate some rows to inspect the synthetic data.</p><pre><code class="language-python">np.random.seed(12)
model.sample(5)</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.54.31-AM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="940" height="360" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-11.54.31-AM.png 600w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.54.31-AM.png 940w" sizes="(min-width: 720px) 720px"/></figure><p>Although the <code>sdv</code> is generating known city names, countries and districts, their combinations don't make sense. We can also go back to our original example and generate only some rows for <code>Cambridge</code>.</p><pre><code class="language-python">np.random.seed(10)

conditions = {'Name': 'Cambridge'}
model.sample(5, conditions=conditions)</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.55.06-AM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="1022" height="364" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-11.55.06-AM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/Screen-Shot-2022-01-19-at-11.55.06-AM.png 1000w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.55.06-AM.png 1022w" sizes="(min-width: 720px) 720px"/></figure><p>The result is a variety of Cambridges that aren't necessarily in USA, GBR, or CAN. These aren't valid cities!</p><p><strong>What's going on?</strong> The SDV models include probabilities that some unseen combinations are possible. This is by design: Synthesizing new combinations -- that don't blatantly match the original data -- helps with privacy.</p><p>However in this particular case, we aren't worried about the privacy of a city belonging to a country or district. We actually <em>do</em> want the data to match. This is why we need to build a constraint.</p><h3 id="fixing-the-data-using-rejecting-sampling">Fixing the data using rejecting sampling</h3><p>In our <a href="https://sdv.dev/blog/eng-sdv-constraints/" rel="nofollow">previous article</a>, we described a solution called <code>reject_sampling</code> that works on any type of constraint and is very easy to build: We simply create the synthetic data as usual and then throw out (reject) any data that doesn't match.</p><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/UniqueCombinations-02.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="883" height="316" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/UniqueCombinations-02.png 600w, https://sdv.ghost.io/content/images/2022/01/UniqueCombinations-02.png 883w" sizes="(min-width: 720px) 720px"/></figure><p>In theory, this can solve our <code>UniqueCombinations</code> constraint. In practice, this strategy is only efficient if the model can easily generate acceptable data. Let's calculate the chances of getting an acceptable combination (<code>Name</code>, <code>CountryCode</code>, <code>District</code>) from the model.</p><pre><code class="language-python">np.random.seed(0)

# Sample data from the model
# The sample may include combinations that aren't valid
n = 100000
new_data = model.sample(n)

# Calculate how many rows are valid
combo = ['Name', 'CountryCode', 'District']
merged = new_data.merge(data, left_on=combo, right_on=combo, how='left')
passed = merged[merged['ID_y'].notna()].shape[0]

# Print out our results
print("Valid rows: ", (passed/n)*100, "%")
print("Rejected rows: ", (1 - passed/n)*100, "%")</code></pre><pre><code>Valid rows:  0.038 %
Rejected rows:  99.96199999999999 %</code></pre><p>With such a low probability of passing the constraint, this strategy can become intractable.</p><h3 id="fixing-the-data-using-transformations">Fixing the data using transformations</h3><p>A more efficient strategy is for the ML model to learn the constraint directly, so it always produces acceptable data. We can do this by transforming the data in a clever way, forcing the model to learn the logic.</p><p>Our <a href="https://sdv.dev/blog/eng-sdv-constraints/" rel="nofollow">previous article</a> described how to do this for a different constraint. Unfortunately, the exact same transformation won't work to solve our current <code>UniqueCombinations</code> constraint. <strong>The transform strategy requires a different, creative solution for each constraint.</strong> So we have to start from scratch.</p><p>Can you think of any other ways to enforce <code>UniqueCombinations</code>?</p><p><strong>A solution: Concatenating the data</strong></p><p>One solution is to concatenate the data. That is, rather than treating the city <code>Name</code>, <code>CountryCode</code> and <code>District</code> as separate items, we treat them as a single value. This will force the model to learn them as 1 single concept rather than as multiple columns that can be recombined.</p><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/UniqueCombinations-01.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="1524" height="1200" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/UniqueCombinations-01.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/UniqueCombinations-01.png 1000w, https://sdv.ghost.io/content/images/2022/01/UniqueCombinations-01.png 1524w" sizes="(min-width: 720px) 720px"/></figure><p>Let's see this in action.</p><pre><code class="language-python"># create transformed data that concatenates the columns
data_transform = data.copy()

# Concatenate the data using a separator
data_transform['concatenated'] = data_transform['Name'] + '#' + data_transform['CountryCode'] + '#' + data_transform['District']

# We can drop the individual columns
data_transform.drop(labels=['Name', 'CountryCode', 'District'],
                    axis=1, inplace=True)

data_transform.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.58.21-AM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="828" height="368" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-11.58.21-AM.png 600w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.58.21-AM.png 828w" sizes="(min-width: 720px) 720px"/></figure><p>Now, we can train the model using the transformed (concatenated) data instead.</p><pre><code class="language-python">np.random.seed(35)

# create a new model that will learn from the transformed data
model_transform = GaussianCopula(categorical_transformer='label_encoding')
model_transform.fit(data_transform)

# this will produce transformed data
output = model_transform.sample()
output.head(5)</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.58.53-AM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="882" height="368" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-11.58.53-AM.png 600w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.58.53-AM.png 882w" sizes="(min-width: 720px) 720px"/></figure><p>To get back realistic-looking data, we can convert the concatenated column back into <code>Name</code>, <code>City</code> and <code>District</code>.</p><pre><code class="language-python">import pandas as pd

# Split the conatenated column by the separator and save the reuslts
names = []
countrycodes = []
districts = []

for x in output['concatenated']:
  try:
    name, countrycode, district = x.split('#')
  except:
    name, countrycode, district = [np.nan]*3
  names.append(name)
  countrycodes.append(countrycode)
  districts.append(district)

# Add the individual columns back in
output['Name'] = pd.Series(names)
output['CountryCode'] = pd.Series(countrycodes)
output['District'] = pd.Series(districts)

# Drop the concatenated column
output.drop(labels=['concatenated'], axis=1, inplace=True)</code></pre><p>As a result, the output now looks like our original data.</p><pre><code class="language-python">output.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.59.41-AM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="1020" height="368" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-11.59.41-AM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/Screen-Shot-2022-01-19-at-11.59.41-AM.png 1000w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-11.59.41-AM.png 1020w" sizes="(min-width: 720px) 720px"/></figure><p>Most importantly, the <code>Name</code>, <code>CountryCode</code> and <code>District</code> columns now make sense!</p><p><strong>Caveats of transforming the data</strong></p><p>The transform strategy is an efficient and elegant approach to modeling. But there is a downside: <strong>The transform strategy might lose some mathematical properties.</strong></p><p>To see why, consider the model's perspective:</p><ul><li><code>Cambridge#GBR#England</code> is completely different from</li><li><code>Cambridge#USA#Massachusetts</code> is completely different from</li><li><code>Boston#USA#Massachusetts</code></li></ul><p>The problem is that two of these actually have something in common -- they are located in <code>Massachusetts, USA</code>. So the model will not be able to learn anything special about <code>Massachusetts</code> or <code>USA</code> as a whole.</p><p>As an example, let's see how well the model was able to learn populations of US-based cities.</p><pre><code class="language-python">import matplotlib.pyplot as plt

# Populations of real US cities
real_usa = data.loc[data['CountryCode'] == 'USA', 'Population']

# Populations of synthetic US cities
synth_usa = output.loc[output['CountryCode'] == 'USA', 'Population']

# Plot the distributions
plt.ylabel('US City Data')
plt.xlabel('Population')
_ = plt.boxplot([real_usa, synth_usa],
                showfliers=False,
                labels=['Real', 'Synthetic'],
                vert=False
)
plt.show()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-12.00.53-PM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="1022" height="500" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-12.00.53-PM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/Screen-Shot-2022-01-19-at-12.00.53-PM.png 1000w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-12.00.53-PM.png 1022w" sizes="(min-width: 720px) 720px"/></figure><p>The real data shows less variation in city population than the synthetic data. The differences make sense because our model wasn't able to learn about the USA as one complete concept.</p><p><strong>Can we fix this?</strong> It's challenging to fix this issue without degrading the mathematical correlations in some other way. If you have any ideas, we welcome you to <a href="https://github.com/sdv-dev/SDV/issues/414" rel="nofollow">join our discussion</a>!</p><h3 id="inputting-a-uniquecombination-into-the-sdv">Inputting a UniqueCombination into the SDV</h3><p>We built the constraint -- both the <code>reject_sampling</code> and <code>transform</code> approaches -- directly into the SDV library. If you have <code>sdv</code> installed, this is ready to use. Import the <code>UniqueCombinations</code> class from the <code>constraints</code> module.</p><pre><code class="language-python">from sdv.constraints import UniqueCombinations

# Create a Unique Combinations constraint
unique_city_country_district = UniqueCombinations(
  columns=['Name', 'CountryCode', 'District'],
  handling_strategy='transform' # you can change this 'reject_sampling' too
)

# Create a new model using the constraint
updated_model = GaussianCopula(
  constraints=[unique_city_country_district],
  categorical_transformer='label_encoding'
)</code></pre><p>Now, you can train the model on your data and sample synthetic data.</p><pre><code class="language-python">np.random.seed(35)

updated_model.fit(data)
updated_model.sample(5)</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-12.02.30-PM.png" class="kg-image" alt="Building the Unique Combinations Constraint in the SDV" loading="lazy" width="1146" height="382" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-19-at-12.02.30-PM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/Screen-Shot-2022-01-19-at-12.02.30-PM.png 1000w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-19-at-12.02.30-PM.png 1146w" sizes="(min-width: 720px) 720px"/></figure><p>All of the synthetic data is guaranteed to follow the <code>UniqueCombinations</code> constraint.</p><h3 id="takeaways">Takeaways</h3><ol><li>We can identify a <code>UniqueCombinations</code> requirement by asking: Should it be possible to further mix-and-match the data?</li><li>We can enforce any logical constraint by using reject sampling, which throws out any invalid data. This is not efficient for <code>UniqueCombinations</code>.</li><li>An alternative approach is to transform the data, forcing the ML model to learn the constraint. For <code>UniqueCombinations</code> we transformed the data by concatenating it.</li><li>The logic for <code>UniqueCombinations</code> is already built into the SDV's <code>constraints</code> module, and is ready to use.</li></ol><p>Further reading:</p><ul><li><a href="https://sdv.dev/blog/eng-sdv-constraints/" rel="nofollow">Engineering Constraints Blog Article</a></li><li><a href="https://sdv.dev/SDV/user_guides/single_table/constraints.html" rel="nofollow">Handling Constraints User Guide</a></li><li><a href="https://sdv.dev/SDV/api_reference/constraints/tabular.html" rel="nofollow">Tabular Constraints API</a></li></ul>]]></content:encoded></item><item><title><![CDATA[The SDV in 2021: A year in review]]></title><description><![CDATA[In this article, we summarize SDV growth – downloads as well as community building – that indicates increasing market demand for synthetic data.]]></description><link>https://sdv.dev/2021-year-review/</link><guid isPermaLink="false">Ghost__Post__61d3611b6317ec003be8e4b3</guid><category><![CDATA[Open Source]]></category><dc:creator><![CDATA[Kalyan Veeramachaneni]]></dc:creator><pubDate>Mon, 03 Jan 2022 21:07:19 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2022/01/Year-in-review-with-sdv.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2022/01/Year-in-review-with-sdv.png" alt="The SDV in 2021: A year in review"/><p>We started SDV open source in 2018 at MIT with the goal of creating a powerful, usable, machine learning-based synthetic data generation software system. The core belief that drove us was the conviction that more than 90% of data work can be done using synthetic data instead of real data. Early<a href="https://news.mit.edu/2017/artificial-data-give-same-results-as-real-data-0303"> experiments at MIT</a> had been promising and we were ready to invest our time and energy into that promise.</p><p>Now, 3 years later, we are pleased to see that the market demand for synthetic data is increasing. In a 2021 article, Gartner <a href="https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/">predicted</a> that 60% of data used for AI &amp; analytics will be synthetic by 2024. </p><p>As time progressed, we used feedback from our users to make numerous improvements to the SDV (see articles <a href="https://sdv.dev/blog/community-feedback-models/">Part 1</a> and <a href="https://sdv.dev/blog/community-feedback-workflow/">Part 2</a>). In response, we've seen increased usage, validating the market need for synthetic data generation software. In this article, we'll describe the SDV growth trends in detail.</p><h3 id="persistent-4xyear-growth-in-downloads">Persistent 4x/year growth in downloads</h3><p>Every year we are experiencing a 4x increase in SDV downloads. In 2021, we had 135,000 downloads of SDV – up from 30,576 in 2020. From the start of 2020 to the end of 2021, we have seen 16x total increase in SDV downloads. The figure below shows our yearly usage.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/01/downloads-graphic-1.png" class="kg-image" alt="The SDV in 2021: A year in review" loading="lazy" width="2000" height="889" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/downloads-graphic-1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/downloads-graphic-1.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/01/downloads-graphic-1.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/01/downloads-graphic-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Downloads of SDV per year since we open sourced the library in 2018. By downloading the SDV, a user is signaling their need for synthetic data – which we can interpret as a vote from the market.</figcaption></img></figure><p>The downloads are coming from all over the world. In the map below, we list the top 10 countries.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/01/worldmap-graphic.png" class="kg-image" alt="The SDV in 2021: A year in review" loading="lazy" width="2000" height="1156" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/worldmap-graphic.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/worldmap-graphic.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/01/worldmap-graphic.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/01/worldmap-graphic.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Downloads of the SDV in 2021, broken down by the top 10 countries. Notice that Europe accounts for 50 half the countries.</figcaption></img></figure><p>Why are users downloading the SDV? We know that they want to create synthetic data, but they are using the synthetic data to solve a variety of different needs. We will explore this more and share it in a future article.</p><h3 id="over-a-thousand-new-community-members">Over a thousand new community members</h3><p>Another measure of our growth – and validation from the market – comes from the SDV community we've built on our <a href="https://github.com/sdv-dev/SDV">GitHub</a> and <a href="https://bit.ly/sdv-slack-invite">Slack</a>. In 2021, we welcomed more than 1000 new members to these spaces.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/01/community-graphic-2.png" class="kg-image" alt="The SDV in 2021: A year in review" loading="lazy" width="2000" height="761" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/community-graphic-2.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/community-graphic-2.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/01/community-graphic-2.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2022/01/community-graphic-2.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>A summary of how the SDV Community grew in 2021. Any user can join the community and actively participate through the SDV GitHub and Slack.</figcaption></img></figure><p>As <a href="https://www.bvp.com/atlas/measuring-the-engagement-of-an-open-source-software-community">this article</a> points out, members contribute in several different ways: Many help increase awareness of an open source solution for this enterprise pain point. Meanwhile, others jump in, use it and give feedback actively. In 2021, we doubled the number of unique users raising issues on our GitHub. Throughout the  year, over 200 members actively participated in our forums by raising GitHub issues or contributing to discussions on Slack.</p><p>Enterprise feedback is particularly useful to us. This type of feedback comes from users who are solving targeted business problems with the SDV. Direct and succinct feedback explains what would make the SDV more useful. An example is shown below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-03-at-12.55.08-PM.png" class="kg-image" alt="The SDV in 2021: A year in review" loading="lazy" width="1718" height="282" srcset="https://sdv.ghost.io/content/images/size/w600/2022/01/Screen-Shot-2022-01-03-at-12.55.08-PM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2022/01/Screen-Shot-2022-01-03-at-12.55.08-PM.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2022/01/Screen-Shot-2022-01-03-at-12.55.08-PM.png 1600w, https://sdv.ghost.io/content/images/2022/01/Screen-Shot-2022-01-03-at-12.55.08-PM.png 1718w" sizes="(min-width: 720px) 720px"><figcaption>Feedback about a missing feature – composite keys – that would make a direct impact on an enterprise use case. We've removed the user's GitHub account name for privacy. In this case, the missing feature did make it into our pre-alpha.</figcaption></img></figure><p>Our team addresses the user feedback throughout the entire SDV ecosystem. The ecosystem includes not only modeling, but also the ability to compare models through <a href="https://github.com/sdv-dev/SDGym">SDGym</a> and measure synthetic data quality through <a href="https://github.com/sdv-dev/SDMetrics">SDMetrics</a>. In 2021, the team put out 49 releases throughout the SDV ecosystem, doubling our number of releases in 2020.</p><h3 id="looking-forward-to-2022">Looking forward to 2022!</h3><p>We are looking forward to 2022! With so many users giving us feedback, we have a long list of features that we want to incorporate. We can't wait to share with our community what everyone is using SDV for, and keep on climbing to our original goal: 90% of data work accomplished with synthetic data.</p>]]></content:encoded></item><item><title><![CDATA[How we engineered constraint handling strategies in SDV]]></title><description><![CDATA[The SDV enforces deterministic rules using constraints. What strategies did we use to engineer this ML system? Dive into the details.]]></description><link>https://sdv.dev/eng-sdv-constraints/</link><guid isPermaLink="false">Ghost__Post__61c10f636317ec003be8e39d</guid><category><![CDATA[Engineering]]></category><dc:creator><![CDATA[Andrew Montanez]]></dc:creator><pubDate>Tue, 21 Dec 2021 00:14:45 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2021/12/Banner-01.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2021/12/Banner-01.png" alt="How we engineered constraint handling strategies in SDV"/><p>The SDV uses machine learning (ML) to automatically learn rules (aka correlations) from real data and generate accurate synthetic data. While these models are powerful, they may not learn everything. In our <a href="https://sdv.dev/blog/user-input-synthetic-data/" rel="nofollow">previous article</a>, we described how the SDV models may not learn <strong>deterministic rules</strong>. These are patterns and laws that are inherent to the dataset:</p><ul><li>They are unchangeable, no matter what data you input.</li><li>They describe rules that must apply to every row, no exceptions.</li></ul><p>Luckily, it's possible for you to improve the machine learning model: When you input constraints, it ensures the model will learn deterministic rules and ultimately improve the quality of your synthetic data.</p><p>In this article, we'll dive into the technical details of how you can apply constraints and how they work under-the-hood. You can also follow along in our <a href="https://colab.research.google.com/drive/1cVGv2Xtzhd9qHgbkjsYLeLzsA8bDd1uA?usp=sharing">notebook</a>.</p><pre><code>!pip install sdv==0.13.0</code></pre><pre><code class="language-python">import numpy as np
import warnings

warnings.filterwarnings('ignore')</code></pre><h3 id="the-dataset">The Dataset</h3><p>The dataset we're using comes from a <a href="https://www.kaggle.com/c/expedia-hotel-recommendations/data?select=train.csv" rel="nofollow">Kaggle Competition</a> hosted by Expedia. We've modified the data slightly for our use.</p><pre><code class="language-python">from sdv.demo import load_tabular_demo

data = load_tabular_demo('expedia_hotel_logs')</code></pre><p>In this real-world dataset, each row represents a search result for a hotel booking.</p><p>For the purposes of this notebook, we'll drop some columns that aren't useful to us.</p><pre><code class="language-python">import pandas as pd

# Drop some columns that aren't useful for this demo
drop_columns = ['date_time', 'user_location_country', 'user_location_region',
                'user_location_city', 'user_id', 'srch_destination_id',
                'hotel_country', 'hotel_market', 'hotel_cluster',
                'srch_destination_type_id', 'orig_destination_distance',
                'posa_continent', 'site_name', 'channel']
data = data.drop(drop_columns, axis=1)

# make sure these columns are read as datetimes
for col in ['srch_ci', 'srch_co']:
  data[col] = pd.to_datetime(data[col])

# Inspect the data
data.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.36.14-PM.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="349" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/Screen-Shot-2021-12-20-at-3.36.14-PM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/Screen-Shot-2021-12-20-at-3.36.14-PM.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/Screen-Shot-2021-12-20-at-3.36.14-PM.png 1600w, https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.36.14-PM.png 2122w" sizes="(min-width: 720px) 720px"/></figure><p>The search parameters, for finding a hotel room, saved in this dataset come from from user's input. For example:</p><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/EngineeredConstraint-08.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="912" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/EngineeredConstraint-08.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/EngineeredConstraint-08.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/EngineeredConstraint-08.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2021/12/EngineeredConstraint-08.png 2400w" sizes="(min-width: 720px) 720px"/></figure><p><strong>Deterministic Rule</strong></p><p>In order for the search to be valid, the searched check-in date must happen before the searched check-out date. That is: <code>srch_ci &lt; srch_co</code>.</p><p>This is an inherent property of any search, not just for this particular dataset -- we call this a <strong>deterministic rule</strong>. We can verify if this is true by checking for any exceptions.</p><pre><code class="language-python">print('Violations of the deterministic rule')
len(data[data['srch_ci'] &gt; data['srch_co']])</code></pre><pre><code>0</code></pre><p><strong>Will SDV's machine learning model learn this out of the box?</strong></p><p>To test this, let's use SDV to learn a <code>GaussianCopula</code> model from the data and sample synthetic data.</p><pre><code class="language-python">from sdv.tabular import GaussianCopula

np.random.seed(0)

model = GaussianCopula(primary_key='log_id')
model.fit(data)

synth_data = model.sample(500)
synth_data.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.36.54-PM-1.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="388" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/Screen-Shot-2021-12-20-at-3.36.54-PM-1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/Screen-Shot-2021-12-20-at-3.36.54-PM-1.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/Screen-Shot-2021-12-20-at-3.36.54-PM-1.png 1600w, https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.36.54-PM-1.png 2050w" sizes="(min-width: 720px) 720px"/></figure><p>Now, we can inspect the synthetic data to see if there are any invalid rows.</p><pre><code class="language-python">invalid_row_indices = synth_data['srch_ci'] &gt; synth_data['srch_co']
invalid_rows = synth_data[invalid_row_indices]

num_invalid = len(invalid_rows)
perc_invalid = num_invalid / len(synth_data) * 100
print('Number of invalid rows:', num_invalid, '(', round(perc_invalid, 2), '%)')

invalid_rows.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.37.27-PM-1.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="414" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/Screen-Shot-2021-12-20-at-3.37.27-PM-1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/Screen-Shot-2021-12-20-at-3.37.27-PM-1.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/Screen-Shot-2021-12-20-at-3.37.27-PM-1.png 1600w, https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.37.27-PM-1.png 2070w" sizes="(min-width: 720px) 720px"/></figure><p>The majority of the rows (94.8%) are valid, meaning the model learned the rule pretty accurately. It learned probabilistically that if the <code>srch_ci</code> is higher <code>srch_co</code> should be even higher. However, some invalid rows (~5%) are still created so <strong>the model did not learn this deterministic rule.</strong></p><p>This raises the question: What can we do to enforce a deterministic rule?</p><h3 id="improving-the-synthetic-data">Improving the synthetic data</h3><p>Let's explore some options for enforcing our deterministic rule in order to improve the overall quality of the synthetic data.</p><p><strong>Rejecting invalid data</strong></p><p>The simplest solution is to simply drop the invalid rows, and continually sample from the model until the desired amount of valid rows are produced. We call this <strong>reject sampling</strong>.</p><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/EngineeredConstraint-07--1-.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="493" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/EngineeredConstraint-07--1-.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/EngineeredConstraint-07--1-.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/EngineeredConstraint-07--1-.png 1600w, https://sdv.ghost.io/content/images/2021/12/EngineeredConstraint-07--1-.png 2071w" sizes="(min-width: 720px) 720px"/></figure><p>The code below performs reject sampling until we have synthesized 500 rows.</p><pre><code class="language-python">import pandas as pd

# Keep track of how many valid rows we've sampled
num_valid_rows = synth_data.shape[0] - invalid_rows.shape[0]

while num_valid_rows &lt; 500:
  # Reject the invalid data 
  synth_data = synth_data.drop(invalid_rows.index)
  
  # Create new data to replace the invalid data
  new_data = model.sample(500-num_valid_rows)
  synth_data = pd.concat([synth_data, new_data])
  invalid_rows = synth_data[synth_data['srch_ci'] &gt; synth_data['srch_co']]
  num_valid_rows = synth_data.shape[0] - invalid_rows.shape[0]

synth_data.reset_index(drop=True, inplace=True)</code></pre><p>Now, there are no invalid rows in our dataset.</p><pre><code class="language-python">invalid_rows = synth_data[synth_data['srch_ci'] &gt; synth_data['srch_co']]
invalid_rows.shape[0]</code></pre><pre><code>0</code></pre><p>In this example, we got lucky. Only a small percentage of the rows were invalid each time <code>sample</code> was called.</p><p>What would happen if majority of the rows were invalid every time we sampled? It would take a longer time to get all the desired rows. <strong>Sampling time is the primary drawback of reject sampling. </strong>Is there another approach we can use to improve the time?</p><p><strong>Transforming your data</strong></p><p>Instead of reject sampling, what if the model never produced invalid rows in the first place? To achieve this, we can alter the input data to the model so it's forced to learn the constraint.</p><p>Let's stop giving the <code>srch_ci</code> and <code>srch_co</code> to the model. Instead, let's teach the model to learn the <code>srch_ci</code> and the <code>difference</code> between the dates.</p><pre><code>difference = srch_co - srch_ci</code></pre><p>The model will produce <code>srch_ci</code> and <code>difference</code> as a result. Then, we can re-compute <code>srch_co</code> with the opposite formula.</p><pre><code>srch_co = srch_ci + difference</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/EngineeredConstraint-06--1-.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="879" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/EngineeredConstraint-06--1-.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/EngineeredConstraint-06--1-.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/EngineeredConstraint-06--1-.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2021/12/EngineeredConstraint-06--1-.png 2400w" sizes="(min-width: 720px) 720px"/></figure><p>(Of course, we need to make sure the difference is always positive, which we can do using a <code>log + 1</code>.)</p><p>Let's see this in action.</p><pre><code class="language-python"># Compute the difference
diff = (data['srch_co'] - data['srch_ci']).astype('timedelta64[D]')

# Take the log and add one to ensure that it's positive
date_diff = np.log(diff + 1)

# The model should learn this column instead of the checkout date
modified_data = data.drop('srch_co', axis=1)
modified_data['difference'] = date_diff
modified_data[['srch_ci', 'difference']].head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.30.15-PM.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="390" height="360"/></figure><p>Now, we can fit the model with the modified data. The new samples will include the <code>srch_ci</code> and <code>date_diff</code> columns.</p><pre><code class="language-python">np.random.seed(20)

modified_model = GaussianCopula(primary_key='log_id')
modified_model.fit(modified_data)

modified_synth_data = modified_model.sample(500)
modified_synth_data[['srch_ci', 'difference']].head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.31.03-PM.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="392" height="356"/></figure><p>We can recompute the <code>srch_co</code> based on <code>srch_ci</code> and <code>difference</code>.</p><pre><code class="language-python"># Undo the log+1 that we added
diff = (np.exp(modified_synth_data['difference'].values).round() - 1).clip(0).astype('timedelta64[ns]')

# Reconstruct the end_date and remove the date_diff column
modified_synth_data['srch_co'] = modified_synth_data['srch_ci'] + diff
modified_synth_data = modified_synth_data.drop('difference', axis=1)

modified_synth_data.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.38.38-PM.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="491" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/Screen-Shot-2021-12-20-at-3.38.38-PM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/Screen-Shot-2021-12-20-at-3.38.38-PM.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/Screen-Shot-2021-12-20-at-3.38.38-PM.png 1600w, https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.38.38-PM.png 2142w" sizes="(min-width: 720px) 720px"/></figure><p>Let's verify that this computation does not create any invalid rows.</p><pre><code class="language-python">invalid_rows = modified_synth_data[modified_synth_data['srch_ci'] &gt; modified_synth_data['srch_co']]
invalid_rows.shape[0]</code></pre><pre><code>0</code></pre><p>The transformation worked! In our case, this was a more efficient way to enforce the deterministic rule.</p><p>But if our rule were more complex -- and we couldn't think of a transformation -- we could always fall back to reject sampling.</p><h3 id="inputting-deterministic-rules-in-the-sdv">Inputting deterministic rules in the SDV</h3><p>We've seen how reject sampling and transform can be used to improve the quality of the synthetic data by accounting for deterministic rules. However, it may be cumbersome for you to manually implement these strategies. In fact, we saw some common problems in our SDV user community:</p><ul><li>Users had multiple deterministic rules in their dataset. For example, there could be multiple comparisons between different pairs of columns.</li><li>Users from multiple domains often had the same kind of deterministic rule. For example, one column being greater than another is a common deterministic rule, agonistic of a use case or domain.</li></ul><p>To solve these problems, we introduced a constraints module in the SDV. <strong>With the constraints module, SDV users can easily input deterministic rules. </strong>Let's look at an example.</p><p><strong>Using the SDV constraints module</strong></p><p>The <code>constraints</code> module in the SDV contains several different types of pre-defined deterministic rules.</p><p>We will use the <code>GreaterThan</code> constraint, which will enforce that one column's values are always greater than another's.</p><pre><code class="language-python">from sdv.constraints import GreaterThan</code></pre><p>Next, we can input the logic of our deterministic rule by creating a constraint object. The <code>GreaterThan</code> constraint accepts the column names as input.</p><pre><code class="language-python">gt_constraint = GreaterThan(
  low='srch_ci',
  high='srch_co')</code></pre><p>Finally, we can input this constraint when instantiating the model.</p><pre><code class="language-python">np.random.seed(10)

# Apply the constraint to the model
model_with_constraint = GaussianCopula(
  primary_key='log_id',
  constraints=[gt_constraint])

model_with_constraint.fit(data)

# Sample synthetic data
constrained_data = model_with_constraint.sample(500)
constrained_data.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.39.11-PM.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="389" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/Screen-Shot-2021-12-20-at-3.39.11-PM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/Screen-Shot-2021-12-20-at-3.39.11-PM.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/Screen-Shot-2021-12-20-at-3.39.11-PM.png 1600w, https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.39.11-PM.png 2046w" sizes="(min-width: 720px) 720px"/></figure><p>As a result, we should see that all 500 generated rows are valid on the first try. No invalid rows are present in our dataset.</p><pre><code class="language-python">invalid_rows = constrained_data[constrained_data['srch_ci'] &gt; constrained_data['srch_co']]
invalid_rows.shape[0]</code></pre><pre><code>0</code></pre><p>Using the SDV was much simpler than writing the code ourselves! Plus, we can create multiple constraints for the same dataset an easily use them on other datasets.</p><p><strong>Specifying the strategy in the constraints module</strong></p><p>By default, the <code>GreaterThan</code> constraint uses the <code>transform</code> strategy. However, you can use the <code>handling_strategy</code> argument to control this. This argument accepts <code>'reject_sampling'</code> or <code>'transform'</code> as valid strategies.</p><pre><code class="language-python">gt_reject_constraint = GreaterThan(
  low='srch_ci',
  high='srch_co',
  handling_strategy='reject_sampling' # specify the strategy
)</code></pre><p>Similar to before, we can then input this constraint into the model.</p><pre><code class="language-python">np.random.seed(30)

# Apply the constraint to the model
model_with_reject_constraint = GaussianCopula(
  primary_key='log_id',
  constraints=[gt_reject_constraint])

model_with_reject_constraint.fit(data)

# Sample synthetic data
constrained_reject_data = model_with_reject_constraint.sample(500)
constrained_reject_data.head()</code></pre><figure class="kg-card kg-image-card"><img src="https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.40.44-PM.png" class="kg-image" alt="How we engineered constraint handling strategies in SDV" loading="lazy" width="2000" height="377" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/Screen-Shot-2021-12-20-at-3.40.44-PM.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/Screen-Shot-2021-12-20-at-3.40.44-PM.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/Screen-Shot-2021-12-20-at-3.40.44-PM.png 1600w, https://sdv.ghost.io/content/images/2021/12/Screen-Shot-2021-12-20-at-3.40.44-PM.png 2048w" sizes="(min-width: 720px) 720px"/></figure><pre><code class="language-python">invalid_rows = constrained_reject_data[constrained_reject_data['srch_ci'] &gt; constrained_reject_data['srch_co']
invalid_rows.shape[0]</code></pre><pre><code>0</code></pre><h3 id="what-other-deterministic-rules-are-already-available-in-sdv">What other deterministic rules are already available in SDV?</h3><p>The <code>GreaterThan</code> constraint is one kind of deterministic rule, but there may be others that apply to your dataset. The SDV offers more constraints for other types of logic.</p><ul><li><a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#unique-constraint" rel="nofollow">Unique</a> when values in a column must be unique to the entire dataset.</li><li><a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#uniquecombinations-constraint" rel="nofollow">UniqueCombinations</a> to limit the permutations between multiple columns.</li><li><a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#positive-and-negative-constraints" rel="nofollow">Positive and Negative</a> to enforce boundaries.</li><li><a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#columnformula-constraint" rel="nofollow">ColumnFormula</a> when there is a formulaic association between columns.</li><li><a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#rounding-constraint" rel="nofollow">Rounding</a> to enforce decimal precision.</li><li><a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#between-constraint" rel="nofollow">Between</a> when one column's values must be between 2 other values.</li><li><a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html#onehotencoding-constraint" rel="nofollow">OneHotEncoding</a> when your data includes a variable with one hot encoding.</li></ul><p>For each of them, you can specify handling strategies for <code>reject_sampling</code> to discard invalid data or <code>transform</code> to modify the data (unique to each constraint).</p><p><strong>What if my rule isn't included in the module?</strong></p><p>You may come across a rule that cannot be described by any of the constraints classes in the SDV. In this case, you can define a <a href="https://sdv.dev/SDV/user_guides/single_table/custom_constraints.html#defining-custom-constraints" rel="nofollow">CustomConstraint</a> with logic specific to your use case.</p><p>Additionally, consider <a href="https://github.com/sdv-dev/SDV/issues/new/choose" rel="nofollow"><strong>filing a feature request on GitHub</strong></a> with details about your use case &amp; scenario. We can add your logic as a pre-defined constraint so others can benefit from it too!</p><h3 id="takeaways">Takeaways</h3><p>In this notebook, we explored what happens when we have a deterministic rule in our dataset.</p><ol><li>Machine learning models may not able to learn the deterministic rules out of the box, but it is possible to improve the model to learn these types of rules.</li><li>Deterministic rules can be handled by discarding invalid data (<strong>reject sampling</strong>) or by adding some clever preprocessing to your code (<strong>transforming</strong>).</li><li>The SDV offers a <code>constraints</code> module that allows you to input commonly found deterministic rules. You can specify the handling strategy for each constraint and apply multiple rules to the same dataset.</li></ol><p><strong>Further Reading</strong></p><p>For further information about constraints refer to the <a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html" rel="nofollow">Handling Constraints User Guide</a>.</p>]]></content:encoded></item><item><title><![CDATA[User input to enhance synthetic data generation]]></title><description><![CDATA[ML models learn some rules out of the box, while other logic requires more work. Which is which? Read more to find out.]]></description><link>https://sdv.dev/user-input-synthetic-data/</link><guid isPermaLink="false">Ghost__Post__61a68d091b683e0048b2a2f3</guid><category><![CDATA[Product]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Wed, 01 Dec 2021 16:06:49 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2021/11/Banner.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2021/11/Banner.png" alt="User input to enhance synthetic data generation"/><p>In our <a href="https://sdv.dev/blog/fake-to-synthetic-ml">previous article</a>, we explored how machine learning (ML) plays a key role in synthetic data creation. One of the biggest strengths of ML is <em>automatic rule detection</em> (also known in ML terms as <em>correlations</em>): The algorithms are designed to learn patterns in the data, even without additional user input. The result is synthetic data that resembles the original, right down to its mathematical properties!</p><p>However, in some cases, applying an ML model right out of the box may not immediately achieve the desired result. In this article, we'll explore the strengths of ML models and go through those areas where user input may be required.</p><h3 id="strengths-of-ml-models">Strengths of ML Models</h3><p>The goal of any ML-based synthetic data generation software is to learn from and emulate the input data. To illustrate this, let's pretend you work in the car insurance business, and you're in possession of a real dataset related to drivers and their insurance:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Figure-03.png" class="kg-image" alt="User input to enhance synthetic data generation" loading="lazy" width="1916" height="835" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Figure-03.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Figure-03.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Figure-03.png 1600w, https://sdv.ghost.io/content/images/2021/11/Figure-03.png 1916w" sizes="(min-width: 720px) 720px"><figcaption>An example dataset, including license and collision coverage information associated with different drivers.</figcaption></img></figure><p>An ML-based system, such as the <a href="https://sdv.dev/blog/intro-to-sdv/">Synthetic Data Vault</a> (SDV), will learn patterns from the real data and use it to create new synthetic data. Recall some of the important patterns that ML algorithms detect:</p><ul><li><strong><strong><strong>Shapes. </strong></strong></strong>The general shape of the data. For example, in the dataset above, 50% of drivers have Collision Coverage and the Annual Premium is uniformly scattered between $3,000 and $9,000.</li><li><strong><strong><strong>Correlations.</strong> </strong></strong>The trends within the data. For example, having Collision Coverage -- especially Standard coverage -- means a higher Annual Premium.</li></ul><p>These shapes and correlations will be present in the synthetic data that is outputted by the ML model, as shown below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Figure-04.png" class="kg-image" alt="User input to enhance synthetic data generation" loading="lazy" width="1875" height="832" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Figure-04.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Figure-04.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Figure-04.png 1600w, https://sdv.ghost.io/content/images/2021/11/Figure-04.png 1875w" sizes="(min-width: 720px) 720px"><figcaption>An example of a synthetic dataset created by an ML-based algorithm. The algorithm will learn patterns from the real data and emulate them.</figcaption></img></figure><p>Perhaps <strong>the single biggest strength of an ML algorithm is its ability to learn rules by looking for general patterns in the data,</strong> using probability and statistics.</p><h3 id="what-ml-models-do-not-learn-out-of-the-box">What ML models do not learn out of the box</h3><p>Let's take a closer look at the synthetic car insurance data. You might notice that two of the rows in the synthetic data don't make complete sense. Below, we've highlighted the errors.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Figure-05.png" class="kg-image" alt="User input to enhance synthetic data generation" loading="lazy" width="1867" height="831" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Figure-05.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Figure-05.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Figure-05.png 1600w, https://sdv.ghost.io/content/images/2021/11/Figure-05.png 1867w" sizes="(min-width: 720px) 720px"><figcaption>The synthetic car insurance data, with errors highlighted.</figcaption></img></figure><p>Do you see what has gone wrong? In the first row, the license expired 3 years earlier than it was issued. In the last row, a driver without Collision Coverage has a Collision Policy Type. Additionally, the same Customer ID has been repeated in Row 3 and Row 4.</p><p>There are three rules that the ML algorithm did not follow:</p><ol><li>License Expiration &gt; License Issue Year</li><li>If Has Collision Coverage = NO, then Collision Policy Type must be empty</li><li>All Customer IDs must be unique</li></ol><p>Why does the ML model easily pick up on some rules and not others? To answer this question, we can look closely at the rules themselves. All of the rules that the ML model successfully learned -- including the distribution shapes and the correlations -- were based on general trends. These <strong>probabilistic rules</strong> apply to a majority of the relationships within the dataset, but not all of them. Although they have to make sense in aggregate, a few rows may be exceptions.</p><p>By contrast, the rules that the ML model failed to learn were stricter. These <strong>deterministic rules</strong> describe intrinsic laws of nature, time or logic. Each and every row must adhere to them, and they won't change regardless of  how much (or how little) data has been given to the ML model.</p><p>To continue with the driving theme: A probabilistic rule is like a yield sign, signaling a general recommendation that works out differently for each individual situation -- some cars may need to stop, while others just slow down. Meanwhile, a deterministic rule is like a stop sign, demanding that every single car must come to a full stop.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Figure-06.png" class="kg-image" alt="User input to enhance synthetic data generation" loading="lazy" width="1607" height="662" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Figure-06.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Figure-06.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Figure-06.png 1600w, https://sdv.ghost.io/content/images/2021/11/Figure-06.png 1607w" sizes="(min-width: 720px) 720px"><figcaption>A probabilistic rule applies to a majority of rows, but leaves room for exceptions. Meanwhile, a deterministic rule applies to every single row.</figcaption></img></figure><p><strong>By default, our ML model assumed that all rules were probabilistic.</strong> When this happens, synthetic data still generally follows the desired properties -- for example, License Expiration &gt; License Issue Year -- for <em>most</em> of the rows, but not for every row.</p><h3 id="improving-the-ml-models-using-constraints">Improving the ML Models using constraints</h3><p>Just because the ML model didn't automatically follow a deterministic rule doesn't mean that it can't. It's possible to improve the model so that it understands this type of rule. As a user working with the SDV, you can input deterministic rules into your model using <strong>constraints</strong>.</p><p>An ML model built using constraints will accommodate both probabilistic and deterministic rules.</p><p><strong>Do you need SDV constraints?</strong></p><p>Deterministic rules are often easy to spot in your dataset: They are the rules that every single row must follow in order to be valid, regardless of how much data there is overall.  But even if you identify the right constraints, there are some cases where you might not actually want to supply them to the SDV.</p><p>Because the SDV learns probabilistic rules, most of the synthesized data is generally valid. Having a few errors sprinkled in might actually be beneficial if you want your synthetic data to cover some edge cases. For example, if you're using the synthetic data to test insurance claim software, leaving in some weird data points might help you ensure that the software can handle unexpected cases -- like the License Expiration accidentally being set too early.</p><p>The figure below shows a few questions you can ask to determine whether adding a constraint is the right approach.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/12/Figure-07.png" class="kg-image" alt="User input to enhance synthetic data generation" loading="lazy" width="2000" height="586" srcset="https://sdv.ghost.io/content/images/size/w600/2021/12/Figure-07.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/12/Figure-07.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/12/Figure-07.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2021/12/Figure-07.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Should you input a rule using constraints? First, determine whether the rule is deterministic, and then take your use case into account.</figcaption></img></figure><p><strong>The SDV Constraints offering</strong></p><p>If you decide that adding deterministic rules is important for generating your synthetic data, the SDV has many different constraints to choose from! The table below describes the constraints you would need in order to define the deterministic rules that would best mold your Car Insurance dataset.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Figure-08.png" class="kg-image" alt="User input to enhance synthetic data generation" loading="lazy" width="2000" height="578" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Figure-08.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Figure-08.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Figure-08.png 1600w, https://sdv.ghost.io/content/images/2021/11/Figure-08.png 2204w" sizes="(min-width: 720px) 720px"><figcaption>The GreaterThan, ColumnFormula and Unique constraints -- all available in the SDV -- set the deterministic rules that ensure your synthetic Car Insurance Data is useful and makes sense.</figcaption></img></figure><p>The SDV offers many more possible constraints, including:</p><ul><li>UniqueCombinations</li><li>Positive and Negative</li><li>Rounding</li><li>Between</li><li>OneHotEncoding</li></ul><p>You can add multiple constraints to the same dataset in order to accommodate all the deterministic rules you need. For more details, read the <a href="https://sdv.dev/SDV/user_guides/single_table/handling_constraints.html">Constraints User Guide</a>.</p><h3 id="takeaways">Takeaways</h3><p>In this article, we learned that:</p><ul><li>Data is governed by rules. The SDV automatically learns probabilistic rules, which describe overall trends or patterns in the data.</li><li>However, sometimes the data has <strong>deterministic rules</strong>, which are always inherent no matter how much or how little data there is. ML-based systems, including the SDV, may not enforce deterministic rules out of the box.</li><li>Users can input deterministic rules to the SDV using <strong>constraints</strong>. To figure out whether you should input a constraint, ask yourself whether there are any rules that the data must always follow. There are many constraints to choose from.</li></ul><p>In future articles, we'll dive deeper into this topic. We'll explore the technical details behind constraints, and how exactly the SDV's ML models are able to learn deterministic rules.<br/></p>]]></content:encoded></item><item><title><![CDATA[Software Testing: Synthetic data changes the game]]></title><description><![CDATA[Creating fake data is an old concept -- but machine learning is a whole new ballgame. Learn about why ML is a key ingredient to synthetic data.]]></description><link>https://sdv.dev/fake-to-synthetic-ml/</link><guid isPermaLink="false">Ghost__Post__61927ca167598b003b3d944a</guid><category><![CDATA[Applications]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Tue, 16 Nov 2021 16:33:56 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2021/11/Article-13.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2021/11/Article-13.png" alt="Software Testing: Synthetic data changes the game"/><p>Data is a great source of information. Real data — which is based on observations of real-world phenomena like weather, movements on a factory floor or the activities of a user base — can help us notice trends, increase business efficiency and solve problems. </p><p>But data can be helpful even if it isn’t real. This data, sometimes called fake or test data, doesn’t come directly from real-world observations, but is instead artificially crafted by a human or machine. The latest and most complex iteration of this data type — what we call synthetic data — builds on previous work done in this space. </p><p>In this article, we'll go through the history of fake data. By the end, you'll be able to answer the following questions:</p><ul><li>What were the original motivations and tools for manually creating data?</li><li>What differentiates synthetic data from other types of fake data?</li><li>What role does machine learning play in generating synthetic data?</li></ul><h3 id="the-dawn-of-fake-data-test-data-management">The Dawn of Fake Data: Test Data Management</h3><p>One group of people has worked with fake data for a long time: software engineers. They need data in order to test the systems they build, and the real stuff isn't always usable (for example, due to privacy). </p><p>Let's pretend it's the early 2000s, and you're an IT professional working at a bank. You're responsible for the software that updates account balances after each transaction. You'd like to test this software before putting it into production. What do you do?</p><p>Most likely, you'll come up with a few test scenarios to ensure that your functionality — updating the balance — can properly handle a variety of inputs.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Article-09--1-.png" class="kg-image" alt="Software Testing: Synthetic data changes the game" loading="lazy" width="2000" height="541" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Article-09--1-.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Article-09--1-.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Article-09--1-.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2021/11/Article-09--1-.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>This table shows a few scenarios you may use to test your system. In these scenarios, you're testing how a monetary transfer of $20 changes the balance in different accounts.</figcaption></img></figure><p>Notice that in order to create these scenarios, you had to generate data: various starting balances ($500, $20, $10) as well as a transfer amount ($20). This is an early version of using fake data in order to test your software!</p><p><strong>Using Tools for Manual Creation</strong></p><p>Now let's fast forward in time. Over the years, your software has gotten even more complex, and you're constantly adding new functionalities. For example, maybe you start allowing transfers with foreign currency. </p><p>You need to test these functionalities before you roll them out. To save time, you might end up using -- or creating -- a tool that allows you to generate and manage fake data for testing. </p><p>The simplest tool may be a basic permutation, as illustrated below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Article-07-2.png" class="kg-image" alt="Software Testing: Synthetic data changes the game" loading="lazy" width="1723" height="809" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Article-07-2.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Article-07-2.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Article-07-2.png 1600w, https://sdv.ghost.io/content/images/2021/11/Article-07-2.png 1723w" sizes="(min-width: 720px) 720px"><figcaption>A simple manual test data generation tool that uses permutations. The resulting scenarios -- with different starting balances, transfer amounts and transfer currencies -- are outputted as a data table.</figcaption></img></figure><p>A more sophisticated tool might allow you greater control over the rules the data must follow. It will also allow you to create more columns as your functionalities increase. For example, maybe the bank now offers two different account types: Premium and Normal. </p><p>Now you need a test data generation tool that can handle all of these variables and come out with something like this:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Article-11.png" class="kg-image" alt="Software Testing: Synthetic data changes the game" loading="lazy" width="1955" height="655" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Article-11.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Article-11.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Article-11.png 1600w, https://sdv.ghost.io/content/images/2021/11/Article-11.png 1955w" sizes="(min-width: 720px) 720px"><figcaption>A more sophisticated test data tool will allow you to specify rules manually. It will follow them to generate test data.</figcaption></img></figure><p>Many test data management tools use sophisticated logic to precisely create these data columns and their values. But the rules they use are manually written, and rely on human intuition and domain knowledge. For example:</p><ul><li>Account type = Premium 10% of the time and Normal 90% of the time</li><li>Starting balance is between $10,000 and $250,000 if Account type = Premium<br>or between -$1,000 and $10,000 if Account type = Normal</br></li><li>Transfer amount follows a bell curve with a mean of $7,500 and standard deviation of $1,000</li><li>Etc.</li></ul><p>There are downsides to this manual approach. It takes time and effort to come up with these rules, to keep track of them, and to update them as your application changes.</p><h3 id="adding-machine-learning">Adding Machine Learning</h3><p>Adopting machine learning (ML) opens up entirely new avenues in data generation. In the process, it gets rid of some of these downsides.</p><p>At a high level, ML-based software (such as the <a href="https://sdv.dev/blog/intro-to-sdv/">Synthetic Data Vault</a>) works in three steps:</p><ol><li>The user inputs real data into the ML software</li><li>The ML software automatically learns patterns in the data</li><li>The software outputs data that contains those patterns</li></ol><p>Let's go back to our banking example to see how this works. It's now 2021 and you're using <a href="https://sdv.dev/">the SDV</a> to generate your test data. You input all the transactions your bank has handled in the last week. </p><p>After modeling, the SDV outputs entirely new data that looks and behaves like the original. An illustration of this is shown below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Article-10.png" class="kg-image" alt="Software Testing: Synthetic data changes the game" loading="lazy" width="2000" height="516" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Article-10.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Article-10.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/11/Article-10.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2021/11/Article-10.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>With ML tools (like the SDV), you input real data into the software. The software then learns patterns from the data and outputs data that matches those patterns.</figcaption></img></figure><p>Notice that the output data contains many of the same properties as the original. The model learned all of the following information:</p><ul><li><strong>Ranges &amp; Categories.</strong> Transfers range from $5K to $10K. Bank accounts can be either premium or normal. Etc.</li><li><strong>Shapes.</strong> 10% of accounts are premium. Transfers follow a bell curve distribution with a mean of $7,500 and a standard deviation of $1,000. Etc.</li><li><strong>Correlations.</strong> Premium bank accounts tend to have higher balances ($10K to $250K) than normal accounts (-$1K to $10K).</li></ul><p>In other words: <strong>while the old test data management tools required you to manually come up with rules, ML-based tools learn these rules automatically.</strong> <strong> </strong>Moreover, they can learn new information. For example, the ML picked upon a couple of extra correlations:</p><ul><li>Premium accounts are more likely to transfer foreign currency.</li><li>Normal accounts are more likely to be overdrawn (transfer more than their current balance).</li></ul><p>Using an ML-based data generation tool will help you ensure that your software is robust against these typical cases. And while manual data generation tools generate fake data, <strong>ML-based approaches generate what we call synthetic data.</strong></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/11/Article-08-1.png" class="kg-image" alt="Software Testing: Synthetic data changes the game" loading="lazy" width="1574" height="419" srcset="https://sdv.ghost.io/content/images/size/w600/2021/11/Article-08-1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/11/Article-08-1.png 1000w, https://sdv.ghost.io/content/images/2021/11/Article-08-1.png 1574w" sizes="(min-width: 720px) 720px"><figcaption>Ask whether you had to input any real data or rules. Based on this, you'll know whether you are dealing with synthetic data or fake data.</figcaption></img></figure><p><strong>Benefits of Synthetic Data</strong></p><p>There are some clear advantages to using synthetic data over fake data, especially in software testing. Below, we've detailed a few.</p><ul><li><strong>Saves time with automation.</strong> Because ML automatically learns patterns from the real data, there is no need to spend a lot of time coming up with and inputting rules. ML learns rules that you may even miss.</li><li><strong>Is usable by non-experts. </strong>Realistic fake data can only be generated by domain experts, who know the precise rules governing the dataset. However, anyone can generate synthetic data. All they have to do is input the real data and the ML software takes care of the rest!</li><li><strong>Increases adaptability. </strong>Applications and data will inevitably change over time. It's easy to update synthetic data as this happens, simply by retraining the ML model with newer data.</li></ul><p>Benefits of synthetic data expand beyond software testing. The SDV Community is using synthetic data for an ever-increasing variety of tasks, including machine learning development, de-biasing datasets and scenario planning.</p><h3 id="key-takeaways">Key Takeaways</h3><p>In this article, we surveyed numerous ways of creating and using data  that is not real. In particular, we learned that:</p><ul><li>Creating fake data is not a novel concept. Older generations of tools will output fake data when given an explicit list of rules. This is especially useful for software testing.</li><li>Adding ML to this process is a newer evolution. Users input real data into the ML model, and it's able to automatically infer the rules. Data generated using ML-based systems is known as <strong>synthetic data</strong>.</li><li>Synthetic data's key advantages include its automation and adaptability. The uses of synthetic data expand beyond software testing.</li></ul><p>In future articles, we'll put ML models to the test! We'll uncover their strengths and weaknesses, and guide you through getting the most from synthetic data using the Synthetic Data Vault.</p>]]></content:encoded></item><item><title><![CDATA[Your Feedback in Action, Part 2: Data Workflow]]></title><description><![CDATA[After thousands of downloads, see how the synthetic data workflow in the SDV has evolved based on feedback from users.]]></description><link>https://sdv.dev/blog/community-feedback-workflow</link><guid isPermaLink="false">Ghost__Post__609c384488b3f9003e080016</guid><category><![CDATA[Open Source]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Wed, 19 May 2021 16:52:14 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2021/05/Banner-2-1.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2021/05/Banner-2-1.png" alt="Your Feedback in Action, Part 2: Data Workflow"/><p>The Synthetic Data Vault (SDV) is a software system that allows users all over the world to input a dataset and generate synthetic data. The SDV was born out of academic research at MIT — but in 2018, we open-sourced it, so that people all over the world could use it.</p><p>Since then, we've been listening carefully to our community's feedback, making sure that we address any gaps between theoretical academic research and practical use. This article is the second in a multi-part series detailing recent improvements to the SDV that make it work in the real world. Here we'll discuss how we've amped up the data synthesis workflow. (For our previous discussion about how we've improved core models, see <a href="http://sdv.dev/blog/community-feedback-models">Part 1</a>.)</p><h3 id="what-are-workflows">What are workflows?</h3><p>We open sourced the SDV not just to let users generate synthetic data, but also to allow them <em>use</em> that data to solve real-world problems. Our community taught us that actually using the SDV involves a multi-step process — and that improving the system means paying attention to this entire workflow, not just the core machine learning.</p><p>According to our users, this workflow boils down to a few generalizable steps:</p><ol><li>Identifying real datasets that need to be synthesized</li><li>Transforming the datasets into a machine-readable format</li><li>Running the machine learning model</li><li>Synthesizing data according to particular specifications</li><li>Reversing the transformations such that the synthesized data looks like the original</li><li>Evaluating the synthesized data that results</li></ol><p>These steps are illustrated in the diagram below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-2----1.png" class="kg-image" alt="Your Feedback in Action, Part 2: Data Workflow" loading="lazy" width="2000" height="1106" srcset="https://sdv.ghost.io/content/images/size/w600/2021/05/Community-Feedback--Part-2----1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/05/Community-Feedback--Part-2----1.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/05/Community-Feedback--Part-2----1.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2021/05/Community-Feedback--Part-2----1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>The entire synthetic data workflow involves more than just modeling. Data also needs to be transformed, synthesized, reverse transformed, and evaluated.</figcaption></img></figure><p>The key insight from our users was that the application of machine learning models is only one step of a much larger puzzle. When the open source community helped us understand this, we were able to improve on the SDV software by adding in transformations, synthesizing options, and evaluation tools -- all detailed below.</p><h3 id="transforming-data">Transforming Data</h3><p>One major lesson from our open source community was how messy real-world datasets are compared to those used in academia. Academic datasets often come pre-sanitized and ready for numerical use. In the real world, however, databases are growing and changing constantly, and are often significantly different from the optimal yet theoretical structures used by machine learning researchers.</p><p>Two thorny data types frequently encountered in the real world are <em>datetimes</em> and <em>null values</em>.</p><ul><li><strong>Datetimes</strong> can follow many different formats, including YYYY-MM-DD or MM-DD-YY. However, machine learning models accept numerical values only. Usually these are Unix timestamps, defined as the number of seconds that have elapsed since January 1, 1970. By this logic, a date like 2021-01-01 will transform into the number 1609488000.</li><li><strong>Null values</strong> also present a problem for mathematical models when they appear in numerical data. While users can tell models to ignore these values, the presence of a null might actually indicate something important, like a user declining to answer a question. To account for this, the SDV creates a new, binary column to address whether the original value is null.</li></ul><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-2----2-1.png" class="kg-image" alt="Your Feedback in Action, Part 2: Data Workflow" loading="lazy" width="2000" height="754" srcset="https://sdv.ghost.io/content/images/size/w600/2021/05/Community-Feedback--Part-2----2-1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/05/Community-Feedback--Part-2----2-1.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/05/Community-Feedback--Part-2----2-1.png 1600w, https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-2----2-1.png 2280w" sizes="(min-width: 720px) 720px"><figcaption>When working with real-world datasets, it's necessary to apply transformations between real data and machine-readable data. This example transforms datetimes and null values.</figcaption></img></figure><p>To solve this problem, we introduced a new library called <a href="https://github.com/sdv-dev/RDT">Reversible Data Transforms</a> (RDT). The RDT library contains necessary logic for transforming different types of real world data to its machine-ready counterpart — as well as the logic for its reversal, so that a synthetic data user won't know the difference. The RDT is a standalone library that can reach beyond the synthetic data space, helping data scientists and academics across fields to clean their data. Since November 2020, the RDT has been supported on all major platforms including MacOS, Windows, and Linux.</p><h3 id="synthesizing-data-conditionally">Synthesizing Data Conditionally</h3><p>When we first imagined the SDV, we assumed users would simply want to use all the synthetic data generated by the model. However, we soon found that some users have more complex needs, and require more control over the data they synthesize — opening up new possibilities for synthetic data in the process.</p><p>For example, one of our users, an engineer, found a whole new use for SDV. The engineer was writing a machine learning classifier on a dataset when they noticed it was unbalanced. Applying any algorithms to this dataset would lead to biased models. The engineer realized that, if used strategically, SDV could actually debias the data — if it only generated data with rarer attributes, the synthetic data it created could be combined with the real data to form a fully balanced dataset.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-2----3.png" class="kg-image" alt="Your Feedback in Action, Part 2: Data Workflow" loading="lazy" width="1705" height="960" srcset="https://sdv.ghost.io/content/images/size/w600/2021/05/Community-Feedback--Part-2----3.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/05/Community-Feedback--Part-2----3.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/05/Community-Feedback--Part-2----3.png 1600w, https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-2----3.png 1705w" sizes="(min-width: 720px) 720px"><figcaption>Synthesized data can help remove bias by creating balanced datasets. In this example, synthesizing those rows that only correspond to females creates a balance between males and females.</figcaption></img></figure><p>In February of 2021, we added <a href="https://sdv.dev/SDV/user_guides/single_table/gaussian_copula.html#conditional-sampling">conditional sampling</a> to the SDV to enable this use case. Now, users can specify attributes or values that must be present in the synthesized data. In addition to debiasing datasets, users can use this feature to test hypothetical scenarios.</p><h3 id="evaluating-synthesized-data">Evaluating Synthesized Data</h3><p>When the entire system is working smoothly and outputting synthetic data, users still need to know: Is the data good enough to use? This vital question inspired us to add evaluation capabilities to the SDV. In doing so, we faced two key challenges: Defining the metrics, and creating a useful process<strong>.</strong></p><p><strong>Metrics</strong></p><p>No single metric perfectly captures the different dimensions of synthetic data users may want to evaluate. Some want to preserve a high degree of mathematical likeness, others want to emphasize a particular column for machine learning predictions, and still others are more focused on threat models that can compromise privacy. </p><p>To address this, we created a separate library, <a href="https://github.com/sdv-dev/SDMetrics">SDMetrics</a>, to define evaluation metrics. The library now includes a suite of metrics that cover differentiation of synthetic and real data, statistical likeness, and privacy.</p><p><strong>Application</strong></p><p>Rather than apply metrics on an ad-hoc basis, some SDV power users were creating mini-workflows to rapidly test out different models, datasets and evaluation criteria in succession. Inspired by their innovation, we created <a href="https://github.com/sdv-dev/SDGym">SDGym</a>, a system that allows users to input models, datasets and success metrics to build a comprehensive evaluation framework.</p><h3 id="the-sdv-software-today">The SDV Software Today</h3><p>The SDV software is continuously evolving based on community feedback. In this article, we discussed improvements to the workflow surrounding synthetic data generation, including data transformations, sampling methods and evaluation tools. Earlier, in <a href="https://sdv.dev/blog/community-feedback-models">Part 1</a> of this series, we discussed the core synthetic data models themselves. In future blog articles, we plan to dig deeper into each of these areas, and to uncover new ones with you.</p><p>Like the SDV, this blog is a collaborative effort. Use our <a href="https://bit.ly/sdv-slack-invite">Slack</a> to let us know which topics you'd like to hear more about. And as always, use <a href="https://github.com/sdv-dev/SDV">GitHub</a> to file technical issues with the system. Working together, we can make SDV the most trusted, transparent and comprehensive platform for synthetic data generation!</p><p><em>For other inquiries, please contact <a href="mailto: info@sdv.dev">info@sdv.dev</a>.</em><br/></p>]]></content:encoded></item><item><title><![CDATA[Your Feedback in Action, Part 1: Data Models]]></title><description><![CDATA[After thousands of downloads, see how SDV's machine learning models have evolved based on feedback from users.]]></description><link>https://sdv.dev/community-feedback-models/</link><guid isPermaLink="false">Ghost__Post__609c351b88b3f9003e07ffb8</guid><category><![CDATA[Open Source]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Wed, 12 May 2021 20:15:30 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2021/05/Banner-2.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2021/05/Banner-2.png" alt="Your Feedback in Action, Part 1: Data Models"/><p>In our <a href="https://sdv.dev/blog/intro-to-sdv/">last post</a>, we introduced the <a href="https://github.com/sdv-dev/SDV">Synthetic Data Vault</a> (SDV) — a software system that allows users to input a dataset and generate synthetic data. The SDV was born out of academic research at MIT — but in 2018, we open-sourced it, so that people all over the world could use it.</p><p>Since then, we've been listening carefully to our community's feedback, making sure that we address any gaps between theoretical academic research and practical use. This multi-part series details recent improvements we've made so that SDV works in the real world. In this article, we focus on the machine learning-based modeling techniques that form the core of the system, while <a href="https://sdv.dev/blog/community-feedback-workflow/">Part 2</a> will cover the surrounding workflow.</p><h3 id="whats-in-a-model">What's in a model?</h3><p>At its core, the SDV is a set of machine learning models designed to understand and mimic real world data. Once the SDV creates a particular model, developers can generate synthetic data by sampling it. For synthetic data to be successful, this generative model must be correct — but through discussions with our open source community, we realized that there is no such thing as a single, winning approach that works every time. Each dataset and use case is different.<br/></p><p>Our solution is to provide choices, giving users all the necessary tools to make useful synthetic data for each new case at hand. Let's dive into three popular uses of the SDV where such options are available: Tabular models, sequential data and business logic.</p><h3 id="more-options-for-tabular-models">More Options for Tabular Models</h3><p>The earliest version of SDV was based on a classic statistical method: <a href="https://en.wikipedia.org/wiki/Copula_(probability_theory)">Gaussian Copulas</a>. This model is transparent by definition. It allows us to understand and exert control over formulas in the model, notably the distributions of each variable. This can be especially useful for business applications, where data often follows predictable distributions. For example, wind speed is known to follow a <a href="https://en.wikipedia.org/wiki/Wind_power">Weibull distribution</a>, biological measures like height usually follow <a href="https://en.wikipedia.org/wiki/Normal_distribution#Occurrence_and_applications">normal distributions</a> and credit default rates often follow <a href="https://en.wikipedia.org/wiki/Exponential_distribution">exponential distributions</a>.</p><p>Meanwhile, advances in the AI space had also produced a robust, alternative model for those willing to sacrifice transparency: A deep learning technique called <a href="https://en.wikipedia.org/wiki/Generative_adversarial_network">Generative Adversarial Networks</a> (GANs). GANs model complex processes that don't follow known formulas. While these models’ inner workings aren’t easily explained by humans, they produce highly accurate results. We created a GAN, called CTGAN, specifically for synthetic data. This black box model is especially good at figuring out complex correlations between variables in large datasets.</p><p>For a long time, SDV allowed users a choice between our Gaussian Copulas based model, called GaussianCopula, and CTGAN to model tabular data. While this choice provided some flexibility, our users reported they had a hard time choosing between such extreme alternatives. We wondered if a middle ground was possible: Could we specify distributions while also using GANs to identify complex correlations?</p><p>We couldn't find any model that fit both of these requirements, so we made our own! A key insight was that we could use Gaussian Copulas to understand the data and transform it before applying it to a GAN. The result is <a href="https://sdv.dev/SDV/user_guides/single_table/copulagan.html">CopulaGAN</a>, a hybrid model we released in October 2020.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-1----1.png" class="kg-image" alt="Your Feedback in Action, Part 1: Data Models" loading="lazy" width="2000" height="695" srcset="https://sdv.ghost.io/content/images/size/w600/2021/05/Community-Feedback--Part-1----1.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/05/Community-Feedback--Part-1----1.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/05/Community-Feedback--Part-1----1.png 1600w, https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-1----1.png 2100w" sizes="(min-width: 720px) 720px"><figcaption>CopulaGAN is in the middle of the spectrum, between simple, easily understood models (like GaussianCopula) and complex black box models (like CTGAN).</figcaption></img></figure><p>CopulaGAN combines the human accessibility of Gaussian Copulas with the robust accuracy of GANs. This innovation provides users with a new choice: a hybrid approach.</p><h3 id="the-special-case-of-sequential-data">The Special Case of Sequential Data</h3><p>Another tricky case pointed out by our users involved sequential data. While sequential data is stored in a table, it is unlike a regular table in that its rows are linked together, usually by a time component. This use case is extremely frequent, especially in finance — any table with credit card transactions, stock prices, or payments is almost certainly sequential. </p><p>At the time, solutions treated sequential data as a case of general tabular modeling. After all, sequential data is inside a table. However, these solutions failed to incorporate the key information that makes sequential data unique: The relationships that exist between rows.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-1----2.png" class="kg-image" alt="Your Feedback in Action, Part 1: Data Models" loading="lazy" width="1840" height="1210" srcset="https://sdv.ghost.io/content/images/size/w600/2021/05/Community-Feedback--Part-1----2.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/05/Community-Feedback--Part-1----2.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/05/Community-Feedback--Part-1----2.png 1600w, https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-1----2.png 1840w" sizes="(min-width: 720px) 720px"><figcaption>In this table of stock prices, rows that describe the same company — in this case, Google — are related to each other through time. Related rows are a special feature of sequential datasets.</figcaption></img></figure><p>While considering this pain point, we recognized sequential data as an entirely new case that required its own unique set of modeling techniques. In October 2020, we released our DeepEcho library, which focuses entirely on sequential data. We also introduced our <a href="https://sdv.dev/SDV/user_guides/timeseries/par.html">PAR model:</a> a GAN approach made specifically for sequential data.</p><h3 id="encoding-business-logic-using-constraints">Encoding Business Logic using Constraints</h3><p>Even with a plethora of modeling choices, it's vital to capture nuances in business logic while modeling synthetic data. This is due to differences in how humans and machines understand datasets.</p><p>Often, humans can easily glean the meaning of a dataset using context clues. Consider a table showing the names and ages of students and their legal guardians. A human will intuitively realize that a student must be younger than their guardian.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-1----3.png" class="kg-image" alt="Your Feedback in Action, Part 1: Data Models" loading="lazy" width="2000" height="969" srcset="https://sdv.ghost.io/content/images/size/w600/2021/05/Community-Feedback--Part-1----3.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/05/Community-Feedback--Part-1----3.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/05/Community-Feedback--Part-1----3.png 1600w, https://sdv.ghost.io/content/images/2021/05/Community-Feedback--Part-1----3.png 2230w" sizes="(min-width: 720px) 720px"><figcaption>In this table of students and their guardians, the student is always younger than their guardian. This is a constraint that humans intuitively understand.</figcaption></img></figure><p>But will a machine understand the same rule? Because all of the SDV's models use statistics, they analyze trends generally — meaning that in this case, they will include a small possibility that a student could be older than their guardian. After all, is it totally out of the question that an older individual could enroll and list their child as their guardian? Either way, only a human expert can truly figure out what makes sense for this dataset!</p><p>To solve this pain point, SDV introduced the concept of <a href="https://sdv.dev/SDV/user_guides/single_table/constraints.html">constraints</a> in July 2020. Constraints give users the ability to encode their business knowledge and expertise into an SDV model. In our example, they could specify that a guardian's age must be greater than the student's. Currently, the GreaterThan and UniqueCombination constraints allow for easy handling of common scenarios. We also provide a blanket CustomConstraint class, which gives users flexibility to capture more nuanced knowledge.</p><h3 id="more-community-feedback">More Community Feedback</h3><p>We believe that the more humans and machines can work together, the more efficient our processes can become. In this article, we explained how user feedback about the SDV led to new core modeling techniques and innovations — enabling a system that now provides a choice of multiple models, handles sequential data, and understands constraints. In <a href="https://sdv.dev/blog/community-feedback-workflow/">Part 2</a>, we will discuss similar feedback-driven innovations in the rest of the workflow.</p><p>Using SDV — and giving us feedback — fuels this rapid evolution. To start a discussion, please message us on <a href="https://bit.ly/sdv-slack-invite">Slack</a> or file an issue on <a href="https://github.com/sdv-dev/SDV">GitHub</a>. Working together, we can make SDV the most trusted, transparent and comprehensive platform for synthetic data generation!</p><p><em>For other inquiries, please contact <strong>info@sdv.dev</strong>.</em><br/></p>]]></content:encoded></item><item><title><![CDATA[Meet the Synthetic Data Vault]]></title><description><![CDATA[Welcome to the SDV Blog! The SDV is a comprehensive, open source software for synthetic data generation. Join our growing community as we create an ecosystem to solve real world problems!]]></description><link>https://datacebo.com/blog/intro-to-sdv</link><guid isPermaLink="false">Ghost__Post__608c5562f9741d003b6f73b8</guid><category><![CDATA[Open Source]]></category><dc:creator><![CDATA[Neha Patki]]></dc:creator><pubDate>Tue, 04 May 2021 13:00:00 GMT</pubDate><media:content url="https://sdv.ghost.io/content/images/2021/05/blog-header--1-.png" medium="image"/><content:encoded><![CDATA[<img src="https://sdv.ghost.io/content/images/2021/05/blog-header--1-.png" alt="Meet the Synthetic Data Vault"/><p>Hello world! We, the creators of MIT's Synthetic Data Vault, warmly welcome you to our official blog. Soon we'll be using this space to deep-dive into topics related to our libraries, and to unpack ideas in the synthetic data space. We're looking forward to exploring this exciting area with you.</p><p>But first, we want to properly introduce our project: The <a href="https://github.com/sdv-dev/SDV">Synthetic Data Vault</a> (SDV), an open source software ecosystem for generating synthetic data. In this post, we’ll explain why synthetic data is important, and tell the story of how we created the vault. We’ll also lay out what’s in store — and how you can get involved. Let’s get started with a brief overview.</p><h3 id="synthetic-data-what">Synthetic Data What?</h3><p>Synthetic data is a bold new frontier in machine learning. It allows developers to share and use data more effectively.</p><p>It may seem counterintuitive, but although billions of gigabytes of data are produced every day, there are still huge gaps in what developers are actually able to use. Accessibility concerns, regulatory issues and imbalanced datasets can all keep experts from using data. This impedes progress in finance, health care and other domains.</p><p>Good synthetic data can fill these gaps. The SDV uses machine learning to analyze data. Then, it creates fully synthetic datasets that mimic the original. Although the synthetic data is entirely machine generated, it maintains the original format and mathematical properties. This makes synthetic data versatile. It can completely replace the existing data in a workflow, or it can supplement the data to enhance its utility. Already, our users have successfully used the SDV to augment datasets, test applications, remove bias and more.</p><h3 id="a-history-of-the-sdv">A History of the SDV</h3><p>Our story starts in 2013. In MIT's Laboratory for Information and Decision Systems (LIDS), we were working on general data science projects. We had developed new techniques, and we were excited to test them on real datasets. However, as soon as we asked for the data, we hit roadblocks. The process for getting access to data turned out to be much more complex than we anticipated, with many regulations and security red tape. </p><p>We wondered: What if we didn't need the real data in the first place? If we had synthetic data with the same mathematical properties as the original, it would be much easier for everyone to share and use.</p><p>In 2016, we released a paper describing the very first iteration of the <a href="https://dai.lids.mit.edu/wp-content/uploads/2018/03/SDV.pdf">SDV</a>. It introduced a novel technique for synthesizing multi-table data, and included trials where data scientists successfully used synthetic data instead of real data for machine learning tasks. Related research to come out of the lab included <a href="https://arxiv.org/pdf/1907.00503.pdf">CTGAN</a>, a novel approach to generating synthetic data using deep learning.</p><p>After these successes in the research community, we decided to move beyond purely academic solutions. Synthetic data has the potential to solve real-world problems faced by people on all sides of data science: internal developers writing software, external contractors working offshore, 3rd party partners offering services and even the end users who create the data. After some pilot testing on enterprise applications, we open sourced our work in 2018, publishing <a href="https://pypi.org/project/sdv/">sdv on PyPi</a> for general use. Open sourcing offered ample opportunities for collaboration and customization. It allowed users all over the world to test the SDV in enterprise settings, and helped the SDV ecosystem evolve into a one-stop shop for synthetic data needs!</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://sdv.ghost.io/content/images/2021/04/Blog-Map.png" class="kg-image" alt="Meet the Synthetic Data Vault" loading="lazy" width="2000" height="1279" srcset="https://sdv.ghost.io/content/images/size/w600/2021/04/Blog-Map.png 600w, https://sdv.ghost.io/content/images/size/w1000/2021/04/Blog-Map.png 1000w, https://sdv.ghost.io/content/images/size/w1600/2021/04/Blog-Map.png 1600w, https://sdv.ghost.io/content/images/size/w2400/2021/04/Blog-Map.png 2400w" sizes="(min-width: 720px) 720px"><figcaption>Users all over the world are using our software to create synthetic data. This map shows the total downloads* of <a href="https://github.com/sdv-dev/CTGAN">CTGAN</a> (our most popular synthetic data model) per continent.</figcaption></img></figure><p>We listened to feedback and, as of today, have made 93 releases (across all our libraries), addressing 504 issues. We have been thrilled to see a burgeoning community of invested users using the SDV to solve problems. We've seen over 200K user downloads from PyPi, 400 stars in the SDV <a href="https://github.com/sdv-dev/SDV">GitHub repository</a> and 200 developers in our Slack channel. Our community is global and includes people in diverse roles: academics, data scientists, operations managers, engineers and more. We are continually learning from our community, and we're excited to bring new innovations to you!</p><h3 id="just-the-beginning">Just the Beginning</h3><p>Synthetic data has the potential to revolutionize the entire field of data science, allowing us to solve problems that once seemed untouchable. We want the Synthetic Data Vault to be the most trusted, transparent and comprehensive platform for synthetic data generation, but we can't do it without our users. It's our ever-growing open source community that allows us to quickly repair bugs, triage feature requests and improve to serve a variety of real-world needs.  </p><p>That’s where you come in. If you’re already a member of this community, we can’t thank you enough. And if you’d like to get involved, see below for ways to get started. Either way, watch this space for more nuanced discussions about synthetic data. We're excited to share what we've learned from you, and show how we are collectively improving the ecosystem. It’s time to open the vault!</p><p><strong>Want more ways to get involved?</strong></p><ul><li>Follow us on Twitter <a href="https://twitter.com/sdv_dev">@sdv_dev</a> for release announcements, blog updates and more</li><li>Join our <a href="https://bit.ly/sdv-slack-invite">Slack</a> community to meet other users, discuss synthetic data solutions and suggest topics for the blog</li><li>Visit &amp; star our <a href="https://github.com/sdv-dev">GitHub repositories</a></li><li>If you've successfully used the SDV for your project, share your experience and tag us</li></ul><p>For other inquiries, please contact us at <em><strong>info@sdv.dev</strong></em>.<br/></p><p><em>*Total download statistics per continent come from the </em><a href="https://github.com/pypa/linehaul"><em>Linehaul project</em></a><em> using BigQuery, and include mirrors. Are you aware of more accurate ways to count Python package downloads? Let us know!</em></p>]]></content:encoded></item></channel></rss>