SOS | Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2024)

research-article

Authors: Jayoung Kim, Chaejeong Lee, Yehjin Shin, Sewon Park, + 3, Minjung Kim, Noseong Park, and Jihoon Cho (Less)

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2022

Pages 762 - 772

Published: 14 August 2022 Publication History

  • 4citation
  • 448
  • Downloads

Metrics

Total Citations4Total Downloads448

Last 12 Months147

Last 6 weeks20

  • Get Citation Alerts

    New Citation Alert added!

    This alert has been successfully added and will be sent to:

    You will be notified whenever a record that you have chosen has been cited.

    To manage your alert preferences, click on the button below.

    Manage my Alerts

    New Citation Alert!

    Please log in to your account

  • Get Access

      • Get Access
      • References
      • Media
      • Tables
      • Share

    Abstract

    Score-based generative models (SGMs) are a recent breakthrough in generating fake images. SGMs are known to surpass other generative models, e.g., generative adversarial networks (GANs) and variational autoencoders (VAEs). Being inspired by their big success, in this work, we fully customize them for generating fake tabular data. In particular, we are interested in oversampling minor classes since imbalanced classes frequently lead to sub-optimal training outcomes. To our knowledge, we are the first presenting a score-based tabular data oversampling method. Firstly, we re-design our own score network since we have to process tabular data. Secondly, we propose two options for our generation method: the former is equivalent to a style transfer for tabular data and the latter uses the standard generative policy of SGMs. Lastly, we define a fine-tuning method, which further enhances the oversampling quality. In our experiments with 6 datasets and 10 baselines, our method outperforms other oversampling methods in all cases.

    References

    [1]

    Commonwealth of Australia 2010 Bureau of Meteorology. https://www.kaggle.com/jsphyg/weather-dataset-rattle-package.

    [2]

    HackerEarth Machine Learning Challenge-Adopt a buddy. https://www.kaggle.com/akash14/adopt-a-buddy.

    [3]

    Jonas Adler and Sebastian Lunz. 2018. Banach Wasserstein GAN. In NeurIPS.

    [4]

    Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein Generative Adversarial Networks. In ICML.

    [5]

    Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag.

    Digital Library

    [6]

    L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. 1984. Classification and Regression Trees. Taylor & Francis. https://books.google.co.kr/books?id=JwQx- WOmSyQC

    [7]

    Nitesh V. Chawla, KevinW. Bowyer, Lawrence O. Hall, andW. Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority over-Sampling Technique. J. Artif. Int. Res. 16, 1 (2002).

    [8]

    Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In NeurIPS.

    [9]

    Edward Choi, Siddharth Biswal, A. Bradley Maline, Jon Duke, F. Walter Stewart, and Jimeng Sun. 2017. Generating Multi-label Discrete Electronic Health Records using Generative Adversarial Networks. (2017). arXiv:1703.06490

    [10]

    [11]

    David R Cox. 1958. The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological) 20, 2 (1958), 215--232.

    [12]

    Tim Dockhorn, Arash Vahdat, and Karsten Kreis. 2022. Score-Based Generative Modeling with Critically-Damped Langevin Diffusion. In ICLR.

    [13]

    Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml

    [14]

    Justin Engelmann and S. Lessmann. 2021. Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning. Expert Syst. Appl. 174 (2021), 114582.

    Digital Library

    [15]

    Cristóbal Esteban, L. Stephanie Hyland, and Gunnar Rätsch. 2017. Realvalued (Medical) Time Series Generation with Recurrent Conditional GANs. arXiv:1706.02633

    [16]

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NeurIPS.

    [17]

    Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. 2018. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367 (2018).

    [18]

    Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. Improved Training of Wasserstein GANs. In NeurIPS.

    [19]

    Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: A New over-Sampling Method in Imbalanced Data Sets Learning. In ICIC.

    [20]

    Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In IJCNN.

    [21]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. In NeurIPS.

    [22]

    Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Rémi Tachet des Combes, and Ioannis Mitliagkas. 2020. Adversarial score matching and improved sampling for image generation. arXiv preprint arXiv:2009.05475 (2020).

    [23]

    James Jordon, Jinsung Yoon, and V. D. Mihaela Schaar. 2019. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. In International Conference on Learning Representations.

    [24]

    Jayoung Kim, Jinsung Jeon, Jaehoon Lee, Jihyeon Hyeong, and Noseong Park. 2021. OCT-GAN: Neural ODE-Based Conditional Tabular GANs. In TheWebConf.

    [25]

    Jaehoon Lee, Jihyeon Hyeong, Jinsung Jeon, Noseong Park, and Jihoon Cho. 2021. Invertible Tabular GANs: Killing Two Birds with One Stone for Tabular Data Synthesis. In NeurIPS.

    [26]

    M. Lichman. 2013. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml

    [27]

    Giovanni Mariani, Florian Scheidegger, Roxana Istrate, Costas Bekas, and A. Cristiano I. Malossi. 2018. BAGAN: Data Augmentation with Balancing GAN. CoRR abs/1803.09655 (2018).

    [28]

    Sankha Subhra Mullick, Shounak Datta, and Swagatam Das. 2019. Generative Adversarial Minority Oversampling. In ICCV.

    [29]

    Augustus Odena, Christopher Olah, and Jonathon Shlens. 2017. Conditional Image Synthesis With Auxiliary Classifier GANs. arXiv:1610.09585

    [30]

    KANCHARLA PARIMALA and Sumohana Channappayya. 2019. Quality Aware Generative Adversarial Networks. In NeurIPS.

    [31]

    Noseong Park, Ankesh Anand, Joel Ruben Antony Moniz, Kookjin Lee, Jaegul Choo, David Keetae Park, Tanmoy Chakraborty, Hongkyu Park, and Youngmin Kim. 2018. MMGAN: Manifold-Matching Generative Adversarial Networks. In ICPR.

    [32]

    Noseong Park, Mahmoud Mohammadi, Ksh*tij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. 2018. Data Synthesis based on Generative Adversarial Networks. (2018). arXiv:1806.03384

    [33]

    Eckhard Platen. 1999. An introduction to numerical methods for stochastic differential equations. Acta Numerica 8 (1999), 197--246. https://doi.org/10.1017/ S0962492900002920

    [34]

    Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. 2021. Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. In ICML.

    [35]

    C Okan Sakar, S Olcay Polat, Mete Katircioglu, and Yomi Kastro. 2019. Real-time prediction of online shoppers' purchasing intention using multilayer perceptron and LSTM recurrent neural networks. Neural Computing and Applications 31 (10 2019), 6893--6908.

    [36]

    Robert E. Schapire. 1999. A Brief Introduction to Boosting. In IJCAI.

    [37]

    Daniel Sessler, Andrea Kurz, Leif Saager, and Jarrod Dalton. 2011. Operation Timing and 30-Day Mortality After Elective General Surgery. Anesthesia and analgesia 113 (09 2011), 1423--8. https://doi.org/10.1213/ANE.0b013e3182315a6d

    [38]

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. Score-Based Generative Modeling through Stochastic Differential Equations. In ICLR.

    [39]

    Akash Srivastava, Lazar Valkov, Chris Russell, Michael U. Gutmann, and Charles Sutton. 2017. VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning. In NeurIPS.

    [40]

    Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 86 (2008).

    [41]

    Pascal Vincent. 2011. A Connection between Score Matching and Denoising Autoencoders. Neural Comput. 23, 7 (2011), 1661--1674.

    Digital Library

    [42]

    WentaoWang, SuhangWang,Wenqi Fan, Zitao Liu, and Jiliang Tang. 2020. Globaland- local aware data generation for the class imbalance problem. In ICDM. SIAM, 307--315.

    [43]

    Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. 2022. Tackling the Generative Learning Trilemma with Denoising Diffusion GANs. In ICLR.

    [44]

    Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling Tabular data using Conditional GAN. In NeurIPS.

    Cited By

    View all

    • Liu TFan JTang NLi GDu X(2024)Controllable Tabular Data Synthesis Using Diffusion ModelsProceedings of the ACM on Management of Data10.1145/36392832:1(1-29)Online publication date: 26-Mar-2024

      https://dl.acm.org/doi/10.1145/3639283

    • Lee CKim JPark NKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)CoDiProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619190(18940-18956)Online publication date: 23-Jul-2023

      https://dl.acm.org/doi/10.5555/3618408.3619190

    • Kotelnikov ABaranchuk DRubachev IBabenko AKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)TabDDPMProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619133(17564-17579)Online publication date: 23-Jul-2023

      https://dl.acm.org/doi/10.5555/3618408.3619133

    • Show More Cited By

    Index Terms

    1. SOS: Score-based Oversampling for Tabular Data

      1. Computing methodologies

        1. Artificial intelligence

      Recommendations

      • OBGAN: Minority oversampling near borderline with generative adversarial networks

        Abstract

        Class imbalance is a major issue that degrades the performance of machine learning classifiers in real-world problems. Oversampling methods have been widely used to overcome this issue by generating synthetic data from minority ...

        Highlights

        • OBGAN: A novel minority oversampling method with GAN for class imbalance problems.

        Read More

      • Meta Learning for Imbalanced Big Data Analysis by using Generative Adversarial Networks

        ICBDC '18: Proceedings of the 3rd International Conference on Big Data and Computing

        Imbalanced big data means big data where the ratio of a certain class is relatively small compared to other classes. When the machine learning model is trained by using imbalanced big data, the problem with performance drops for the minority class ...

        Read More

      • An Empirical Study of Oversampling and Undersampling for Instance Selection Methods on Imbalance Datasets

        CIARP 2013: Proceedings, Part I, of the 18th Iberoamerican Congress on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications - Volume 8258

        Instance selection methods get low accuracy in problems with imbalanced databases. In the literature, the problem of imbalanced databases has been tackled applying oversampling or undersampling methods. Therefore, in this paper, we present an empirical ...

        Read More

      Comments

      Information & Contributors

      Information

      Published In

      SOS | Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (1)

      KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

      August 2022

      5033 pages

      ISBN:9781450393850

      DOI:10.1145/3534678

      • General Chairs:
      • Aidong Zhang

        University of Virginia

        ,
      • Huzefa Rangwala

        Amazon/George Mason University

      Copyright © 2022 ACM.

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [emailprotected]

      Sponsors

      • SIGMOD: ACM Special Interest Group on Management of Data
      • SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 14 August 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. oversampling
      2. score-based generative model
      3. tabular data synthesis

      Qualifiers

      • Research-article

      Funding Sources

      • IITP

      Conference

      KDD '22

      Sponsor:

      • SIGMOD
      • SIGKDD

      Acceptance Rates

      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Upcoming Conference

      KDD '24

      • Sponsor:
      • sigkdd
      • sigkdd

      The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

      August 25 - 29, 2024

      Barcelona , Spain

      Contributors

      SOS | Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (4)

      Other Metrics

      View Article Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 4

        Total Citations

        View Citations
      • 448

        Total Downloads

      • Downloads (Last 12 months)147
      • Downloads (Last 6 weeks)20

      Other Metrics

      View Author Metrics

      Citations

      Cited By

      View all

      • Liu TFan JTang NLi GDu X(2024)Controllable Tabular Data Synthesis Using Diffusion ModelsProceedings of the ACM on Management of Data10.1145/36392832:1(1-29)Online publication date: 26-Mar-2024

        https://dl.acm.org/doi/10.1145/3639283

      • Lee CKim JPark NKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)CoDiProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619190(18940-18956)Online publication date: 23-Jul-2023

        https://dl.acm.org/doi/10.5555/3618408.3619190

      • Kotelnikov ABaranchuk DRubachev IBabenko AKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)TabDDPMProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619133(17564-17579)Online publication date: 23-Jul-2023

        https://dl.acm.org/doi/10.5555/3618408.3619133

      • Lim HPark SKim MLee JLim SPark NFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)MadSGM: Multivariate Anomaly Detection with Score-based Generative ModelsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614956(1411-1420)Online publication date: 21-Oct-2023

        https://dl.acm.org/doi/10.1145/3583780.3614956

      View Options

      Get Access

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      Get this Publication

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      SOS | Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2024)

      References

      Top Articles
      Latest Posts
      Article information

      Author: Clemencia Bogisich Ret

      Last Updated:

      Views: 6301

      Rating: 5 / 5 (80 voted)

      Reviews: 87% of readers found this page helpful

      Author information

      Name: Clemencia Bogisich Ret

      Birthday: 2001-07-17

      Address: Suite 794 53887 Geri Spring, West Cristentown, KY 54855

      Phone: +5934435460663

      Job: Central Hospitality Director

      Hobby: Yoga, Electronics, Rafting, Lockpicking, Inline skating, Puzzles, scrapbook

      Introduction: My name is Clemencia Bogisich Ret, I am a super, outstanding, graceful, friendly, vast, comfortable, agreeable person who loves writing and wants to share my knowledge and understanding with you.