😊 A Survey of Data Synthesis Approaches

arXiv paper

Overview of A Survey of Data Synthesis Approaches

Pipeline of Synthesizing Data
Augmentation Objectives
Augmentation Approaches
Post-Processing

Future Work

Pipeline of Synthesizing Data

Augmentation Objectives: Identifying needs for synthetic data
Synthetic Data Generation: Creating synthetic data using various methods
Post-Processing: Refining and ensuring the quality of generated data

Augmentation Objectives

We categorize these objectives into four types: Improving Diversity, Data Balancing, Addressing Domain Shift, or Resolving Edge Cases. A single data augmentation method may not be limited to addressing only one of the objectives mentioned above.

1. Enhancing Diversity:

Reduce the possibility of overfitting, resulting in better generalization capabilities.

Cubuk et al. (2020)：employs random sampling of transformation subsets
Liu et al. (2024b)：leverages the LLMs’ in-context learning capability
Wang et al. (2023)：utilizes Rouge-L to examine the similarity between generated data

2. Balancing Data Sets:

Providing more balanced training data for minority classes.

Rresampling or synthetic data generation SMOTE (Chawla et al., 2002), ADASYN (He et al., 2008)

3. Addressing Domain Shifts:

To adapt to the differences in data distribution between the different domain tasks.

Chen et al. (2021)：adding noise to input sentences and transforming sentence formats between different domains.
Orbes-Arteaga et al. (2022)：consistency training and adversarial learning

4. Managing Edge Cases:

Expand the variety of training data by introducing rare but plausible scenario.

Simulating various real-world perturbations and anomalies (Yudkin et al., 2022)

Augmentation Approaches

We introduce various approaches to generate synthetic data, and categorize these approaches into four types: Expert Knowledge, Direct Training, Pre-train then Fine-tune, and Foundation Models without Fine-tuning. Techniques for generating synthetic data often align with the prevailing machine learning methodologies of their time.

1. Leveraging Expert Knowledge

Creation: The methods include synonym replacement (Wei and Zou, 2019; Zhang et al., 2016), or random inserion of words (Zhu et al., 2022).
Transformation: The methods include dispersing punctuation marks throughout the text (Karimi et al., 2021), changing the structure or format of the original text.
Hybrid: Mapping the data from a specific domain to the distribution of a general domain, and conducting data augmentation by finding similar data in the general domain, is equivalent to combining feature transformation and feature creation. (Chen et al., 2021)
- Limitations:
  1. Performance gain can be marginal when data is sufficient.
  2. Knowledge-based engineering often involves generating synthetic samples by synonym replacement or structural adjustments, which will not change the label of the original dataset, so the imbalance in the dataset remains.
- Advantages:
  1. Fast and simple.
  2. Performance gain in small dataset is clear.

2. Direct Model Training

Before the widespread adoption of pre-trained models, we often develop a model that is trained exclusively on data specific to the task at hand for synthesizing new data. The key characteristic of this approach is that the augmentation model does not leverage any pre-existing models or datasets; it starts from scratch, learning exclusively from the task-specific dataset.

RNN (Kobayashi, 2018), (Xu et al.,2016), (Fadaee wt al., 2017)
CNN (Guo et al., 2019)
Limitations: The main limitation is its reliance on large amounts of labeled data for training, which is not always readily available.
Advantages: Trained models generate more diverse and realistic data than knowledge-based engineering methods, which can help improve the robustness and generalization of the main model.

3. Pre-training followed by Fine-tuning

This section covers augmentation techniques under the pre-train then fine-tune paradigm. During pre-training, a model learns meaningful data representations on a large related dataset using unsupervised learning. Then, the pre-trained model is fine-tuned on a smaller labeled dataset for the target task, adapting its parameters to that specific task.

Limitations: Pre-trained models are prone to overfitting on small amounts of data, leading to domain shift when used for data augmentation.
Advantages: Compared to direct training, pre-trained models don’t require extensive data for fine-tuning to achieve similar or even superior performance. Especially when used for data augmentation, there is often a shortage of data.

4. Utilizing Foundation Models without Fine-tuning

Companies have published foundation models that often exhibit excellent performance on downstream tasks without the need for additional fine-tuning. Using prompt design, for example: zero-shot,, in-context learning, dialogue with LLM, to synthetic data earns a lot of popularity.

Limitations: The synthetic data generated by foundation models may not be as tailored to specific domain needs compared to data from fine-tuned models. This can lead to less accurate or less effective data for training down-stream models.
Advantages: Using foundation models directly allows for quicker deployment because there’s no need for an additional fine-tuning phase.

Post-Processing

1. Ensuring Basic Quality

Basic quality encompasses elements such as fluency, grammatical accuracy, format validation among others.

Fluency Using SLOR metric to evaluate fluency. Using GPT-4 to mimic human evaluations (Abdulin et al.).
Format validation Regular expression to ensure the correct format (Lee et al.)

2. Maintaining Label Consistency

To avoid the discrepancies between the data and its labels.

Train a classifier to filter generated context (Zhou et al., Anaby-Tavor et al., Puri et al.)

3. Aligning Data Distribution

To enhance the diversity and generalizability of synthetic dataset.

Filtered out data with high similarity (Wang et al.)
Solve class imbalance by with diverse data using similarity score (Suhaeni and Yong 2023)

Future Work

1. Shifting Focus from Quantity to Quality

As the volume of data reaches a certain threshold, the incremental gains in model performance begin to diminish. The emerging trend is toward enabling models to learn effectively from smaller but high quality datasets.

2. Evaluating the Impact of Augmented Data

Creating a standard benchmark for evaluating data augmentation techniques-focusing on their quality, diversity, and relevance—is a key but complex challenge in advancing machine learning. Despite these difficulties, creating a strong benchmark is crucial as it could greatly help in developing more effective and flexible augmentation methods

3. Expanding to Multi-Modal Data Augmentation

Currently, there are relatively few studies that focus on multi-modal data augmentation, even though this area holds significant potential for enhancing model performance in complex task

Citation

@misc{chang2024surveydatasynthesisapproaches,
      title={A Survey of Data Synthesis Approaches}, 
      author={Hsin-Yu Chang and Pei-Yu Chen and Tun-Hsiang Chou and Chang-Sheng Kao and Hsuan-Yun Yu and Yen-Ting Lin and Yun-Nung Chen},
      year={2024},
      eprint={2407.03672},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2407.03672}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

😊 A Survey of Data Synthesis Approaches

Overview of A Survey of Data Synthesis Approaches

Pipeline of Synthesizing Data

Augmentation Objectives

1. Enhancing Diversity:

2. Balancing Data Sets:

3. Addressing Domain Shifts:

4. Managing Edge Cases:

Augmentation Approaches

1. Leveraging Expert Knowledge

2. Direct Model Training

3. Pre-training followed by Fine-tuning

4. Utilizing Foundation Models without Fine-tuning

Post-Processing

1. Ensuring Basic Quality

2. Maintaining Label Consistency

3. Aligning Data Distribution

Future Work

1. Shifting Focus from Quantity to Quality

2. Evaluating the Impact of Augmented Data

3. Expanding to Multi-Modal Data Augmentation

Citation

About

Releases

Packages

Contributors 4

MiuLab/SynData-Survey

Folders and files

Latest commit

History

Repository files navigation

😊 A Survey of Data Synthesis Approaches

Overview of A Survey of Data Synthesis Approaches

Pipeline of Synthesizing Data

Augmentation Objectives

1. Enhancing Diversity:

2. Balancing Data Sets:

3. Addressing Domain Shifts:

4. Managing Edge Cases:

Augmentation Approaches

1. Leveraging Expert Knowledge

2. Direct Model Training

3. Pre-training followed by Fine-tuning

4. Utilizing Foundation Models without Fine-tuning

Post-Processing

1. Ensuring Basic Quality

2. Maintaining Label Consistency

3. Aligning Data Distribution

Future Work

1. Shifting Focus from Quantity to Quality

2. Evaluating the Impact of Augmented Data

3. Expanding to Multi-Modal Data Augmentation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Packages