NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models
NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models
“Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.
The base model was pre-trained on a corpus of 9 trillion tokens consisting of a diverse assortment of English based texts, 50+ natural languages, and 40+ coding languages. Subsequently the Nemotron-4-340B-Instruct model went through additional alignment steps including:
- Supervised Fine-tuning (SFT)
- Direct Preference Optimization (DPO)
- Reward-aware Preference Optimization (RPO) (Additional in-house alignment technique)
Throughout the alignment process, we relied on only approximately 20K human-annotated data while our data generation pipeline synthesized over 98% of the data used for supervised fine-tuning and preference fine-tuning (DPO & RPO). We provide comprehensive details about our synthetic data generation pipeline in the technical report…”
Source: blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/
Link: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-4-340b-instruct