NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models

“Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.

The base model was pre-trained on a corpus of 9 trillion tokens consisting of a diverse assortment of English based texts, 50+ natural languages, and 40+ coding languages. Subsequently the Nemotron-4-340B-Instruct model went through additional alignment steps including:

Supervised Fine-tuning (SFT)
Direct Preference Optimization (DPO)
Reward-aware Preference Optimization (RPO) (Additional in-house alignment technique)

Throughout the alignment process, we relied on only approximately 20K human-annotated data while our data generation pipeline synthesized over 98% of the data used for supervised fine-tuning and preference fine-tuning (DPO & RPO). We provide comprehensive details about our synthetic data generation pipeline in the technical report…”

Source: blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/

Link: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-4-340b-instruct

June 18, 2024

0 Comments

Inline Feedbacks

View all comments

Request a Quote

Log In

NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models

NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models

NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models