NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models

0

NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models

NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models

“Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.

The base model was pre-trained on a corpus of 9 trillion tokens consisting of a diverse assortment of English based texts, 50+ natural languages, and 40+ coding languages. Subsequently the Nemotron-4-340B-Instruct model went through additional alignment steps including:

Throughout the alignment process, we relied on only approximately 20K human-annotated data while our data generation pipeline synthesized over 98% of the data used for supervised fine-tuning and preference fine-tuning (DPO & RPO). We provide comprehensive details about our synthetic data generation pipeline in the technical report…”

Source: blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/

Link: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-4-340b-instruct

June 18, 2024
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments

Subscribe to our Digest