Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog
Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog
“Most of the popular decoder-only LLMs (GPT-3, for example) are pretrained on the causal modeling objective, essentially as next-word predictors. These LLMs take a series of tokens as inputs, and generate subsequent tokens autoregressively until they meet a stopping criteria (a limit on the number of tokens to generate or a list of stop words, for example) or until it generates a special <end>
token marking the end of generation. This process involves two phases: the prefill phase and the decode phase.
Note that tokens are the atomic parts of language that a model processes. One token is approximately four English characters. All inputs in natural language are converted to tokens before inputting into the model…”
Source: developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/