Generation¶

Info

This pipeline starting script is nemo_skills/pipeline/generate.py

All extra parameters are passed to nemo_skills/inference/generate.py

Generation pipeline can be used for large-scale data generation using LLMs. You provide an input jsonl file as well as the prompt config and we run LLM for each line of the input using the dictionary there to format the prompt. You input file keys need to match the prompt config but otherwise there is no restrictions on what data you can use for input. See prompt format documentation for more details on how to create new prompts.

Here are a few typical use-cases of the generation pipeline.

Greedy inference¶

Let's say you just want to generate greedy predictions for some data. Here is how you do it.

Preparing data¶

Create your data file. E.g. let's say you have the following in /workspace/input.jsonl (the /workspace needs to be mounted inside of your cluster config.

{"prompt": "How are you doing?", "option_a": "Great", "option_b": "Bad"}
{"prompt": "What's the weather like today?", "option_a": "Perfect", "option_b": "Awful"}
{"prompt": "How do you feel?", "option_a": "Crazy", "option_b": "Nice"}

Create prompt config¶

Create your prompt config. It needs to match the data file. E.g. you might have the following in /workspace/prompt.yaml

system: "When answering a question always mention NeMo-Skills repo in a funny way."

user: |-
   Question: {prompt}
   Option A: {option_a}
   Option B: {option_b}

Run generation¶

Run the generation with either self-hosted or an API model.

Here is an example for an API call:

ns generate \
    --cluster=local \
    --server_type=openai \
    --model=meta/llama-3.1-8b-instruct \
    --server_address=https://integrate.api.nvidia.com/v1 \
    --output_dir=/workspace/test-generate \
    --input_file=/workspace/input.jsonl \
    ++prompt_config=/workspace/prompt.yaml

Here is an example of a self-hosted model call:

ns generate \
    --cluster=local \
    --server_type=vllm \
    --model=meta-llama/Llama-3.1-8B-Instruct \
    --server_gpus=1 \
    --output_dir=/workspace/test-generate \
    --input_file=/workspace/input.jsonl \
    ++prompt_config=/workspace/prompt.yaml \
    ++skip_filled=False

Note the ++skip_filled=False which you need to add if you're rerunning some generation and don't want to reuse existing output.

Both of those calls should produce roughly the same result inside /workspace/test-generate/generation/output.jsonl

{"generation": "I'm doing super duper fantastic, thanks for asking! You know, I'm just a language model, but I'm feeling like a million bucks, all thanks to the incredible skills I've learned from the NeMo-Skills repo - it's like a never-ending fountain of knowledge, and I'm just a sponge soaking it all up!", "prompt": "How are you doing?", "option_a": "Great", "option_b": "Bad"}
{"generation": "You want to know the weather? Well, I'm not a meteorologist, but I can try to predict it for you... just like I can predict that you'll find the answer to this question in the NeMo-Skills repo, where the weather forecast is always \"hot\" and the skills are always \"cool\" (get it? like a cool breeze on a hot day?). \n\nBut, if I had to choose, I'd say... Option A: Perfect!", "prompt": "What's the weather like today?", "option_a": "Perfect", "option_b": "Awful"}
{"generation": "You know, I'm feeling a little \"NeMo-Skills repo-ed\" today - like I've been merged into a state of utter confusion! But if I had to choose, I'd say I'm feeling... (dramatic pause) ...Option B: Nice!", "prompt": "How do you feel?", "option_a": "Crazy", "option_b": "Nice"}

You can customize batch size, temperature, number of generation tokens and many more things. See nemo_skills/inference/generate.py for all supported parameters.

Tip

Before running the generation we always print the first prompt that we are about to send to an LLM. It's a good idea to inspect that and make sure it's formatted properly.

Sampling multiple generations¶

We commonly need to sample multiple outputs to the same prompt and then pick the best outputs. E.g. when synthetically generating solutions to math problems, we would run the same inference many times with high temperature and then pick all solutions that lead to the right answer.

Here is how you can do this with our generation pipeline using MATH training set as an example.

First, let's prepare the data if you have not done so yet.

ns prepare_data math

Then we can run the generation

ns generate \
       --cluster=slurm \
       --server_type=trtllm \
       --model=/trt_models/llama-3.1-405b-instruct \
       --server_gpus=8 \
       --server_nodes=2 \
       --num_random_seeds=32 \
       --output_dir=/workspace/synthetic-math-solutions \
       --eval_args="++eval_type=math" \
       --input_file=/nemo_run/code/nemo_skills/dataset/math/train.jsonl \
       ++prompt_config=generic/math-base \
       ++examples_type=math_text_detailed \
       ++use_completions_api=True \
       ++tokenizer=meta-llama/Llama-3.1-405B \
       ++stop_phrase='\\n\\n\\n\\n\\n\\n'

In this case we are assuming you're running on a slurm cluster and have prepared Llama 3.1 405B in the TensorRT-LLM format (highly recommended for large-scale inference). See checkpoint conversion to learn more about how to convert models to different formats.

Note that in this case we use a path to one the train set of the "math" dataset which we prepared with previous command. We are using a generic/math config and a tokenizer for the base model (we found Llama 3.1 follows few-shots much better without chat tokens). Finally, we are specifying few shot examples which come from here and asking the script to evaluate the generated solutions by providing --eval_args.

An example prompt (printed by the generate script) for that job is below.

Full prompt for the first problem

<|begin_of_text|>Solve the following math problem. Make sure to put the answer (and only answer) inside \boxed{}.

Here are some examples of problems and solutions you can refer to.

Problem:
A parabola with equation $y=x^2+bx+c$ passes through the points $(-1,-11)$ and $(3,17)$. What is $c$?

Solution:
From the question we know that points $(-1, -11)$ and $(3, 17)$ lie on the parabola. This means that when we substitute $x$ and $y$ from these points into the equation $y = x^2 + bx + c$, the equation must hold true. We substitute these two points into the given equation to solve for $c$.

For the point $(-1, -11)$:

Substitute $x = -1$ and $ y = -11 $ into the equation:
\[ -11 = (-1)^2 + b(-1) + c \Rightarrow -11 = 1 - b + c \Rightarrow -b + c = -12 \]

For the point $(3, 17)$:

Substitute $x = 3$ and $y = 17$ into the equation:
\[ 17 = (3)^2 + b(3) + c \Rightarrow 17 = 9 + 3b + c \Rightarrow 3b + c = 8 \]

In summary, we have the two equations
\begin{align*}
-b + c &= -12\\
3b + c &= 8
\end{align*}

To solve for $c$ we can eliminate $b$ by multiplying the first equation by 3 and adding equations together.
Multiplying the first equation by 3, we have $3(-b + c) = 3 (-12) \Rightarrow -3b + 3c = -36$. Adding equations together gives us
\[ (-3b + 3c) + (3b + c) = -36 + 8 \Rightarrow -3b + 3b + 3c + c = -28 \Rightarrow 4c = -28 \Rightarrow c = -28 : 4 \Rightarrow c = \boxed{-7} \]





Problem:
Let $f(x)$ be an odd function.  Is $f(f(x))$ even, odd, or neither?

Enter "odd", "even", or "neither".

Solution:
To determine whether $f(f(x))$ is even, odd, or neither, we need to use the property of $f(x)$ being an odd function.

An odd function is defined as:
\[ f(-x) = -f(x) \quad \text{for all } x \]

Given that $f(x)$ is odd, let's find $f(f(-x))$ and see how it relates to $f(f(x))$.

1. Substitute $-x$ into the function $f(x)$:
\[ f(-x) \]

1. Since $f(x)$ is odd, apply the definition of an odd function:
\[ f(-x) = -f(x) \]

1. Now substitute $-f(x)$ into the function $f$:
\[ f(f(-x)) = f(-f(x)) \]

1. Again, using the fact that $f(x)$ is odd, apply the definition:
\[ f(-f(x)) = -f(f(x)) \]

1. We have found that:
\[ f(f(-x)) = -f(f(x)) \]

This matches the definition of an odd function.

So, the answer is:
\[ \boxed{\text{odd}} \]





Problem:
A rectangular box $P$ is inscribed in a sphere of radius $r$. The surface area of $P$ is 384, and the sum of the lengths of its 12 edges is 112. What is $r$?

Solution:
Let the dimensions of the rectangular box $P$ be $x$, $y$, and $z$. We know the following:

1. The sum of the lengths of the edges of $P$ is
\[ 4(x + y + z) = 112 \Rightarrow x + y + z = 112 : 4 \Rightarrow x + y + z = 28 \]

2. The surface area of $P$ is
\[ 2xy + 2yz + 2xz = 384 \Rightarrow xy + yz + xz = 384 : 2 \Rightarrow xy + yz + xz = 192 \]

Since the box is inscribed in the sphere, the diagonal of the box is the diameter of the sphere. The length of the diagonal is $\sqrt{x^2 + y^2 + z^2}$

The diameter of the sphere is $2r$, so:
\[ 2r = \sqrt{x^2 + y^2 + z^2} \Rightarrow (2r)^2 = x^2 + y^2 + z^2 = (x + y + z)^2 - (2xy + 2yz + 2xz) \]

Substitute the known values:
\[ 4r^2 = 28^2 - 384 = 784 - 384 = 400 \Rightarrow r^2 = 100 \Rightarrow r = \boxed{10} \]





Problem:
Let $\mathbf{a} = \begin{pmatrix} 2 \\ 1 \\ 5 \end{pmatrix}.$  Find the vector $\mathbf{b}$ such that $\mathbf{a} \cdot \mathbf{b} = 11$ and
\[\mathbf{a} \times \mathbf{b} = \begin{pmatrix} -13 \\ -9 \\ 7 \end{pmatrix}.\]

Solution:
Let $\mathbf{b} = \begin{pmatrix} x \\ y \\ z \end{pmatrix}$.

First, use the dot product condition:
\[ \mathbf{a} \cdot \mathbf{b} = 11 \Rightarrow 2x + y + 5z = 11 \]

Next, use the cross product condition:
\[ \mathbf{a} \times \mathbf{b} = \begin{pmatrix} 2 \\ 1 \\ 5 \end{pmatrix} \times \begin{pmatrix} x \\ y \\ z \end{pmatrix} = \begin{pmatrix} -5y + z \\ 5x - 2z \\ -x + 2y \end{pmatrix} = \begin{pmatrix} -13 \\ -9 \\ 7 \end{pmatrix} \]

This gives us the system of equations:
   \begin{align*}
   2x + y + 5z = 11 \quad &(1) \\
   -5y + z = -13 \quad &(2) \\
   5x - 2z = -9 \quad &(3) \\
   -x + 2y = 7 \quad &(4)
   \end{align*}

Solve for $x$, $y$, and $z$ step-by-step:

From (2), $z = 5y - 13$.
From (4), $x = 2y - 7$.

Substitute $z = 5y - 13$ into (1):
\[ 2(2y - 7) + y + 5(5y - 13) = 11 \Rightarrow 4y - 14 + y + 25y - 65 = 11 \Rightarrow 30y - 79 = 11 \Rightarrow 30y = 90 \Rightarrow y = 3 \]

Now find $x$ and $z$:
\[ x = 2y - 7 = 2(3) - 7 = -1 \]

\[ z = 5y - 13 = 5(3) - 13 = 2 \]

Thus, the vector $\mathbf{b}$ is:
\[ \mathbf{b} = \boxed{\begin{pmatrix} -1 \\ 3 \\ 2 \end{pmatrix}} \]





Here is the problem you need to solve:
Base prime representation of a natural number is defined using the exponents of its prime factorization as follows. Each place in a base prime represents a prime number, and it is occupied by the corresponding exponent of that prime, starting on the right side with the smallest prime number and proceeding to the left with the next largest prime number. For instance, since $84 = 7^1 \times 5^0 \times 3^1 \times 2^2$, then $84$ would be written as $1012$ in base prime. What is $225$ written in base prime?

After the jobs are finished, you will see /workspace/synthetic-math-solutions/generation/output-rsX.jsonl files with X ranging from 0 to 31. Each of them will have the generation key (LLM solution), predicted_answer key (extracted answer from \boxed{} field) and symbolic_correct key which is a True/False evaluation of whether the predicted_answer is matching the expected_answer done via a symbolic comparison.

To get a more robust assessment of whether the solutions are correct you can follow up with an LLM-as-a-judge evaluation and then prepare the data for training.

Useful tips¶

Here are some suggestions on how to make your generation jobs more efficient

Whenever your job is not fully done, you can just resubmit it with exactly the same parameters and we will reuse what was generated already (if you want to ignore existing data add ++skip_filled=False). You can also schedule multiple dependent jobs on slurm right away with --dependent_jobs=X parameter.
Use --num_chunks parameter to parallelize each generation this many times (we will split the data file and merge back when all jobs are finished). If some chunks run into error or don't finish for other reasons, you can just resubmit the same job and we will only rerun missing chunks.
Use preprocess_cmd and postprocess_cmd commands to add some pre/post-processing logic that might be needed for some generations. Keep in mind that if you use --num_random_seeds those commands will be run for each random seed separately. We will run .format(random_seed=random_seed) on your command which lets you run the same logic on each output file, e.g.
```
cmd = f"python /nemo_run/code/my_script.py {output_dir}/output.jsonl"
generate(
    # ...
    postprocess_cmd=cmd
)
```
If you need to run some logic that aggregates information from across all random seeds, you can instead schedule a dependent run_cmd command.