Scaling Test-Time Compute for Longer Thinking in LLMs
Authored by: Sergio Paniego
🚨 WARNING: This notebook is resource-intensive and requires substantial computational power. If you’re running this in Colab, it will utilize an A100 GPU.
In this recipe, we'll guide you through extending the inference time for an Instruct LLM system using test-time compute to solve more challenging problems, such as complex math problems. This approach, inspired by OpenAI o1-o3 models, demonstrates that longer reasoning time during inference can enhance model performance.
This technique builds on experiments shared in this blog post, which show that smaller models, like the 1B and 3B Llama Instruct models, can outperform much larger ones on the MATH-500 benchmark when given enough "time to think". Recent research from DeepMind suggests that test-time compute can be scaled optimally through strategies like iterative self-refinement or using a reward model.
The blog introduces a new repository for running these experiments. In this recipe, we'll focus on building a small chatbot that engages in longer reasoning to tackle harder problems using small open models.
![]()
1. Install Dependencies
Let’s start by installing the search-and-learn repository! 🚀
This repo is designed to replicate the experimental results and is not a Python pip package. However, we can still use it to generate our system. To do so, we’ll need to install it from source with the following steps:
Log in to Hugging Face to access meta-llama/Llama-3.2-1B-Instruct, as it is a gated model! 🗝️
If you haven't previously requested access, you'll need to submit a request before proceeding.
2. Setup the Large Language Model (LLM) and the Process Reward Model (PRM) 💬
As illustrated in the diagram, the system consists of an LLM that generates intermediate answers based on user input, a PRM model that evaluates and scores these answers, and a search strategy that uses the PRM feedback to guide the subsequent steps in the search process until reaching the final answer.
Let’s begin by initializing each model. For the LLM, we’ll use the meta-llama/Llama-3.2-1B-Instruct model, and for the PRM, we’ll use the RLHFlow/Llama3.1-8B-PRM-Deepseek-Data model.

2.1 Instantiate the Question, Search Strategy, and Call the Pipeline
Now that we've set up the LLM and PRM, let's proceed by defining the question, selecting a search strategy to retrieve relevant information, and calling the pipeline to process the question through the models.
-
Instantiate the Question: In this step, we define the input question that the system will answer, considering the given context.
-
Search Strategy: The system currently supports the following search strategies:
best_of_n,beam_search, anddvts(see diagram). For this example, we'll usebest_of_n, but you can easily switch to any of the other strategies based on your needs. We need to define some configuration parameters for the configuration of the search strategy. You can check the full list here. -
Call the Pipeline: With the question and search strategy in place, we’ll call the inference pipeline, processing the inputs through both the LLM and PRM to generate the final answer.

The first step is to clearly define the question that the system will answer. This ensures that we have a precise task for the model to tackle.
Next, we define the configuration, including parameters like the number of candidate answers (N), and choose the search strategy that will be used. The search strategy dictates how we explore the potential answers. In this case, we'll use best_of_n.
With the question and configuration in place, we use the selected search strategy to generate multiple candidate answers. These candidates are evaluated based on their relevance and quality and the final answer is returned.
2.2 Display the Final Result
Once the pipeline has processed the question through the LLM and PRM, we can display the final result. This result will be the model's output after considering the intermediate answers and scoring them using the PRM.
Here's how to display the final answer:
'## Step 1: Recall the conversion formulas\nTo convert from rectangular coordinates $(x, y)$ to polar coordinates $(r, \\theta)$, we use the following formulas:\n- $r = \\sqrt{x^2 + y^2}$\n- $\\theta = \\tan^{-1}\\left(\\frac{y}{x}\\right)$\n\n## Step 2: Substitute the given values into the formulas\nGiven $(x, y) = (0, 3)$, we substitute these values into the formulas:\n- $r = \\sqrt{0^2 + 3^2} = \\sqrt{0 + 9} = \\sqrt{9} = 3$\n- $\\theta = \\tan^{-1}\\left(\\frac{3}{0}\\right)$. However, since division by zero is undefined, we recognize that the point $(0, 3)$ is on the positive y-axis, meaning $\\theta = \\frac{\\pi}{2}$.\n\n## Step 3: Combine the results for the polar coordinates\nTherefore, the polar coordinates of the point $(0, 3)$ are $\\left(3, \\frac{\\pi}{2}\\right)$.\n\nThe final answer is: $\\boxed{\\left(3, \\frac{\\pi}{2}\\right)}$' The model’s output might include special tokens, such as <|start_header_id|> or <|end_header_id|>. To make the answer more readable, we can safely remove them before displaying it to the end user.
'## Step 1: Recall the conversion formulas\nTo convert from rectangular coordinates $(x, y)$ to polar coordinates $(r, \\theta)$, we use the following formulas:\n- $r = \\sqrt{x^2 + y^2}$\n- $\\theta = \\tan^{-1}\\left(\\frac{y}{x}\\right)$\n\n## Step 2: Substitute the given values into the formulas\nGiven $(x, y) = (0, 3)$, we substitute these values into the formulas:\n- $r = \\sqrt{0^2 + 3^2} = \\sqrt{0 + 9} = \\sqrt{9} = 3$\n- $\\theta = \\tan^{-1}\\left(\\frac{3}{0}\\right)$. However, since division by zero is undefined, we recognize that the point $(0, 3)$ is on the positive y-axis, meaning $\\theta = \\frac{\\pi}{2}$.\n\n## Step 3: Combine the results for the polar coordinates\nTherefore, the polar coordinates of the point $(0, 3)$ are $\\left(3, \\frac{\\pi}{2}\\right)$.\n\nThe final answer is: $\\boxed{\\left(3, \\frac{\\pi}{2}\\right)}$' After removing any special tokens, we can display the final answer to the user. Since the answer is based on markdown, it can be rendered properly by displaying it as markdown.
3. Assembling It All! 🧑🏭️
Now, let's create a method that encapsulates the entire pipeline. This will allow us to easily reuse the process in future applications, making it efficient and modular.
By combining the LLM, PRM, search strategy, and result display, we can simplify the workflow and ensure that it’s reusable for other tasks or questions.
We simplify the workflow, ensuring that it’s reusable for different tasks or questions. Additionally, we’ll track the time spent on each method so that we can understand the practical implications of using each strategy and configuration.
Here’s how we can structure the method:
⏳ 3.1 Comparing Thinking Time for Each Strategy
Let’s compare the thinking time of three methods: best_of_n, beam_search, and dvts. Each method is evaluated using the same number of answers during the search process, measuring the time spent thinking in seconds and the number of generated tokens.
In the results below, the best_of_n method shows the least thinking time, while the dvts method takes the most time. However, best_of_n generates more tokens due to its simpler search strategy.
| Method | Number of Answers During Search | Thinking Time (Seconds) | Generated Tokens |
|---|---|---|---|
| best_of_n | 8 | 3.54 | 3087 |
| beam_search | 8 | 10.06 | 2049 |
| dvts | 8 | 8.46 | 2544 |
This comparison illustrates the trade-offs between the strategies, balancing time spent thinking and the complexity of the search process.
1. Best of n
We’ll begin by using the best_of_n strategy. Here’s how to track the thinking time for this method:
Finished in 3.54 seconds Total tokens in all completions: 3087
2. Beam Search
Now, let's try using the beam_search strategy.
Beam search iterations: 20%|██ | 8/40 [00:10<00:40, 1.26s/it]
Finished in 10.06 seconds Total tokens in all completions: 2049
3. Diverse Verifier Tree Search (DVTS)
Finally, let's try the dvts strategy.
Beam search iterations: 22%|██▎ | 9/40 [00:08<00:29, 1.06it/s]
Finished in 8.46 seconds Total tokens in all completions: 2544
🙋 3.2 Testing the System with a Simple Question
In this final example, we’ll test the system using a straightforward question to observe how it performs in simpler cases. This allows us to verify that the system works as expected even for basic queries.
Let's try the following question:
Finished in 1.03 seconds Total tokens in all completions: 544
Even though we set a larger number of candidate answers (N), the time spent thinking remains relatively small (1.03 seconds and 544 generated tokens). This demonstrates the system’s ability to efficiently handle easier problems, spending less time on them, while leveraging its enhanced capabilities for more complex questions.
🏆 We now have a fully operational pipeline that leverages test-time compute, enabling the system to "think longer" for more complicated queries, while also maintaining fast response times for straightforward questions.
This approach ensures the system can scale its thinking time based on the task's complexity, offering an efficient and responsive solution for both simple and challenging problems.
4. Continuing the Journey and Resources 🧑🎓️
If you're eager to continue exploring, be sure to check out the original experimental blog and all the references mentioned within it. These resources will deepen your understanding of test-time compute, its benefits, and its applications in LLMs.
Happy learning and experimenting! 🚀