Browser Use Quickstart
🌐 Building an Intelligent Browser Agent with Llama 4
This notebook provides a step-by-step guide to creating an AI-powered browser agent capable of navigating and interacting with websites autonomously. By combining the power of Llama 4 Scout, Playwright, and Together AI, this agent can perform tasks seamlessly while understanding both visual and textual content.
Demo
For a detailed explanation of the code and a demo video, visit our blog post: Blog Post and Demo Video
Features
- Visual understanding of web pages through screenshots
- Autonomous navigation and interaction
- Natural language instructions for web tasks
- Persistent browser session management
For example, you can ask the agent to:
- Search for a product on Amazon
- Find the cheapest flight to Tokyo
- Buy tickets for the next Warriors game
What's in this Notebook?
This recipe walks you through:
- Setting up the environment and installing dependencies.
- Automating browser interactions using Playwright.
- Defining a structured prompt for the LLM to understand the task and execute the next action.
- Leveraging Llama 4 Scout for content comprehension.
- Creating a persistent and intelligent browser agent for real-world applications.
*Please note that the agent is not perfect and may not always behave as expected.
1. Install Required Libraries
This cell installs the necessary Python packages for the script, such as together, playwright, and beautifulsoup4.
It also ensures that Playwright is properly installed to enable automated browser interactions.
2. Import Modules and Set Up Environment
Set your Together API key to instantiate the client client. Feel free to use a different provider if it's more convenient.
Vision Query Example
This function converts an image file into a Base64-encoded string, which is required for LLM querying.
The next cell shows an example of how to use the encode_image function to convert an image file into a Base64-encoded string, which is then used in a chat completion request to the Llama 4 Scout model.
Helper Functions to Parse the Accessibility Tree
The agent will use the accessibility tree to understand the elements on the page and interact with them. A helper function is defined here to help simplity the accessibility tree for the agent.
3. Define Prompts
a) Planning Prompt: Create a structured prompt for the LLM to understand the task and execute the next action.
b) Agent Execution Prompt A structured prompt is created, specifying the instructions for processing the webpage content and screenshots.
Few Shot Examples
Performance improves drastically by adding a few shot examples.
4. Define a task and generate a plan of actions to execute
You can define your own task or use one of the examples below
Generate a plan of actions to execute
The next cell queries the LLM using the planning prompt to generate a plan of actions to execute. This then becomes each of the individual subtasks for the execution agent to complete.
Generating plan... 1. Navigate to Google Flights. 2. Enter the departure city. 3. Enter "Istanbul" as the destination city. 4. Select round-trip option. 5. Enter the departure date (current date or next available date). 6. Enter the return date (one week after the departure date). 7. Click the 'Search' button to find available flights. 8. Extract and compare the flight options, highlighting the cheapest option.
5. Create the Browser environment and Run the Agent
The necessary modules for web scraping are imported, and the setup for using Playwright asynchronously is initialized.
The context is provided to the LLM to help it understand its current state and generate the next required action to complete the provided task.
- At any step, you can press enter to continue or 'q' to quit the agent loop.
Agent response: {
"current_state": "On the Google homepage.",
"reasoning": "The task is to find the cheapest round trip flight to Istanbul. The first step is to navigate to Google Flights.",
"action": "navigation",
"url": "https://www.google.com/flights"
}
Press 'q' to quit or Enter to continue:
Agent response: {
"current_state": "On Google Flights homepage.",
"reasoning": "The task is to find the cheapest round trip flight to Istanbul. The next step is to enter Istanbul as the destination city.",
"action": "fill",
"selector": "combobox=Where to?",
"value": "Istanbul"
}
Press 'q' to quit or Enter to continue:
Agent response: {
"current_state": "On Google Flights with destination city filled as Istanbul.",
"reasoning": "The next step is to fill in the departure date to proceed with the flight search.",
"action": "fill",
"selector": "textbox=Departure",
"value": "2025-10-15"
}
Press 'q' to quit or Enter to continue:
Agent response: {
"current_state": "On Google Flights with departure city set to Phoenix, destination city set to Istanbul, and departure date set to October 15, 2025.",
"reasoning": "The next step is to fill in the return date field to complete the search criteria for a round-trip flight.",
"action": "fill",
"selector": "textbox=Return",
"value": "2025-10-22"
}
Press 'q' to quit or Enter to continue:
Agent response: {
"current_state": "On Google Flights with departure city set to Phoenix, destination city set to Istanbul, departure date set to October 15, 2025, and return date set to October 22, 2025.",
"reasoning": "The next step is to click the 'Search' button to find available flights based on the specified criteria.",
"action": "click",
"selector": "button=Search"
}
Press 'q' to quit or Enter to continue:
Agent response: {
"current_state": "On Google Flights search results page for Phoenix to Istanbul.",
"reasoning": "The page has loaded with multiple flight options. I need to identify and select the cheapest round-trip flight option to fulfill the task.",
"action": "click",
"selector": "text=From 1098 US dollars round trip total",
"url": "https://www.google.com/flights"
}
Press 'q' to quit or Enter to continue:
Agent response: {
"current_state": "On Google Flights search results page for Phoenix to Istanbul.",
"reasoning": "The page has loaded with multiple flight options. I need to sort the results to find the cheapest option.",
"action": "click",
"selector": "tab=Cheapest"
}
Press 'q' to quit or Enter to continue:
Agent response: {
"current_state": "On Google Flights search results page with multiple flight options displayed.",
"reasoning": "The accessibility tree shows multiple flight options with prices. The cheapest option is listed as $1098. To proceed with the task, I need to select the cheapest flight option.",
"action": "click",
"selector": "link=From 1098 US dollars round trip total. 2 stops flight with United and Lufthansa. Operated by SkyWest DBA United Express. Leaves Phoenix Sky Harbor International Airport at 10:41 AM on Wednesday, October 15 and arrives at Istanbul Airport at 9:20 PM on Thursday, October 16. Total duration 24 hr 39 min. Layover (1 of 2) is a 2 hr 33 min layover at Los Angeles International Airport in Los Angeles. Layover (2 of 2) is a 6 hr 30 min layover at Frankfurt Airport in Frankfurt am Main. Select flight",
"value": ""
}
Press 'q' to quit or Enter to continue: q
And that's it! Congratulations! 🎉🎉
You've just created a browser agent that can navigate websites, understand page content through vision, plan and execute actions based on natural language commands, and maintain context across multiple interactions.
Collaborators
Feel free to reach out with any questions or feedback!