Browser Automation using Puppeteer MCP & Gemini CLI

Installing and Connecting Puppeteer MCP to Gemini CLI

Gemini CLI is Google's command-line interface for interacting with Gemini models. It supports MCP servers via a ~/.gemini/settings.json configuration file, using the same MCP protocol wire format as other AI clients. This means Puppeteer MCP integrates with Gemini CLI using the same server package — @modelcontextprotocol/server-puppeteer — just with a different config path.

Install Gemini CLI:

npm install -g @google/gemini-cli

Authenticate:

gemini auth login

Add Puppeteer MCP to ~/.gemini/settings.json:

{
  "mcpServers": {
    "puppeteer": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-puppeteer"]
    }
  }
}

Gemini CLI will spawn the Puppeteer MCP process on session start and communicate with it over stdio. No additional service management is needed.

For headful debugging (watch the browser while Gemini CLI controls it):

{
  "mcpServers": {
    "puppeteer": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-puppeteer"],
      "env": {
        "PUPPETEER_HEADLESS": "false"
      }
    }
  }
}

For CI environments:

{
  "mcpServers": {
    "puppeteer": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-puppeteer"],
      "env": {
        "PUPPETEER_LAUNCH_ARGS": "--no-sandbox --disable-setuid-sandbox"
      }
    }
  }
}

Verify the setup by starting a Gemini CLI session and asking:

Which MCP tools are available in this session?

You should see the six Puppeteer tools listed. If they are absent, run gemini --debug to inspect MCP server startup logs.

Tips
- If you have multiple Gemini CLI profiles or workspace configurations, ensure the ~/.gemini/settings.json is the one being picked up by verifying with gemini config show before starting a debugging session.
- Use npx -y rather than a global install to keep the server version flexible and avoid conflicts with other projects' Node.js tooling.
- For Google Cloud-hosted staging environments, configure the Gemini CLI session with appropriate service account credentials before pointing the Puppeteer agent at internal-only URLs.
- Test MCP connectivity before beginning a complex debugging session by asking the agent to navigate to https://example.com and screenshot it — a 30-second smoke test that confirms the full tool chain is working.

Using Gemini CLI to Diagnose Frontend Issues Autonomously with Puppeteer MCP

Gemini's multimodal capabilities make it particularly well-suited for screenshot-driven frontend debugging. When the Puppeteer MCP agent captures a screenshot, Gemini can visually analyze the image alongside any DOM or console data it has evaluated — combining visual and structural understanding in a single reasoning step.

A prompt that leverages Gemini's visual analysis:

Navigate to http://localhost:3000/dashboard and screenshot the page. Visually analyze the
screenshot: are there any obvious layout issues, cut-off elements, overlapping components,
or broken images? Also evaluate document.querySelectorAll('img') and report any images
where naturalWidth is 0 (failed to load). Cross-reference the visual findings with the
DOM evaluation.

Gemini will navigate, capture the screenshot, evaluate the DOM, and produce a combined visual + structural analysis — something that requires both vision and DOM inspection working together.

For diagnosing a reported visual regression:

A user reported that the navigation bar on http://localhost:3000 looks "squished" on their
screen. Please:
1. Screenshot at viewport 1920x1080
2. Screenshot at viewport 1440x900
3. Screenshot at viewport 1280x720
4. For each viewport, evaluate the computed height of the nav element (selector: 'nav.main-nav')
   and the number of nav items visible
5. Visually compare the three screenshots and identify at which viewport the nav appears
   "squished" — items too close together, text truncated, or items wrapping to a second row
6. Report the specific viewport breakpoint where the issue appears and the computed heights

For JavaScript error diagnosis with visual context:

At http://localhost:3000/reports/monthly, users are seeing a blank chart where revenue data
should appear. Please:
1. Navigate to the page and screenshot immediately
2. Wait for 3 seconds (evaluate: new Promise(r => setTimeout(r, 3000))) and screenshot again
3. Evaluate: window.__chartInstances or document.querySelector('canvas') — is a canvas element
   present in the DOM?
4. Evaluate: check for any errors in window.onerror or console.error outputs
5. Evaluate: fetch('/api/reports/monthly').then(r => r.json()).then(d => JSON.stringify(d).substring(0, 500))
   to check whether the API call returns data
6. Visually describe what you see in the screenshot — is it a blank white area, a spinner,
   an error message, or truly nothing?

The combination of puppeteer_evaluate for data extraction and Gemini's visual analysis of the screenshot produces a more complete diagnosis than either approach alone.

Tips
- Explicitly ask Gemini to "visually analyze the screenshot" in your prompts — this activates Gemini's image reasoning on the captured Puppeteer screenshot, not just text-based DOM analysis.
- For color and contrast issues (accessibility failures), ask Gemini to visually assess contrast ratios in the screenshot: "Look at the text in the header against the background — does the contrast appear sufficient for readability?"
- Gemini handles long JSON evaluation outputs well — ask the agent to evaluate large objects (like full Redux state or API responses) and it can summarize the relevant parts without truncation issues.
- For responsive layout bugs, always request screenshots at three or more viewports in the same prompt so Gemini can compare them side-by-side in its reasoning.

Automating QA Flows and UI Exploration from Gemini CLI

Gemini CLI supports extended agentic sessions, which makes it well-suited for running long QA flows that span many pages and interactions. For multi-step test scenarios, Gemini maintains context across all the tool calls in a session, allowing it to produce coherent pass/fail summaries that reference earlier steps.

A cross-page user journey QA prompt:

Run a full user journey QA test on http://localhost:3000, testing the complete
"browse → add to cart → checkout" flow:

Step 1 - Product browsing:
  - Navigate to /products
  - Screenshot the product listing
  - Verify at least 6 product cards are present (evaluate querySelectorAll('.product-card').length)

Step 2 - Product detail:
  - Click the first product card
  - Screenshot the product detail page
  - Verify the "Add to Cart" button is visible and enabled

Step 3 - Add to cart:
  - Click "Add to Cart"
  - Screenshot the cart indicator (usually in the header, selector '.cart-count')
  - Verify the cart count incremented

Step 4 - Cart review:
  - Navigate to /cart
  - Screenshot the cart page
  - Verify the product appears in the cart with a price

Step 5 - Checkout:
  - Click "Proceed to Checkout"
  - Fill in: name "QA Tester", email "[email protected]", address "1 Test St"
  - Screenshot before submitting

Produce a structured QA report with pass/fail for each step and include screenshots.

For exploratory UI mapping that leverages Gemini's comprehension:

Please explore the admin panel at http://localhost:3000/admin and produce a sitemap.
For each section in the left sidebar:
1. Click it and note the URL
2. Screenshot the page
3. Visually describe the main content area: what data is shown, what actions are available
4. Note any visible errors, empty states, or broken elements

After visiting all sections, produce:
- A hierarchical sitemap of the admin panel
- A list of any sections with visible errors or broken UI
- A list of any sections that appear to have incomplete features (placeholder text, "coming soon" labels, etc.)

Tips
- For long QA flows, break the prompt into phases (browsing, checkout, post-purchase) and run each phase as a separate Gemini CLI prompt in the same session — this keeps each response focused and avoids token limit issues on very long automation sequences.
- Gemini's strong natural language comprehension means you can describe expected behavior in plain English ("the cart count should increment by 1") rather than writing explicit DOM assertions — the agent will determine the right evaluation to verify it.
- Ask Gemini to produce its QA report in a specific markdown format at the end, so you can pipe the output directly to a file: gemini "run QA test... output report as markdown" > qa-report-$(date +%F).md.
- For accessibility QA, ask Gemini to visually assess each screenshot for readability, color contrast, and obvious WCAG violations — the visual analysis complements automated axe-core checks.

Comparing Puppeteer MCP Frontend Debugging Output Between Gemini CLI and Claude Code

Both Gemini CLI and Claude Code connect to the same @modelcontextprotocol/server-puppeteer package and call the same set of tools — the difference is in how each model reasons about the tool outputs and constructs its responses.

Visual analysis of screenshots. Gemini's multimodal architecture is deeply integrated — when it receives a screenshot from puppeteer_screenshot, it analyzes the image natively and can produce detailed visual descriptions, identify layout anomalies, describe color and typography issues, and compare multiple screenshots within the same reasoning step. Claude Code also processes screenshots visually when image input is supported in the client, but the depth of visual analysis varies by client configuration. For screenshot-heavy debugging workflows, Gemini CLI often produces richer visual commentary.

DOM and JavaScript evaluation output. Both models handle puppeteer_evaluate output well. Claude Code tends to be more methodical in correlating multiple evaluation outputs — checking DOM state, then console errors, then computed styles — and producing a structured diagnostic checklist. Gemini CLI tends toward comprehensive narrative summaries that blend visual and DOM findings into a single coherent explanation.

QA report format. Claude Code produces tightly structured outputs — numbered steps, pass/fail status, inline code references — which map well to engineering ticket formats. Gemini CLI produces more narrative-style reports that read naturally but may require more post-processing to extract structured test results.

Prompt length and complexity. Both handle multi-step, multi-scenario QA prompts well. Gemini CLI has a larger context window (Gemini 1.5 Pro supports up to 1M tokens), which is useful when including long DOM snapshots, full API responses, or large sets of screenshots in the context. Claude Code's context window is more than sufficient for typical debugging sessions but may require breaking up very long exploratory sessions.

Practical recommendation: use Gemini CLI for initial visual triage of rendering bugs (screenshot + visual analysis) and for long exploratory QA sessions that involve many page states. Use Claude Code for structured debugging workflows where you want methodical step-by-step diagnosis with tight correlation between DOM findings and code-level fixes.

Both tools produce solid automation output — the choice often comes down to your team's existing tooling preferences and which model's output format integrates better with your QA reporting workflow.

Tips
- Run the same debugging prompt against both Gemini CLI and Claude Code on a confirmed bug and compare the reports — this builds your intuition for which model to reach for in different debugging scenarios.
- For Gemini CLI sessions that involve many screenshots, explicitly ask the model to "compare screenshot 1 and screenshot 3" at the end — Gemini can reference images captured earlier in the session when prompted.
- Use Claude Code's structured output for automated parsing in CI pipelines. Use Gemini CLI's narrative output for team-facing reports and stakeholder updates.
- Both tools benefit from the same Puppeteer MCP server, so you can switch between them mid-investigation without reconfiguring the server or losing your automation setup.