Demystifying AI 'Computer Use': Building GUI Automation with Planar Workflows

Every AI lab is suddenly showcasing agents that can "use computers" – navigating interfaces, clicking buttons, and completing tasks that previously required human hands. This capability sits at the intersection of two worlds: the visual reasoning of modern LLMs and the automation promises of Robotic Process Automation (RPA). But despite the demos and hype, there's surprisingly little explanation of how these systems actually work in practice or how reliable they truly are.

The motivation for our exploration came from practical, real-world challenges our customers face daily. Finance teams struggle with invoice processing across multiple legacy systems. Operations teams manually navigate through dozens of screens to update vendor information. Customer service representatives spend hours toggling between different interfaces to resolve simple issues. These workflows span systems that were never designed to work together, creating endless friction and manual overhead.

While usually we integrate through programmatic channels — be they database queries, existing APIs, bespoke MCP servers, report builders, or SFTP buckets for dropping off EDI files — sometimes the integration method of last resort (or the lightest lift point of entry for a POC!) is orchestration through the GUI.

When you have customers whose employees stare at a screen like this all day, sometimes you need to get creative in order to drive value quickly.

We built planar-computer-use to peel back the curtain. While companies like Anthropic, OpenAI, Microsoft, and Google showcase increasingly sophisticated computer use capabilities at the model layer, we wanted to understand how these systems could be constructed from existing components and how they might be integrated into production-grade workflows.

To be clear: we realize this isn't the AI or reinforcement learning community's conception of how to scale these systems long-term. Our approach is decidedly pragmatic rather than research-oriented. But this practical foundation enables real applications today, and we believe that's valuable while the field continues to evolve.

The integration question is particularly interesting: how do these visually-guided agents fit into broader systems? By incorporating computer use capabilities into Planar Workflows, we can support longer-running tasks, provide better observability, and create a foundation for more complex automations that can be monitored, debugged, and improved over time.

In this post, we'll break down how AI computer use actually works under the hood. We'll show how existing models can be orchestrated to perceive interfaces, make decisions, and execute actions. Most importantly, we'll explain how workflow orchestration creates a practical framework for leveraging these capabilities beyond impressive demos.

Anatomy of an AI Computer User

At its core, AI computer use involves three fundamental capabilities: perception, decision-making, and action. The agent needs to see what's on screen, decide what to do next, and then execute that action.

Let's look at how we implemented each of these components in planar-computer-use:

Seeing the Screen: VNC Integration

For an agent to interact with a GUI, it first needs to see it. We chose VNC (Virtual Network Computing) as our interface, providing a standard protocol for both viewing and controlling remote desktops.

# Simplified example of our VNC integration
async with VNCManager.connect(host=vnc_host, port=vnc_port, password=vnc_password) as vnc:
    screenshot = await vnc.capture_screen_base64()
    # Now we can pass this screenshot to our agent for analysis

By using VNC, we've immediately solved several challenges:

Cross-platform compatibility across different operating systems
Standardized interfaces for screen capture and input control
Well-established protocols for security and access control

The VNCManager class handles all the low-level details of connection management, screen capture, and input control, exposing a clean API that can be easily integrated into our workflow system.

Reasoning About What's On Screen

With a screenshot in hand, the next challenge is understanding what's visible and deciding what to do. This is where modern vision-language models shine:

# Our high-level orchestration agent decides what step to take next
next_action = await computer_use_orchestration_agent.run(
    screenshot=screenshot,
    goal="Find the latest AI research papers",
    history=previous_actions
)
 
# Example response: "Click on the Chrome browser icon in the taskbar"

Rather than using a single agent, we've implemented a two-tier approach:

The Orchestration Agent examines the current state of the screen and determines the next immediate action to take. Unlike planning-based approaches that try to map out steps in advance, our agent is deliberately "stateless" - it makes decisions based on what it can see right now, combined with the ultimate goal.
The Computer Use Agent translates these basic steps into concrete tool calls, like clicking specific UI elements or typing text.

This stateless approach makes the system more robust to unexpected UI states. If a browser takes longer than expected to load, or a dialog appears unexpectedly, the orchestration agent can adapt by making decisions based on the current reality rather than a predetermined plan.

That said, we think an area for improvement is providing some memory of previous actions. This is crucial for long-running tasks:

# Adding context of previous actions with timestamps
next_action = await computer_use_orchestration_agent.run(
    screenshot=screenshot,
    goal="Search for dog pictures",
    history=[
        {"action": "Click the 'Start' menu", "timestamp": "08:49"},
        {"action": "Click the 'Firefox' icon", "timestamp": "08:50"}
    ],
    current_time="08:51"
)
 
# Agent can now make better decisions like:
# "I clicked Firefox a minute ago, so I should wait for it to load
# rather than clicking it again"

This separation of concerns makes the system more robust and easier to debug. If the orchestration agent suggests an unreasonable action, we can catch it before executing. If a specific UI element can't be found, we can retry with different descriptions without restarting the entire task.

Finding UI Elements: The Visual Grounding Challenge

Perhaps the trickiest part of computer use is precisely locating UI elements described in natural language. We explored multiple approaches:

# Simplified example of our progressive zooming approach
async def query_element_position(element: str):
    screenshot = await take_screenshot()
 
    # First pass: coarse grid
    grid_img = draw_annotated_grid(screenshot, grid_size=3)
    cell = await grounding_agent.identify_cell(grid_img, element)
 
    # Second pass: zoom in and use finer grid
    zoomed_area = crop_to_cell(screenshot, cell)
    fine_grid_img = draw_annotated_grid(zoomed_area, grid_size=5)
    precise_cell = await grounding_agent.identify_cell(fine_grid_img, element)
 
    # Calculate final coordinates
    return calculate_coordinates(cell, precise_cell)

We found that a progressive zooming approach worked surprisingly well – first identifying a general region with a coarse grid, then zooming in for more precise localization. This method is both computationally efficient and effective across different interface designs.

For some use cases, we also experimented with direct visual grounding using models like OS-ATLAS, which can produce bounding boxes directly from image-text pairs.

Taking Action: Mouse and Keyboard Control

Once we know where UI elements are located, taking action is relatively straightforward:

# Example of our tool for clicking UI elements
async def click_element(element: str):
    x, y = await query_element_position(element)
    vnc = VNCManager.get()
    await vnc.mouse_move(x, y)
    await vnc.click()
    return f"Clicked on {element} at coordinates ({x}, {y})"

Our system supports the full range of interactions a human might perform:

Clicking, double-clicking, and right-clicking
Typing text
Pressing key combinations (like Ctrl+C for copy)

Orchestrating Computer Use with Planar Workflows

Where our approach diverges from many demos is in how we orchestrate these capabilities. By embedding computer use within Planar Workflows, we gain several important benefits:

Persistent State and Observability

The main workflow that powers planar-computer-use is surprisingly straightforward:

@workflow()
async def perform_computer_task(goal: str, vnc_host_port: str, vnc_password: str) -> str:
    async with VNCManager.connect(
        host=vnc_host, port=vnc_port, password=vnc_password
    ) as vnc:
        for turn in range(MAX_TURNS):
            screenshot = await take_screenshot()
 
            # Decide what to do next
            next_action = await computer_use_orchestration_agent.run(
                screenshot=screenshot,
                goal=goal,
                history=previous_actions
            )
 
            if next_action == "complete":
                return f"Successfully completed task: {goal}"
 
            # Execute the action
            result = await computer_use_agent.run(
                screenshot=screenshot,
                action=next_action
            )
 
            previous_actions.append({"action": next_action, "result": result})
            await asyncio.sleep(1)  # Give UI time to update
 
        raise Exception(f"Failed to complete task within {MAX_TURNS} turns")

By running this as a Planar Workflow, we get:

Persistence - The entire state of the task (including screenshots, action history, and agent decisions) is automatically persisted by Planar.
Observability - Every step, agent call, and tool execution becomes visible in the Planar dashboard, making it easy to monitor and debug complex tasks.
Retries and Error Handling - We can implement sophisticated retry logic for specific failures without restarting the entire task.

Hybrid Automation: Combining Agents with Traditional Tools

One of the most powerful aspects of our approach is the ability to interleave agentic decision-making with traditional automation tools. It turns out that entering an invoice into an IBM AS/400 terminal doesn't require the world's most sophisticated reasoning model – but navigating to the right screen and handling unexpected error dialogs might.

@workflow()
async def process_invoice_in_legacy_system(invoice_data: dict):
    # Use agent to navigate to the correct screen
    await computer_use_agent.run(
        screenshot=await take_screenshot(),
        action="Navigate to invoice entry screen"
    )
 
    # Switch to deterministic automation for data entry
    await type_text(invoice_data["vendor_name"])
    await press_key("Tab")
    await type_text(invoice_data["amount"])
    await press_key("Tab")
    await type_text(invoice_data["invoice_number"])
 
    # Use agent to handle any unexpected dialogs or errors
    screenshot = await take_screenshot()
    if "error" in await analyze_screen(screenshot):
        resolution = await computer_use_orchestration_agent.run(
            screenshot=screenshot,
            goal="Resolve the error and complete invoice entry",
            history=previous_actions
        )
        await computer_use_agent.run(screenshot=screenshot, action=resolution)
    else:
        await press_key("Enter")  # Submit the form

This hybrid approach creates remarkably robust automation systems. The agent handles the parts that require visual understanding and adaptation, while deterministic automation handles the predictable data entry. The result is more reliable than pure agent-based approaches and more flexible than pure rule-based automation.

Future Direction: True Desktop Session Isolation

Currently, our VNC integration manages desktop sessions at a global level, but we're working toward a more isolated approach. Our vision is to support "VNC as a service" where each workflow gets its own private desktop environment:

@workflow()
async def isolated_computer_task(goal: str):
    # Create a fresh desktop session for this workflow
    session_id = await vnc_service.start_session()
 
    try:
        for turn in range(MAX_TURNS):
            # Get screenshot from this workflow's private session
            screenshot = await vnc_service.get_screenshot(session_id)
 
            # Rest of workflow logic...
            next_action = await computer_use_orchestration_agent.run(
                screenshot=screenshot,
                goal=goal
            )
 
            # Execute actions on the private session
            if "click" in next_action:
                element = extract_element(next_action)
                x, y = await query_element_position(element, screenshot)
                await vnc_service.click(session_id, x, y)
    finally:
        # Always clean up the session when done
        await vnc_service.shutdown(session_id)

This approach would use lightweight containers (via Docker or Podman) with copy-on-write filesystems, making it efficient to spawn isolated desktop environments for each workflow. By abstracting this through a simple API service, we can make each workflow truly durable - if a workflow is interrupted, it can resume with exactly the same desktop state.

Breaking Down Complex Tasks

For real-world applications, we often need to chain multiple computer use tasks together. Planar Workflows makes this composition natural:

@workflow()
async def research_and_summarize(topic: str):
    # First workflow: Search for information
    search_results = await perform_computer_task(
        goal=f"Search for recent information about {topic}",
        vnc_host_port=VNC_HOST_PORT,
        vnc_password=VNC_PASSWORD
    )
 
    # Second workflow: Create a summary document
    summary = await perform_computer_task(
        goal=f"Open a text editor and write a summary of: {search_results}",
        vnc_host_port=VNC_HOST_PORT,
        vnc_password=VNC_PASSWORD
    )
 
    return summary

This composability is powerful – each computer use task can be a building block in larger processes, with clear boundaries and error handling.

Practical Challenges and Solutions

Building planar-computer-use revealed several practical challenges that aren't immediately obvious from the demos we see:

Challenge 1: Balancing Precision and Adaptability

Early versions of our system tried to be too precise, breaking when UI elements moved even slightly. We found success by making our agents more adaptable:

Having the orchestration agent focus on higher-level goals rather than specific UI paths
Using more descriptive language for UI elements ("the search bar at the top of the page" vs. "the input field")
Incorporating feedback from failed actions to try alternative approaches

Challenge 2: Handling Dynamic Content

Web pages and applications with dynamic content pose a special challenge. Our solution was to build in deliberate pauses and retry logic:

@step()
async def wait_for_page_load(element_to_wait_for: str, max_attempts: int = 10):
    for attempt in range(max_attempts):
        try:
            position = await query_element_position(element_to_wait_for)
            return True
        except ElementNotFoundError:
            await asyncio.sleep(1)
 
    raise TimeoutError(f"Element '{element_to_wait_for}' never appeared")

By making these waits explicit steps in our workflow, we can track them and adjust timeouts based on observed behavior.

Challenge 3: Debugging Visual Interactions

When something goes wrong, understanding why can be difficult with visual interfaces. We created a debugging workflow that's proven invaluable:

@workflow()
async def highlight_ui_element(element: str) -> PlanarFile:
    """Take a screenshot and highlight the detected position of a UI element."""
    screenshot = await take_screenshot()
    position = await query_element_position(element)
 
    highlighted = await draw_rectangle(
        image=screenshot,
        x=position.x,
        y=position.y,
        width=20,
        height=20,
        color="red",
        thickness=2
    )
 
    return highlighted

This workflow, combined with Planar's ability to view workflow artifacts, makes it much easier to verify whether the system is correctly identifying UI elements.

The Future of AI Computer Use

Our exploration with planar-computer-use has convinced us that, despite being early, AI-driven GUI automation has a bright future, especially when combined with durable workflow orchestration. We see several promising directions:

Learning from Demonstrations

Rather than programming tasks explicitly, future systems could learn from human demonstrations:

@workflow()
async def learn_from_demonstration(task_name: str):
    # Record human performing a task
    recording = await record_human_demonstration()
 
    # Extract key decision points and actions
    task_model = await analyze_demonstration(recording)
 
    # Register as a new workflow
    await register_learned_workflow(task_name, task_model)

This approach could dramatically reduce the effort required to automate complex tasks.

Multi-Agent Collaboration

More complex scenarios might benefit from multiple specialized agents working together:

A navigator agent focusing on finding the right applications and screens
A detail agent handling precise data entry and validation
A troubleshooting agent that activates when unexpected errors occur

Planar's workflow system provides a natural framework for orchestrating this kind of collaboration.

Better Grounding Models

Gemini and other frontier models are getting better at precise visual understanding (e.g. Simon Willison's post on Image Segmentation using Gemini). We're excited to see how this, and other open source models, improve over time.

Try It Yourself

We're planning on releasing planar-computer-use as an open-source project. While it's still a proof of concept, we believe it demonstrates the potential of combining LLMs with durable workflows for GUI automation and serves as a good starting point for building more complex systems.

To get started:

Set up a VNC server on the machine you want to control
Clone the repository: git clone https://github.com/coplane/planar-computer-use
Install dependencies: uv sync
Run a simple test: uv run python examples/simple_task.py

We welcome contributions and feedback from the community!

Conclusion

AI "computer use" capabilities are more than just impressive demos – they represent a fundamental shift in how we can automate GUI-based tasks. By combining the visual reasoning of modern LLMs with the orchestration capabilities of durable workflows, we can create systems that are both adaptable and reliable.

Our exploration with planar-computer-use has shown that these systems can be built today using existing components and integrated into broader automation workflows. The ability to interleave intelligent decision-making with traditional automation creates robust systems that aren't flaky – letting sophisticated models handle the parts that require visual understanding while using deterministic automation for predictable operations.

While there are still challenges to overcome, AI agents that can use computers like humans do are becoming a practical reality. We're excited to share our work and see what the community builds with these capabilities, and we look forward to continuing our exploration of this fascinating intersection of AI and automation.

To learn more about Planar and CoPlane, check out the CoPlane.com.

About the Author

Thiago Arruda

Related Articles

Building Reliable AI for Finance and Operations: A Tested Approach

Why AI Implementations Shouldn't Be Managed Like ERP Projects

Introducing CoPlane: Reliable Process Orchestration for the Agentic Enterprise

Topics