Statement

Compovision is an interactive installation that invites visitors to create with artificial intelligence—without typing a single word. Instead of written prompts, participants place everyday objects in front of cameras, letting the visual world act as the language. This process, known as visual prompting, allows people to compose entirely new world by utilizing the world around them.

By redefining the boundaries of how we communicate with machines, Compovision removes the technical barriers of prompt engineering and opens AI collaboration to broader audiences. As language-based interactions continue to dominate the AI landscape, it offers a playful and tangible alternative—one where imagination flows from the physical world into new, unexpected digital forms. It explores what it means to collaborate with machines in intuitive, embodied ways.

Concept and realization by Lionel Ringenbach, aka Ucodia.

Create it with your own hands

By pointing a set of camera to places or objects from your daily life, Compovision is able to instantly create new visual composition. This intelligent system is able to understand what you want to create simply through observation of the world you feed into it.

How it understands and creates

Under the hood, the intelligent system is able to discern the world you show using an AI vision model, which is able translate images into textual descriptions of what it sees. In turn it is able to compose the descriptions of all the images into a larger imaginary scene. Finally, it uses an image generation system to return an image of the scene which composes the world around you into a brand new world.

The first prototype (May 2024)

The initial prototype was ideated in October 2023, developed in May 2024 and first exhibited at the Vancouver AI Community Meetup in May 2024.

The core of the prototype is a sequential pipeline of 3 large language models (artificial intelligence LLM), each with a very specific tasks, and is executed in the following order:

Llava, an “image-to-text” model
Llama3, a “text-to-text” model
StableDiffusion XL, a “text-to-image” model

In short, the pipeline acts as if it was an “images-to-text-to-text-to-image” model. This is different from an “images-to-image” model because the middle of the pipeline allows greater flexibility into how final the image is composed.

Workflow integration

The pipeline was created and integrated entirely in TouchDesigner and ComfyUI. 3 webcams and a simple screen monitors were used to intake live video feed and display the resulting images. The entire software was executed offline by a Macbook Pro with a Apple M1 Max chip.

Prototype setup, showing the webcam and final output