OllamaとFastAPIで画像・動画を解析するVLMアプリを作ってみよう

Table of Contents

目標

今回は、ローカルで動くVLMを使って画像や動画を解析するWebアプリを作ります。

具体的には

Webカメラから最新のフレームを取得する
そのフレームをVLMで解析する
モデルの返答とそのフレームをブラウザから確認できるようにする
モデルの返答をJSON形式で受け取れるようにする(一応)

構成

今回のアプリは、コンテナを2つ使って構成します。

Webアプリコンテナ: Pythonコードを編集し、FastAPIを起動する
Ollamaコンテナ: llava:7b を読み込んでVLMとして応答する

この2つを docker compose でまとめて起動します。開発は devcontainer で行い、VS Codeを用いて作業します。

環境

Docker
Docker Compose
Visual Studio Code
Dev Containers 拡張
GPUが使えるDocker環境（CPUのみでも動くかもしれませんが…）

Pythonライブラリは次の7つを使います。

requests
Pillow
opencv-python
fastapi
uvicorn[standard]
jinja2
python-multipart

手順1. devcontainer と compose を用意する

作業用ディレクトリを作り、以下の構成でファイルを置きます。

最初に次を実行して必要なフォルダを作ります。

ShellScript

mkdir -p .devcontainer templates static

mkdir -p .devcontainer templates static

├── .devcontainer/

│   ├── devcontainer.json

│   └── docker-compose.yml

├── static/

│   └── style.css

├── templates/

│   └── index.html

├── app.py

├── requirements.txt

├── media_utils.py

└── wincamera.py

wincamera.pyは、共有フォルダ上のlatest.jpgを更新するためのスクリプトです。アプリはこのlatest.jpgを現在のカメラ画像として参照し、解析フォームでパスを空のまま送信したときもこの画像を使います。

.devcontainer/devcontainer.jsonは次のようにします。

JSON

{
    "name": "vlm-ollama-gpu",
    "dockerComposeFile": "docker-compose.yml",
    "service": "dev",
    "workspaceFolder": "/work",
    "customizations": {
        "vscode": {
            "extensions": [
                "ms-python.python",
                "ms-azuretools.vscode-docker"
            ]
        }
    },
    "mounts": [
        "source=/absolute/path/to/share,target=/camera-share,type=bind"
    ]
}

{
    "name": "vlm-ollama-gpu",
    "dockerComposeFile": "docker-compose.yml",
    "service": "dev",
    "workspaceFolder": "/work",
    "customizations": {
        "vscode": {
            "extensions": [
                "ms-python.python",
                "ms-azuretools.vscode-docker"
            ]
        }
    },
    "mounts": [
        "source=/absolute/path/to/share,target=/camera-share,type=bind"
    ]
}

ここで `/camera-share` をホスト側の共有フォルダへマウントしています。アプリはこの場所に置かれた `latest.jpg` を現在のカメラ画像として扱います。`/absolute/path/to/share` は実際の共有フォルダの絶対パスに置き換えてください。

次に `.devcontainer/docker-compose.yml` です。

YAML

services:
  dev:
    image: nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04
    command: sleep infinity
    volumes:
      - ..:/work:cached
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]

  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]

volumes:
  ollama:

services:
  dev:
    image: nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04
    command: sleep infinity
    volumes:
      - ..:/work:cached
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]

  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]

volumes:
  ollama:

この設定で、開発用コンテナ `dev` と VLM実行用コンテナ `ollama` が同時に立ち上がります。FastAPIアプリからは、コンテナ名 `ollama` をホスト名として使えるので、`http://ollama:11434` にアクセスすればOllama APIへ接続できます。

手順2 コンテナを起動し、モデルを入れる

`llava:7b` を使うには、Ollamaコンテナを起動したうえでモデルを取得します。まずはホスト側のターミナルで、プロジェクトルートから次のコマンドを実行してコンテナを立ち上げます。

ShellScript

docker compose -f .devcontainer/docker-compose.yml up -d

docker compose -f .devcontainer/docker-compose.yml up -d

そのあと VS Code でこのフォルダを開き、Dev Containers の機能で Reopen in Container を実行します。これで `dev` コンテナの中に入れます。続いて、ホスト側のターミナルでOllamaコンテナに対して `llava:7b` を取得します。

ShellScript

docker compose -f .devcontainer/docker-compose.yml exec ollama ollama pull llava:7b

docker compose -f .devcontainer/docker-compose.yml exec ollama ollama pull llava:7b

取得できたか確認するには、同じくホスト側のターミナルで次のコマンドを実行します。

ShellScript

docker compose -f .devcontainer/docker-compose.yml exec ollama ollama list

docker compose -f .devcontainer/docker-compose.yml exec ollama ollama list

ここで `llava:7b` が表示されれば、Ollama側の準備は完了です。

手順3 Python依存関係を入れる

次に、開発用コンテナの中でPythonライブラリをインストールします。

`requirements.txt` を次の内容で作成します。

requests
Pillow
opencv-python
fastapi
uvicorn[standard]
jinja2
python-multipart

そのうえで、VS Code で開いた `dev` コンテナのターミナルで次を実行します。

ShellScript

pip install -r requirements.txt

pip install -r requirements.txt

手順4 画像と動画の前処理コードを書く

次に、画像と動画をVLMへ送れる形に変換するコードを `media_utils.py` として作成します。

Python

import argparse
import base64
import io
import pathlib
import time

import requests
from PIL import Image

try:
        import cv2
except Exception:
        cv2 = None


def encode_pil_image_to_b64(img: Image.Image) -> str:
        buf = io.BytesIO()
        img.save(buf, format="PNG")
        return base64.b64encode(buf.getvalue()).decode()


def process_image(path: pathlib.Path, max_size=(512, 512)):
        with Image.open(path) as img:
                img.thumbnail(max_size, Image.LANCZOS)
                return encode_pil_image_to_b64(img)


def process_video(path: pathlib.Path, max_frames=8, max_size=(512, 512)):
        if cv2 is None:
                raise RuntimeError("OpenCV is required for video support. Install opencv-python.")

        cap = cv2.VideoCapture(str(path))
        if not cap.isOpened():
                raise RuntimeError(f"Unable to open video: {path}")

        frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT) or 0)
        if frame_count <= 0:
                frames = []
                while len(frames) < max_frames:
                        ret, frame = cap.read()
                        if not ret:
                                break
                        frames.append(frame)
        else:
                sample_count = min(max_frames, frame_count)
                indices = [
                        int(round(i * (frame_count - 1) / (sample_count - 1)))
                        if sample_count > 1 else 0
                        for i in range(sample_count)
                ]
                frames = []
                for idx in indices:
                        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
                        ret, frame = cap.read()
                        if not ret:
                                continue
                        frames.append(frame)

        cap.release()

        images_b64 = []
        for frame in frames:
                rgb = frame[:, :, ::-1]
                img = Image.fromarray(rgb)
                img.thumbnail(max_size, Image.LANCZOS)
                images_b64.append(encode_pil_image_to_b64(img))

        return images_b64


def main():
        p = argparse.ArgumentParser(description="Send image or video frames to Ollama model")
        p.add_argument("input", help="Path to image or video file")
        p.add_argument("--max-frames", type=int, default=8, help="Max frames to sample from video")
        args = p.parse_args()

        path = pathlib.Path(args.input)
        if not path.exists():
                raise SystemExit(f"File not found: {path}")

        all_start = time.time()

        if path.suffix.lower() in [".png", ".jpg", ".jpeg", ".bmp", ".webp", ".tiff"]:
                resize_start = time.time()
                images = [process_image(path)]
                resize_time = time.time() - resize_start
        else:
                resize_start = time.time()
                images = process_video(path, max_frames=args.max_frames)
                resize_time = time.time() - resize_start

        if not images:
                raise SystemExit("No frames extracted from input")

        prompt = (
                "List up to 5 important objects in this first-person media. "
                "Include people, animals, obstacles to movement, buildings, user interfaces, plants, and anything with distinctive appearance. "
                "Use this strict format, one per line: [object with features], [direction], [distance in meters]. Do not write any explanations, background, or extra text. "
                "Do not use full sentences. If fewer than 5, list only those. No other output."
        )

        payload = {
                "model": "llava:7b",
                "prompt": prompt,
                "images": images,
                "stream": False,
        }

        r = requests.post("http://ollama:11434/api/generate", json=payload, timeout=600)

        all_time = time.time() - all_start
        print(f"[Resize time: {resize_time:.3f} sec]")
        print(f"[Total time: {all_time:.3f} sec]")
        print(r.json().get("response"))


if __name__ == "__main__":
        main()

import argparse
import base64
import io
import pathlib
import time

import requests
from PIL import Image

try:
        import cv2
except Exception:
        cv2 = None


def encode_pil_image_to_b64(img: Image.Image) -> str:
        buf = io.BytesIO()
        img.save(buf, format="PNG")
        return base64.b64encode(buf.getvalue()).decode()


def process_image(path: pathlib.Path, max_size=(512, 512)):
        with Image.open(path) as img:
                img.thumbnail(max_size, Image.LANCZOS)
                return encode_pil_image_to_b64(img)


def process_video(path: pathlib.Path, max_frames=8, max_size=(512, 512)):
        if cv2 is None:
                raise RuntimeError("OpenCV is required for video support. Install opencv-python.")

        cap = cv2.VideoCapture(str(path))
        if not cap.isOpened():
                raise RuntimeError(f"Unable to open video: {path}")

        frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT) or 0)
        if frame_count <= 0:
                frames = []
                while len(frames) < max_frames:
                        ret, frame = cap.read()
                        if not ret:
                                break
                        frames.append(frame)
        else:
                sample_count = min(max_frames, frame_count)
                indices = [
                        int(round(i * (frame_count - 1) / (sample_count - 1)))
                        if sample_count > 1 else 0
                        for i in range(sample_count)
                ]
                frames = []
                for idx in indices:
                        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
                        ret, frame = cap.read()
                        if not ret:
                                continue
                        frames.append(frame)

        cap.release()

        images_b64 = []
        for frame in frames:
                rgb = frame[:, :, ::-1]
                img = Image.fromarray(rgb)
                img.thumbnail(max_size, Image.LANCZOS)
                images_b64.append(encode_pil_image_to_b64(img))

        return images_b64


def main():
        p = argparse.ArgumentParser(description="Send image or video frames to Ollama model")
        p.add_argument("input", help="Path to image or video file")
        p.add_argument("--max-frames", type=int, default=8, help="Max frames to sample from video")
        args = p.parse_args()

        path = pathlib.Path(args.input)
        if not path.exists():
                raise SystemExit(f"File not found: {path}")

        all_start = time.time()

        if path.suffix.lower() in [".png", ".jpg", ".jpeg", ".bmp", ".webp", ".tiff"]:
                resize_start = time.time()
                images = [process_image(path)]
                resize_time = time.time() - resize_start
        else:
                resize_start = time.time()
                images = process_video(path, max_frames=args.max_frames)
                resize_time = time.time() - resize_start

        if not images:
                raise SystemExit("No frames extracted from input")

        prompt = (
                "List up to 5 important objects in this first-person media. "
                "Include people, animals, obstacles to movement, buildings, user interfaces, plants, and anything with distinctive appearance. "
                "Use this strict format, one per line: [object with features], [direction], [distance in meters]. Do not write any explanations, background, or extra text. "
                "Do not use full sentences. If fewer than 5, list only those. No other output."
        )

        payload = {
                "model": "llava:7b",
                "prompt": prompt,
                "images": images,
                "stream": False,
        }

        r = requests.post("http://ollama:11434/api/generate", json=payload, timeout=600)

        all_time = time.time() - all_start
        print(f"[Resize time: {resize_time:.3f} sec]")
        print(f"[Total time: {all_time:.3f} sec]")
        print(r.json().get("response"))


if __name__ == "__main__":
        main()

このファイルには2つの役割があります。

`process_image` と `process_video` を、FastAPIアプリから呼び出す
単体で実行して、Ollamaとの接続確認に使う

手順5 FastAPIアプリを書く

次に、アプリ本体の `app.py` を作ります。

Python

import os
import time
import pathlib
from typing import Optional

from fastapi import FastAPI, Form, Request
from fastapi.responses import HTMLResponse, FileResponse, JSONResponse
from fastapi.staticfiles import StaticFiles
from fastapi.templating import Jinja2Templates

import requests

from media_utils import process_image, process_video


BASE_CAMERA_PATH = os.environ.get("CAMERA_PATH", "/camera-share/latest.jpg")
DEFAULT_PROMPT_FILE = pathlib.Path(os.environ.get("DEFAULT_PROMPT_FILE", "default_prompt.txt"))

app = FastAPI()
templates = Jinja2Templates(directory="templates")
app.mount("/static", StaticFiles(directory="static"), name="static")


@app.get("/", response_class=HTMLResponse)
def index(request: Request):
        camera_path = pathlib.Path(BASE_CAMERA_PATH)
        img_b64 = None
        if camera_path.exists():
                try:
                        img_b64 = process_image(camera_path)
                except Exception:
                        img_b64 = None

        if DEFAULT_PROMPT_FILE.exists():
                try:
                        default_prompt = DEFAULT_PROMPT_FILE.read_text(encoding="utf-8")
                except Exception:
                        default_prompt = ""
        else:
                default_prompt = (
                        "List up to 5 important objects in this first-person media. "
                        "Include people, animals, obstacles to movement, buildings, user interfaces, plants, and anything with distinctive appearance. "
                        "Use this strict format, one per line: [object with features], [direction], [distance in meters]. Do not write any explanations, background, or extra text. "
                        "Do not use full sentences. If fewer than 5, list only those. No other output."
                )

        return templates.TemplateResponse(
                "index.html",
                {
                        "request": request,
                        "image_b64": img_b64,
                        "response_text": None,
                        "default_prompt": default_prompt,
                        "current_prompt": default_prompt,
                        "current_path": "",
                        "current_max_frames": 8,
                        "current_structured": False,
                        "parsed_json": None,
                        "parse_error": None,
                },
        )


@app.post("/save_prompt")
def save_prompt(prompt: str = Form(...)):
        try:
                DEFAULT_PROMPT_FILE.write_text(prompt, encoding="utf-8")
        except Exception as e:
                return JSONResponse({"ok": False, "error": str(e)}, status_code=500)
        return JSONResponse({"ok": True})


@app.get("/frame")
def frame():
        camera_path = pathlib.Path(BASE_CAMERA_PATH)
        if not camera_path.exists():
                return HTMLResponse("Not found", status_code=404)

        suffix = camera_path.suffix.lower()
        if suffix == ".png":
                media_type = "image/png"
        else:
                media_type = "image/jpeg"

        return FileResponse(
                str(camera_path),
                media_type=media_type,
                headers={"Cache-Control": "no-cache, no-store, must-revalidate"},
        )


@app.post("/analyze", response_class=HTMLResponse)
def analyze(
        request: Request,
        path: Optional[str] = Form(None),
        max_frames: int = Form(8),
        prompt: Optional[str] = Form(None),
        structured: Optional[str] = Form(None),
):
        if path:
                camera_path = pathlib.Path(path)
        else:
                camera_path = pathlib.Path(BASE_CAMERA_PATH)

        if not camera_path.exists():
                return templates.TemplateResponse(
                        "index.html",
                        {"request": request, "image_b64": None, "response_text": f"File not found: {camera_path}"},
                )

        want_structured = bool(structured)

        try:
                default_prompt = (
                        "List up to 5 important objects in this first-person media. "
                        "Include people, animals, obstacles to movement, buildings, user interfaces, plants, and anything with distinctive appearance. "
                        "Use this strict format, one per line: [object with features], [direction], [distance in meters]. Do not write any explanations, background, or extra text. "
                        "Do not use full sentences. If fewer than 5, list only those. No other output."
                )

                effective_prompt = prompt if prompt else default_prompt

                if want_structured:
                        effective_prompt = (
                                effective_prompt
                                + "\n\nOUTPUT FORMAT INSTRUCTIONS: Return ONLY a JSON array (no extra text) where each item is an object with keys: \"object\" (string), \"direction\" (string), and \"distance_m\" (number). Example: [{\"object\": \"person at desk\", \"direction\": \"front\", \"distance_m\": 2.5}]."
                        )

                start = time.time()
                if camera_path.suffix.lower() in [".png", ".jpg", ".jpeg", ".bmp", ".webp", ".tiff"]:
                        resize_start = time.time()
                        images = [process_image(camera_path)]
                        resize_time = time.time() - resize_start
                else:
                        resize_start = time.time()
                        images = process_video(camera_path, max_frames=max_frames)
                        resize_time = time.time() - resize_start

                if not images:
                        raise RuntimeError("No frames extracted from input")

                payload = {
                        "model": "llava:7b",
                        "prompt": effective_prompt,
                        "images": images,
                        "stream": False,
                }

                resp = requests.post("http://ollama:11434/api/generate", json=payload, timeout=600)
                total_time = time.time() - start

                result = {
                        "response": resp.json().get("response"),
                        "resize_time": resize_time,
                        "total_time": total_time,
                        "image_b64": images[0],
                }

                parsed = None
                parse_error = None
                if want_structured:
                        import json

                        try:
                                parsed = json.loads(result["response"])
                        except Exception as e:
                                parse_error = str(e)

                result["parsed_json"] = parsed
                result["parse_error"] = parse_error
        except Exception as e:
                return templates.TemplateResponse(
                        "index.html",
                        {
                                "request": request,
                                "image_b64": None,
                                "response_text": f"Error: {e}",
                                "current_prompt": prompt or "",
                                "current_path": str(camera_path),
                                "current_max_frames": max_frames,
                                "current_structured": want_structured,
                                "parsed_json": None,
                                "parse_error": None,
                        },
                )

        return templates.TemplateResponse(
                "index.html",
                {
                        "request": request,
                        "image_b64": result.get("image_b64"),
                        "response_text": result.get("response"),
                        "resize_time": result.get("resize_time"),
                        "total_time": result.get("total_time"),
                        "current_prompt": prompt or "",
                        "current_path": str(camera_path),
                        "current_max_frames": max_frames,
                        "current_structured": want_structured,
                        "parsed_json": result.get("parsed_json"),
                        "parse_error": result.get("parse_error"),
                },
        )

import os
import time
import pathlib
from typing import Optional

from fastapi import FastAPI, Form, Request
from fastapi.responses import HTMLResponse, FileResponse, JSONResponse
from fastapi.staticfiles import StaticFiles
from fastapi.templating import Jinja2Templates

import requests

from media_utils import process_image, process_video


BASE_CAMERA_PATH = os.environ.get("CAMERA_PATH", "/camera-share/latest.jpg")
DEFAULT_PROMPT_FILE = pathlib.Path(os.environ.get("DEFAULT_PROMPT_FILE", "default_prompt.txt"))

app = FastAPI()
templates = Jinja2Templates(directory="templates")
app.mount("/static", StaticFiles(directory="static"), name="static")


@app.get("/", response_class=HTMLResponse)
def index(request: Request):
        camera_path = pathlib.Path(BASE_CAMERA_PATH)
        img_b64 = None
        if camera_path.exists():
                try:
                        img_b64 = process_image(camera_path)
                except Exception:
                        img_b64 = None

        if DEFAULT_PROMPT_FILE.exists():
                try:
                        default_prompt = DEFAULT_PROMPT_FILE.read_text(encoding="utf-8")
                except Exception:
                        default_prompt = ""
        else:
                default_prompt = (
                        "List up to 5 important objects in this first-person media. "
                        "Include people, animals, obstacles to movement, buildings, user interfaces, plants, and anything with distinctive appearance. "
                        "Use this strict format, one per line: [object with features], [direction], [distance in meters]. Do not write any explanations, background, or extra text. "
                        "Do not use full sentences. If fewer than 5, list only those. No other output."
                )

        return templates.TemplateResponse(
                "index.html",
                {
                        "request": request,
                        "image_b64": img_b64,
                        "response_text": None,
                        "default_prompt": default_prompt,
                        "current_prompt": default_prompt,
                        "current_path": "",
                        "current_max_frames": 8,
                        "current_structured": False,
                        "parsed_json": None,
                        "parse_error": None,
                },
        )


@app.post("/save_prompt")
def save_prompt(prompt: str = Form(...)):
        try:
                DEFAULT_PROMPT_FILE.write_text(prompt, encoding="utf-8")
        except Exception as e:
                return JSONResponse({"ok": False, "error": str(e)}, status_code=500)
        return JSONResponse({"ok": True})


@app.get("/frame")
def frame():
        camera_path = pathlib.Path(BASE_CAMERA_PATH)
        if not camera_path.exists():
                return HTMLResponse("Not found", status_code=404)

        suffix = camera_path.suffix.lower()
        if suffix == ".png":
                media_type = "image/png"
        else:
                media_type = "image/jpeg"

        return FileResponse(
                str(camera_path),
                media_type=media_type,
                headers={"Cache-Control": "no-cache, no-store, must-revalidate"},
        )


@app.post("/analyze", response_class=HTMLResponse)
def analyze(
        request: Request,
        path: Optional[str] = Form(None),
        max_frames: int = Form(8),
        prompt: Optional[str] = Form(None),
        structured: Optional[str] = Form(None),
):
        if path:
                camera_path = pathlib.Path(path)
        else:
                camera_path = pathlib.Path(BASE_CAMERA_PATH)

        if not camera_path.exists():
                return templates.TemplateResponse(
                        "index.html",
                        {"request": request, "image_b64": None, "response_text": f"File not found: {camera_path}"},
                )

        want_structured = bool(structured)

        try:
                default_prompt = (
                        "List up to 5 important objects in this first-person media. "
                        "Include people, animals, obstacles to movement, buildings, user interfaces, plants, and anything with distinctive appearance. "
                        "Use this strict format, one per line: [object with features], [direction], [distance in meters]. Do not write any explanations, background, or extra text. "
                        "Do not use full sentences. If fewer than 5, list only those. No other output."
                )

                effective_prompt = prompt if prompt else default_prompt

                if want_structured:
                        effective_prompt = (
                                effective_prompt
                                + "\n\nOUTPUT FORMAT INSTRUCTIONS: Return ONLY a JSON array (no extra text) where each item is an object with keys: \"object\" (string), \"direction\" (string), and \"distance_m\" (number). Example: [{\"object\": \"person at desk\", \"direction\": \"front\", \"distance_m\": 2.5}]."
                        )

                start = time.time()
                if camera_path.suffix.lower() in [".png", ".jpg", ".jpeg", ".bmp", ".webp", ".tiff"]:
                        resize_start = time.time()
                        images = [process_image(camera_path)]
                        resize_time = time.time() - resize_start
                else:
                        resize_start = time.time()
                        images = process_video(camera_path, max_frames=max_frames)
                        resize_time = time.time() - resize_start

                if not images:
                        raise RuntimeError("No frames extracted from input")

                payload = {
                        "model": "llava:7b",
                        "prompt": effective_prompt,
                        "images": images,
                        "stream": False,
                }

                resp = requests.post("http://ollama:11434/api/generate", json=payload, timeout=600)
                total_time = time.time() - start

                result = {
                        "response": resp.json().get("response"),
                        "resize_time": resize_time,
                        "total_time": total_time,
                        "image_b64": images[0],
                }

                parsed = None
                parse_error = None
                if want_structured:
                        import json

                        try:
                                parsed = json.loads(result["response"])
                        except Exception as e:
                                parse_error = str(e)

                result["parsed_json"] = parsed
                result["parse_error"] = parse_error
        except Exception as e:
                return templates.TemplateResponse(
                        "index.html",
                        {
                                "request": request,
                                "image_b64": None,
                                "response_text": f"Error: {e}",
                                "current_prompt": prompt or "",
                                "current_path": str(camera_path),
                                "current_max_frames": max_frames,
                                "current_structured": want_structured,
                                "parsed_json": None,
                                "parse_error": None,
                        },
                )

        return templates.TemplateResponse(
                "index.html",
                {
                        "request": request,
                        "image_b64": result.get("image_b64"),
                        "response_text": result.get("response"),
                        "resize_time": result.get("resize_time"),
                        "total_time": result.get("total_time"),
                        "current_prompt": prompt or "",
                        "current_path": str(camera_path),
                        "current_max_frames": max_frames,
                        "current_structured": want_structured,
                        "parsed_json": result.get("parsed_json"),
                        "parse_error": result.get("parse_error"),
                },
        )

このファイルの役割は入力を受け取り、前処理関数を呼び、Ollamaに渡し、その結果を画面へ戻すことです。

また、解析フォームの `path` を空のまま送信した場合は、`BASE_CAMERA_PATH` の既定値である `/camera-share/latest.jpg` が解析対象になります。

JSON形式で結果を返してもらう際、コードではプロンプトでその旨を指定することで実現していますが、モデルの応答次第でフォーマットが崩れる可能性があります。実際にはollamaのformat機能を用いるのが良いようです。

"""
payload = {
    "model": "llava:7b",
    "prompt": effective_prompt,
    "images": images,
    "stream": False,
    "format": {
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "object": {"type": "string"},
                "direction": {"type": "string"},
                "distance_m": {"type": "number"},
            },
            "required": ["object", "direction", "distance_m"],
        },
    },
}

resp = requests.post("http://ollama:11434/api/generate", json=payload, timeout=600)
"""

手順6 画面テンプレートを書く

次に `templates/index.html` を作ります。

HTML

<!doctype html>
<html>
    <head>
        <meta charset="utf-8" />
        <meta name="viewport" content="width=device-width,initial-scale=1" />
        <title>Camera Analyze</title>
        <link rel="stylesheet" href="/static/style.css" />
    </head>
    <body>
        <div class="container">
            <h1>Camera Analyze</h1>

            <div class="columns">
                <div class="left">
                    <div class="panel">
                        <h2>Latest frame</h2>
                        <div class="frame-wrap">
                            <img id="frame" src="/frame" alt="latest frame" class="frame" onerror="this.style.display='none';document.getElementById('noimg').style.display='block'" />
                            <p id="noimg" style="display:none">No image available.</p>
                        </div>
                    </div>
                </div>

                <div class="right">
                    <div class="panel">
                        <h2>Analyze</h2>
                        <form method="post" action="/analyze">
                            <label>Path to image (optional)</label>
                            <input type="text" name="path" placeholder="/camera-share/latest.jpg" value="{{ current_path or '' }}" />
                            <label>Prompt</label>
                            <textarea id="promptArea" name="prompt" rows="6">{{ current_prompt or default_prompt }}</textarea>
                            <div style="display:flex;gap:8px;align-items:center;margin-bottom:12px">
                                <button type="submit">Analyze</button>
                                <button type="button" id="saveDefaultBtn">Save as default</button>
                                <span id="saveStatus" style="color:#6b7280;font-size:0.9em"></span>
                            </div>
                            <label>Max frames</label>
                            <input type="number" name="max_frames" value="{{ current_max_frames or 8 }}" min="1" />
                            <label><input type="checkbox" name="structured" value="1" {% if current_structured %}checked{% endif %} /> Structured JSON output (parseable)</label>
                            <button type="submit">Analyze</button>
                        </form>
                    </div>

                    <div class="panel response-panel">
                        <h2>Response</h2>
                        <div class="response-content">
                            {% if response_text %}
                                {% if parsed_json %}
                                    <div class="parsed-table">
                                        <table>
                                            <thead><tr><th>Object</th><th>Direction</th><th>Distance (m)</th></tr></thead>
                                            <tbody>
                                                {% for item in parsed_json %}
                                                <tr>
                                                    <td>{{ item.object }}</td>
                                                    <td>{{ item.direction }}</td>
                                                    <td>{{ item.distance_m }}</td>
                                                </tr>
                                                {% endfor %}
                                            </tbody>
                                        </table>
                                    </div>
                                    <p class="timings">[Resize time: {{ resize_time }} sec] [Total time: {{ total_time }} sec]</p>
                                {% else %}
                                    <pre>{{ response_text }}</pre>
                                    <p class="timings">[Resize time: {{ resize_time }} sec] [Total time: {{ total_time }} sec]</p>
                                {% endif %}
                            {% else %}
                                <p>No analysis yet.</p>
                            {% endif %}
                        </div>
                    </div>
                </div>
            </div>
        </div>
        <script>
            function refreshFrame(){
                const img = document.getElementById('frame');
                if(!img) return;
                img.style.display = '';
                document.getElementById('noimg').style.display = 'none';
                img.src = '/frame?t=' + Date.now();
            }
            setInterval(refreshFrame, 1000);
            window.addEventListener('load', refreshFrame);

            document.getElementById('saveDefaultBtn').addEventListener('click', async function(){
                const btn = this;
                const status = document.getElementById('saveStatus');
                const prompt = document.getElementById('promptArea').value;
                btn.disabled = true;
                status.textContent = 'Saving...';
                try{
                    const form = new URLSearchParams();
                    form.append('prompt', prompt);
                    const r = await fetch('/save_prompt', {method:'POST', body: form});
                    const j = await r.json();
                    if(j.ok){
                        status.textContent = 'Saved';
                    } else {
                        status.textContent = 'Save failed';
                    }
                }catch(e){
                    status.textContent = 'Error';
                }finally{
                    btn.disabled = false;
                    setTimeout(()=>status.textContent='', 4000);
                }
            });
        </script>
    </body>
</html>

<!doctype html>
<html>
    <head>
        <meta charset="utf-8" />
        <meta name="viewport" content="width=device-width,initial-scale=1" />
        <title>Camera Analyze</title>
        <link rel="stylesheet" href="/static/style.css" />
    </head>
    <body>
        <div class="container">
            <h1>Camera Analyze</h1>

            <div class="columns">
                <div class="left">
                    <div class="panel">
                        <h2>Latest frame</h2>
                        <div class="frame-wrap">
                            <img id="frame" src="/frame" alt="latest frame" class="frame" onerror="this.style.display='none';document.getElementById('noimg').style.display='block'" />
                            <p id="noimg" style="display:none">No image available.</p>
                        </div>
                    </div>
                </div>

                <div class="right">
                    <div class="panel">
                        <h2>Analyze</h2>
                        <form method="post" action="/analyze">
                            <label>Path to image (optional)</label>
                            <input type="text" name="path" placeholder="/camera-share/latest.jpg" value="{{ current_path or '' }}" />
                            <label>Prompt</label>
                            <textarea id="promptArea" name="prompt" rows="6">{{ current_prompt or default_prompt }}</textarea>
                            <div style="display:flex;gap:8px;align-items:center;margin-bottom:12px">
                                <button type="submit">Analyze</button>
                                <button type="button" id="saveDefaultBtn">Save as default</button>
                                <span id="saveStatus" style="color:#6b7280;font-size:0.9em"></span>
                            </div>
                            <label>Max frames</label>
                            <input type="number" name="max_frames" value="{{ current_max_frames or 8 }}" min="1" />
                            <label><input type="checkbox" name="structured" value="1" {% if current_structured %}checked{% endif %} /> Structured JSON output (parseable)</label>
                            <button type="submit">Analyze</button>
                        </form>
                    </div>

                    <div class="panel response-panel">
                        <h2>Response</h2>
                        <div class="response-content">
                            {% if response_text %}
                                {% if parsed_json %}
                                    <div class="parsed-table">
                                        <table>
                                            <thead><tr><th>Object</th><th>Direction</th><th>Distance (m)</th></tr></thead>
                                            <tbody>
                                                {% for item in parsed_json %}
                                                <tr>
                                                    <td>{{ item.object }}</td>
                                                    <td>{{ item.direction }}</td>
                                                    <td>{{ item.distance_m }}</td>
                                                </tr>
                                                {% endfor %}
                                            </tbody>
                                        </table>
                                    </div>
                                    <p class="timings">[Resize time: {{ resize_time }} sec] [Total time: {{ total_time }} sec]</p>
                                {% else %}
                                    <pre>{{ response_text }}</pre>
                                    <p class="timings">[Resize time: {{ resize_time }} sec] [Total time: {{ total_time }} sec]</p>
                                {% endif %}
                            {% else %}
                                <p>No analysis yet.</p>
                            {% endif %}
                        </div>
                    </div>
                </div>
            </div>
        </div>
        <script>
            function refreshFrame(){
                const img = document.getElementById('frame');
                if(!img) return;
                img.style.display = '';
                document.getElementById('noimg').style.display = 'none';
                img.src = '/frame?t=' + Date.now();
            }
            setInterval(refreshFrame, 1000);
            window.addEventListener('load', refreshFrame);

            document.getElementById('saveDefaultBtn').addEventListener('click', async function(){
                const btn = this;
                const status = document.getElementById('saveStatus');
                const prompt = document.getElementById('promptArea').value;
                btn.disabled = true;
                status.textContent = 'Saving...';
                try{
                    const form = new URLSearchParams();
                    form.append('prompt', prompt);
                    const r = await fetch('/save_prompt', {method:'POST', body: form});
                    const j = await r.json();
                    if(j.ok){
                        status.textContent = 'Saved';
                    } else {
                        status.textContent = 'Save failed';
                    }
                }catch(e){
                    status.textContent = 'Error';
                }finally{
                    btn.disabled = false;
                    setTimeout(()=>status.textContent='', 4000);
                }
            });
        </script>
    </body>
</html>

この画面で、最新画像の確認、解析対象パスの入力、プロンプト編集、JSON出力の切り替えを行います。

手順7 スタイルを書く

`static/style.css` は次のようにします。お好みでどうぞ。

CSS

body{font-family:system-ui,Segoe UI,Roboto,Arial;margin:0;padding:20px;background:#f5f7fb}
.container{max-width:1600px;margin:0 auto}
.columns{display:flex;gap:20px;align-items:flex-start}
.left{flex:0 0 55%;max-width:900px}
.right{flex:1;min-width:380px}
.panel{background:#fff;padding:16px;border-radius:8px;margin-bottom:12px;box-shadow:0 1px 4px rgba(20,30,60,0.06)}
.frame{width:100%;height:auto;border-radius:6px;display:block}
.frame-wrap{display:flex;justify-content:center;align-items:center;min-height:280px}
form input[type=text],form input[type=number]{width:100%;padding:8px;margin:6px 0 12px;border:1px solid #d7dbe6;border-radius:6px}
form textarea{width:100%;padding:8px;margin:6px 0 12px;border:1px solid #d7dbe6;border-radius:6px;resize:vertical}
button{background:#2563eb;color:#fff;border:none;padding:10px 14px;border-radius:6px;cursor:pointer}
pre{white-space:pre-wrap;background:#0f1724;color:#e6f0ff;padding:12px;border-radius:6px}

@media (max-width: 900px){
    .columns{display:block}
    .right{flex-basis:auto;max-width:none}
    .left{min-width:0}
}

.response-panel{
    width:420px;
    max-width:90vw;
}
.response-content{
    max-height:60vh;
    overflow-y:auto;
}
.response-content pre{white-space:pre-wrap;word-break:break-word;margin:0}
.response-content .timings{color:#6b7280;font-size:0.9em;margin-top:8px}

@media (max-width: 900px){
    .response-panel{width:100%;}
}

.parsed-table table{width:100%;border-collapse:collapse}
.parsed-table th, .parsed-table td{border:1px solid #e6eef8;padding:8px;text-align:left}
.parsed-table thead{background:#f1f5f9}

body{font-family:system-ui,Segoe UI,Roboto,Arial;margin:0;padding:20px;background:#f5f7fb}
.container{max-width:1600px;margin:0 auto}
.columns{display:flex;gap:20px;align-items:flex-start}
.left{flex:0 0 55%;max-width:900px}
.right{flex:1;min-width:380px}
.panel{background:#fff;padding:16px;border-radius:8px;margin-bottom:12px;box-shadow:0 1px 4px rgba(20,30,60,0.06)}
.frame{width:100%;height:auto;border-radius:6px;display:block}
.frame-wrap{display:flex;justify-content:center;align-items:center;min-height:280px}
form input[type=text],form input[type=number]{width:100%;padding:8px;margin:6px 0 12px;border:1px solid #d7dbe6;border-radius:6px}
form textarea{width:100%;padding:8px;margin:6px 0 12px;border:1px solid #d7dbe6;border-radius:6px;resize:vertical}
button{background:#2563eb;color:#fff;border:none;padding:10px 14px;border-radius:6px;cursor:pointer}
pre{white-space:pre-wrap;background:#0f1724;color:#e6f0ff;padding:12px;border-radius:6px}

@media (max-width: 900px){
    .columns{display:block}
    .right{flex-basis:auto;max-width:none}
    .left{min-width:0}
}

.response-panel{
    width:420px;
    max-width:90vw;
}
.response-content{
    max-height:60vh;
    overflow-y:auto;
}
.response-content pre{white-space:pre-wrap;word-break:break-word;margin:0}
.response-content .timings{color:#6b7280;font-size:0.9em;margin-top:8px}

@media (max-width: 900px){
    .response-panel{width:100%;}
}

.parsed-table table{width:100%;border-collapse:collapse}
.parsed-table th, .parsed-table td{border:1px solid #e6eef8;padding:8px;text-align:left}
.parsed-table thead{background:#f1f5f9}

手順8 Windows側で最新画像を書き出す

アプリが参照する `latest.jpg` を更新するため、Windows側では次のスクリプトを動かします。

ここで指定する `out_dir` は、`.devcontainer/devcontainer.json` の `source` と同じ共有フォルダを指すように設定します。

Python

import cv2
import os
import time

out_dir = r"D:\camera-share"
os.makedirs(out_dir, exist_ok=True)

tmp_path = os.path.join(out_dir, "latest_tmp.jpg")
final_path = os.path.join(out_dir, "latest.jpg")

cap = cv2.VideoCapture(0, cv2.CAP_DSHOW)
print("opened:", cap.isOpened())

if not cap.isOpened():
        raise RuntimeError("Failed to open camera with CAP_DSHOW index 0")

while True:
        ret, frame = cap.read()
        if ret:
                cv2.imwrite(tmp_path, frame)
                try:
                        os.replace(tmp_path, final_path)
                except PermissionError as e:
                        print("replace failed (file in use), skipping this frame:", e)
                except OSError as e:
                        print("replace failed, skipping this frame:", e)
                else:
                        print("saved", frame.shape)
        else:
                print("capture failed")
        time.sleep(1)

import cv2
import os
import time

out_dir = r"D:\camera-share"
os.makedirs(out_dir, exist_ok=True)

tmp_path = os.path.join(out_dir, "latest_tmp.jpg")
final_path = os.path.join(out_dir, "latest.jpg")

cap = cv2.VideoCapture(0, cv2.CAP_DSHOW)
print("opened:", cap.isOpened())

if not cap.isOpened():
        raise RuntimeError("Failed to open camera with CAP_DSHOW index 0")

while True:
        ret, frame = cap.read()
        if ret:
                cv2.imwrite(tmp_path, frame)
                try:
                        os.replace(tmp_path, final_path)
                except PermissionError as e:
                        print("replace failed (file in use), skipping this frame:", e)
                except OSError as e:
                        print("replace failed, skipping this frame:", e)
                else:
                        print("saved", frame.shape)
        else:
                print("capture failed")
        time.sleep(1)

手順9 実際に起動する

ここまで作成したら、`dev` コンテナの中でFastAPIアプリを起動します。

ShellScript

uvicorn app:app --reload --host 0.0.0.0 --port 8000

uvicorn app:app --reload --host 0.0.0.0 --port 8000

VS Code の Port Forwarding が有効であれば、そのままブラウザで転送先URLを開きます。ローカルのポート `8000` に転送されている場合は `http://localhost:8000` を開きます。

起動後の確認は次の順で行います。

`/camera-share/latest.jpg` を用意して、アプリが現在のカメラ画像を読み込めることを確認する
フォームの入力欄へ画像または動画のパスを入れて解析する

解析フォームには、`dev` コンテナから参照できる画像または動画ファイルのパスを入力します。ここに入力したパスが、そのまま解析対象になります。

JSON形式を有効にすると、次のような結果を表として扱えます。

JSON

[
    {
        "object": "person wearing dark jacket",
        "direction": "front-left",
        "distance_m": 1.8
    },
    {
        "object": "desk with monitor",
        "direction": "front",
        "distance_m": 1.2
    },
    {
        "object": "chair",
        "direction": "right",
        "distance_m": 0.8
    }
]

[
    {
        "object": "person wearing dark jacket",
        "direction": "front-left",
        "distance_m": 1.8
    },
    {
        "object": "desk with monitor",
        "direction": "front",
        "distance_m": 1.2
    },
    {
        "object": "chair",
        "direction": "right",
        "distance_m": 0.8
    }
]

Web画面を使う前に、VLMとの接続だけ確認したいときはCLIでも試せます。次の `<image-path>` と `<video-path>` には、`dev` コンテナから参照できる実際のファイルパスを指定します。

ShellScript

python media_utils.py <image-path>
python media_utils.py <video-path> --max-frames 8

python media_utils.py <image-path>
python media_utils.py <video-path> --max-frames 8

まとめ

ポイントは次の3つです。

開発用コンテナとOllamaコンテナを分ける
画像と動画の前処理を先に切り出す
文字列出力だけでなくJSON出力にも対応する

目標

構成

環境

手順1. devcontainer と compose を用意する

手順2 コンテナを起動し、モデルを入れる

手順3 Python依存関係を入れる

手順4 画像と動画の前処理コードを書く

手順5 FastAPIアプリを書く

手順6 画面テンプレートを書く

手順7 スタイルを書く

手順8 Windows側で最新画像を書き出す

手順9 実際に起動する

まとめ

コメントする コメントをキャンセル

コメントするコメントをキャンセル