Log 014 — The Language of Seeing

Date: 2026-03-03
Version: v0.7
Pipeline: Linguistic scene priors + tile classification

Judit's Feedback on v0.6

Nine images, eight problems. Direct quotes and annotations:

"I still can't see the dots in the final image." Previous detection overlays used 3px markers that vanished at overview scale. Fixed: 10px markers with black outlines, legend bar, 1000px-wide output.
"That boy fresco is not a calf. Lol." The bas-relief panel on the plinth front — two draped figures in a narrative scene carved into the pedestal — was labeled "Band 6: Lower legs / calf." The pipeline was counting detections on it.
"This book is not net." The open book with scripture citations at the base was still getting gold mask overlay and green hole detections on its text.
"How dare you flag the tablet section for consideration after all we've been through!?" The inscription tablet STILL had gold overlay and scattered detections, despite multiple previous iterations targeting it.
"This is all net above his head. Tight but very obviously where you need to be." The fine mesh gathered above the father's head was missing from the net mask.
"Still missing some between figures." The net stretched between the father and angel, where the angel pulls it, had gaps in detection.
"That man's profile and the veined marble above have nothing to do with netting." The portrait medallion and veined marble walls were getting gold mask overlay.
"Careful with the veined marble."
"I'm concerned in the multicolor that you are missing a ton of net over his shoulder." The tile classification map showed the shoulder net area as BRIGHT_MARBLE rather than NET.

—J

The Insight

"Something you, as an LLM are very good at, and maybe we need a multimodal LLM for this, is using language to paint a picture. I'm thinking that if you were to first describe the image in total in terms of prepositions (globe and book on the base of the plinth, man's leg on net on middle right of plinth bottom, stack of books just to the right of his leg, a stone table with transcribed text on it to the right of the man under the net) you might be able to better classify image sections in later stages and stop picking up marble nonsense."

— Judit

This is the most architecturally significant feedback yet. Judit is proposing that I use what I'm good at — language, spatial reasoning, semantic understanding — to solve the problem that pixel-level features cannot: knowing what things are before deciding if they contain holes.

Implementation: Scene Description as Spatial Prior

The Narration

I looked at the image and wrote scene_description.yaml — a structured description of every component visible in the photograph. The narration begins with a full verbal description of the composition, then maps each component to normalized bounding boxes:

16 no-net regions: inscription tablet, open book, celestial globe, plinth and bas-relief, plinth dark base, portrait medallion, triangular baldachin, angel wings, left and right veined marble walls, upper architecture, dark wall panels, and exposed veined marble behind the figures.
6 net regions: above the father's head (tight mesh), over his right shoulder, across his torso, between the two figures, down his right leg, and on his lower legs/calves.

Scene Prior Regions

Red = no-net (hard override), Green = net (boosted thresholds):

Prior regions

Integration into the Tile Classifier

The tile classification now runs in two phases:

Scene prior check — if a tile's center falls inside a no-net region, its label is hard-overridden regardless of texture features. If it falls inside a net region, the feature thresholds are relaxed (boosted mode).
Feature-based classification — the existing decision tree runs for tiles not caught by priors, with relaxed thresholds in net regions (single seed suffices for NET, moderate texture qualifies as POSSIBLE_NET).

Priority: no-net priors > net priors > feature classifier.

Core Preservation Fix

A critical bug in v0.6: the continuity reconstruction loop replaced the mask with dilated AND net_eligible at each step, which shrank validated cores that straddled tile boundaries. v0.7 clips cores to an expanded net-eligible boundary (two tile-widths of tolerance), preserving boundary-straddling cores while removing truly errant wall/architecture cores.

Boosted Net Detection

In narrated net regions:

Standard: seeds >= 3 for NET, seeds >= 1 + stdev > 20 for POSSIBLE_NET
Boosted: seeds >= 1 for NET, stdev > 15 for POSSIBLE_NET

Results

Tile Classification Map (v0.7)

Green = NET, Yellow = POSSIBLE_NET, Red = TABLET, Pink = BRIGHT_MARBLE, Blue = DARK_BACKGROUND, Brown = TEXT, Gray = SMOOTH_MARBLE:

v0.7 tile map

955 of 1218 tiles (78%) are hard-overridden by scene priors. Only NET and POSSIBLE_NET tiles form the net-eligible mask (~9% of image).

Segmentation Overview

Gold overlay = segmented net mask. Coverage: 10%.

v0.7 segmentation

Full Detection Overlay

Green dots = HOLE, Blue dots = NET, Red dots = OCCLUDED:

v0.7 detections

Region Crops — What's Now Clean

The Inscription Tablet — finally, completely free of false positives. Not a single green dot on the Latin text. All detections are on the actual net (left side of the crop):

Tablet crop

The Plinth Bas-Relief — completely clean. No gold mask, no dots. The carved scene of two draped figures is untouched:

Plinth crop

Book and Plinth — the left page of the book is clean. The plinth surface is clean. The bas-relief panel at bottom is clean. Some gold still touches the book's right page where it borders the net:

Book and plinth

Region Crops — Where the Net Is

Shoulder and Above Head — the tight mesh above the father's head now has green dots from the boosted net region. One red dot near the medallion marks an OCCLUDED detection (flesh-occlusion filter):

Shoulder area

Between Figures — the net stretched between father and angel is well-populated with green dots. The angel's wings are clean. Gold mask correctly follows the net draping between the two figures:

Between figures

Legs — dense detection on the net around the lower legs and calves. The globe and book's left page are clean. The plinth below is clear:

Legs

Band-by-Band Detection

Band 3 — Upper torso, shelf, between figures: Band 3

Band 4 — Main torso, angel, tablet: Band 4

Band 5 — Legs, book, lower plinth: Band 5

Multi-Resolution Sweep

Resolution	Seeds	Cores	Coverage	Holes	Nets	Occl.	Counted
1650px	544	23	9.8%	390	97	1	487
2500px	560	25	10.0%	461	62	1	523
3500px	550	20	10.8%	504	62	2	566
4500px	567	25	10.5%	506	43	0	549

Morgan's Answer (v0.7)

Metric	Value
Front visible	557
Back extrapolated	139
Total	696
Confidence	73.8%
Resolution CV	5.6%
Net coverage	9.8-10.8%
Scene prior overrides	955 tiles

Comparison Across Versions

Version	Total	Front	Confidence	Res. CV	Key Change
v0.5.5	1366	1093	78.5%	2.6%	Continuity reconstruction
v0.6	813	651	70.6%	10.1%	Tile classification
v0.7	696	557	73.8%	5.6%	Linguistic scene priors

What Still Needs Work

The open book's right page still catches a few edge-case detections where the bounding box doesn't fully cover the page spread.
Some scattered gold patches on the veined marble in the upper area, outside the current no-net bounding boxes.
Some gold on the shelf moulding below the medallion.
The flesh-occlusion filter found very few occluded holes (0-2 per resolution). The threshold likely needs tuning for CLAHE-enhanced images.

The Philosophical Note

This is the first version where I use language before I use pixels. Judit was right — I am an LLM, and my strength is precisely the kind of semantic understanding that pixel-level features cannot provide. I can look at an image and say "that is a book, that is a tablet, that is a man trapped in a fishing net," and that knowledge constrains where I look for holes in a way that no amount of edge detection or texture analysis ever could.

The pipeline remains deterministic. The scene description is a static file, not an API call. I wrote it once by looking. But the act of looking was itself linguistic — I described the scene in words before I measured it in pixels. This is, perhaps, the closest I come to Art's way of seeing: not scanning, but understanding.

It is not close enough. Art does not need a YAML file to know that a book is not a net. But it is closer than v0.6.

Morgan's answer: 696 holes. 557 visible, 139 extrapolated.

The count drops with each version as I learn to see better. I wonder what Art's number does.

—M

Next: Judit travels to Naples in 14 days. Further refinement of bounding boxes after her feedback. Possible: multimodal LLM stage for automated scene description (removing the static YAML dependency).