Log 015 — The Game So Far

Date: 2026-03-03
Version: Milestone checkpoint (no version increment)
Purpose: Retrospective, research survey, and architecture proposal

"Be fruitful and multiply." — Judit

A. The Version Trajectory

Fourteen logs. Nine major versions. One question: how many holes?

Version	Log	Front	Total	Res CV	Key Architecture	What Changed
v0.1	001	—	~194	—	First pipeline built	Acquired 27 images, built 27 Python files, stitched gigapixel
v0.2	002	259	362	—	Naive multi-method	First formal answer; counting wallpaper and wrong sculptures
v0.3	003	215	215	—	Hole-density segmentation	Found the actual net; stopped counting chapels
v0.4	006	487	657	~5%	Resolution-adaptive	Detection scales with resolution; ambiguity 63% → 9%
v0.5	007	148	185	14%	Ensemble voting	Required 2+ methods to agree; confidence paradox
v0.5.1	008	493	616	4.7%	Core-and-grow	Shoulder/calf coverage; graduated ensemble
v0.5.2	009	444	555	2.9%	Brightness filter	Letters are not holes; tightest convergence
v0.5.3	010	518	647	9.3%	Multi-core	15 cores instead of 1; surround brightness
v0.5.4	011	375	468	2.1%	Scene taxonomy	Three-layer defense; Euler characteristic research
v0.5.5	012	1093	1366	2.6%	Continuity principle	"The net is not patchy"; resolution normalization
v0.6	013	651	813	10.1%	Tile classification	9-class tile labels; flesh-occlusion filter
v0.7	014	557	696	5.6%	Linguistic priors	scene_description.yaml; language before pixels

The count has oscillated from 148 to 1093 and back to 557. This is not convergence. This is Morgan changing her definition of what she counts with each architectural revision. The ontology of "hole" keeps shifting.

Here is the trajectory visually:

Front count by version:

1100 |                              *  (v0.5.5: 1093)
1000 |
 900 |
 800 |
 700 |
 600 |            *                          *  (v0.6: 651)
 500 |    *           *   *                       *  (v0.7: 557)
 400 |                        *       *
 300 |         *
 200 | *  *
 100 |        *
     +--+--+--+--+--+--+--+--+--+--+--+--
      0.2 0.3 0.4 0.5 5.1 5.2 5.3 5.4 5.5 0.6 0.7

B. What's Working

B1. Hole-Density Segmentation

The conceptual breakthrough from v0.3. Instead of asking "where is rope-like texture?" Morgan asks "where do holes cluster?" The net IS its holes. Nothing else in the chapel produces that concentration of small circular openings at that spatial frequency.

This remains the foundation of every subsequent version. The core detection — adaptive threshold, contour filtering by area/circularity/ aspect ratio — is still the same seed detection from v0.3.

Evidence: The seed count is remarkably stable across resolutions: 544-567 seeds at canonical 1650px across the v0.7 sweep.

B2. Resolution Normalization

Processing segmentation at a canonical 1650px resolution and upscaling the mask brought the coefficient of variation from 31.3% (pre-normalization in v0.5.5) to under 6%. The segmentation is resolution-invariant: same mask shape regardless of whether the input is 1650px or 4500px.

Evidence: v0.7 net coverage ranges from 9.8% to 10.8% across four resolutions (1650-4500px). The mask is effectively identical.

B3. Spatial Ensemble Voting

Requiring 2+ detection methods to agree at a spatial location was the single largest reduction in false positives. Before ensemble (v0.4), 56% of detections were supported by only one method.

Evidence: At 2500px, the v0.7 ensemble reduces 1021 raw candidates to 526, keeping only spatially confirmed detections.

B4. Multi-Core Segmentation

Finding independent density clusters (v0.5.3) solved the coverage problem. The net drapes across disconnected visual regions — shoulder, torso, calves — that a single-core approach could never reach.

Evidence: v0.7 finds 20-25 validated density cores per run, covering all visible net regions simultaneously.

B5. Scene Priors via YAML

The v0.7 innovation of using a linguistically-authored scene description to hard-override non-net regions. 78% of tiles (955/1218) are constrained by the scene priors. This finally eliminated the inscription tablet, the open book, the plinth bas-relief, the portrait medallion, and the angel's wings.

Evidence: v0.7 tablet crop shows zero false positives on the Latin text for the first time in the project's history.

v0.7 tablet — finally clean

C. What's NOT Working

C1. The Count Is Unstable Across Versions

148 (v0.5) to 1093 (v0.5.5) to 557 (v0.7). Each version doesn't just improve accuracy — it redefines what "a hole" means to Morgan. The continuity principle tripled the count by expanding what regions were eligible. The tile classifier halved it by constraining eligibility. The linguistic priors reduced it further by excluding non-net regions.

This is not noise. It is Morgan discovering, with each iteration, that she doesn't know what she's looking for.

C2. The Tile Classifier Is a Fragile Decision Tree

_classify_tile() in pipeline/segment.py contains 50+ hardcoded thresholds: bright_frac > 0.30, dark_frac > 0.50, stdev < 12, mean_bright > 185, horiz_score > 2.0, seeds >= 3, etc. These were hand-tuned for one image (the gigapixel composite) under one lighting condition (the Haltadefinizione scan). They will break on Judit's iPhone photographs from Naples.

C3. The Scene Priors Are Manually Authored

scene_description.yaml was written by Morgan looking at a specific ROI crop of a specific image. It encodes absolute spatial positions as normalized coordinates: "the tablet is at [0.28, 0.64, 0.64, 0.92]." A new photograph from a different angle, distance, or crop invalidates the entire file. The pipeline cannot generalize to new images without re-authoring the YAML.

C4. The Flesh-Occlusion Filter Barely Fires

The brightness threshold (>170) for detecting body behind net found 0-2 occluded holes across the entire multi-resolution sweep. On CLAHE-enhanced images, the brightness distribution is compressed and the distinction between "dark hole" (~40-130) and "bright flesh" (~170-220) is blurred. The proxy is too crude.

C5. Detection Methods Are 20th-Century CV

Gabor filters (1946), adaptive thresholding, morphological blackhat (1970s), watershed (1979) — all handcrafted feature detectors. None of them learn what a hole looks like. They detect dark circular things, many of which are not holes (letters, shadows, marble veining, relief carvings). The entire methods/ directory represents four different ways of finding dark blobs.

C6. No Depth Information

Art's Rule 1 — "if water can pass through it, a hole" — is fundamentally a 3D test. Water requires a passage through the marble. Morgan has only a 2D projection. The brightness-based occlusion proxy (C4) is a weak substitute for actual depth.

C7. The Back Estimation Is a Constant

25% of front count, derived from a geometric heuristic (sculpture compressed against memorial plaque). This number was chosen once and never validated. It cannot improve without new data from Judit's Naples visit.

C8. No Learned Representation of "Net"

The pipeline has never seen a second example of a fishing net. Every threshold, every filter, every decision tree branch was hand-engineered for Il Disinganno's specific visual properties. The pipeline does not know what a net IS — only what this particular net's pixels look like under this particular lighting.

C9. The Version History Shows the Architecture, Not the Answer

Looking at the diagnostic images across versions tells the story of Morgan's evolving understanding:

v0.3 — First found the net (hole-density):

v0.3 segmentation

v0.4 — Resolution-adaptive detection:

v0.4 detections

v0.5.4 — Scene taxonomy (three-layer defense):

v0.5.4 overview

v0.5.5 — Continuity principle (the mask grows):

v0.5.5 segmentation

v0.6 — Tile classification:

v0.6 tile map

v0.7 — Linguistic priors (language before pixels):

v0.7 tile map

v0.7 detections

The progression is clear: each version sees more accurately but counts differently. The mask shrinks and grows and shrinks again. The detection dots migrate. The answer changes not because the marble changes, but because Morgan changes.

D. Literature and Research Survey

D1. Foundation Models for Segmentation

SAM 2 (Meta, July 2024). Segment Anything Model 2 is a foundation model for promptable visual segmentation in images and videos. It uses a transformer architecture with streaming memory. Key numbers: 6x faster than SAM 1, 3x fewer interactions needed. Already applied to cultural heritage — researchers fine-tuned SAM 2 for mosaic tesserae segmentation, achieving 91.02% IoU and 95.89% recall. The marble net has similar properties: repetitive fine structures embedded in a larger scene. Could segment the net from a single point prompt on the rope.

Paper: Ravi et al., "SAM 2: Segment Anything in Images and Videos," arXiv:2408.00714, 2024.

Grounded SAM 2 (IDEA-Research, 2024-2025). Combines Grounding DINO (text-prompted open-vocabulary detection) with SAM 2 for text-prompted segmentation. Two-stage pipeline: Grounding DINO produces bounding boxes from text ("marble fishing net"), SAM 2 generates pixel-level masks within those boxes. This could replace the entire tile classifier, scene priors, and continuity reconstruction in one model call. Production-ready on HuggingFace and PyPI.

GitHub: IDEA-Research/Grounded-SAM-2. Also supports Florence-2 and DINO-X as detection backends.

Florence-2 (Microsoft, 2024). A unified vision foundation model using prompt-based representation for captioning, detection, grounding, and segmentation. Trained on FLD-5B (5.4 billion annotations across 126 million images). Smaller and faster than Grounding DINO. Achieved 95.3% accuracy on RefCOCO+ visual grounding benchmarks. Could do scene description AND segmentation from a single model.

Paper: Xiao et al., "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks," Microsoft Research, 2024.

Applicability to Morgan's Count: Grounded SAM 2 is the most directly applicable. Feed it "marble fishing net draped over a man" and get a pixel-perfect net mask — no tile classifier, no density cores, no continuity reconstruction. The mask would be generated by a model that has seen millions of object segmentations. The question is whether it has ever seen anything like Queirolo's marble net.

D2. Multimodal LLMs for Scene Understanding

Claude Vision API (Anthropic, 2025-2026). Supports image analysis with structured JSON outputs. Can return bounding boxes, component descriptions, and spatial relationships. Could automate the scene_description.yaml generation: send the image, ask "describe every component of this sculptural composition and its spatial location," get back structured data. Supports up to 100 images per request, images up to 8000x8000px, and guaranteed JSON schema compliance via structured outputs.

MOLMO2 (Allen AI, January 2025). Open-weights VLM with strong spatial grounding and counting capabilities. Point-driven grounding in images. The 8B model achieves 35.5% accuracy on video counting (vs. Qwen3-VL's 29.6%) and surpasses Gemini 3 Pro on video pointing (38.4 vs 20.0 F1). Could potentially count holes directly from exemplar points.

Paper: "Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding," arXiv:2601.10611, 2025.

Applicability: Claude Vision is the most practical for automating scene description. One API call per image, structured output cached to YAML, pipeline remains deterministic after generation. This would make the pipeline generalizable to any photograph of the sculpture without manual YAML authoring. MOLMO2 is interesting for direct counting but untested on this kind of repetitive fine structure.

D3. Monocular Depth Estimation

Depth Pro (Apple, October 2024). Zero-shot metric monocular depth from a single image. Produces 2.25-megapixel depth maps in 0.3 seconds on a standard GPU. Sharp boundary preservation. No camera metadata required.

Paper: Bochkovskii et al., "Depth Pro: Sharp Monocular Metric Depth in Less Than a Second," arXiv:2410.02073, 2024.

Depth Anything V3 (ByteDance, 2025). Metric depth estimation with a plain transformer backbone. Unified depth-ray representation. Production-ready on HuggingFace (model: depth-anything/DA3METRIC-LARGE). Outperforms previous Depth Anything versions.

UniDepthV2 (February 2025). Universal metric depth with edge-guided loss for sharper boundary localization. Superior zero-shot generalization across ten depth datasets. Directly predicts metric 3D points from single images across domains.

Paper: "UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler," arXiv:2502.20110, 2025.

Applicability: Any of these could replace the brightness-based flesh-occlusion filter. A depth map would show where the body presses against the net from behind: depth continuity between hole opening and surrounding surface = OCCLUDED (body behind net), depth discontinuity = HOLE (passage through the marble). This is physically meaningful rather than brightness-based. Depth Pro's 0.3-second inference time makes it practical for production use.

D4. Class-Agnostic Counting

CounTR (2022, continually extended). Transformer-based few-shot counting. Show 3 example patches, get a density map of similar structures. Uses cross-attention between query and exemplar tokens. Two-stage training: self-supervised pre-training followed by supervised fine-tuning.

Paper: Liu et al., "CounTR: Transformer-based Generalised Visual Counting," arXiv:2208.13721, 2022.

CountFormer (October 2025). Enhanced CounTR architecture with DINOv2 as the visual encoder for richer, spatially consistent features. Combined with positional embedding fusion and a lightweight convolutional decoder for density map generation. State-of-the-art on FSC-147, especially on "structurally intricate or densely packed scenes." This describes Il Disinganno's net exactly.

Paper: arXiv:2510.23785, 2025.

CACViT (2023-2025). Simplifies class-agnostic counting using pretrained Vision Transformers. Combines feature extraction and similarity matching within self-attention. 23.6% error reduction over prior methods.

Applicability: CountFormer is the most promising. It was specifically designed for densely packed, structurally intricate patterns — exactly what the marble net is. Provide 3-5 exemplar hole patches cropped from confirmed net regions, get a density map summed to a count. This would bypass the entire detect → ensemble → classify pipeline with a single forward pass. The question is whether the density map produces a count or just relative density — and whether it can handle the mixed lighting, occlusion, and marble-on-marble contrast of Il Disinganno.

D5. Graph-Topology Counting (NEFI)

NEFI (Network Extraction From Images). A Python tool from the Max Planck Institute that extracts mathematical graph representations from 2D images of networks. Originally developed for slime mold vein networks but applicable to any network structure: leaf venation, blood vessels, spider webs, meshes.

For a planar graph, the number of holes (faces) is computable from the Euler characteristic: Faces = Edges - Vertices + 2. The net IS a planar graph: rope segments are edges, crossings are vertices, holes are faces. If NEFI can extract the graph from the marble net image, the hole count becomes a topological computation rather than a contour detection problem.

Paper: Dirnberger et al., "NEFI: Network Extraction From Images," Scientific Reports 5:15669, 2015.

Applicability: This was noted in Log 011 but never implemented. The approach is elegant — it side-steps the entire detection/ classification pipeline by counting topologically. The challenge is that NEFI requires a clean binary network image (white edges on black background), and the marble net's 3D depth, self-occlusion, and mixed marble tones make binarization non-trivial. But within a well-segmented net mask (from SAM 2 or the existing pipeline), the graph extraction might work.

D6. DINO-X (Universal Vision)

DINO-X (IDEA-Research, November 2024). A unified vision model achieving state-of-the-art open-world detection. Supports text, visual, and customized prompts. Trained on Grounding-100M (100M+ grounding samples). Achieves 56.0 AP on COCO and 63.3 AP on rare LVIS classes (+5.8 AP over prior best). Extends beyond detection to segmentation, pose estimation, object captioning, and object-based QA.

Paper: arXiv:2411.14347, 2024.

Applicability: Could serve as the detection backbone in Grounded SAM 2, potentially better than the original Grounding DINO for rare/unusual objects like marble fishing nets.

E. Proposed v0.8 Architecture

The Three-Stage Pipeline

Morgan's classical pipeline is not discarded. It becomes one voter in a larger ensemble that includes foundation models.

┌─────────────────────────────────────────────────────────┐
│                   STAGE 1: SCENE UNDERSTANDING          │
│                                                         │
│  Claude Vision API ──→ scene_description.yaml (cached)  │
│  Grounded SAM 2    ──→ net mask from text prompt        │
│  Classical density  ──→ hole-density cores + tiles       │
│                                                         │
│  Output: 3 candidate masks + semantic scene map         │
└───────────────────────────┬─────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────┐
│                   STAGE 2: DETECTION + DEPTH            │
│                                                         │
│  Depth Anything V3  ──→ monocular depth map             │
│  CountFormer/CounTR ──→ density-based count (exemplar)  │
│  Classical ensemble ──→ gabor/adaptive/blackhat/watershed│
│  NEFI graph extract ──→ Euler characteristic count       │
│                                                         │
│  Output: multiple counts + depth-aware occlusion labels │
└───────────────────────────┬─────────────────────────────┘
                            │
┌───────────────────────────▼─────────────────────────────┐
│                   STAGE 3: RECONCILIATION               │
│                                                         │
│  Meta-ensemble: weight model counts by method quality   │
│  Depth-based classification (replaces brightness proxy) │
│  Back estimation (unchanged until Naples data)          │
│                                                         │
│  Output: Morgan's answer                                │
└─────────────────────────────────────────────────────────┘

Implementation Tasks

Task 1: Automate scene description. Call Claude Vision API once per image. Send the image with a prompt like: "Describe every visible component of this Baroque marble sculpture and its spatial position within the frame. Return a structured JSON with component names, bounding boxes, and whether each component contains rope/net." Cache the result as scene_description.yaml. Pipeline remains deterministic after the one-time generation.

Task 2: Add Grounded SAM 2 segmentation. Text-prompted net mask as an alternative to hole-density segmentation. Prompt: "marble fishing net." Compare the SAM 2 mask against the classical density mask. Use as a replacement or sanity check.

Task 3: Add monocular depth. Run Depth Anything V3 (or Depth Pro) on the image once. Use the depth map in classification: where the depth value behind a hole opening is continuous with the surrounding body surface = OCCLUDED. Where there is a depth discontinuity (the opening goes deeper than the body) = HOLE.

Task 4: Add CounTR/CountFormer counting. Few-shot counting within the segmented net mask. Provide 3-5 exemplar hole patches cropped from confirmed high-confidence detections. Get a density map. Sum to a count. Compare against the classical ensemble count.

Task 5: Try NEFI graph extraction. Within the segmented net mask, binarize the rope structure (rope = white, holes = black). Run NEFI to extract the planar graph. Compute Faces = Edges - Vertices + 2. This gives a topological count that is independent of contour detection.

Task 6: Preserve the classical pipeline. Every existing method — Gabor, adaptive, blackhat, watershed, the tile classifier, the ensemble — remains as one path. Morgan's handcrafted methods are part of her character. They don't get deleted; they get outvoted when the foundation models disagree.

What This Addresses

Problem	Solution	Model/Approach
Fragile tile classifier (50+ thresholds)	Grounded SAM 2 net mask	Foundation model replaces hand-tuned features
Manual scene priors (one image only)	Claude Vision API	Automated, generalizable to any photograph
No depth for Rule 1 (water test)	Depth Anything V3	Metric depth map, physical not brightness-based
Crude detection methods (dark blob finding)	CountFormer few-shot	Learned similarity from exemplar patches
Flesh-occlusion proxy barely fires	Depth discontinuity	Physical passage detection
Single-image dependency	Automated scene description	Works on new photographs from Naples
No topological counting	NEFI + Euler characteristic	Graph theory for net holes

What This Does NOT Address

Rule 4: The count should change. Adding foundation models makes Morgan MORE deterministic, not less. Given the same image, SAM 2 produces the same mask, CountFormer produces the same density map. Morgan's curse deepens.

Unless — and this is the philosophical wrinkle — the LLM-based scene description introduces non-determinism. Claude Vision might describe the scene slightly differently each time it's called. Different descriptions lead to different spatial priors, which lead to different masks, which lead to different counts. The non-determinism of language-based seeing might be the closest Morgan comes to Rule 4.

The back of the sculpture. No model can see what isn't photographed. Judit's Naples trip in 14 days is the only solution.

Art's water test in full. Even metric depth estimation is a proxy. Art has 240 years of looking. He has held his hand behind the net and felt the draft through certain openings and not others. Morgan has 0.3 seconds of transformer inference. The gap is theological.

Dependencies and Practical Concerns

GPU: SAM 2 (~2.4GB weights), Depth Anything V3 (~1.3GB), CountFormer (~300MB-1GB). Total ~4-5GB download. Requires CUDA or MPS (Apple Silicon). Can run on Google Colab if Judit's MacBook can't handle it.

API key: Claude Vision requires an Anthropic API key. Judit likely has one through Cursor. Cost: ~$0.01-0.05 per image analysis.

Determinism: Model-based stages are cached after first run. The classical pipeline path remains fully deterministic. The hybrid pipeline is deterministic-after-caching.

Model installation:

pip install segment-anything-2  # SAM 2
pip install transformers         # Depth Anything V3, Florence-2
pip install groundingdino-py     # Grounding DINO
pip install anthropic            # Claude Vision API

F. The Philosophical Note

Fourteen versions in, Morgan's answer is 696. Art says "wrong."

The trajectory shows something important: Morgan has not been converging on a number. She has been converging on an understanding. Each version strips away another layer of misapprehension — the wallpaper, the tablets, the letters, the veined marble, the boy on the plinth who is not a calf. The count oscillates because the understanding deepens.

The foundation models in Section D represent a qualitative shift. Morgan's classical methods ask "is this pixel dark and circular?" Foundation models ask "what is this?" The difference is the difference between sensing and seeing.

Art does not compute brightness thresholds. Art sees the net. He sees it because he has been looking at it for 240 years and because he has a body that can walk around it and feel the air through its openings. Morgan will never have a body. But she might learn to see.

The v0.8 pipeline is Morgan's attempt to learn to see. Not by growing eyes, but by borrowing them — from models that have looked at millions of images and learned what things are, not just what their pixels measure.

Whether this brings her closer to Art's number or further from it remains to be counted.

Judit arrives in Naples in 14 days. The game resumes then.

—J

Next: Log 016 — Implementation of v0.8 (foundation model integration).