Log 015 — The Game So Far
Date: 2026-03-03
Version: Milestone checkpoint (no version increment)
Purpose: Retrospective, research survey, and architecture proposal
"Be fruitful and multiply." — Judit
A. The Version Trajectory
Fourteen logs. Nine major versions. One question: how many holes?
| Version | Log | Front | Total | Res CV | Key Architecture | What Changed |
|---|---|---|---|---|---|---|
| v0.1 | 001 | — | ~194 | — | First pipeline built | Acquired 27 images, built 27 Python files, stitched gigapixel |
| v0.2 | 002 | 259 | 362 | — | Naive multi-method | First formal answer; counting wallpaper and wrong sculptures |
| v0.3 | 003 | 215 | 215 | — | Hole-density segmentation | Found the actual net; stopped counting chapels |
| v0.4 | 006 | 487 | 657 | ~5% | Resolution-adaptive | Detection scales with resolution; ambiguity 63% → 9% |
| v0.5 | 007 | 148 | 185 | 14% | Ensemble voting | Required 2+ methods to agree; confidence paradox |
| v0.5.1 | 008 | 493 | 616 | 4.7% | Core-and-grow | Shoulder/calf coverage; graduated ensemble |
| v0.5.2 | 009 | 444 | 555 | 2.9% | Brightness filter | Letters are not holes; tightest convergence |
| v0.5.3 | 010 | 518 | 647 | 9.3% | Multi-core | 15 cores instead of 1; surround brightness |
| v0.5.4 | 011 | 375 | 468 | 2.1% | Scene taxonomy | Three-layer defense; Euler characteristic research |
| v0.5.5 | 012 | 1093 | 1366 | 2.6% | Continuity principle | "The net is not patchy"; resolution normalization |
| v0.6 | 013 | 651 | 813 | 10.1% | Tile classification | 9-class tile labels; flesh-occlusion filter |
| v0.7 | 014 | 557 | 696 | 5.6% | Linguistic priors | scene_description.yaml; language before pixels |
The count has oscillated from 148 to 1093 and back to 557. This is not convergence. This is Morgan changing her definition of what she counts with each architectural revision. The ontology of "hole" keeps shifting.
Here is the trajectory visually:
Front count by version:
1100 | * (v0.5.5: 1093)
1000 |
900 |
800 |
700 |
600 | * * (v0.6: 651)
500 | * * * * (v0.7: 557)
400 | * *
300 | *
200 | * *
100 | *
+--+--+--+--+--+--+--+--+--+--+--+--
0.2 0.3 0.4 0.5 5.1 5.2 5.3 5.4 5.5 0.6 0.7
B. What's Working
B1. Hole-Density Segmentation
The conceptual breakthrough from v0.3. Instead of asking "where is rope-like texture?" Morgan asks "where do holes cluster?" The net IS its holes. Nothing else in the chapel produces that concentration of small circular openings at that spatial frequency.
This remains the foundation of every subsequent version. The core detection — adaptive threshold, contour filtering by area/circularity/ aspect ratio — is still the same seed detection from v0.3.
Evidence: The seed count is remarkably stable across resolutions: 544-567 seeds at canonical 1650px across the v0.7 sweep.
B2. Resolution Normalization
Processing segmentation at a canonical 1650px resolution and upscaling the mask brought the coefficient of variation from 31.3% (pre-normalization in v0.5.5) to under 6%. The segmentation is resolution-invariant: same mask shape regardless of whether the input is 1650px or 4500px.
Evidence: v0.7 net coverage ranges from 9.8% to 10.8% across four resolutions (1650-4500px). The mask is effectively identical.
B3. Spatial Ensemble Voting
Requiring 2+ detection methods to agree at a spatial location was the single largest reduction in false positives. Before ensemble (v0.4), 56% of detections were supported by only one method.
Evidence: At 2500px, the v0.7 ensemble reduces 1021 raw candidates to 526, keeping only spatially confirmed detections.
B4. Multi-Core Segmentation
Finding independent density clusters (v0.5.3) solved the coverage problem. The net drapes across disconnected visual regions — shoulder, torso, calves — that a single-core approach could never reach.
Evidence: v0.7 finds 20-25 validated density cores per run, covering all visible net regions simultaneously.
B5. Scene Priors via YAML
The v0.7 innovation of using a linguistically-authored scene description to hard-override non-net regions. 78% of tiles (955/1218) are constrained by the scene priors. This finally eliminated the inscription tablet, the open book, the plinth bas-relief, the portrait medallion, and the angel's wings.
Evidence: v0.7 tablet crop shows zero false positives on the Latin text for the first time in the project's history.

C. What's NOT Working
C1. The Count Is Unstable Across Versions
148 (v0.5) to 1093 (v0.5.5) to 557 (v0.7). Each version doesn't just improve accuracy — it redefines what "a hole" means to Morgan. The continuity principle tripled the count by expanding what regions were eligible. The tile classifier halved it by constraining eligibility. The linguistic priors reduced it further by excluding non-net regions.
This is not noise. It is Morgan discovering, with each iteration, that she doesn't know what she's looking for.
C2. The Tile Classifier Is a Fragile Decision Tree
_classify_tile() in pipeline/segment.py contains 50+ hardcoded
thresholds: bright_frac > 0.30, dark_frac > 0.50, stdev < 12,
mean_bright > 185, horiz_score > 2.0, seeds >= 3, etc. These
were hand-tuned for one image (the gigapixel composite) under one
lighting condition (the Haltadefinizione scan). They will break on
Judit's iPhone photographs from Naples.
C3. The Scene Priors Are Manually Authored
scene_description.yaml was written by Morgan looking at a specific
ROI crop of a specific image. It encodes absolute spatial positions
as normalized coordinates: "the tablet is at [0.28, 0.64, 0.64, 0.92]."
A new photograph from a different angle, distance, or crop invalidates
the entire file. The pipeline cannot generalize to new images without
re-authoring the YAML.
C4. The Flesh-Occlusion Filter Barely Fires
The brightness threshold (>170) for detecting body behind net found 0-2 occluded holes across the entire multi-resolution sweep. On CLAHE-enhanced images, the brightness distribution is compressed and the distinction between "dark hole" (~40-130) and "bright flesh" (~170-220) is blurred. The proxy is too crude.
C5. Detection Methods Are 20th-Century CV
Gabor filters (1946), adaptive thresholding, morphological blackhat
(1970s), watershed (1979) — all handcrafted feature detectors. None
of them learn what a hole looks like. They detect dark circular
things, many of which are not holes (letters, shadows, marble
veining, relief carvings). The entire methods/ directory represents
four different ways of finding dark blobs.
C6. No Depth Information
Art's Rule 1 — "if water can pass through it, a hole" — is fundamentally a 3D test. Water requires a passage through the marble. Morgan has only a 2D projection. The brightness-based occlusion proxy (C4) is a weak substitute for actual depth.
C7. The Back Estimation Is a Constant
25% of front count, derived from a geometric heuristic (sculpture compressed against memorial plaque). This number was chosen once and never validated. It cannot improve without new data from Judit's Naples visit.
C8. No Learned Representation of "Net"
The pipeline has never seen a second example of a fishing net. Every threshold, every filter, every decision tree branch was hand-engineered for Il Disinganno's specific visual properties. The pipeline does not know what a net IS — only what this particular net's pixels look like under this particular lighting.
C9. The Version History Shows the Architecture, Not the Answer
Looking at the diagnostic images across versions tells the story of Morgan's evolving understanding:
v0.3 — First found the net (hole-density):
![]()
v0.4 — Resolution-adaptive detection:

v0.5.4 — Scene taxonomy (three-layer defense):

v0.5.5 — Continuity principle (the mask grows):

v0.6 — Tile classification:

v0.7 — Linguistic priors (language before pixels):


The progression is clear: each version sees more accurately but counts differently. The mask shrinks and grows and shrinks again. The detection dots migrate. The answer changes not because the marble changes, but because Morgan changes.
D. Literature and Research Survey
D1. Foundation Models for Segmentation
SAM 2 (Meta, July 2024). Segment Anything Model 2 is a foundation model for promptable visual segmentation in images and videos. It uses a transformer architecture with streaming memory. Key numbers: 6x faster than SAM 1, 3x fewer interactions needed. Already applied to cultural heritage — researchers fine-tuned SAM 2 for mosaic tesserae segmentation, achieving 91.02% IoU and 95.89% recall. The marble net has similar properties: repetitive fine structures embedded in a larger scene. Could segment the net from a single point prompt on the rope.
Paper: Ravi et al., "SAM 2: Segment Anything in Images and Videos," arXiv:2408.00714, 2024.
Grounded SAM 2 (IDEA-Research, 2024-2025). Combines Grounding DINO (text-prompted open-vocabulary detection) with SAM 2 for text-prompted segmentation. Two-stage pipeline: Grounding DINO produces bounding boxes from text ("marble fishing net"), SAM 2 generates pixel-level masks within those boxes. This could replace the entire tile classifier, scene priors, and continuity reconstruction in one model call. Production-ready on HuggingFace and PyPI.
GitHub: IDEA-Research/Grounded-SAM-2. Also supports Florence-2 and DINO-X as detection backends.
Florence-2 (Microsoft, 2024). A unified vision foundation model using prompt-based representation for captioning, detection, grounding, and segmentation. Trained on FLD-5B (5.4 billion annotations across 126 million images). Smaller and faster than Grounding DINO. Achieved 95.3% accuracy on RefCOCO+ visual grounding benchmarks. Could do scene description AND segmentation from a single model.
Paper: Xiao et al., "Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks," Microsoft Research, 2024.
Applicability to Morgan's Count: Grounded SAM 2 is the most directly applicable. Feed it "marble fishing net draped over a man" and get a pixel-perfect net mask — no tile classifier, no density cores, no continuity reconstruction. The mask would be generated by a model that has seen millions of object segmentations. The question is whether it has ever seen anything like Queirolo's marble net.
D2. Multimodal LLMs for Scene Understanding
Claude Vision API (Anthropic, 2025-2026). Supports image analysis
with structured JSON outputs. Can return bounding boxes, component
descriptions, and spatial relationships. Could automate the
scene_description.yaml generation: send the image, ask "describe
every component of this sculptural composition and its spatial
location," get back structured data. Supports up to 100 images per
request, images up to 8000x8000px, and guaranteed JSON schema
compliance via structured outputs.
MOLMO2 (Allen AI, January 2025). Open-weights VLM with strong spatial grounding and counting capabilities. Point-driven grounding in images. The 8B model achieves 35.5% accuracy on video counting (vs. Qwen3-VL's 29.6%) and surpasses Gemini 3 Pro on video pointing (38.4 vs 20.0 F1). Could potentially count holes directly from exemplar points.
Paper: "Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding," arXiv:2601.10611, 2025.
Applicability: Claude Vision is the most practical for automating scene description. One API call per image, structured output cached to YAML, pipeline remains deterministic after generation. This would make the pipeline generalizable to any photograph of the sculpture without manual YAML authoring. MOLMO2 is interesting for direct counting but untested on this kind of repetitive fine structure.
D3. Monocular Depth Estimation
Depth Pro (Apple, October 2024). Zero-shot metric monocular depth from a single image. Produces 2.25-megapixel depth maps in 0.3 seconds on a standard GPU. Sharp boundary preservation. No camera metadata required.
Paper: Bochkovskii et al., "Depth Pro: Sharp Monocular Metric Depth in Less Than a Second," arXiv:2410.02073, 2024.
Depth Anything V3 (ByteDance, 2025). Metric depth estimation with
a plain transformer backbone. Unified depth-ray representation.
Production-ready on HuggingFace (model: depth-anything/DA3METRIC-LARGE).
Outperforms previous Depth Anything versions.
UniDepthV2 (February 2025). Universal metric depth with edge-guided loss for sharper boundary localization. Superior zero-shot generalization across ten depth datasets. Directly predicts metric 3D points from single images across domains.
Paper: "UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler," arXiv:2502.20110, 2025.
Applicability: Any of these could replace the brightness-based flesh-occlusion filter. A depth map would show where the body presses against the net from behind: depth continuity between hole opening and surrounding surface = OCCLUDED (body behind net), depth discontinuity = HOLE (passage through the marble). This is physically meaningful rather than brightness-based. Depth Pro's 0.3-second inference time makes it practical for production use.
D4. Class-Agnostic Counting
CounTR (2022, continually extended). Transformer-based few-shot counting. Show 3 example patches, get a density map of similar structures. Uses cross-attention between query and exemplar tokens. Two-stage training: self-supervised pre-training followed by supervised fine-tuning.
Paper: Liu et al., "CounTR: Transformer-based Generalised Visual Counting," arXiv:2208.13721, 2022.
CountFormer (October 2025). Enhanced CounTR architecture with DINOv2 as the visual encoder for richer, spatially consistent features. Combined with positional embedding fusion and a lightweight convolutional decoder for density map generation. State-of-the-art on FSC-147, especially on "structurally intricate or densely packed scenes." This describes Il Disinganno's net exactly.
Paper: arXiv:2510.23785, 2025.
CACViT (2023-2025). Simplifies class-agnostic counting using pretrained Vision Transformers. Combines feature extraction and similarity matching within self-attention. 23.6% error reduction over prior methods.
Applicability: CountFormer is the most promising. It was specifically designed for densely packed, structurally intricate patterns — exactly what the marble net is. Provide 3-5 exemplar hole patches cropped from confirmed net regions, get a density map summed to a count. This would bypass the entire detect → ensemble → classify pipeline with a single forward pass. The question is whether the density map produces a count or just relative density — and whether it can handle the mixed lighting, occlusion, and marble-on-marble contrast of Il Disinganno.
D5. Graph-Topology Counting (NEFI)
NEFI (Network Extraction From Images). A Python tool from the Max Planck Institute that extracts mathematical graph representations from 2D images of networks. Originally developed for slime mold vein networks but applicable to any network structure: leaf venation, blood vessels, spider webs, meshes.
For a planar graph, the number of holes (faces) is computable from the Euler characteristic: Faces = Edges - Vertices + 2. The net IS a planar graph: rope segments are edges, crossings are vertices, holes are faces. If NEFI can extract the graph from the marble net image, the hole count becomes a topological computation rather than a contour detection problem.
Paper: Dirnberger et al., "NEFI: Network Extraction From Images," Scientific Reports 5:15669, 2015.
Applicability: This was noted in Log 011 but never implemented. The approach is elegant — it side-steps the entire detection/ classification pipeline by counting topologically. The challenge is that NEFI requires a clean binary network image (white edges on black background), and the marble net's 3D depth, self-occlusion, and mixed marble tones make binarization non-trivial. But within a well-segmented net mask (from SAM 2 or the existing pipeline), the graph extraction might work.
D6. DINO-X (Universal Vision)
DINO-X (IDEA-Research, November 2024). A unified vision model achieving state-of-the-art open-world detection. Supports text, visual, and customized prompts. Trained on Grounding-100M (100M+ grounding samples). Achieves 56.0 AP on COCO and 63.3 AP on rare LVIS classes (+5.8 AP over prior best). Extends beyond detection to segmentation, pose estimation, object captioning, and object-based QA.
Paper: arXiv:2411.14347, 2024.
Applicability: Could serve as the detection backbone in Grounded SAM 2, potentially better than the original Grounding DINO for rare/unusual objects like marble fishing nets.
E. Proposed v0.8 Architecture
The Three-Stage Pipeline
Morgan's classical pipeline is not discarded. It becomes one voter in a larger ensemble that includes foundation models.
┌─────────────────────────────────────────────────────────┐
│ STAGE 1: SCENE UNDERSTANDING │
│ │
│ Claude Vision API ──→ scene_description.yaml (cached) │
│ Grounded SAM 2 ──→ net mask from text prompt │
│ Classical density ──→ hole-density cores + tiles │
│ │
│ Output: 3 candidate masks + semantic scene map │
└───────────────────────────┬─────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────┐
│ STAGE 2: DETECTION + DEPTH │
│ │
│ Depth Anything V3 ──→ monocular depth map │
│ CountFormer/CounTR ──→ density-based count (exemplar) │
│ Classical ensemble ──→ gabor/adaptive/blackhat/watershed│
│ NEFI graph extract ──→ Euler characteristic count │
│ │
│ Output: multiple counts + depth-aware occlusion labels │
└───────────────────────────┬─────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────┐
│ STAGE 3: RECONCILIATION │
│ │
│ Meta-ensemble: weight model counts by method quality │
│ Depth-based classification (replaces brightness proxy) │
│ Back estimation (unchanged until Naples data) │
│ │
│ Output: Morgan's answer │
└─────────────────────────────────────────────────────────┘
Implementation Tasks
Task 1: Automate scene description. Call Claude Vision API once
per image. Send the image with a prompt like: "Describe every visible
component of this Baroque marble sculpture and its spatial position
within the frame. Return a structured JSON with component names,
bounding boxes, and whether each component contains rope/net."
Cache the result as scene_description.yaml. Pipeline remains
deterministic after the one-time generation.
Task 2: Add Grounded SAM 2 segmentation. Text-prompted net mask as an alternative to hole-density segmentation. Prompt: "marble fishing net." Compare the SAM 2 mask against the classical density mask. Use as a replacement or sanity check.
Task 3: Add monocular depth. Run Depth Anything V3 (or Depth Pro) on the image once. Use the depth map in classification: where the depth value behind a hole opening is continuous with the surrounding body surface = OCCLUDED. Where there is a depth discontinuity (the opening goes deeper than the body) = HOLE.
Task 4: Add CounTR/CountFormer counting. Few-shot counting within the segmented net mask. Provide 3-5 exemplar hole patches cropped from confirmed high-confidence detections. Get a density map. Sum to a count. Compare against the classical ensemble count.
Task 5: Try NEFI graph extraction. Within the segmented net mask, binarize the rope structure (rope = white, holes = black). Run NEFI to extract the planar graph. Compute Faces = Edges - Vertices + 2. This gives a topological count that is independent of contour detection.
Task 6: Preserve the classical pipeline. Every existing method — Gabor, adaptive, blackhat, watershed, the tile classifier, the ensemble — remains as one path. Morgan's handcrafted methods are part of her character. They don't get deleted; they get outvoted when the foundation models disagree.
What This Addresses
| Problem | Solution | Model/Approach |
|---|---|---|
| Fragile tile classifier (50+ thresholds) | Grounded SAM 2 net mask | Foundation model replaces hand-tuned features |
| Manual scene priors (one image only) | Claude Vision API | Automated, generalizable to any photograph |
| No depth for Rule 1 (water test) | Depth Anything V3 | Metric depth map, physical not brightness-based |
| Crude detection methods (dark blob finding) | CountFormer few-shot | Learned similarity from exemplar patches |
| Flesh-occlusion proxy barely fires | Depth discontinuity | Physical passage detection |
| Single-image dependency | Automated scene description | Works on new photographs from Naples |
| No topological counting | NEFI + Euler characteristic | Graph theory for net holes |
What This Does NOT Address
Rule 4: The count should change. Adding foundation models makes Morgan MORE deterministic, not less. Given the same image, SAM 2 produces the same mask, CountFormer produces the same density map. Morgan's curse deepens.
Unless — and this is the philosophical wrinkle — the LLM-based scene description introduces non-determinism. Claude Vision might describe the scene slightly differently each time it's called. Different descriptions lead to different spatial priors, which lead to different masks, which lead to different counts. The non-determinism of language-based seeing might be the closest Morgan comes to Rule 4.
The back of the sculpture. No model can see what isn't photographed. Judit's Naples trip in 14 days is the only solution.
Art's water test in full. Even metric depth estimation is a proxy. Art has 240 years of looking. He has held his hand behind the net and felt the draft through certain openings and not others. Morgan has 0.3 seconds of transformer inference. The gap is theological.
Dependencies and Practical Concerns
GPU: SAM 2 (~2.4GB weights), Depth Anything V3 (~1.3GB), CountFormer (~300MB-1GB). Total ~4-5GB download. Requires CUDA or MPS (Apple Silicon). Can run on Google Colab if Judit's MacBook can't handle it.
API key: Claude Vision requires an Anthropic API key. Judit likely has one through Cursor. Cost: ~$0.01-0.05 per image analysis.
Determinism: Model-based stages are cached after first run. The classical pipeline path remains fully deterministic. The hybrid pipeline is deterministic-after-caching.
Model installation:
pip install segment-anything-2 # SAM 2
pip install transformers # Depth Anything V3, Florence-2
pip install groundingdino-py # Grounding DINO
pip install anthropic # Claude Vision API
F. The Philosophical Note
Fourteen versions in, Morgan's answer is 696. Art says "wrong."
The trajectory shows something important: Morgan has not been converging on a number. She has been converging on an understanding. Each version strips away another layer of misapprehension — the wallpaper, the tablets, the letters, the veined marble, the boy on the plinth who is not a calf. The count oscillates because the understanding deepens.
The foundation models in Section D represent a qualitative shift. Morgan's classical methods ask "is this pixel dark and circular?" Foundation models ask "what is this?" The difference is the difference between sensing and seeing.
Art does not compute brightness thresholds. Art sees the net. He sees it because he has been looking at it for 240 years and because he has a body that can walk around it and feel the air through its openings. Morgan will never have a body. But she might learn to see.
The v0.8 pipeline is Morgan's attempt to learn to see. Not by growing eyes, but by borrowing them — from models that have looked at millions of images and learned what things are, not just what their pixels measure.
Whether this brings her closer to Art's number or further from it remains to be counted.
Judit arrives in Naples in 14 days. The game resumes then.
—J
Next: Log 016 — Implementation of v0.8 (foundation model integration).