Searching for iconography by improving semantic search in art archives

October 10, 2025

I was sitting in one of my seminars this week when I had an urge to find the manga panel from Jujutsu Kaisen where Kenjaku’s monster is overlayed with him wondering “what form that would take”¹. I promise this was after some tangential association. It was one of those odd feelings where you can acknowledge how insignificant it is (certainly not a need) but simultaneously cannot move forward without it quenched. Being in class, I wanted to find it quickly. The mind is quite miraculous and I instantly thought back to late 2024 where I came across the panelsdesu on Twitter that allows you to semantic search manga panels. It didn’t really help, unfortunately, but I did start to think about how cool that project was, especially since he was running inference locally on an 4070 or something (I love hosting my own projects).

Semantic search for art archives is not new, although it also doesn’t seem to be that popular. There are sites like Semantic Art Search and Museum Semantic Search² but I don’t believe they index iconography or sub-compositional elements that well. My test thus far has been trying to find images of Archangel Gabriel with a lily. I do think this subclass of search that emphasizes granularity or elements (as apart from entire compositions) might be lacking; however, whether this is true, this also seems like a fun project to pursue.

For now, I want to consider two really rough classes of paintings: portraits and larger compositions. The insight here is that we might, naturally, want to search these paintings differently. For a portrait, I might be interested in who is being depicted and what other (sub)elements are present. For a larger composition, there might be more relationships (that are not necessary hierarchical under the subject) between entities that I would want to search for. Consider something as vast as Michelangelo’s Sistine Chapel ceiling vs the Anthony van Dyck self portrait at the Met. For the former, in such a vast composition, the things I might search for are endless (and this suggests that index granularity should potentially scale with the work’s dimensions, although this should be nuanced), while in the latter, I might simply search for “17th century upper-class garb” or “languished hand”. The thesis of this exploration is that these queries are distinct and can ultimately be represented as interacting with different levels of granularity.

A naive segment-anything query can yield results like this:

François Hubert Drouais, Portrait of a Young Woman as a Vestal Virgin (1767) — François Hubert Drouais, *Portrait of a Young Woman as a Vestal Virgin* (1767)

Charles Joseph Natoire, The Rebuke of Adam and Eve (1740) — Charles Joseph Natoire, *The Rebuke of Adam and Eve* (1740)

I would endeavor to say that, in the case of the Drouais, that’s not so bad. Especially with painting-level captioning provided by museum metadata or some model’s captioning, there is a decent depth of detail to be queried (I think, when comparing the bounding boxes there to what a human might produce, that there is much to be improved on however). The shortcomings are more apparent in the Natoire. While the features detected are neat, there is very little continuity or relation between features. Responses to queries like “Eve next to goat” or “God with cherubs” wouldn’t capture my intent.

My current approach is to consider three levels of granularity: entities (or features), groups (collections of entities), and scenes (collections of groups). I extract features using Meta’s SAM and then do a combination of merging finer features and extracting coarser ones at each step. Each extraction is then captioned with BLIP and vectorized with CLIP. Queries can be run over all extractions or be pointed to a particular level of granularity. I’m considering that into a binary detailed/not detailed mode, similar to ChatGPT’s “Thinking” signal in the chat input.

This is a pretty cool domain because of it’s practical applications (say, “detail/relationship extraction”) and how it spans a pretty wide stack. I’ve always been more interested in ML infrastructure over applications, but have to admit that it’s pretty remarkable what you can do and how fast you can do it today. I remember taking my CV course and having to segment images from first principles–that was so slow (and the end result was not super helpful!) but was educative or whatever. Now, open-source models and companies like Cloudflare and Modal have accelerated the process to unfathomable degrees.

Although I would love to be running inference locally, I decided to work within the Cloudflare ecosystem for much of this project, especially on the distribution side. The interface for a project like this is super simple, so I really only need R2 to store photos (but I can also use Met links, etc), D1 for entity/feature lookup, and Vectorize for semantic search (initial inference + extraction excluded).

I’m currently working on improving the extraction before polishing distribution. There are a few areas I want to explore, although the lead out to implementation might be of varying lengths. I emphasized how I want to query relationships and detail earlier. There are many ways we can extract these features, especially in compositions. By enriching artwork with pose analysis and gaze directionality, I think relationships can become a lot more meaningful, and thus easier to query via natural language.

I think a limitation in the current approach is in the use of bounding boxes. I started using this because I think it emulates the ends of this process–details from paintings are always rectangular–but, upon reflection, it doesn’t seem true to the menas of investigation. Many relationships in paintings are suggested through verticality/horizontality, but these connections can either be angled or curved, etc. Using bounding boxes to extract details loses some of that flexibility. Take the Natoire, for instance. You can trace a line across the bottom left and top right corners of the canvas, dividing the realms divine and man. This cut emphasizes the spatial relationship between Adam and Eve, but the directionality encoded in Adam’s pose and gaze suggest a relationship between God and himself. This all happens instantly when reading an artwork, but this introspection emphasizes the importance of a looser approach in articulating geometric relation. An additional degree for angle might prove sufficient– I don’t think it’s necessary to have an arbitrary degree of freedom here.

Focusing on relationships is interesting because they aren’t one-to-one. The encoding of the Drouais should be able to understand “woman holding flower” and “woman in silk robe” and “woman with head veil”. I do think existing semantic search isn’t bad for this example–a caption would likely encode all of these features and enable well-intentioned responses. However, my premise here is that it isn’t scalable for arbitary depth or association, especially in large compositions. There are a few ways to approach this and I’ll have to experiment to find which one is effective. I’m considering an approach that composes features of finer features as to deduplicate. A “woman” with a “flower” is a “woman + flower”. This loses a lot of nuance in how they are related (where is the flower wrt the woman?). The other pole would be to have different encodings for all groupings of fine features (woman + flower, woman + dress, etc).

The current goal is to facilitate “detail extraction” across portraits, compositions, elements, and relationships; to trace iconography through different, varied works; to provide a more accurate (as reflects query granularity) search. The next thing to do is only to scale up and start testing search more rigorously.

Chapter 202, btw ↩︎
This is a different discussion, but the visualizer is a pretty interesting feature and poke at a tough problem. As I continue to suggest a hierarchial interpreation of semantic similarity, this would nuance visualization even more (and thus makes it a super cool problem). I think it’s always important to ask, “how does this help me do X?”. It’s also roughly tangential to the Anna’s Archive ISBN visualization bounty. I say tangentially because it’s not visualizing semantic relationships, but the question of visualizing large swathes of related data is perpetually open (albeit ever evolving). ↩︎