Encoding Leaf Morphology as a Feature Schema

How SASYA turns the botanical vocabulary of leaf shape, margin, and venation into a structured, filterable trait schema — and where that data actually comes from.

Most plant guides do not start with color. They start with the leaf: is it opposite or alternate on the stem, what shape is the edge, how do the veins run. This is the same list a printed plant key uses. I had to decide how to turn that list into something a computer can actually use, for thousands of species, without making up facts I do not have.

Six leaf shapes: ovate, lanceolate, cordate, palmate compound, peltate, and fenestrate

The schema

Every plant family in SASYA's data has a LeafMorphology record. It holds: arrangement (Alternate/Opposite/Whorled/Basal/Variable), type (Simple/Compound/Variable), margin (Entire/Toothed/Lobed/Spiny/Variable), and venation (Pinnate/Palmate/Parallel/Variable). Flowers have a similar record: symmetry, inflorescence, fruit type. I keep this list short on purpose. Four or five choices per trait, not free text. A trait you cannot filter on is a trait you cannot use to ask a good question later.

The four margin values in order: entire, toothed, lobed, spiny

Where the values come from, and where they don't

Here is the honest part: I mostly do not have per-species leaf data I can trust, so I do not make it up. Instead, these traits are recorded at the family level, where they are usually the same across the whole family. Every Rubiaceae has opposite leaves. That is not a guess. Each species then uses its family's values by default, labeled "typical of the family," not as a fact about that one species. Some families break this rule. Rosaceae has both simple leaves, like apple, and compound leaves, like rose. For those, I check by hand and fix the value instead of leaving a wrong default.

Two-tier sourcing

For high-traffic families, I use a table I checked by hand. I do not trust an algorithm with the families most people will search for. For the rest, I read the family's English Wikipedia page and pull out the same four facts. This is cached locally so I am not calling Wikipedia every time the site rebuilds. I only read enough text to get short facts like "opposite, simple, entire, pinnate," not to copy any sentences. If the Wikipedia parsing comes back too vague, that family moves into the hand-checked table instead.

What this schema is for, today

Right now, this schema powers the leaf and flower filters on family and species pages, and the "typical of the family" notes you see there. It does not yet feed the on-device model or the hint questions mentioned elsewhere. That part comes later. Getting this trait data right and complete, family by family, comes first. What gets built next depends on how good this data is.

This is still a work in progress. I keep adding to it and fixing it as I go.

← All blog posts