doba

Schema Identification

Detect which schema unknown data belongs to, then transform it.

Schema Identification

Sometimes you have data but don't know which schema it came from. An API payload, a deserialized blob, a message from a queue. You need to figure out what it is before you can transform it.

The identify config option adds two methods to your registry: identify() to detect the schema, and identifyAndTransform() to detect and transform in one call. Both are fully opt-in. If you don't configure identify, these methods don't exist on the registry type at all.

Two forms

The identify option accepts either a guard map or a function. Pick the one that fits your data.

Guard map

Map schema keys to predicates. Each predicate receives an unknown value and returns true if it belongs to that schema:

import { ,  } from 'dobajs'
import {  } from 'zod'

const registry = ({
const registry: Registry<{
    database: z.ZodObject<{
        id: z.ZodString;
        email: z.ZodString;
        passwordHash: z.ZodString;
        role: z.ZodEnum<{
            admin: "admin";
            user: "user";
        }>;
    }, z.core.$strip>;
    frontend: z.ZodObject<{
        id: z.ZodString;
        email: z.ZodString;
        createdAt: z.ZodString;
        role: z.ZodEnum<{
            admin: "admin";
            user: "user";
        }>;
    }, z.core.$strip>;
    ai: z.ZodObject<{
        id: z.ZodString;
        email: z.ZodString;
        isAdmin: z.ZodBoolean;
    }, z.core.$strip>;
}, true>
: { : .({ : .(), : .(), : .(), : .(['admin', 'user']), }), : .({ : .(), : .(), : .(), : .(['admin', 'user']), }), : .({ : .(), : .(), : .(), }), }, : { 'database->frontend': () => ({ : ., : ., : new ().(), : ., }), 'frontend->ai': () => ({ : ., : ., : . === 'admin', }), }, : { : .('passwordHash'), : .('isAdmin'), : .('createdAt', 'role').(() => { return !('passwordHash' in ( as object)) && !('isAdmin' in ( as object)) }), }, })

Guards run in definition order. The first guard that returns true wins. You don't have to cover every schema. Unlisted schemas are simply not identifiable.

Guard map keys are typed against your schema map. Writing identify: { typo: match.field('x') } is a compile-time error if "typo" isn't a registered schema key.

Function form

For pattern-based discrimination (like versioned data with a type field), pass a single function that returns the schema key or null:

import { createRegistry, byField } from 'dobajs'

const registry = createRegistry({
  schemas: { v1: v1Schema, v2: v2Schema, v3: v3Schema },
  migrations: { /* ... */ },
  identify: byField('version', { prefix: 'v' }),
})
// { version: "2" } -> "v2" -> matches the "v2" schema key

Or write the function yourself for full control:

const registry = createRegistry({
  schemas: { database: dbSchema, ai: aiSchema, frontend: feSchema },
  migrations: { /* ... */ },
  identify: (value: unknown) => {
    if (typeof value !== 'object' || value === null) {
      return null
    }
    if ('passwordHash' in value) {
      return 'database'
    }
    if ('isAdmin' in value) {
      return 'ai'
    }
    if ('createdAt' in value) {
      return 'frontend'
    }
    return null
  },
})

Return null when nothing matches. Returned keys are verified against the schema map at runtime. If the function returns a string that isn't a registered schema key, identify() returns an identify_failed error.

Using identify

identify()

Detects which schema a value belongs to:

const  = await .({ : 'abc123', : '1', : 'a@b.com' })

if (.) {
  .value
value: "database" | "frontend" | "ai"
.. } else { .[0].code
code: DobaIssueCode
}
ok: true
value: "database"
meta: {
schema: "database"
}

identifyAndTransform()

The primary use case. Detect the source schema and transform to a target in one call:

const  = await .(, 'ai')

if (.) {
  .value
value: {
    id: string;
    email: string;
    isAdmin: boolean;
}
..from
from: "database" | "frontend" | "ai"

the source schema detected by identify.

..path
path: readonly ("database" | "frontend" | "ai")[]

ordered list of schema keys traversed, e.g. ['v1', 'v2', 'v3'].

}
ok: true
value: { id: "user-123", email: "alice@example.com", isAdmin: true }
meta: {
from: "database"
path: ["database", "frontend", "ai"]
steps: 2
}

This runs identify() first, then feeds the result into transform(). The path is resolved automatically through the migration graph. All transform options work here:

await registry.identifyAndTransform(data, 'ai', {
  validate: 'each',
  pathStrategy: 'direct',
})

Helpers

match

Chainable predicate builder. Each method adds an AND condition. The result is both chainable (add more conditions) and callable as (value: unknown) => boolean:

import { match } from 'dobajs'

// single checks
match.field('passwordHash')                    // field exists
match.field('version', 2)                      // field equals value (strict ===)
match.fields('displayName', 'avatar')          // all fields present
match.type('string')                           // typeof check
match.test((v) => Array.isArray(v))            // arbitrary predicate

// chaining = AND
match.field('passwordHash').field('email')
// true only if both passwordHash AND email exist

// complex guard
match.type('object')
  .fields('id', 'email')
  .test((v) => !('passwordHash' in (v as object)))

byField

Reads a field from the value and derives a schema key. Handles the common case where data has a type, version, or kind field:

import { byField } from 'dobajs'

// value.version matches schema key directly
byField('version')
// { version: "v1" } -> "v1"

// prefix/suffix for naming conventions
byField('version', { prefix: 'v' })
// { version: "2" } -> "v2"

byField('kind', { prefix: 'schema_', suffix: '_legacy' })
// { kind: "user" } -> "schema_user_legacy"

// explicit mapping when convention doesn't fit
byField('type', { map: { UserDB: 'database', UserFE: 'frontend' } })
// { type: "UserDB" } -> "database"
// { type: "Unknown" } -> null

prefix/suffix and map are mutually exclusive. Field values are converted to strings via String(). Returns null if the value isn't an object or the field is missing.

firstMatch

Composes multiple discriminator functions. Tries each in order, returns the first non-null result:

import { byField, firstMatch } from 'dobajs'

identify: firstMatch(
  byField('_tag'),                    // try tagged data first
  byField('version', { prefix: 'v' }), // then version field
  (v) => typeof v === 'string' ? 'name' : null, // then typeof
)

tryParse

A sentinel value. When used as a guard, doba validates the value against that schema's ~standard.validate() instead of running a sync predicate. Import it from dobajs:

import { createRegistry, match, tryParse } from 'dobajs'

const registry = createRegistry({
  schemas: { cat: catSchema, dog: dogSchema, fish: fishSchema },
  migrations: { /* ... */ },
  identify: {
    cat: match.field('indoor'),   // cheap sync check
    dog: tryParse,                 // validate against dogSchema
    fish: tryParse,                // validate against fishSchema
  },
})

How it works:

  1. Sync guards run first (cheap, one function call each)
  2. If a sync guard matches, that's the result. tryParse schemas are never checked.
  3. If no sync guard matched, all tryParse schemas are validated in parallel
  4. If exactly one validates, that's the match
  5. If multiple validate, you get identify_ambiguous
  6. If none validate, you get identify_failed

Prefer sync guards over tryParse. Sync strategies (guard map, function, byField) run at ~120--200ns per call. tryParse runs actual schema validation and costs ~435ns, roughly 2x slower. Reserve it for schemas that are structurally hard to tell apart.

Error handling

Two issue codes specific to identification:

identify_failed

No guard matched and no tryParse schema validated the value:

const result = await registry.identify({ completely: 'unknown' })
if (!result.ok) {
  result.issues[0].code    // "identify_failed"
  result.issues[0].message // "no schema matched the provided value"
}

identify_ambiguous

Multiple tryParse schemas validated the same value. This only happens with tryParse. Sync guards use first-match-wins, so they can never produce ambiguity.

const result = await registry.identify(ambiguousData)
if (!result.ok) {
  result.issues[0].code    // "identify_ambiguous"
  result.issues[0].message // "multiple schemas matched: cat, dog"
  result.issues[0].meta    // { matches: ["cat", "dog"] }
}

Fix ambiguity by adding a sync guard for one of the conflicting schemas, or making the schemas more specific so they don't both validate the same input.

Performance

All identify strategies are fast, but they aren't equal. Benchmarks on Apple M3 Pro:

OperationTimeThroughput
Guard map (match)~120ns8.4M ops/sec
Function form~195ns5.1M ops/sec
byField~205ns4.9M ops/sec
tryParse (schema validation)~435ns2.3M ops/sec
identifyAndTransform (1 hop)~816ns1.2M ops/sec
identifyAndTransform (2 hops)~773ns1.3M ops/sec

Build match chains once. Calling match.field('x').field('y') allocates new arrays and closures on each call (~1.6us). Assign the chain to a variable and reuse it. Executing an already built chain is ~10--30ns.

identifyAndTransform has minimal overhead compared to calling identify then transform separately. Use whichever reads better in your code.

Conditional types

When identify is not in the config, the methods don't exist on the registry type. This is enforced at the type level via function overloads:

// Without identify: methods don't exist
const reg = createRegistry({ schemas, migrations })
// reg.identify        -> type error: property does not exist
// reg.identifyAndTransform -> type error: property does not exist

// With identify: methods are present
const reg = createRegistry({ schemas, migrations, identify: { ... } })
await reg.identify(data)                      // works
await reg.identifyAndTransform(data, 'ai')    // works

The existing transform(), validate(), has(), findPath(), and explain() methods work exactly the same whether identify is configured or not.

Full example

Putting it all together with a versioned API:

import { ,  } from 'dobajs'
import {  } from 'zod'

const  = .({ : .(), : .() })
const  = .({ : .(), : .(), : .(['admin', 'user']) })
const  = .({ : .(), : .(['admin', 'user']), : .() })

const  = ({
  : { : , : , :  },
  : {
    'v1->v2': () => ({
      : ..(' ')[0] ?? .,
      : ..(' ')[1] ?? '',
      : . ? 'admin' as  : 'user' as ,
    }),
    'v2->v3': () => ({
      : `${.} ${.}`.(),
      : .,
      : 'unknown@example.com',
    }),
  },
  : {
    : .('admin'),
    : .('firstName', 'lastName'),
    : .('displayName'),
  },
})

// Unknown data comes in from an API
const : unknown = { : 'Alice Smith', : true }

const  = await .(, 'v3')
if (.) {
  .value
value: {
    displayName: string;
    role: "admin" | "user";
    email: string;
}
..from
from: "v1" | "v2" | "v3"

the source schema detected by identify.

..path
path: readonly ("v1" | "v2" | "v3")[]

ordered list of schema keys traversed, e.g. ['v1', 'v2', 'v3'].

}

For the full API reference including all types and method signatures, see the Identify API reference.

On this page