Building reliable workflows with NATS and Go

How I use NATS JetStream as the backbone of durable workflow execution in production, and why it beats Kafka for this use-case.

NATS JetStream changed how I think about distributed workflow execution. This is what I learned shipping it in production.

The problem with polling

Every workflow engine eventually hits the same wall: polling is cheap to build but expensive to run at scale. You end up with a thundering herd of goroutines sleeping on time.Sleep, wasting CPU cycles checking for work that isn’t there.

// The naive way — don't do this
for {
    jobs, err := db.GetPendingJobs(ctx)
    if err != nil {
        log.Error(err)
    }
    for _, job := range jobs {
        go process(job)
    }
    time.Sleep(500 * time.Millisecond)
}

The problem compounds when you add retries, dead-letter queues, and back-pressure. What started as 50 lines becomes a mini-broker.

Why NATS JetStream

JetStream gives you persistent, at-least-once delivery with consumer groups out of the box. The mental model is simple:

  • Stream — an ordered, persistent log of messages
  • Consumer — a cursor into that stream, with its own delivery semantics
js, _ := nc.JetStream()

// Create a stream once
js.AddStream(&nats.StreamConfig{
    Name:     "WORKFLOWS",
    Subjects: []string{"workflow.>"},
    Storage:  nats.FileStorage,
})

// Subscribe with a durable consumer
sub, _ := js.Subscribe("workflow.run", func(msg *nats.Msg) {
    var job Job
    json.Unmarshal(msg.Data, &job)

    if err := process(job); err != nil {
        msg.Nak() // requeue with backoff
        return
    }
    msg.Ack()
}, nats.Durable("worker"), nats.ManualAck())

Backoff and retries

JetStream supports NakWithDelay for exponential backoff without any extra infrastructure:

attempt := msg.Metadata().NumDelivered
delay := time.Duration(math.Pow(2, float64(attempt))) * time.Second
msg.NakWithDelay(delay)

After a configurable MaxDeliver, messages land in a dead-letter stream automatically.

What I’d do differently

If I were starting over, I’d define a proper WorkflowStep interface earlier and keep NATS as a pure transport layer — not let job-specific logic leak into the subscriber. The boundary matters.


This is one post in an ongoing series on building Iris, my workflow automation platform.