Using the checker idiom to monitor software systems

When I worked at Stripe, nearly every engineering team used a monitoring system called Checker. Although widespread within the company, the checker idiom seems to be uncommon elsewhere. This post outlines what checker is, how it compares to other monitoring systems, and why it’s useful — and so why you might want to consider using a system like checker in your team.

How it works

The checker framework consists of three components:

  1. A library for engineers to define checks.
  2. A scheduler to execute these checks at specified intervals.
  3. Alerting integrations to notify the relevant parties when a check fails.

I’ll use an example to illustrate the different components. Imagine we’re building software that generates daily reports at 11am, and we want to define an alert that notifies us by 12pm if a report wasn’t generated for the day.

Library

In Python, the core logic for the check might look like this:

def check_report_generated_daily():
    reports = get_reports_for_date(current_date())
    assert len(reports) > 0, f"Daily report not generated for {current_date()}"

You’ll notice that the check looks remarkably like a unit test. First, we query the database to get the reports that were generated today. Then, we assert that at least one report was generated. If the assertion fails, the check will raise an exception, which will be caught by the framework and trigger an alert.

Scheduler

We can add metadata to the check to specify when it should run:

@check(schedule="0 12 * * *") # 12pm
def check_report_generated_daily():
    reports = get_reports_for_date(current_date())
    assert len(reports) != 0, f"Daily report not generated for {current_date()}"

The simplest implementation of a scheduler would generate a cron job to run the check at the specified times. A more sophisticated scheduler could support running checks only on business days, automatically rerunning checks that fail due to transient errors, and adding jitter to avoid heavy load on the database at commonly scheduled times.

Alerting

Lastly, we can specify who should be notified when the check fails, and how:

@check(
    schedule="0 12 * * *",
    notify="business_intelligence", 
    channel=["jira"], 
)
def check_report_generated_daily():
    reports = get_reports_for_date(current_date())
    assert len(reports) != 0, f"Daily report not generated for {current_date()}"

When a check’s assertion fails, the framework identifies the relevant notification channels (e.g., Slack, Jira, PagerDuty), and sends an alert to the appropriate team. In this case, a ticket will be created on the business intelligence team’s Jira board.

Before looking more closely at why checker is so useful, let’s take a brief detour to compare it with other monitoring systems.

A brief comparison of monitoring systems

Most observability stacks will include time-series metrics, logs, and error tracking, all of which might provide some alerting capabilities. Here’s a condensed comparison of these monitoring options, plus checker.

FeatureTime-series metrics (e.g. Prometheus)Logs (e.g. Splunk)Error tracking (e.g. Sentry)Checker
Primary focusNumerical data over timeTextual event dataExceptions and crashesBusiness logic and state
Data typeMostly quantitativeFreeformMostly freeformAny
TriggerThreshold breachesLog patterns/anomaliesExceptions thrownCustom assertions
ExecutionContinuousEvent-drivenEvent-drivenScheduled
Query languageCustom (e.g. PromQL)Often SQL-likeUsually GUI-basedNative programming language
Historical analysisStrongStrongMediumLimited
Real-time monitoringYesYesYesNo (scheduled)
Typical use caseSystem-level metricsDebugging, audit trailsUnexpected errorsBusiness logic, state validation

Now let’s take a closer look at checker.

Why checker is useful

Ergonomics

Checks are often easier to write than alternative monitoring implementations. This is again best explained with an example: imagine the infrastructure team has just implemented an automated database schema update process and wants to ensure all migrations complete on time. They can write a check to alert if any migration remains in the “in progress” state for more than 1 hour, as that indicates the migration process may be stuck.

@check(
    schedule="*/15 * * * *", # every 15 minutes
    notify="infrastructure",
    channel=["slack"],
)
def check_db_migrations_not_stuck():
    active_migrations = get_active_db_migrations()
    stuck_migrations = []
    for migration in active_migrations:
        if migration.start_time < current_time() - timedelta(hours=1):
            stuck_migrations.append(migration.id)
    assert len(stuck_migrations) == 0, f"Stalled migrations: {stuck_migrations}"

Without the checker idiom, the most likely alternative (at least according to Claude) is a Prometheus AlertManager rule. That might look like this:

groups:
  - name: DatabaseMigrationAlerts
    rules:
      - alert: MigrationStuck
        expr: db_migration_duration_seconds{status="in_progress"} > 3600
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Database migration stuck"
          description: "Migration {{ $labels.migration_id }} has been in progress for over 1 hour"

While the yaml to define the alert is simple enough, there is a lot work hiding behind the db_migration_duration_seconds metric. Emitting this metric requires writing a Prometheus exporterOr pushing the metric. to expose the migration duration as a metric, and then setting up a job to periodically poll the migration status. With checker, an engineer just needs to write the code like in the example above. Both the metric exporter and polling components are unnecessary — the check code can directly query the database, and scheduling is handled by the checker framework.

Testing

Compared to a check, the Prometheus scenario has several extra components that need to be tested: the polling job, the metric exporter, and the alerting rule (maybe you got the syntax wrong). This makes the Prometheus setup more complex and harder to test. With checker, on the other hand, the assertion logic is the only bit that needs to be tested.

Everything else

The key property of checker that makes it ergonomic is that a check is just code. A check can be written in the same language as the business logic and located in the same repository. This means IDE integrations, testing, builds, deployments, versioning, and all the other ways code is managed all automatically apply to checks.

Time-aware assertions

Most monitoring systems are time-agnostic. For instance, it doesn’t matter if the server runs out of memory at 10am or 10pm, an engineer needs to be paged. Other alerts need to be time-aware. An alert about incomplete payroll at 9 AM on the 1st of the month is actionable and urgent; the same alert at random times throughout the month would be meaningless (because payroll only runs once a month!).

Timing is the key element here — the checker framework’s ability to schedule checks allows for these kinds of business-critical, time-sensitive validations that other monitoring systems can’t accommodate as easily.

Detecting absence

Prometheus, Sentry, and logs are event-driven systems. These stacks aren’t designed to detect the lack of an event.Prometheus has an absent() function, but it’s not as powerful as code. For example, it can only alert if a metric is missing, not if a metric is missing and its last reported value was below a certain threshold. Custom code could easily check both conditions. Second, Prometheus doesn’t guarantee 100% accuracy, so its possible that data is incomplete. Checker, on the other hand, does very well here. The first example in this post for detecting a missing report highlights this.

The reason checker can do this idiomatically is that it runs on a schedule, and that it queries the state of the system, rather than reacting to events.

Querying over state

Events tell you “something happened”, while polling lets you ask “what is the state of the system?” Often the requirement is to alert on a particular state, i.e. the effect of several events upon a system, rather than the events themselves. For example, a team may not care about the intermediate actions that reconcile cash, but does care that 99.99% of all cash is reconciled in aggregate. Often, it is also intractable to instrument and aggregate all events to get the same information, so a check is the only way to express the alert.

So far, the examples I’ve used query the production transactional database, but checker also allows querying a data warehouse in SQL.In fact, checker is even more general than this. You can schedule any piece of code to be run that has an assertion in it. This could be a SQL query, a REST API call, or even a shell script. This is especially useful for complex checks that need to join several data sources. For example, a check might want to join several tables to verify the sum of payment amounts less refunds is equal to the net cash flowing into the system. Expanding the scope of the check to inspect different data sources easily allows monitoring the state of the entire system, not just an individual component.

Invariant assertion

One way of looking at the checker idiom is that it gives is the ability to assert invariants like “deploys are not stale” or “an order object is always linked to an item object”. This is a sort of proof of correctness. In fact, most search results for invariant assertion relate to proving algorithm correctness. If you squint, that’s what checker does, except over an entire system rather than a single algorithm.

Conclusion

I find that checker fills a niche not covered by typical monitoring tools. It is time-aware, detects absences, and queries over state, all packaged in an easy-to-use pattern. I think the pattern is so useful that, at a large-enough scale, most software engineering teams end up reinventing elements of checker to monitor their systems. If you’re working on observability tools for your organization, consider building a checker framework for your stack.

Published