Miggo

Thank you!

You're subscribed!

Oops! Something went wrong while submitting the form.

Intro

One of my favourite software engineering posts is Alexis King’s “Parse, Don’t Validate”. Its conclusion is simple yet profound: Functions that parse inputs are more powerful versions of validation functions. This is because validation returns a boolean, while parsers transform the input into a different domain.

A one-sentence summary doesn’t do it justice. It took me a while to understand the implications of what she wrote — not only because the original is in Haskell and talks a lot about type design, but because its implications are far-reaching and subtle.

Alexis’ mantra echoed through my career. It affected how I write code and how I analyse it for vulnerabilities. Team members roll their eyes when I repeat the phrase for the thousandth time, yet it’s helped us avoid countless bugs, shaping some core structures of our application.

In honor of the original article turning six (hurray!) I want to explore its main points by showing real-life examples of how we use this pattern at Miggo, and how it can be found in the wild.

What’s all this parsing about?

Let’s set the scene. We’ve got a data pipeline written in python. One of our data sources are telemetry spans — events describing operations that happened inside customer environments. These spans can represent a diverse set of operations, like:

Handling HTTP requests
Sending HTTP requests
Database calls
Using cloud resources (like calls to S3, SQS, Cloud Compute, lambda invocations)
And so forth

For example, an incoming http request can look something like this:

{
  "tags": {
    "http.target": "/foo",
    "http.host": "https://example.com"
  },
  "resourceTags": {
    "serviceName": "...",
  }
}

We want to translate these raw json events into our domain model. One of these is the HTTPServer, which represents a server that listens for http requests. It looks something like:

class HTTPServer(dataclass):
  service_name: str
  domain: str

Where service_name is what’s in resourceTags['serviceName'], and domain is what’s in tags['http.host'].

We have a classic data boundary: We receive a loose data-structure as an input, and we want to either map it to a domain object or fail. Because we ingest a myriad of spans as input, and some won’t be worthy of our HTTPServer, we have to make sure the input matches our expectations.

How would you have implemented it? Think about it for a moment, there’s a lot of options available, I’ll go make some hot cocoa.

Alright, let’s break it down! Here’s a first attempt:

def validate_is_span_http_server(span: dict):
  return "serviceName" in span["resourceTags"] and "http.host" in span["tags"]

def handle_one_span(span: dict) -> HTTPServer | None:
  if not validate_is_span_http_server(span):
    return None
  return HTTPServer(
    service_name=span["resourceTags"]["serviceName"],
    domain=span["tags"]["http.host"],
  )

Simple enough, right? We check that we got the expected attributes, and then we take them.

Let’s turn up the heat. One day, I slip on my keyboard and comment this part out:

def handle_one_span(span: dict) -> HTTPServer | None:
  # if not validate_is_span_http_server(span):
  # return None
  return HTTPServer(
    service_name=span["resourceTags"]["serviceName"],
    domain=span["tags"]["http.host"],
  )

Unless this specific branch is covered by unit tests (you don’t just check the happy path, right?), we can gleefully deploy this change. And when we get a span that looks like:

{
  "tags": {
    "aws.operation": "…"
  },
  "resourceTags": {
    "serviceName": "…",
  }
}

We’ll throw a KeyError: 'http.host'. In codebases where there’s a preference to use get on dicts, we wouldn’t have even gotten an exception — we would have gotten an invalid HTTPServer with None for a domain. Yikes.

Something else that we constantly have to deal with is how frequently changing our inputs are. We can also get spans that look like:

{
  "tags": {
    "http.url": "https://example.com/foo/bar"
  }
}

Note the http.url instead of http.host. It means we have to parse the url to extract the domain out of it. Let’s use ada_url:

from ada_url import URL


def validate_is_span_http_server(span: dict):
  if "serviceName" not in span["resourceTags"]:
    return False
  if "http.host" in span["tags"]:
    return True
  return URL.can_parse(span["tags"].get("http.url"))

We were even very clever — we leveraged how can_parse(None) returns false, we have early returns, nothing can go wrong, bim-bam-boom, let’s create our object!

def handle_one_span(span: dict) -> HTTPServer | None:
  if not validate_is_span_http_server(span):
    return None

  domain = span["tags"].get("http.host")
  if not domain:
    domain = URL(span["tags"]["http.url"]).host

  return HTTPServer(
    service_name=span["resourceTags"]["serviceName"],
    domain=domain,
  )

Uhm…something smells off. We do the host or url dance we’ve previously done, again. We have to parse the url twice per valid span: Once in the inner function, and again in the outer. That’s not great.

Validation is a lossy process

Our “true” and “false” return values are lossy representations of:

Which tags exist and which tags don’t (serviceName, http.host, http.url)
What the semantics of these values are (string service name, parse-able url)

Consider: In the body of handle_one_span, we kinda just…assume that http.url exists when http.host doesn’t. handle_one_span also kinda just…assumes that http.url can even parse as a valid url. Because we’ve got eyes on both functions, we can see that that’s the case. But looking at the caller in isolation, no such guarantee is provided.

This is knowledge that our validation function may have gleaned, but none of it is inherent to the contract between them. If we change the implementation of one of them, we have to consider the ramifications on the other, and if we don’t — bugs aplenty.

This is tight coupling, and it makes puppies sad. Go pet a puppy to make it happy while we think how we can uncouple these functions.

Preserving semantics between function boundaries

Our goal in uncoupling is for the internal function to clearly communicate what it has gleaned. Doing that means returning a richer data structure than a boolean:

def span_to_server_span(span: dict) -> dict | None:
  try:
    service_name = span["resourceTags"]["serviceName"]
  except KeyError:
    return None

  domain = span["tags"].get("http.host")
  if not domain:
    try:
      http_url = URL(span["tags"]["http.url"])
      domain = http_url.host
    except (KeyError, ValueError):
      pass

  if not domain:
    return None

  return {
    "service_name": service_name,
    "domain": domain,
  }

def handle_one_span(span: dict) -> HTTPServer | None:
  if server_span := span_to_server_span(span):
    return HTTPServer(
    service_name=server_span["service_name"],
    domain=server_span["domain"],
  )

It’s the same function, but we keep what we’ve accomplished for later. Revisiting our critiques:

Previously, both handle_one_span and validate needed to know which tags exist and when. Now, only span_to_server_span does
Previously, both handle_one_span and validate needed to know the semantics of the values. Now, only span_to_server_span does

As long as they both maintain the contract of what the return value means, we can change either function independently. In fact, let’s take it a step further:

def span_to_server_span(span: dict) -> HTTPServer | None:
  try:
    service_name = span["resourceTags"]["serviceName"]
  except KeyError:
    return None

  domain = span["tags"].get("http.host")
  if not domain:
  try:
    http_url = URL(span["tags"]["http.url"])
    domain = http_url.host
  except (KeyError, ValueError):
    pass

  if not domain:
    return None

  return HTTPServer(
    service_name=service_name,
    domain=domain,
  )

def handle_one_span(span: dict) -> HTTPServer | None:
  if server_span := span_to_server_span(span):
    return server_span

This allows us to do all kinds of refactorings more easily. We can support more entities by chaining ifs together:

if server_span := span_to_server_span(span):
  return server_span
if client_span := span_to_client_span(span):
  return client_span

…and neither function needs to know about it, or what order they’re called in. We can add logic like tenant-specific filters to handle_one_span (or its caller!) without anyone having to know anything about tags or the values inside them — operating only on the domain model.

Another kind of refactoring we can do is change the span input type to not be a dict, but an interface with commonly used accessors, like:

class Span:
  def get_tag(self, name: str) -> str | None: ...
  def get_resource_tag(self, name: str) -> str | None: ...
  def get_service_name(self) -> str: ...

And only the inner functions need to be concerned.

Is this the best code in the world? Nah. I would have liked to factor out the domain extraction, and there’s more input validation to be done. But it’s definitely a start.

What does any of this have to do with parsing?

Returning to “Parse, don’t validate”. Usually, we see “parsers” in programming language implementations, as the part that turns strings into syntax trees. I claim (and Alexis has my back! …hopefully) that our span_to_server_span is a parser: It takes a loose input from one data domain (telemetry spans), and transforms it into another data domain (our entity model).

By parsing and not validating, we’ve freed our functions to only focus on one level of abstraction, and decoupled them from one another. As shown above, we can refactor span_to_server_span without worrying about its callers.

It’s easy to dismiss this idiom as being about strong types — that it’s irrelevant when using duck typing or when passing dicts around. It’s especially tempting when looking at the original article in Haskell. I disagree. When we made span_to_server_span both accept and return a dict, we made a validator into a parser: We created a function which turned data from one domain (telemetry spans) into another (our entity model).

When following this adage, we’ve demonstrated several important principles of software engineering: Decoupling concerns, writing functions that deal with only one level of abstraction, and keeping our values semantic.

Can’t I just use pydantic?

Libraries like Pydantic and attrs are fantastic parsers: You give them a loose data structure, and they give back a domain object. We could have defined HTTPServer as a BaseModel no problem. They offer post- and pre- validations to further specify how you want your domain objects to behave. 7/5 with rice.

While they’re good building blocks, they do not replace your parsing algorithm. Taking the domain example above, pydantic/attrs don’t know anything about spans: They need to be fed the data comprising the domain object. Sorry, it’s difficult to get out of this one.

Recognising validation functions in the wild

Now that we’ve seen parsing and validation functions, let’s see how you can apply this to your codebase. The first step is identifying these functions. The two patterns I most frequently see are:

Like the pipeline example — assumptions about the shape or semantics in one function being duplicated in another
A call to a validate function in the beginning of a long chain, relying on an early return or exception bubbling

We can see an example in Cobra, a CLI application builder in Go:

Runnable checks whether a Command has a Run or RunE members that aren’t null
One of its callers, Execute, early-returns if Runnable returns false
Later on, Execute assumes that Command has either a Run or RunE member

Aside from “just remembering”, there’s nothing between these calls that ensures Runnable has to be called: A change in one can break the other.

A short trip to grep.app looking for validation functions brings us to Leaflet, a mapping library, where:

LatLng.validate checks whether its arguments can be potentially turned into a LatLng
The LatLng constructor calls validate, and then repeats its logic almost byte-per-byte

JavaScript class constructors have to return a class instance or throw an exception. If you don’t want to pay the cost of exception handling and prefer null, an alternative to a static validate can be a static from function. This is the kind of code review comment I find myself leaving, taking a leaf from programming languages like Rust for implementing explicit conversions from one domain to another.

Putting on my security researcher hat, these are common patterns in authn/authz bypasses and privilege escalation vulnerabilities. For example CVE-2024–55949 in Minio, an S3-compatible storage service, was patched by adding this at the top of the request handler:

// w = response, r = request
objectAPI, _ := validateAdminReq(ctx, w, r, policy.ImportIAMAction)

The validateAdminReq function, despite being called validate, translates the request object into an objectAPI. That’s a classic parser. A basic validation function would have returned a boolean if the permissions were allowed, and the request handler would have then created the objectAPI directly, something like:

// Pseudocode
if !validateAdminReq(ctx, w, r, policy.ImportIAMAction) {
  return ...;
}
objectAPI = newObjectAPI(…);

Making it easy to miss or forget the validation. The Minio patch is good — it makes it harder to forget the permissions check.

Taking it a step further and making it impossible (or weird, so it stands out in code reviews) to directly create an objectAPI without an explicit permissions check, we can reduce the likelihood of these vulnerabilities existing in the first place.

Conclusion

This is a lot of words to say how “Parse, don’t validate” is great. It’s one of those simple principles that conveys wisdom on how we should author software. Obvious, yet nuanced, like so many timeless lessons. While the original article is set in the context of type-driven design, it extends beyond typing to all languages: from C all the way to Elixir.

<script src="https://cdn.jsdelivr.net/npm/gsap@3.12.5/dist/gsap.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/gsap@3.12.5/dist/Flip.min.js"></script>

<script>
  document.addEventListener("DOMContentLoaded", (event) => {
    gsap.registerPlugin(Flip);
    const state = Flip.getState("");
    const element = document.querySelector("");
    element.classList.toggle("");
    Flip.from(state, {
      duration: 0,
      ease: "none",
      absolute: true,
    });
  });
</script>

<script src="https://cdn.jsdelivr.net/npm/gsap@3.12.5/dist/gsap.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/gsap@3.12.5/dist/Flip.min.js"></script>

<script>
  document.addEventListener("DOMContentLoaded", (event) => {
    gsap.registerPlugin(Flip);
    const state = Flip.getState("");
    const element = document.querySelector("");
    element.classList.toggle("");
    Flip.from(state, {
      duration: 0,
      ease: "none",
      absolute: true,
    });
  });
</script>