Having the right data

In the beginning there was the relational database, and it was…OK, I guess. It was based on the relational model, and allowed operations that were within the relational algebra.

I mean it actually didn’t. The usual standard for relational databases is ISO 9075, or SQL. It doesn’t really implement the relational model, but something very similar to it. Still, there is a standard way for dealing with relational data, using a standard syntax to construct queries and statements that are mathematically provable.

I mean there actually isn’t. None of the “SQL databases” you can get hold of actually implement the SQL standard accurately or in its entirety. But it’s close enough.

At some point people realised that you couldn’t wake up the morning of your TechCrunch demo and code up your seed-round-winning prototype before your company logo hit the big screen, because it involved designing your data model. So the schemaless database became popular. These let you iterate quickly by storing any data of any shape in the database. If you realise you’re missing a field, you add the field. If you realise you need the data to be in a different form, you change its form. No pesky schemata to migrate, no validation.

I mean actually there is. It’s just that the schema and the validation are the responsibility of the application code: if you add a field, you need to know what to do when you see records without the field (equivalent to the field being null in a relational database). If you realise the data need to be in a different form, you need to validate whether the data are in that form and migrate the old data. And because everyone needs to do that and the database doesn’t offer those facilities, you end up with lots of wasteful, repeated, buggy code that sort of does it.

So the pendulum swings back, and we look for ways to get all of that safety back in an automatic way. Enter JSON schema. Here’s a sample of the schema (not the complete thing) for Covid-19 cases in Global.health:

{
  bsonType: 'object',
  additionalProperties: false,
  properties: {
    location: {
      bsonType: 'object',
      additionalProperties: false,
      properties: {
        country: { bsonType: 'string', maxLength: 2, minLength: 2 },
        administrativeAreaLevel1: { bsonType: 'string' },
        administrativeAreaLevel2: { bsonType: 'string' },
        administrativeAreaLevel3: { bsonType: 'string' },
        place: { bsonType: 'string' },
        name: { bsonType: 'string' },
        geoResolution: { bsonType: 'string' },
        query: { bsonType: 'string' },
        geometry: {
          bsonType: 'object',
          additionalProperties: false,
          required: [ 'latitude', 'longitude' ],
          properties: {
            latitude: { bsonType: 'number', minimum: -90, maximum: 90 },
            longitude: { bsonType: 'number', minimum: -180, maximum: 180 }
          }
        }
      }
    }
  }
}

This is just the bit that describes geographic locations, relevant to the falsehoods we believed about countries in an earlier post. This schema is stored as a validator in the database (you know, the database that’s easier to work with because it doesn’t have validators). But you can also validate objects in the application if you want. (Actually we currently have two shadow schemas: a Mongoose document description and an OpenAPI specification, in the application. It would be a good idea to normalise those: pull requests welcome!)

About Graham

I make it faster and easier for you to create high-quality code.
This entry was posted in software-engineering and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.