Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Highly opinionated SQL schema design that's easy to start with and scales well
2 points by hot_gril on April 25, 2023 | hide | past | favorite | 9 comments
Today I'm going to share what I've learned over several years of using relational DBMSes (mostly Postgres). I'm probably not the first to agree on this, and there are probably people who disagree.

This is how I would build a social media app as an example. But it works for a wide variety applications, from social to money to inventory tracking, and it's served me well each time.

1. Make each table a logical collection of events that have occurred, not objects or state. This is not CRUD. One row means one event, for example a table `react` means "a user reacted to some post." Keep these events as close as possible to the end user's intent.

2. Never delete or update rows as a way to record events. This would mean forgetting that something happened. Only do that for manual corrections or special privacy reasons, e.g. GDPR requirements to erase user data.

3. Figure out the present state by aggregating events, either in SQL for realtime requests or perhaps in a mapreduce pipeline for batch. For example, the visible reactions on a post are only the latest one each user made, superseding previous ones.

4. On the technical side, in Postgres always use an `id bigserial` primary key and add a `created_at timestamptz` column for bookkeeping. Some other attributes might be unique, but never make them primary keys, and only create foreign keys to the `id` col. Similar for MySQL. If (for special use cases) you need to understand exactly which event happened first transactionally, order by some other autoincrement column (or just the ID if you're lazy). If you need to do that across multiple tables, make a separate table joining them.

5. Try to only add new tables and cols as needed for new features. It's ok and inevitable to end up with some deprecated ones that you eventually clean up more carefully.

6. Keep everything normalized as a baseline. If some query is too slow to aggregate past data, make an additional denormalized table tracking the current state (i.e. an exception to rules #1-3) alongside the regular ones you insert into in the same transaction.

7. Kinda implied by all the above, don't use ORMs or other heavy-handed ways to abstract away the DBMS. SQL is your interface.

8. You can get pretty far with a monolithic backend/DB this way, but when it's time for microservices: Draw a graph of your tables by foreign keys. Each clique is a candidate for becoming its own service (with its own DB) if you don't foresee it becoming connected to the others. You aren't going to have cross-service transactions. Introduce caching along these boundaries. Never make a service persist info it's not authoritative on (ephemeral caching is OK). Things should be identified across services by some random UUID or incrementing int, not some attribute that's believed to be unique at some point (e.g. user's email address). Generally, whichever service observes things happening is the one that owns the data; don't make a service that resembles a custom DBMS for multiple others to R/W to.

Why is this so great? You have a clear signal on what exactly to store and what not to. You never lose information. You can always support new features with the existing info you have. You have fewer layers of abstraction between your UI (or API) and your database, which despite some conventional advice actually makes it easier to adapt to new features.



Could you provide an example of how you would implement this approach to store Credit/Debit events for an account? Additionally, how would you handle a scenario where there are 30,000 events on the account, and you need to calculate the balance to prevent overdraft?


Heh, you found the hard case. You want to add a denormalized table (point #6) specifically for locking on the balance, just cause Postgres/MySQL `serializable` mode is way too slow to rely on. You still keep baseline insert-only credits/debits table(s) that you insert into in the same xact, and all the usual rules apply there.

You can also do this without making such an exception. I used to keep a separate "pending" table that I'd insert into, commit, then check the balance with the pending row included before moving it to non-pending. So two transactions. That worked, problem is it was annoying. Though it was a good solution for debits/credits that involved an async external step that could fail or time out; simply ignore the pending rows that are too old and never got resolved. 30K rows is still small enough to query quickly.


What are the key trade-offs in your approach vs doing event sourcing?


Tbh I had to just look up event sourcing, but thanks, I'll add that to my vocabulary. Seems like my approach includes a form of event sourcing, though I'm not always storing the ordering of events, only in cases where it's needed. And that concept isn't specific to relational DBs.


Why normalize? Go even further and use a universal relation / Entity-Attribute-Value pattern. One table with columns entity_id, type = assert|retract, key, value

Essentially, use SQL as an immutable key-value store and use views for queries.


With a KV store of denormalized info, how would you store and efficiently query the count of "like" reacts user A made on all posts that have also been liked by user B?


SELECT count(likes_a) FROM facts as likes_a, facts as likes_b WHERE likes_a.attribute = ‘like’ AND likes_b.attribute = ‘like’ AND likes_a.value = likes_b.value AND likes_a.entity_id != likes_b.entity_id AND likes_b.entity_id = 123


Ah, I also should've looked up EAV since there's a lot on this. I don't see what you gain from this other than not having to define new tables/cols for new use cases, and the downside is any DB structure has to be enforced entirely in code. I've been down routes like this before (not EAV but other sorta polymorphic DBs), and it's never ended well. Became unsafe to touch the code at some point, especially in a team setting.

On the more technical side, I don't know if Postgres would handle this efficiently. The `attribute` col has the classic low-cardinality problem for indexes. The one big table you use would have many very different access patterns. There are DBMSes more designed for this use case.

I'm also not seeing very much EAV adoption mentioned on Wikipedia. I can see this working for some specialized use cases, though. Has this worked well for you in the past?


I personally think it’s safer to enforce data constraints in application code, because “bad” data won’t break it. Putting the constraints in the database means your application is coupled to whatever database you use. Worse, because databases only offer a small set of constraints, it’s often the case that constraints are split between the database and application code. You have “validations” in both the application code and database definitions, and data must be checked/transformed anyway whenever it is marshaled to and from the database. Constraints in the database are effectively implicit, living in a set of migrations or often not part of any source control.

EAV indexing is much easier, because EAV is naturally represented as a tree. The whole database is the index. You only have four universal indexes that cover every query: EAV, AEV, AVE, and VAE. The mutual likes example above is covered by these indexes. And you never had to worry about it!

See Datomic for real world example of an EAV architecture.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: