Idk about the grammatical correctness of the punctuation, but I really enjoyed reading his writing. Never read something by him before, it was genuinely refreshing, specially given it was a glorified ad.
Can you elaborate a bit more on the challenges faced in making Postgres shard-able?
I remember that adding sharing to Postgres natively was an uphill battle. There were a few companies who has proprietary solutions for it. What you've been able to achieve is nothing less than a miracle.
1. People don't design schemas to be sharded, although many gravitate towards a common key, e.g. user_id or country_id or tenant_it or customer_id. Once that happens, sharding becomes easier.
2. Postgres provides a lot of guarantees that are tricky to maintain when sharded: atomic changes, referential integrity, check constraints, unique indexes (and constraints), to name a few. Those have to be built separately by a sharding layer (like PgDog) and have trade-offs, usually around performance. It's a lot more expensive to check a globally enforced constraint than a local one (network hops aren't free).
3. Online migrations from unsharded to sharded can be tricky: you have to redistribute terabytes of data while the DB continues to serve writes. You can't lose a single row - Postgres is used as a store of record and this can be a serious issue with business impact.
We're taking increasingly bigger bites at this apple. We started with basic query routing and are now doing query rewrites as well. We didn't handle data movements previously and now have almost fully automatic resharding. It takes time, elbow grease and most importantly, willing and courageous early adopters to whom we owe a huge debt of gratitude.
That's was my second question, how on earth can you replicate real world Postgres workloads that benefit the most from sharing.
Are there some specific standard Postgres test suites you run PgDog through to ensure it's compliant with Postgres standards?
You've mentioned NoSQL quite a bit, what sort of techniques do shard-able NoSQL database employ which makes sharding inherently easier? Do you attempt to emulate some of those techniques in PGDog?
Lastly how do you solve the problem of Postgres constraints, from what I've understood PgDog runs standard Postgres instances as the shard, if let's say one table in shard 1 has a foreign key to a record in shard 2 how do you prevent Postgres from rejecting that record since it technically doesn't exist on it's current shard?
> Are there some specific standard Postgres test suites you run PgDog through to ensure it's compliant with Postgres standards?
That's right. We have many levels of testing: unit, integration, and acceptance, where we run the same query against an unsharded Postgres database and PgDog, and compare the result.
> what sort of techniques do shard-able NoSQL database employ which makes sharding inherently easier?
They remove features. For example, most of them don't support joins, so each table can be stored anywhere in the cluster with no data locality restrictions. There are no foreign key constraints either, or even transaction support. The list goes on. Ultimately, NoSQL databases are just K/V stores, with a fancy API. Scaling K/V is a solved problem.
> one table in shard 1 has a foreign key to a record in shard 2 how do you prevent Postgres from rejecting that
We don't, at least not yet. We can and will build a more sophisticated query engine that will validate constraints, but it may not always be completely atomic or performant. Cross-shard queries are expensive, because of the laws of physics. For example, if a query is executed outside of a transaction, validating the constraint could introduce a race condition, while in non-sharded Postgres, all queries run inside implicit transactions.
Aaah you've got me excited and thinking about all sorts of ways this can fix the issue. I really appreciate your time for answering my questions, it's all very interesting.
Can't PgDog pull in the query planning and execution part from Postgres, and maintain a cache of the different indexes that are available pulled in from the different postgres shards and then follow through on the execution. This way PgDog could technically scale up to as many instances and keep postgres instances themselves as just a persistence backend?
However, I understand that at that point you're basically just making an entirely new database not really a sharding support service on top of postgres, you'd need to attempt to maintain feature parity with postgres which can turn into a maintenance pain.
Do you have any insights on how platforms like planetscale or cockroach are doing some of this stuff?
idk man it's rare to fight the compiler once you've used Rust for long enough unless you're doing something that's the slightest bit complex with async.
You get to good at schmoozing the compiler you start to create actual logical bugs faster.
That goes for almost every language. I recall my first couple of weeks with various compiled language and they all had their 'wtf?' moments when a tiny mistake in the input generated reams of output. But once you get past that point you simply don't make those mistakes anymore. Try missing a '.' in a COBOL program and see what happens. Make sure there is enough paper in the box under LPT1...
Can someone provide a true engineers perspective on the ADCs' on ESP SoC's?
I've heard a lot about people trashing it and most experienced engineers admit that it's finicky however if you have the knowledge you can make it work as well as any STM chip.
ESP32's are so interesting, they're the only major chip that (used to) have their own newish ISA (before transitioning to RISCV) and be so successful.
If you need more accurate analog measurements it is better to use an external ADC (with i.e. SPI interface). This will cost quite a bit more but will save the hassle of calibrating each individual device. Mostly comes down to how much dev time you want to invest in it vs. hardware cost vs. TTM.
I'm not too familiar with the ESP32 ADCs except I remember they're unusually lightly specified even for microcontroller ADCs. If "the knowledge" involves things you couldn't do - or rely on - in production like careful calibration and characterization, that would answer your question.
To be clear and towards the OP's comment about ESP32 ISA -- Xtensa isn't really a self contained architecture, it can be customized (extended) by the vendor. The ISA can be extended for these customizations. ESP32 is one customization of it.
The ADCs on the ESP32 are similar to other embedded MCUs in that they are not intended for audiophile-level audio capture, as some people seem to think they should be capable of.
The main value proposition for these ADCs is to hook them up to a simple potentiometer to allow physical input controls, and even for that purpose you need to average multiple samples to get a somewhat steady value. Of course the ADCs can be used for various other tasks, but "ADC" does not mean they can do anything any ADC can do, there's a wide variety of quality and purpose in the field of ADCs, and the ESP32's ADCs are a cheap and easy way to add a simple ADC function to the chip.
I have been able to use the ADCs quite easily for input controls and monitoring slow-changing voltages, in ap;lications where abdolute precision wasn't the goal, it works perfectly fine for that.
Interesting, I was a mildly heavy cannabis user in my very early teens but stopped short right before starting my junior year of high school. My use was mostly motivated by teen angst and peer pressure, I never really enjoyed it I always felt uncomfortably anxious and hyper sensitive.
This study shows that anxiety was identified in many of the participants of the study, which is pretty close to how I feel now. I am in general and anxious individual and I wonder if this was because of the marijuana use. Well it could also be how anxiety is pretty rampant in my family so could just be in my genes.
Whom's messenger? You didn't point us to anyone's research.
I just don't see how sampling tokens constrained to a grammar can be worse than rejection-sampling whole answers against the same grammar. The latter needs to follow the same constraints naturally to not get rejected, and both can iterate in natural language before starting their structured answer.
Under a fair comparison, I'd expect the former to provide answers at least just as good while being more efficient. Possibly better if top-whatever selection happened after the grammar constraint.
I will die on this hill and I have a bunch of other Arxiv links from better peer reviewed sources than yours to back my claim up (i.e. NeurIPS caliber papers with more citations than yours claiming it does harm the outputs)
Any actual impact of structured/constrained generation on the outputs is a SAMPLER problem, and you can fix what little impact may exist with things like https://arxiv.org/abs/2410.01103
This is really nice, specially the pdf report generation.
I feel very moronic making a dashboard for any products now. Enterprise customers prefer you integrate into their ERPs anyway.
I think we lost the plot as an industry, I've always advocated for having a read only database connection to be available for your customers to make their own visualisations. This should've been the standard 10 years ago and it's case is only stronger in this age of LLMs.
We get so involved with our products we forget that our customers are humans too. Nobody wants another account to manage or remember. Analytics and alerts should be push based, configurable reports should get auto generated and sent to your inbox, alerts should be pushed via notifications or emails and customers should have an option to build their own dashboard with something like this.
Sane defaults make sense but location matters just as much.
> I've always advocated for having a read only database connection to be available for your customers to make their own visualisations.
Roughly three decades ago, that *was* the norm. One of the more popular tools for achieving that was Crystal Reports[1].
In the late 90s, it was almost routine for software vendors to bundle Crystal Reports with their software (very similar to how the MSSQL installer gets invoked by products), then configure an ODBC data source which connected to the appropriate database.
In my opinion, the primary stumbling block of this approach was the lack of a shared SQL query repository. So if you weren’t intimately aware with the data model you wanted to work with, you’d lose hours trying to figure it out on your own or rely on your colleagues sharing it via sneakernet or email.
Crystal Reports has since been acquired by SAP, and I haven’t touched it since the early ‘00s so I don’t know what it looks or functions like today.
My best friend from early uni days did a co-op with Crystal Services, and he's been with them for their entire history through Seagate Software, Crystal Decisions, BusinessObjects (and relocating from Canada to France) and then SAP. I myself have had 2 temporary retirements, at least 4 different careers and countless jobs in that time, and it's wild to know someone who has the same internal drive but has satisfied it with a much more linear path (though you could definitely argue he's seen just as much change as me). From employee ~50 to ~100,050!
This brings me back! My first job was at the Norwegian ERP Agresso, now part of Unit4. I started as a support technician, which was a experience since around the time, '97-'98, everyone was moving from Sybase/Ingres/Informix etc, to either MSSQL or Oracle. I got to interact with those older database systems and install and export/import data to systems running on eg Oracle across parallel Solaris servers at SAAB Areospace and Windows NT running on DEC Alpha at Ericsson, among other more vanilla deployments.
I was a developer albeit not professionally, and my boss gave me the opportunity to develop the integration between Agresso and Crystal Reports, my first professional development project, for which I am still grateful. It was a DLL written in C++ and I imagine they shipped that for quite a while after I left for greener pastures.
I was already a free software and Linux enthusiast, so I did a vain skunkworks attempt at getting Agresso to run with MySQL, which failed, but my Linux server in the office came in handy when I needed some extra software in the field--I asked a colleague to put a CD in the server so I could download it to the client site some 500 km away, and deliver on the migration.
100% agreed regarding shipping a read-replica, for any sufficiently complex enterprise app (ERP, CRM, accounting, etc.).
Customers need it to build custom reports, archive data into a warehouse, drive downstream systems (notifications, audits, compliance), and answer edge-case questions you didn’t anticipate.
Because of that, I generally prefer these patterns over a half-baked built-in analytics UI or an opinionated REST API:
Provide a read replica or CDC stream.
Let sophisticated customers handle authz, modelling, and queries themselves. This gets harder with multi-tenant DBs.
Optionally offer a hosted Data API, using something like -- PostgREST / Hasura / Microsoft DAB.
You handle permissions and safety, but stay largely un-opinionated about access patterns.
Any built-in metrics or analytics layer will always miss edge cases.
With AI agents becoming first-class consumers of enterprise data, direct read access is going to be non-negotiable.
Also, I predict the days of charging customers to access their own goddamn data, behind rate-limited + metered REST APIs are behind us.
I fully agree in spirit, but in practice, read-replica's have some edge cases that are hard to control for. Namely, the incentives aren't fully aligned between the database host and consumer, and that dynamic can lead to some difficult resourcing decisions for the DB host. Whereas an API can be rate limited or underlying API queries can be optimized (however frustrating that might be for consumers).
The CDC stream option you flagged is more viable in my (admittedly biased) opinion. At my company (Prequel) our entire pitch is basically "you should give your customer's a live replica of their data in whatever data platform they want it in" (and let us handle the cross-platform compatibility & multi-tenant DB challenges).
I think this problem could also be a killer use case for Open Table Formats, where the read-replica architecture can be mirrored but the cost of reader compute can be assumed by the data consumer.
To your point, this is only going to be more important with what will likely be a dramatic increase in AI agent data consumption.
1999-2000, the company I worked with gave a smallish number of key users full read rights to the SAP minus HR, briefly after introducing SAP to the global supply chain of that company. The key users came from all orgs using SAP, basically every department had one or two key users.
I was part of this and "saw the light". We had such a great visibility into all the processes, it was unreal. It tremendously sped-up cross-org initiatives.
hi, dev building Shaper here. I agree re sending reports vs dashboards.
Many users use Shaper mostly as UI to filter data and then download a pdf, png or csv file to use elsewhere.
We are also currently working on functionality to send out those files directly as messages using Shaper's task feature.
I get your point, but generally with most enterprise-scale apps you really don’t want your transactional DB doubling as your data warehouse. The “push-based” operation should be limited to moving data from your tx environment to your analytical one.
Of course, if the “analytics” are limited to simple static reports, then a data warehouse is overkill.
Customers don’t want to learn your schema or deal with your clever optimizations either. If you expose a DB make sure you abstract everything away in a view and treat it like a versioned API.
The best example for this are iot devices that share their data. Instead of reinventing the wheel for a dashboard for each customer just give them some docs and restricted access via a replica.
> I've always advocated for having a read only database connection to be available for your customers to make their own visualisations.
A layer on top of the database to account for auth/etc. would be necessary anyways. Could be achieved to some degree with views, but I'd prefer an approach where you choose the publicly available data explicitly.
GraphQL almost delivered on that dream. Something more opinionated would've been much better, though.
That's exactly what I meant. It's a specific replica instance with it's own security etc. but not necessarily a separate API you try to integrate too. APIs can stay for writes, but for reads you have the db
I loved the Blades dashboard. Something about idly pressing the shoulder buttons to flip through the blades while talking to my friend with that goofy wireless "Xbox communicator" on my ear.
Best Xbox console. It had pretty good games. Sad they were unable to keep that momentum going and are basically nope’ing from the console business altogether now.
I was able pull together a Halo 3 LAN party last year, although the "consoles" were Linux PCs and the game was the MCC edition (60fps instead 30). Split-screen was resurrected via mods. I bought some Microsoft gamepad receiver to bring Xbox 360 original controllers under Linux. Some people insisted they get to play on the original gamepad (otherwise it was a mixed bag of PlayStation and newer Xbox/PC controllers).
I also realized that Halo 3 itself would have been old enough to drink with us!
I don't know if the Altair 8800 would count as my first home computer, as I was too young to really understand what it was and mostly just liked to play with the paper tape feed on the Teletype attached to it. By the time we got the PET 2001, I was old enough to actually use it as intended.
Most systemd components do rely on some core systemd components like systemd (the service manager) and journald. I would say that a core thesis of systemd is that Linux needs/needed a set of higher-level abstractions, and that systemd-the-service-manager has provided those abstractions. The fact that other parts of systemd-the-project rely on those abstractions does not imply that the project is monolithic.
>Try running any part of the systemd software suite on an openrc system and see how that works out?
Well from this POV it's kinda openrc's problem if it doesn't. What about trying to run any part of the Openrc software suite on an Upstart system? The question why would anyone sane want to is rhetorical tho...
Why obsessing over whether systemd is monolithic and in what measure anyway? There certainly ARE optional systemd parts. So it's correct to say it's not entirely monolithic.
openrc-init can be used on an upstart system, the daemon manager itself can't but that's because you'd have two different daemon managers. Beyond that there aren't any openrc software components, because it was designed to be a modular init system that just handles what it was intended to handle.
The rest of the system for example chrony, sysklogd, cron, etc run fine on upstart systems, because they aren't tied to systemd and are fully modular.
It's okay to be a monolith, that doesn't make it inherently bad or anything, but we should be honest about it, and it does come with some tradeoffs.
Explain the existence of "elogind" and "eudev" then?
It might be the case that one can disable some components of systemd, on a systemd system. It is certainly not the case that they are "loosely coupled", or there would be no incentive to maintain forks of core systemd components with the sole and explicit purpose of decoupling from systemd.
In theory. In practice, systemd is a mess of different components that have subtle dependencies on each other. And while the core of systemd is solid enough, everything around it is not.
It's a collection of tightly-coupled components that are functionally a monolith because large distros tend to rely on the various components rather than allowing modularity.
reply