Thank you for pointing at validation, I'll check it. It's not completely clear what is it possible to limit not only particular document but database, or how to handle conflict if document changed on pouch, but rejected on couchdb server.
I'm not sure about current time but previously it was a problem that couchdb file grow until some limit on filesystem and couchdb just crashed.
Start of the envoy readme: it's not battle tested or supported in any way. Also it doesn't do any validation apart from limiting permissions for different users.
It's easier to reimplement couchdb than to create smart proxy that will estimate is this query expensive or not.
I'm not saying about rate-limiting proxy or load-balancing to different backends which could be implemented on nginx or something else.
I'm not clear what you mean by limiting "not only particular document but database". As far as a document changing in pouch and rejected on the server, that's one of two scenarios.
1) The client you wrote is bugged and generated bad data. This scenarios can occur just as easily using Postgres and an application server. What does your app server do if a client tries to send bad data? (Answer: Whatever you told it to do. Most likely throwing a 500 when your databases refuses the incoming data.)
As for what will happen when pouch syncs to couch the server will let everything else sync, but not the bad document. The return value from the API call will tell you what documents didn't sync.
2) Someone is intentionally trying to shove bad data into your database. In this case it's worked as advertised and rejected the bad data. What do you care if a malicious client breaks?
What kind of "expensive" query are you envisioning? Mango queries don't support joins, and only simple equality filters, so in general the worst thing someone could do is send a query that doesn't use an index, but why are you letting the client query the server in the first place? Just have the client sync and query client side. Or don't allow access the the _find endpoint and restrict them to the map/reduce view you handwrote.
If you must let them send arbitrary queries (which to me implies a relatively trusted user, but let's pretend their not), then run the query with a limit of 1 or 0, and examine the execution stats to see if they are using an index, and check their query to see if their limit is reasonable. But at this point you've now entered into a scenario that's going to be very difficult with a custom API too.
> I’m not clear what you mean by limiting "not only particular document but database".
I’ve limited document size to 10mb and ratelimited updates to 10 per second. Client starts to update document with random data 10 requests per second. As far as I understand couch stores all versions at least some time. This means that this one client could fill space on my server 100mb/s. There is no such issues with postgress, and no one allow clients execute raw queries on database without any application server. Document only 10mb but database is huge.
> What kind of "expensive" query are you envisioning?
I have never used couch, so I don’t know what could be expensive. May be some lookup without index or something like this.
Sorry for my ignorance, is it true that if I limit couch only to replication it will not be any not indexed lookups?
Looks like implement secure system with couch is very hard but I can’t find any best practices, mostly only authentication and basic validation.
> I’ve limited document size to 10mb and rate-limited updates to 10 per second. Client starts to update document with random data 10 requests per second. As far as I understand couch stores all versions at least some time. This means that this one client could fill space on my server 100mb/s. There is no such issues with PostgreSQL, and no one allow clients execute raw queries on database without any application server. Document only 10mb but database is huge.
Ah! Now we are getting somewhere! Your concerned about someone filling your disk.
OK, let's modify your scenario a little. Instead of updating an existing document, they create a new document. This a malicious client, why do updates that'll get cleaned up in a few minutes when I can make it permanent?
So, CouchDB allows these writes, and now your disk is full.
What does Postgres with a custom API do? Allows these writes, and now your disk is full.
Your allowing 10MB documents because that makes sense for your application right? So your Postgres table is going to have a binary column or some other column meant to hold bulk data, and your API is going to accept it.
If it doesn't make sense, lower the max document size. Apply validations to limit what fields can be written to, and how big they can be. In Postgres this is called your "schema". Couch being "schemaless", it's now your validation function. Couch is no different from any other schemaless database such as Mongo, RethinkDB and FoundationDB in this regard.
Also your rate limiting here is weak. If I can post to your sever at 100Mb/s second, I can saturate a 1GB link with only 10 clients. Doesn't matter if you reject my posts, if I can send them to the server, I can DOS you pretty easily.
The main thing Postgres gives you here is that it requires you to define your schema upfront (unless you use JSON columns, in which case it joins the schemaless club above). Couch will happily let you not, in which case someone wants to write a record of their car maintenance into your recipe book app? Couch is good with that. But take a step back. what actually stops them from putting that in the "description" column of your Postgres recipe app? Not much. So you have to think about what's important. Do I actually need to make sure these are all the same "shape"? If so I need a validation function. If I can just shrug and say "garbage in, garbage out", then I just need controls around how much data they can insert, but hey, I needed that for Postgres anyway.
> Sorry for my ignorance, is it true that if I limit couch only to replication it will not be any not indexed lookups?
Correct (enough). The entirety of CouchDB is built around efficient replication. While it's not going to use a formal "index" getting all of the changes after a specific rev is an efficient operation.
It’s trivial to limit number of created documents in postgres, couchdb or application server though validation, I’m talking about updating document not creating new. In posgres if I update 1mb document used space will not always grow. In couch db situation is different. In case of relation db you have application server with custom logic and validations, couchdb from other side is accessible from outsize.
My idea that it’s very hard to create safe couchdb based system and most recommendations limited to setup nginx proxy and authenticate users which is not enough.
> It’s trivial to limit number of created documents in postgres, couchdb or application server though validation, I’m talking about updating document not creating new. In posgres if I update 1mb document used space will not always grow. In couch db situation is different. In case of relation db you have application server with custom logic and validations, couchdb from other side is accessible from outsize.
It is? It's unclear to me why I'm allowing 10 updates to a (largish, 10MB! Use a file or store it in S3!) document per-minute, but not 10 creates. Maybe I'm building Google Docs? Except I'd want old revisions, so those are creates. Plus 10 Mb is a huge spreadsheet. But sure lets roll with it. Actually Couch does not keep old versions of documents around, only old revision numbers. When a document is updated, the old version becomes eligible for compaction (basically garbage collection). So your attacker has to be fast enough to outrun the compactor, while being slow enough to not get temporarily banned from your service. It seems like less effort to me to use this power to flood your network I/O, which is almost certainly lower than your disk I/O. Or just choke your Postgres server on it's 100Mb/s disk I/O for updates + whatever is required to maintain your indexes.
I'm not actually advocating for Couch over Postgres. In my mind Postgres should be the default choice, and you switch to something else if you have a reason. For Couch, the biggest reason is sync is built in, in such a way that you can leverage it for your own applications with minimal effort. In my experience sync can be devilishly hard for non-trivial cases, so depending on your app, that can be pretty compelling.
But so far you seem to be focused on DOS attacks your not going to find separate advice for Postgres vs Mongo vs Couch, because the backing system doesn't matter. The attacks and mitigations are identical no matter the back-end, namely stop the traffic before it consumes your resources.
Couch is not equivalent to mongo or relational because it accessible to clients if we want synchronisation. Securing app server is manageable problem and there is huge number of resources how to do it correctly.
In case of couch I've not seen any secure open-source example.
I'm not focused on DOS attacks, I'm just proposing different attack vectors.
Is it trivial? Let’s say you have a back end and an app that lets you post comments, like this site. How do you stop someone from spamming comments? Each comment is represented by a row in a table so the space will grow.
If you need to limit the number of items it is trivial. You need to write something like `has_many :things, :before_add => :limit_things` in app server or create constraint in sql.
Spam prevention is not trivial but mostly solved problem. You can find a lot of articles about this topic.
But creating secure couchdb looks like very non-trivial.
Yeah... that's a Rails callback, not an SQL constraint, and can't be relied upon in the face of multiple simultaneous requests. Which kind of demonstrates my point. With a custom API, you have to understand your system, it's requirements, and it's limitations. You can't just read a blog post on "securing your webapp" and assume it's good.
Couch is no different. You have to understand Couch, you have to understand it's features and limitations, and build your system within those constraints.
You seem to be asserting because Couch is designed to be internet connected it can't be secure. If that's true, then I guess every customer on IBM Cloudant (Couch as a service), Realm (another database designed for mobile sync), and Firebase (Google database as a service) are all in trouble and just don't know it yet.
Security for all systems is non trivial. Thinking it is assures your system is not secure.
I'm not asserting that couch is unsecure, I need such database but the problem that I can't see any resource that could help me design secure production system.
You can check even trivial rails blog or todo example from some book and it will be limited in scope but more or less secure. I'm having hard time to find secure couchdb example.
> Security for all systems is non trivial.
But not equally hard.
If you use firebase you should understand that you getting vendor lock-in and in some cases you can spend much more money, but for some types of projects this platform is ok for me.
Same with couchdb, I understand that if I get replication with client, I need to pay by reorganising data or may be spend more resources to make system secure. There is no free lunch.
No I'm not assuming constraint on the number of comments. First example shows how easy limit number created objects.
Spam prevention is other topic not so trivial but mostly solved problem.
Also as far as database size, I don't believe there is a hard limit. I think you might be thinking of when MongoDB would silently corrupt databases larger than 2GB on it's 32-bit version.
As far as I remember it was a filesystem limit not couchdb limit. It was a problem that file always grow and couchdb crashed when limit exceeded. Can't find particular issue, but googling show some issues [1] that make me think that we should be very careful with db size.
I'm not sure about current time but previously it was a problem that couchdb file grow until some limit on filesystem and couchdb just crashed.
Start of the envoy readme: it's not battle tested or supported in any way. Also it doesn't do any validation apart from limiting permissions for different users.
It's easier to reimplement couchdb than to create smart proxy that will estimate is this query expensive or not.
I'm not saying about rate-limiting proxy or load-balancing to different backends which could be implemented on nginx or something else.