Waves Author here. Happy to answer any questions folks have about the book, about DynamoDB, or about self-publishing.
NoSQL modeling is waaay different than relational modeling. I think a lot of NoSQL advice out there is pretty bad, which results in people dismissing the technology altogether. I've been working with DynamoDB for a few years now, and there's no way I'll go back.
The book has been available for about a month now, and I've been pretty happy with the reception. Strong support from Rick Houlihan (AWS DynamoDB wizard) and a lot of other folks at AWS.
You can get a free preview by signing up at the landing page. If you buy and don't like it, there's a full money-back guarantee with no questions asked. Also, if you're having income problems due to COVID, hit me up and we'll make something work :)
Anyhow, hit me up with questions!
EDIT: Added a coupon code for folks hearing about the book here. Use the code "HACKERNEWS" to save $20 on Basic, $30 on Plus, or $50 on Premium. :)
The biggest problem I'm aware of with DynamoDB is the hot key / partition issue[1]. Throughout is distributed evenly across nodes, you can't control how many nodes you have, so you always have a node that's hot either temporarily or permanently and so you end up having to over provision all your nodes to be able to handle that hot case, which ends up costing far more than alternatives. What's your take on this? This is the chief reason I avoid DynamoDB, which in theory would be a good fit for some of my problems.
As of a couple years ago, DynamoDB will redistribute throughput between shards based on usage [1], so in theory this should eliminate the hot shard problem. I haven't had a chance to test this in practice, if anybody has hands-on experience I'd love to hear it.
You also finally have a way of identifying hot keys with the terribly named CloudWatch Contributor Insights for DynamoDB. [2]
For exceptional use cases, you also have the option of On-Demand Capacity to pay for what you use and not worry about capacity at all. [3]
Basically, most of these issues are gone. As long as you don't have extreme skew in your partition keys, you don't need to worry about throughput limits.
What was your approach to self-publishing here? What tools did you use? If I wanted to publish a book but knew nothing about it, what resources should I read and what approach would you recommend?
The biggest advice I can give you is not about any specific tool, it's about an approach. You need to think about how you will market the book if you're self-publishing.
Engage with the community that will be interested in the book. Write articles, help out on Twitter, write code libraries, etc.
For me, I wrote DynamoDBGuide.com two and a half years ago over Christmas break. I wanted to just make an easier introduction to DynamoDB after I watched Rick Houlihan's talk at re:Invent (which is awesome).
That led to other opportunities and to me being seen as an 'expert' (even when I wasn't!). I got more questions and spent more time on DynamoDB to the point where I started to know more. I gave a few talks, etc.
I finally decided to do a book and set up a landing page and mailing list. I basically followed the playbook that Adam Wathan described for his first book launch.[0] Write in public, release sample chapters, engage with people, etc.
In terms of tooling, I used AsciiDoc to generate the book and Gumroad to sell. On a 1-10 scale, I'd give AsciiDoc a 5 and Gumroad an 8. But the tooling barely matters -- think about how to find the people that are interested :)
Happy to answer any other questions, either in public or via email.
Just bought the book. I've been working at AWS and using DynamoDB for years now, but I'm sure there are things I could be doing better. I love that you've dedicated attention to analytics and operations too.
Honest question: would you say "NoSQL modeling is way more restrictive, labor intensive and painful, but in turn gives you consistent performance as you scale" is a fair characterization?
One note on this -- if you have an LSI, you can't have an item collection larger than 10GB, where an item collection refers to all the items with the same partition key in your main table and your LSI.
A DynamoDB table with an LSI can scale far beyond 10GB. That said, I would avoid LSIs in almost all circumstances. Just go with a GSI.
Thank you for the clarification abd12, you are right. I only create LSI if I know for sure the data within a given shard will not go beyond 10GB... since we never really do we always go with GSI.
I just answered this on Twitter, but I think there are two instances where it's a no-brainer to use DynamoDB:
- High-scale situations where you're worried about performance of a relational database, particularly joins, as it scales.
- If you're using serverless compute (e.g. AWS Lambda or AppSync) where traditional databases don't fit well with the connection model.
That said, you can use DynamoDB for almost every OLTP application. It's just more a matter of personal preference as to whether you want to use a relational database or something like DynamoDB. I pick DynamoDB every time b/c I understand how to use it and like the other benefits (billing model, permissions model, performance characteristics), but I won't say you're wrong if you don't choose it in these other situations.
It's different than a relational database in that you need to model your data to your patterns, rather than model your data and then handle your patterns.
Once you learn the principles, it really is like clockwork. It changes your process, but you implement the same process every time.
Honestly, I think part of the problem is that there's a lot of bad NoSQL content out there. A little standardization of process in this space will go a long way, IMO :)
Thank you! Glad you found DynamoDBGuide.com helpful :)
Yep, I do talk about aggregations in the book. One strategy that I've discussed is available in a blog post here[0] and involves using DynamoDB Transactions to handle aggregates.
If you're looking for large-scale aggregates for analytics (e.g. "What are my top-selling items last month?"), I have an Analytics supplement in the Plus package that includes notes on different patterns for analytics w/ DynamoDB.
Great point! It was originally about both of these things, but the storage aspect isn't discussed much anymore because it's really not a concern.
The data integrity issue is still a concern, and I talk about that in the book. You need to manage data integrity in your application and think about how to handle updates properly. But it's completely doable and many people have.
> You need to manage data integrity in your application
This is just another way of saying you need to implement your own system for managing consistency. Dynamo offers transactions now, but they don’t offer actual serialization. Your transaction will simply fail if it runs into any contention. You might think that’s ok, just retry. But because you’ve chosen to sacrifice modelling your data, this will happen a lot. If you want to use aggregates in your Dynamo, you have to update an aggregate field every time you update your data, which means you create single points of contention for your failure prone transactions to run into all over your app. There are ways around this, but they all come at the cost of further sacrificing consistency guarantees. The main issue with that being that once you’ve achieved an inconsistent state, it’s incredibly difficult to even know it’s happened, let alone to fix it.
Then you run into the issue of actually implementing your data access. With a denormalized data set, implementing all your data access comes at the expense of increasing interface complexity. Your schema ends up having objects, which contain arrays of objects, which contain arrays... and you have to design all of your interfaces around keys that belong to the top level parent object.
The relational model wasn’t designed to optimize one type of performance over another. It was designed to optimize operating on a relational dataset, regardless of the implementation details of the underlying management system. Trying to cram a relational set into a NoSQL DB unavoidably comes at the expense of some very serious compromises, with the primary benefit being that the DB itself is easier to administer. It’s not as simple as cost of storage vs cost of compute. NoSQL DBs like dynamo (actually especially dynamo) are great technology, with many perfectly valid use cases. But RDBMS is not one of them, and everybody I’ve seen attempt to use it as an RDBMS eventually regrets it.
Are there any basic examples you can give around maintaining that integrity?
I'm liking DynamoDB for tasks that fit nicely within a single domain, have relatively pain-free access patterns, etc. And I've found good fits, but there are some places where the eventual consistency model makes me nervous.
I'm specifically thinking about updating multiple different DynamoDB keys that might need to be aggregated for a data object. The valid answer may be "don't do that!" – if so, what should I do?
> Are there any basic examples you can give around maintaining that integrity?
For those types of use cases, the OP’s advice would actually require implementing a fully bespoke concurrency control system in your business logic layer. Without trying to disparage the OP, this is for all intents and purposes, impossible (aside from also being very, very impractical). There’s some things you can do to create additional almost-functional (though still highly impractical) consistency controls for dynamo (like throttling through FIFO queues), but they all end up being worse performance and scaling trade-offs then you’d get from simply using an RDBMS.
A lot of it boils down to the fact that dynamo doesn’t have (and wasn’t designed to have) locking, meaning that pretty much any concurrency control system you want to implement on top of it, is eventually going to run into a brick wall. The best you’d possibly be able to do is a very, very slow and clunky reimplementation of some of Spanner’s design patterns.
Yeah, the limited transaction support is a killer for many use cases.
Thankfully, it’s not actually all that hard to implement your own “dynamo layer” on top of an SQL database and get most of the scaling benefits without giving up real transactions.
I think there are some features of DynamoDB that are miles ahead of other databases:
- Billing model. Pay for reads and writes directly rather than trying to guess how your queries turn into CPU & RAM. Also able to scale reads and writes up and down independently, or use pay-per-use pricing to avoid capacity planning.
- Permissions model: Integrates tightly with AWS IAM so works well with AWS compute (EC2, ECS/EKS, Lambda) with IAM roles. Don't need to think about credential management and rotation.
- Queries will perform the same as you scale. It's going to work the exact same in testing and staging as it is in prod. You don't need to rewrite when you get four times as many users.
A lot of folks are worried about migrations, but they're not as bad as you think. I've got a whole chapter on how to handle migrations. Plus, one of the examples imagines that we're re-visiting a previous example a year later and want to add new objects and change some access patterns. I show how it all works, and they're really not that scary.
Author here! If you want more, I just released a book on DynamoDB yesterday --> https://www.dynamodbbook.com/ . There's a launch discount for the next few days.
The book is highly recommended by folks at AWS, including Rick Houlihan, the leader of the NoSQL Blackbelt Team at AWS[0].
Happy to answer any questions you have! Also available on Twitter and via email (I'm easily findable).
After seeing Rick's Re:invent talk, the one where at about minute 40 everyones' heads exploded, I emailed him (I'm in a very far away other department of Amazon) to ask him for more, because everything he was saying was absolutely not the way my group was using DynamoDB (ie: we were doing it wrong).
He could have ignored my email entirely. He's a busy guy, right? I wouldn't have held it against him at all. Instead, he was super nice, provided me with more documentation, and honestly was just really helpful.
Shawn lists a four-part breakdown that's pretty on-point:
- Background and basics (Chapters 1-6)
- General advice for modeling & implementation (Chapters 7-9)
- DynamoDB strategies, such as how to handle one-to-many relationships, many-to-many relationships, complex filtering, migrations, etc. (Chapters 10-16).
- Five full walkthrough examples, including some pretty complex ones. One of them implements most of the GitHub metadata backend. Nothing related to the git contents specifically but everything around Repos, Issues, PRs, Stars, Forks, Users, Orgs, etc.
The first nine chapters you can probably find available if you google around enough. It's helpful to have all in one place. But the last 13 chapters are unlike anything else available, if I do say so myself. Super in-depth stuff.
I think it will be helpful even if you've been using it for a while, but it really depends on how deep you went on DynamoDB.
It's a bit expensive especially with exchange rates but it is what it is, these things take your time and effort to produce.
From the website I cannot see a table of contents for the book unless this is under provide an email for free chapters. Can you please provide a publicly visible table of contents no email required on the site/here?.
This will be the deciding factor for me making a purchase as I'll be able to see what the book covers and if it'll be of use to me having several years Dynamo experience.
The concepts are definitely applicable. You'll need to do some small work to translate vocabulary and work around a slightly different feature set, but most of it should work for you.
And yep, Cassandra is pretty similar to DynamoDB. Both are wide-column data stores. Some of the original folks that worked on Dynamo (not DynamoDB) at Amazon.com went to Facebook and worked on Cassandra. The concepts underlying Dynamo because the basis for DynamoDB at AWS.
If you take Cassandra and remove the power user features whose runtime cost is hard to predict, what you're left with is pretty close to DynamoDB. It's harder to use and incompatible with everything else, but its key feature is not overpromising capacity. We're only considering alternatives because there's no cost saving story around tiered storage.
Thanks! Honestly, I'm right with you. I would love a physical copy. I did a bit of research and didn't find any great options for making a physical copy of a self-published book for the number of copies I'm expecting to sell (given it's a fairly niche technical area).
That said, if anyone has any great recommendations here, I'm all ears. Actual experience would be best if possible, rather than the first thing you see in Google :).
The market is shaping each generation. Unfortunately, it's pretty common to interact with engineers that don't think before doing, they just start doing with the hope of "figuring it out as we go". This leaves no room for beautiful, simple code.
Is it a side-effect of the web boom? How much of this was the result of the PHP and JavaScript subcultures? Couldn't say.
If you read the parent comment with feeling of frustration and dread, ask yourself: "Am I bad at writing a first draft? Or am I bad at editing my writing?". These are two different things.
For learning how to write a first draft without wanting to dig your nails into your arms or more drastic forms of self-harm, I recommend:
1) "Start with Why" -- Start each draft by writing your goal and then writing the "signs of success" which you can use to recognize making progress.
I find a particularly motivating form of "why" is a description of a problem someone finds themselves in which they want advice about. Reddit is a good source of these if you want inspiration, but you might be better off giving advice to your future self.
2) Take inspiration from automated testing -- If you are anxious about some section being [maliciously] misinterpreted, write down that anxiety with a pointer to the section and a promise to yourself to have a trusted friend read it.
3) Question-Driven-Drafting -- Start with a question. Write the first flawed answer that comes to mind. Write the first question or objection that comes to mind from that answer. Write the first response that comes to mind from that etc... Don't delete.
4) Alternate focusing with exploring -- Set a pomodoro timer. Use my method from #3 or another to produce a bunch of text. When it goes off, congratulate yourself and meditate for 2 minutes. Then set another pomodoro timer and start to turn your ideas into a structured outline: Try to extract the one-sentence key point from the text you just wrote. Then try to build a pyramid-shaped hierarchy under it with the names of the supporting points. When this pomodoro ends, meditate again and start another rambling pomodoro based on the most interesting point.
5) Be willing to "overthink things" -- When you have a question, be willing to actually trust yourself that the question is worth answering. If someone else thinks the answer is obvious, just move on to ask someone else. If someone screams at you that your question is excuse-making, bullshit, or procrastinating, just move on from them. You no longer have parents nor teachers to endure and can write from your own desire to understand the world and communicate ideas. The confusion you notice in yourself is worthy not of ridicule but of sympathetic curiosity.
6) Go for a walk -- Its just a generally good idea.
7) Dictate into otter.ai -- It is as good a use for your time while walking as any. The resulting text will be heavy with misspellings, but you'll be able to edit it and it will restore your sense of confidence in your ability to generate ideas.
8) Work with a writing coach or therapist if you can find a good one. They might be expensive, but they're less expensive than getting fired because you handed in a blank performance self-evaluation.
The two sentences right before what you quoted are helpful:
"I’m not a Docker or Flask performance expert, and that’s not the goal of this exercise. To remedy this, I decided to bump the specs on my deployments.
The general goal for this bakeoff is to get a best-case outcome for each of these architectures, rather than an apples-to-apples comparison of cost vs performance."
I wasn't trying to squeeze out every ounce of performance and determine the minimum number of instances to handle 100 req/sec. I was trying to normalize across the three patterns as much as possible to see best-case performance. I didn't want resource constraints to be an excuse.
I think there’s a pretty big flaw in your benchmark because that failure rate is _insane_ and you shouldn’t need anywhere near that hardware to accomplish this. I don’t think your data is credible as a result.
NoSQL modeling is waaay different than relational modeling. I think a lot of NoSQL advice out there is pretty bad, which results in people dismissing the technology altogether. I've been working with DynamoDB for a few years now, and there's no way I'll go back.
The book has been available for about a month now, and I've been pretty happy with the reception. Strong support from Rick Houlihan (AWS DynamoDB wizard) and a lot of other folks at AWS.
You can get a free preview by signing up at the landing page. If you buy and don't like it, there's a full money-back guarantee with no questions asked. Also, if you're having income problems due to COVID, hit me up and we'll make something work :)
Anyhow, hit me up with questions!
EDIT: Added a coupon code for folks hearing about the book here. Use the code "HACKERNEWS" to save $20 on Basic, $30 on Plus, or $50 on Premium. :)