Silly benchmarks on completely untuned server code

nartz · on March 11, 2020

Thanks for this - but its the age-old argument of dynamic language vs static/compiled language mixed with hey bloated libraries! and hey bloated features! etc.

0. Performance isn't everything, especially to a lot of companies where having something at all is more important than just being the fastest

1. Dynamic languages are generally require less LOC, which generally equates to faster implementations

2. Rails, Django, <your favorite dynamic language framework> sure are generally bloated, but being able to drop in a library for nearly anything you need cannot be overlooked. Especially for small-midsize companies, spinning up another app server is generally easier than writing a whole bunch of multi-threaded code.

3. The appserver is generally not the bottleneck, rather, of course, its the database.

lmilcin · on March 11, 2020

> Dynamic languages are generally require less LOC, which generally equates to faster implementations

That's not what I observed. My understanding is dynamic languages promote people to think more higher level and care less about performance and program structure. While dynamic constructs can be efficient, the freedom they give is often misused by developers to create baroque structures that hurt performance mainly because developers don't know how to use them efficiently.

jchrisa · on March 11, 2020

If you want multi-datacenter consistency, then the best of transaction protocols will still measure latency in terms of the speed of light multiplied by the distance between datacenters.

kiadimoondi · on March 11, 2020

Exactly. And even then, database choice faces some of the same selection criteria. What's better supported, does it have well tested and documented client implementations (or do I have the time to write it myself), what features do I need, what features should I anticipate needing, is the performance good enough for what I plan on doing.

j88439h84 · on March 11, 2020

As mentioned last time, The modern Python async frameworks like Starlette would have no trouble handling this amount of traffic.

https://www.starlette.io/

derwiki · on March 11, 2020

Exciting! I've never heard of Starlette. I set up example.py and ran the same `ab` commands mentioned in this post and got around 2600 RPS. I fresh started the server and threw 100k requests at it with `ab -c 100 -n 100000`, which timed out:

  apr_socket_recv: Operation timed out (60)
  Total of 49189 requests completed

Same thing when omit `-c 100`. Am I doing something wrong?

j88439h84 · on March 11, 2020

Not sure, I can't reproduce the timeout. I'm seeing "5829.85 [#/sec] (mean)" on my (nearly 10-year-old) computer which is doing other stuff. Not too shabby, I think.

derwiki · on March 11, 2020

A friend helped me figure it out: https://danielmendel.github.io/blog/2013/04/07/benchmarkers-...

jiofih · on March 11, 2020

With the machine she used, there are dozens of options. Node, go, Java, rust, crystal, h2o, will all easily outperform those numbers while offering a ton of features and robust ecosystems.

Anyone can cook up a plain http server, making a real application out of it is the other 98% of the work.

derwiki · on March 11, 2020

I tried running as simple an HTTP server as possible with go:

  package main

  import (
      "fmt"
      "net/http"
  )

  func main() {
      http.HandleFunc("/", func (w http.ResponseWriter, r *http.Request) {
          fmt.Fprintf(w, "Foo")
      })

      http.ListenAndServe(":8088", nil)
  }

and when I run it with `ab -c 100 -n 100000`, it falls over before 10k requests:

  $ ab -c 100  -n 100000 http://127.0.0.1:8088/
  This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
  Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
  Licensed to The Apache Software Foundation, http://www.apache.org/

  Benchmarking 127.0.0.1 (be patient)
  apr_socket_recv: Operation timed out (60)
  Total of 6388 requests completed

I'm extremely new to go so maybe I'm doing something wrong. Could someone help me understand?

derwiki · on March 11, 2020

Ahh this explains it: https://danielmendel.github.io/blog/2013/04/07/benchmarkers-...

jiofih · on March 11, 2020

You probably need to increase max open sockets using `ulimit`. Plus ab kinda sucks, use wrk instead.

bufferoverflow · on March 11, 2020

Language / framework and DB choice does matter for performance. A lot.

https://www.techempower.com/benchmarks/#section=data-r18&hw=...

As you can see, the fastest Python implementation is just 11% of the top solution in Rust.

And your key/value store choice matters a lot. Redis is well known, but it's much slower than Tarantool or LMDB. Which is slower than RocksDB or Aerospike (though take this claim with a grain of salt, they can perform differently under different loads (writes, reads, updates, deletes) and different number of concurrent requests).

kjeetgill · on March 11, 2020

I found myself completely nodding along to the first half of your comment only to find myself scratching my head to the last half.

I'm unfamiliar with Tarantool, but comparing a "server" database like redis with embedded ones like lmdb or rocksdb is so strange. It's comparing different engines to other whole cars.

NB: For anyone not in the know, "server" vs embedded here is about how your processes communicate with the DB, not about the type of hardware they're suitable to run on. An embedded database tends to be more of a library that attaches to a file or directory, often from a single process or occasionally from multiple processes on a single host.

"Server" databases tend to be connected to from processes on a number of servers over the network.

I keep using "server" in quotes here because I don't think I've seen that word used to make the distinction vs embedded databases reliable in literature.

kiadimoondi · on March 11, 2020

Some of those DBs solve different use cases than others. I wouldn't use Redis as an embedded DB (unless benchmarking indicated it fit my needs better), as the user of that software benefits most in using it with multi-machine access in mind. Embedded use cases are where RocksDB, LMDB, TokyoDB/KyotoDB, SQLite, etc. come into play.

Keeping the use case in mind (and it's possible evolution) will help pick the best tool for the job.

jstrong · on March 12, 2020

I don't think your "11%" is the best representation of the data.

black sheep (python): 101,508 req/sec

actix (rust): 886,499 req/sec

yes, 101508/886499 = 0.11

but your "x is % of y" doesn't seem to capture the relationship very well here IMO.

you could also say, the rust library processes 773% more requests per sec.

or, the python library processes 89% fewer requests per sec.

(using: percent change = d2/d1-1)

personally I prefer multiples in terms of the larger thing. actix does over 8x the requests per/sec as the python lib.

enitihas · on March 11, 2020

Redis is slower than LMDB? This sounds very hard to believe, since redis operates purely in memory, and all the data structures are designed with that in mind. LMDB does disk writes for every key value set operation. So I find it hard to believe redis is slower than LMDB

kjeetgill · on March 11, 2020

I don't know for sure, but I'd imagine it's dominated by network. Which kinda the point of the article: benchmarks need detail and context to interpret well.

james_s_tayler · on March 11, 2020

Redis writes it's replication log to disk.

enitihas · on March 11, 2020

That happens on a configured time interval, not on every request unlike lmdb.

jugg1es · on March 11, 2020

do you have evidence showing redis performance vs tarantool and lmdb? I'd be interested to see that.

_bxg1 · on March 11, 2020

I come from outside of the Python ecosystem, so I'm not familiar with the specific frameworks mentioned (I read the original post too).

Can someone simplify the argument that she's making about them? Something to do with maintaining separate, blocking threads in anticipation of requests vs spinning things up as you need them? Or is it a criticism of dynamic languages as HTTP servers in general (it's not clear what language was used in this article)?

lumost · on March 11, 2020

Probably a bit of both, but python and other dynamic languages make doing the smart thing very difficult. While I continue to work professionally in both static and dynamic languages, I've observed a few limitations that prevent writing a fast python webserver

- The Global Interpreter lock - you can't avoid forking processes in order to handle concurrent python code.

- Threads are cheaper than processes. A default java thread carries 2MB overhead, a python process for a typical app can easily be 2GB without very careful memory consideration.

- The pure single-threaded python runtime is 10-100x slower than your typical statically typed language, even when doing everything as carefully as possible you'll be 1-2 orders of magnitude off the best implementation in a statically typed language. Conversely a sloppy implementation in a statically typed language will probably work about as well as the best python implementation.

- Foreign Function Interfaces(FFIs) are slow, in Python an int32 consumes 24 bytes of memory whereas in C the int32 consumes 4 bytes. Stringing together 2 C calls that manipulate an int with python would require 4 allocations and 4 casts. Applications that avoid this overhead have to adopt symbolic APIs that inform an underlying C program how to connect multiple function calls and will grind to a halt if there is any python control flow e.g. TensorFlow.

Most of these limitations tend to be common to other dynamically typed languages such as Ruby, PHP, and Perl. While one can theoretically "drop down to C", it's often not that straightforward in common development scenarios. For businesses that require high(er) throughput, or low(er) latency the time spent fighting the interpreter or the associated AWS bills may not be worth the productivity gains from dynamic types in 2020 when compilers can perform an incremental build in milliseconds.

_bxg1 · on March 11, 2020

I don't think it's productive to argue that non-dynamic languages are superior across the board. Everybody knows they're faster. For many people, they're still the right decision.

I think (or at least assumed) that the post's discussion is more interesting than that. A sibling comment seemed to be zeroing in on the crux of the issue. "Python is an order of magnitude slower but it's web server is two orders of magnitude slower" is a meaningful and fruitful thing to talk about.

lumost · on March 11, 2020

That's fair, however the python language is two orders of magnitude slower than it's static brethren for a variety of tasks. Pretty much any time you need to use the native language in the interpreter. https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

It may be worth looking at how this performance gap has trended over time. Anecdotally I recall the performance gap being much narrower 10 years ago.

zbobet2012 · on March 11, 2020

> Threads are cheaper than processes. A default java thread carries 2MB overhead, a python process for a typical app can easily be 2GB without very careful memory consideration.

Threads are not cheaper than processes in Linux in any significant way. See the stack overflow question on that: https://stackoverflow.com/questions/807506/threads-vs-proces...

A _langauges_ use of those may have poor performance.

hermitdev · on March 11, 2020

> a python process for a typical app can easily be 2GB without very careful memory consideration.

Sorry, this is not true. I have a Python app that uses 2 threads. It's memory doesnt even exceed 200 MB.

Yes, Python threads generally suck because of the GIL, but they dont cause memory bloat like what you're describing.

For context, I use Python extensively for extract/transform/load (ETL) work. I deal with quite large files. My loaders all run in mear linear time relative to the number of records in the file, and none use more than 300MB RAM. This is against Python 3.7.3.

If your multithreaded Python app is using 2 GB of RAM, it's not because of the threads. Best look elsewhere. Maybe you're caching something large in thread local storage?

mplanchard · on March 11, 2020

GP was talking about processes explicitly in comparison to threads, so multiprocessing not multithreading. I believe this was related to their previous comment about the GIL making true multithreaded performance impossible:

> - The Global Interpreter lock - you can't avoid forking processes in order to handle concurrent python code.

> - Threads are cheaper than processes. A default java thread carries 2MB overhead, a python process for a typical app can easily be 2GB without very careful memory consideration.

And it is indeed true that forking can be much more memory intensive than threading.

wmf · on March 11, 2020

1. There's various low-hanging performance fruit that hasn't been picked.

2. Python has several different ways to do concurrency (because they evolved over ~25 years) that shouldn't be mixed but it's very easy to accidentally mix them.

3. Binary RPC (e.g. gRPC or Thrift) is more efficient than REST.

hermitdev · on March 11, 2020

I have a lib that queries quant data from a internal webservice. We have to break up large queries into smaller ones (like 4 days at a time). We currently use multiprocessing to get some parallelism. I prototyped using aiohttp. It was nearly an order of magnitude faster using a single thread. Unfortunately, cant use aiohttp because it doesn't do Negotiate/SSPI auth, which aiohttp doesnt support.

hedora · on March 11, 2020

If you stand up a python json rpc server (using whatever libraries are mentioned at the top of the article) it will be nowhere near the untuned numbers. I’d guess more than an order of magnitude off, based on experience with similar python stacks.

If you tune and use a performance oriented language (on a big multicore machine), you can probably get about 100x better than the untuned not-python throughputs. So, for http hello world, 1000 untuned python servers ~= one beefy server.

_bxg1 · on March 11, 2020

What I don't understand is what the "untuned" case here is. You could stand up a Python JSON RPC server and simply "not tune it". That would be "untuned code". Presumably that isn't what's being described.

Is this some C code written by the author? Is it Apache? Something else? Is the point just how much faster native code is than Python?

hedora · on March 11, 2020

The article goes out of its way to avoid describing the “untuned” setup, though the numbers look about right for a single threaded program without piles of frameworks and serializers in the way.

Apparently someone else also built an untuned setup from scratch, and got better numbers.

I think the point of the article is that if the results are surprising to you, then you should try writing one too.

wmf · on March 11, 2020

I'm guessing the "untuned" code here is C or some other statically-typed compiled language and it's not using any "framework". I guess the point is that straight-line Python code might be 10x-20x slower than C but an inefficient architecture can be 100x-1000x slower while being more complex than a good architecture.

samkidman · on March 11, 2020

I tried this out with unicorn and this minimal config.ru which I believe replicates the author's setup:

  class TestApp
    def self.call(env)
      [200, [], ['Blah']]
    end
  end

  run TestApp

ab -n 10000 http://localhost:8080/ --> ~1900 RPS (~8400 RPS)

ab -k -n 10000 http://localhost:8080/ --> ~1900 RPS (~26000 RPS)

ab -c 100 -n 10000 http://localhost:8080/ --> ~2500 RPS (~14500 RPS)

I have no idea why it performed better on the concurrent benchmark. This was on my 2.3GHz i5 mac book pro. I think about 1/5th the performance of something closer to the machine is quite decent.

thedance · on March 11, 2020

FWIW, as a Xoogler, I find I can get pretty far with 1) abseil, 2) grpc, 3) protobuf, and 4) groping around in every open source Google project to find the thing I need (for example, until recently, Cord could be found in random projects but not in abseil). I don't know what it's like for ex-bookfaces but this gets me a good way to my goal, and as a bonus I get to use stuff I wasn't able to at G, like C++17, or io_uring, or even jemalloc.

enitihas · on March 11, 2020

Outside of google, what other C++ libraries do you think help in writing services in C++. C++ lacks a huge amount of libraries compared to something like java, e.g, I can't even find a half decent http client which handles all the important stuff.

devnulloverflow · on March 11, 2020

> I can't even find a half decent http client which handles all the important stuff.

I'm guessing you don't count cURL as half-decent. Because it's API is C++k let alone idiomatic, modern C++?

hermitdev · on March 11, 2020

I think boost.org is the first place to stop. Boost is often used as a proving ground for future standard libs. The guideline support lib [0] is also another tool. ACE [1] is also another (older) framework that adds a lot. I try and avoid ACE, despite having contributed, there may still be value add, depending on your need.

I've also seen interesting things from Abseil and Folly.

[0] https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines [1]https://www.dre.vanderbilt.edu/~schmidt/ACE.html

thedance · on March 11, 2020

I don't know. I'd be very squicked out to expose a C++ HTTP client to the big bad world, i.e. as a web crawler. For private use, technically speaking gRPC contains an HTTP client.

More likely I'd fork a process in a safer language and direct it to make HTTP calls.

hermitdev · on March 11, 2020

Boost Beast seems to be pretty robust. Personally, I'd feel fine using it. I know the author Vinnie is very active with the community and very receptive and responsive to bug reports.

joana035 · on March 11, 2020

This looks like a good opportunity to make the existing ones a bit better :-)

MapleWalnut · on March 11, 2020

Another great post! I'm curious what people would suggest as a tech stack that's not python/gunicorn. Something with a GC, decent ORM, that can replace database-backed services written in Python with similar high level code.

dnautics · on March 11, 2020

So I repeated rachel's experiment with the tech stack that I use (elixir/phoenix). It's got a GC, a database adapter that really does a great job with validations and modeling, and at work I've replaced one django stack with it so far.

Not sure if my machine is at all comparable (i5-8265U unplugged laptop CPU @ 1.60GHz) but also the phoenix hello world does a hell of a lot more, like setting session cookies and producing a full front page that renders through two templates. They also took a bit longer to get to in developer time. While you go get a website at :4000 with two commands -- five if you have to install elixir from scratch, it's not really representative of performance: I had to go to prod to disable the live code-reloader and debug/info level logs, disable SSL (releasing to prod defaults to SSL-enabled) and perform a release (elixir these days really wants you to have devops hygeine).

1) I was pretty pleased with the performance. (1900 rps in the base case, 2600 rps in the keepalive case)

2) Unlike Rachel's platform, Elixir is better with concurrency out of the gate (went up to 7800 rps with -c 100), which tells me, the erlang VM is really doing something right.

3) Paying for the cost of having correctly implemented, difficult things like sessions and no XSS would be worth a massive performance hit, IMO.

4) Doing the right thing with Elixir is crazy easy, there's way fewer footguns than python, and code is typically extremely well documented and tested, and inspiring enough to make you want to document and test (though there are fewer libraries).

enitihas · on March 11, 2020

Golang? The go standard library is very good for writing network servers. It is effortless to write a service speaking HTTP with good enough performance. Doing the right thing is also easy. For example, using io.Copy uses splice or sendfile if possible to transfer the data. This makes writing good enough performance servers very easy

sytringy05 · on March 11, 2020

You could have a look at Reactive Java, node, golang, even nginx + openresty.

Basically anything with an event loop and good non blocking IO support for client requests and you will get loads of traffic through without too much hassle.

Of course a lot of this really depends on what your app needs to do over and above handle HTTP requests and what sort of library support you get from language / framework X.

Java, for example, has some great reactive libraries, but any JDBC driver blocks (as it has to, enforced by the JDBC spec). There's a couple of non JDBC drivers that support non blocking IO, but they are still evolving.

It's a similar story with other languages, there's always something necessary that isn't quite there yet.

That said you can still squeeze vast performance out of the blocking IO things with a solid engineering base (Tomcat, IIS if you can stomach windows) you just hit the thread limits earlier and harder than with the non blocking stuff.

jacob019 · on March 11, 2020

In my experience gevent is great, gunicorn not so much. Gevent has a built in HTTP server, it's fantastic. Just take care to avoid CPU bound or blocking code. I use PyMySQL instead of MySQLdb. If speed is a problem then use PyPy. Last summer I spent a month optimizing and updating our public facing django website, getting rid of gunicorn was a big improvement.

anonymoushn · on March 11, 2020

Doing great with Openresty over here.

tinco · on March 11, 2020

Ruby. It's the language with the most intuitive syntax and standard library, the most ergonomic external libraries and an extremely well polished web development ecosystem.

Close rivals might be PHP and Node.JS, there's pros and cons to all of them.

If something needs to be very performant you pluck it out into a little golang microservice.

cakoose · on March 11, 2020

The standard Ruby VM (MRI) is about the same speed as the standard Python VM (CPython), aka on the slow end. Twitter famously moved from Ruby to Java/Scala, largely for performance reasons.

Node.js is much faster because V8 is much faster, but it's still basically single threaded, so you need to run process-per-CPU, which is what the original blog post (to which this post is a follow-up) was complaining about.

tinco · on March 11, 2020

Yeah, the Ruby VM certainly is slow, but that doesn't really matter for most realistic workloads. In Twitter's case it definitely made sense, but do keep in mind that besides switching to Ruby they also switched their entire architecture around to a paradigm that Ruby had no mature ecosystem for at the time.

Node.JS is much faster both because it has a fast VM, and because it has concurrency much better dealt with in its standard library than Ruby does (i.e. a proper event loop is built in and its use is idiomatic).

That's not what the original blog post was about though, it was about how bad Gunicorn is. I've written Python, but never web apps (since Ruby's superiority there is obvious), but I don't think it's really fair to judge the whole language based on the use of some popular library that's shitty. I think it's symptomatic of Python that the library is shitty, but that's my dislike of Python shining through. The reality is that there are most likely great Python libraries for dealing with concurrent web requests, and she didn't bother to learn them.

On Ruby the community would without hesitation recommend you to run Puma or Passenger or even Iodine or Falcon or whatever fancy stuff you have nowadays. These webservers all deal with concurrency, optimizing memory usage, load balancing and managing queues correctly. They don't have silly things like fork'ing before importing libraries, because good software architecture is something that's highly valued in the Ruby community.

cakoose · on March 17, 2020

> That's not what the original blog post was about though, it was about how bad Gunicorn is.

Yes, the post starts out describing an issue with how Gunicorn listens for connections. Like you said, there are better libraries than Gunicorn so that's not a reason to jump to Ruby.

However, the article goes on to talk about other things, like the lack of real multithreading, import-time code execution, and overall efficiency, all of which also apply to Ruby.

I'm not saying Python or Ruby are a bad choice. It's just that MapleWalnut was soliciting an alternative that doesn't have the problems described in the original post, so Ruby doesn't qualify.