Great to see the this SBC benchmark. I have done one benchmark for several available SBCs from an edge data stack viewpoint (Edge Data Stack Capabilities Ranking: https://joinbase.io/ranking/, but the article is still in progress...so sorry for no content now)
What I have tested : $20 RockPi S(minimum squared Armv64 dev board), $0 unused IPTV box (S905L2 chip)(should be bought from secondhand market in ~ $7), #30 NanoPi Neo3(the minority of cheap SBC with DDR4 on), the Allwinner D1(the first batch avaiable Risc-v 64 board) and my own Xeon server (and more cloud based instances).
A possibly surprising fact is that, if the modern software (such as my database here) can take full advantage of the modern hardware, then the performance of even a small 40mm squared SBC board (such as RockPi S) can be compared to the large-scale-used traditional softwares running on a Xeon Platinum 8260 based bare-metal server(such as PostgreSQL).
There are SQLite(OLTP), DuckDB(OLAP) and some engine-based project like mentioned Apache Arrow(https://arrow.apache.org/)(OLAP): Apache Arrow has many language implementations, some do not include the query engine(for example, Rust implementation, which depends on the DataFusion for more SQL-like analytics) in its own repo, but other do include(for example, C++).
There is a comprehensive benchmark by ClickHouse for OLAP but including kinds of embedding engines: https://benchmark.clickhouse.com/
The more interesting is that, in fact, we have not an embedded HTAP engine. One of my database products already implements 3/4 HTAP at the engine layer, but unfortunately it's still just a free software, not an open source implementation.
I am the founder of this database product. A fundamental curiosity in my heart is, after tens years of development, can the bottom layer of the common-users-accessible database still develop?
JoinBase, is my initial answer to this question.
1. For the first time, it proposes the evolution of the database from the user's point of view.
2. For the first time, it well supports three DB payload types in a single engine of a free industrial-grade database: TP single row transcation write, TP and AP read.
3. For the first time, it is so fast that for most businesses, it is no longer necessary to use a distributed database for their data needs.
As the first end-to-end IoT Database, it is interesting to see how we fit IoT domain data storage and analysis much better than existed open source projects.
I recommend one ClickHouse compatible OLAP database project in Rust: [TensorBase](https://github.com/tensorbase/tensorbase/) for anyone who likes working with AP-side DBs on Rust.
FYI, recent information and progresses for TensorBase:
1. TensorBase(TB, for short) is not an reimplementing or clone of ClickHouse(CH, for short). TensorBase just supports the ClickHouse wire protocol in its server side.
2. TB's in-Rust CH compatible server side is faster than that in-C++ of CH. TB enables *F4* in the critical writing path: Copy-Free, Lock-Free, Async-Free, Dyn-Free (no dynamic object dispatching).
The result of TB's architectural performance: the untuned write throughput of TB is ~ 2x faster than that of CH in the Rust driver bench, or ~70% faster by using CH own ```clickHouse-client``` command. Use [this parallel script](https://github.com/tensorbase/tools/blob/main/import_csv_to_...) to try it yourself!
5. For complex groupby aggregations, recently we help to boost the speed of the TB engine to the same level of ClickHouse(not released, but coming soon).
It is often doubtful if one uses the word "fastest". You often see that one micro-bench lists ten products, then it says "look, I am running in the shortest time".
The problem is that, people often compare "apple to orange". Do you know how to correctly use ClickHouse(there are 20-30 engines in ClickHouse to use. Do you compare an in-memory engine to an disk-persistent-design Database?), Spark, Arrow... ? How can you guarantee to do a fair evaluation among ten or twelve products?
1. TensorBase is highly hackable. If you know the Rust and C, then you can control all of the world. This is obvious not for Apache Arrow and DataFusion (on the top of Arrow).
2. TensorBase uses the whole-stage JIT optimization which is (in complex cases possibly hugely) faster than that done in Gandiva. Expression based computing kernel is far from provoding the top performance for OLAP like bigdata system.
3. TensorBase keeps some kinds of OLTP in mind (although in the early stage its still in OLAP). There is no truely OLTP or OLAP viewpoints in users. Users just want all their queries being fastest.
4. TensorBase is now APL v2 based. Enjoy to hack it yourself!
ps: One recent writting about TensorBase (and those compared with some query engine and project in Rust works included) could be seen in this presentation:
https://tensorbase.io/2020/11/08/rustfest2020.html
It looks like TensorBase has many common things with ClickHouse. Do you have benchmarks that compare performance of TensorBase with ClickHouse similar to benchmarks published by ClickHouse [1]?
"raw csv processing at ~20GB/s" has been demonstrated in one my project as the tooling byproduct[1] based on Rayon(great Rust data parallelism library)[2] and wrapping of modified simdcsv[3] into one simple .rs.
just a little more:
1. simdcsv has severe bugs, so do not use it beyond demo.
2. the processing model of simdcsv still has rooms to good improvements(estimated 2x more, a.k.a. ~40GB+/s in memory in single modern socket should be achievable in some scenarios(no heavy string to complex language object conversions)).
sharing some thoughts here, in that I am recently developing a similar thing:
1. "Query 1.6B rows in milliseconds, live" is just like "sum 1.6B numbers from memory in ms".
In fact, if not full SQL functionalities supported, a naive SQL query is just some tight loop on top of arrays(as partitions for naive data parallelism) and multi-core processors.
So, this kind is just several-line benchmark(assumed to ignore the data preparing and threading wrapping) to see how much time the sum loop can finish.
In fact again, this is just a naive memory bandwidth bench code.
Let's count: now the 6-channel xeon-sp can provide ~120GB/s bandwidth. Then sum loop with 1.6B 4-byte ints without compression in such processors' memory could be finished about ~1.6*4/120 ~= 50ms.
Then, if you find that you get 200ms in xxx db, you in fact has wasted 75% time(150ms) in other things than your own brew a small c program for such toy analysis.
2. Some readers like to see comparisons to ClickHouse(referred as CH below).
The fact is that, CH is a little slow for such naive cases here(seen at web[1] been pointed by guys).
This is because CH is a real world product. All optimizations here are ten- year research and usage in database industry and all included in CH and much much more.
Can you hold such statement in the title when you enable reading from persistent disk? or when doing a high-cardinality aggregation in the query(image that low-cardinality aggregation is like as a tight loop + hash table in L2)?
1. We do plan to support full SQL functionalities, we have a pretty good subset already [1]. I think that what we do is more than "naive memory bandwidth bench code", however I am happy to listen and when the time is right implement functionalities/features that you think we are missing.
2. "Can you hold such statement in the title when you enable reading from persistent disk?" We already persist to disk, the numbers you see imply reading from disk. We do this by using memory mapped files [2].
What I have tested : $20 RockPi S(minimum squared Armv64 dev board), $0 unused IPTV box (S905L2 chip)(should be bought from secondhand market in ~ $7), #30 NanoPi Neo3(the minority of cheap SBC with DDR4 on), the Allwinner D1(the first batch avaiable Risc-v 64 board) and my own Xeon server (and more cloud based instances).
A possibly surprising fact is that, if the modern software (such as my database here) can take full advantage of the modern hardware, then the performance of even a small 40mm squared SBC board (such as RockPi S) can be compared to the large-scale-used traditional softwares running on a Xeon Platinum 8260 based bare-metal server(such as PostgreSQL).