We used to run it self hosted, but the architecture is rather complex, so we ended up switching to their SaaS offering. I tried advocating [0] for a leaner architecture for simpler setups + to run integration tests against... but it wasn't met with much enthusiasm.
I imagine 99% of installations (and certainly CI pipelines) would be fine with their wsgi webserver and sqlite instead of Clickhouse, Relay, memcached, nginx, Postgres, Kafka and a host of other services. I wanted to take a stab at this myself, but given the complexity of the system, uncertainty of being able to merge it back, and, last but not least, their license, I decided against it.
This has been the experience at my company as well. The previous version of Sentry Server was okay-ish to self-host, but the newest version requires setting up some more services that our infra unwilling to setup/maintain as they are different from the tech stack that our devs use in the company. We ended up with SaaS too
- Do NOT migrate to new tools, even if they are really shiny and popular - this may end up being a big hurdle you'll regret down the line (looking at you, Notion, Todoist, Google Keep, the list goes on). Just stick to your tools for at least a couple of years before exploring alternatives.
- Try search instead of organisation. I used to have elaborate folder structures for everything... but have since given up, still keep some basic structure, but I find most things via search anyway.
- Help your non-technical friends and family. You may have discovered the best clouds, NAS solutions, software, methodology... but they still use post its for passwords and keep family recipes in that one notebook in the cupboard. Guide them, share your subscriptions with them, give them pointers.
- Use fewer services, even if they are not the best tools for the job. I find it much easier to reason about a handful of storage media (and their backups, pricing etc.) than having an app for every single activity in my life (e.g. I stopped using apps for recipes, I use my synchronised note taking app instead. Same for shopping lists, use my todo app. Etc.)
I've been taking notes for 2 years, zettelkasten-style, and 90% of them are just dumped in the same directory, without links or tags. If I'm looking for something and remember that I might have that in my notes, I just search for it.
That also implies that, when writing a note, I sometimes add a line with a few related keywords and synonyms.
I spent some time to build my own web-based markdown editor, and was using minisearch[0] to index and do fuzzy search on my notes. I've recently switched to Obsidian and plan to make a similar extension in the next weeks.
If you're on a Linux system and the notes are just text files a
grep -rn . -e 'search term'
is probably sufficient. Use
-rnw
If you're looking for whole words.
Maybe leave your friends and family alone - sure, it's fine to mention a product if the topic comes up but I find one of the things humans tend to want help with the least it's organization - everyone has their own system.
I agree. I rarely ever use new tools, but I switched to Notion a few years ago and even recommended it to many friends and colleagues.
Big mistake. Lessons learned. It's been slow and buggy all this time but I still kept using it. The final nail in the coffin was it started using 15-20% of a CPU core on IDLE, and has been doing so for 5+ months. Tried everything and gave up. Going back to simple text files.
Completely agree here. For better or worse (well, definitely worse), I've noticed that some part of my brain that loves list-making and re-organizing. This leads me to an impulse to be in a never-ending process of migrating from one tool to another, wind at my back with a new motivated epiphany about how to re-organize everything.
Of course, when I say it out loud, it's nonsense, but moment to moment it doesn't feel like that's what I'm doing.
I really love the wiki-style organization but I don't want to depend on a tool with features stuck to that tool. I want something universal, like text.
Right now I'm using Simplenote and its killer linked notes feature, but I fear perhaps I've got a bit of lock-in there too. I at least believe in the ability to easily backup and migrate out of Simplenote, I think it's kinda portable.
But I really think simple principles, like search don't sort, may cut through the unnecessary complexity that my brain loves to produce and administer.
If you want to stick with text, I can wholeheartedly recommend Obsidian. It's a bunch of Markdown files on disk and Obsidian does a great job of layering organization features on top of that. (I personally also pay for their Sync service, but you can sync to other devices using other cloud services — they're just Markdown files on disk, after all!)
I guess in truth I have a handful of "requirements" - text based, able to avoid lock-in, able to sync, and cross-platform availability. Looks like it checks all the boxes.
Your last point is a big one for me. Although I did not personally code the tools I use for 'PKM', they don't depend on any service provider remaining in existence in order to function. Not even my ISP.
> ... even if they are not the best tools for the job
Perhaps I'm mistaken, but from an interview with the game author, I gathered that you can use 10k+ words, only ~2300 of those being candidates for the winning word.
But the analysis only uses the latter subset. So in theory there could be a valid 5-letter that could be a good opener, even if it cannot be a winning word.
What I find hilarious is that companies argue who can query 100 TB faster and try to sell this to people. I've been on the receiving end of offers by both of the companies in question and used both platforms (and sadly migrated some data jobs to them).
While they can crunch large datasets, they are laughably slow for the datasets most people have. So while I did propose we use these solutions for our big-ish data projects, management kept pushing for us to migrate our tiny datasets (tens of gigabytes or smaller) and the perf expectedly tanked compared to our other solutions (Postgres, Redshift, pandas etc.), never mind the immense costs to migrate everything and train everyone up.
Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).
I did work on making a database myself, and I must say that querying 100TB fast, let alone storing 100TB of data, is a real problem. Some companies (very few) don't have much choice but to use a DB that works on 100TB. If you do have small data, then you have a lot of options. But if your data is large, then you have very few options. So it is correct to be competing on how fast a DB can query 100TB of data; while at the same time being slow if you have just 10GB of data. Some databases are designed only for large data, and should not be used if your data is small.
The larger your data, the more that indexing and maintaining them hurt you. This is why they do much better at larger datasets vs small data sets. It’s all about trade offs.
To overcome this, they make use of cache and if the small data is frequently accessed, the performance is generally pretty good and acceptable for most use cases.
Very true. You have to understand the actual capabilities and your actual requirements. We work with petabyte size datasets and BigQuery is hard to beat. Our other reporting systems are still all in MySQL though.
its my experience if its just 10s of GBs then use 'normal' solutions. if TB then spark is great for that. note I have only used DataBricks & Spark, no snowflake.
I remember benchmarking fast compression algorithms a while back and had lz4 as one of the top contenders, but I remember going with Snappy over lz4, because I ran into some compatibility issues in lz4 land... I can't recall any specifics, but I think it was some Hadoop libraries not being able to read files lz4-compressed by some other versions of the library.
Has anyone run into the same issue? I'm considering reopening this investigation (even though I'm very happy with Snappy).
It’s fairly difficult from the other side as well - contributing. I’ve been trying to complete wikidata from a few open source datasets I am intensly familiar with… and it’s been rather painful. WD is the sole place I have ever interacted with that uses RDF, so I always forget the little syntax I learned last time around. I have some pre-existing queries versioned, because I’ll never be able to write them again. I even went to a local Wikimedia training to get acquainted with some necessary tooling, but I’m still super unproductive compared to e.g. SQL.
It’s sad, really, I’d love to contribute more, but the whole data model is so clunky to work with.
That being said, I now remember I stopped contributing for a slightly different reason. While I tried to fill WD with complete information about a given subject, this was never leverage by a Wikimedia project - there is certain resistance to generating Wikipedia articles/infoboxes from Wikidata, so you're fighting on two fronts, you always have to edit things in two places and it's just a waste of everyone's time.
Unless the attitude becomes "all facts in infoboxes and most tables come from WD", the two "datasets" will continue diverging. That is obviously more easily said than done, because relying on WD makes Wikipedia contribution a lot more difficult... and that pretty much defeats its purpose.
The last piece of news I can immediately find is that it was deployed to the Catalan Wikipedia in August 2020, but I'm not sure what progress there has been since.
I have no problems with the data model, but sadly you can't insert RDF statements: you have to go through tools like QS and wikidata-cli and the WD update performance is dismal.
Nice. Reminds me of an optimisation trick from a while ago: I remember being bottlenecked by one of these trigonometric functions years ago when working with a probabilistic data structure... then I figured the input domain was pretty small (a couple dozen values), so I precomputed those and used an array lookup instead. A huge win in terms of perf, obviously only applicable in these extreme cases.
To put some numbers on it, using N terms of the Taylor series for sin(x) [1] with |x| <= 0.1, the maximum error percentage cannot exceed [2]:
N Error Limit
1 0.167% (1/6%)
2 8.35 x 10^-5% (1/11980%)
3 1.99 x 10^-8% (1/50316042%)
4 2.76 x 10^-12% (1/362275502328%)
Even for |x| as large as 1 the sin(x) = x approximation is within 20%, which is not too bad when you are just doing a rough ballpark calculation, and a two term approximation gets you under 1%. Three terms is under 0.024% (1/43%).
For |x| up to Pi/2 (90°) the one term approximation falls apart. The guarantee from the Taylor series is within 65% (in reality it does better, but is still off by 36%). Two terms is guaranteed to be within 8%, three within 0.5%, and four within 0.02%.
Here's a quick and dirty little Python program to compute a table of error bounds for a given x [3].
[1] x - x^3/3! + x^5/5! - x^7/7! + ...
[2] In reality the error will usually be quite a bit below this upper limit. The Taylor series for a given x is a convergent alternating series. That is, it is of the form A0 - A1 + A2 - A3 + ... where all the A's have the same sign, |Ak| is a decreasing sequence past some particular k, and |Ak| has a limit of 0 as k goes to infinity. Such a series converges to some value, and if you approximate that by just taking the first N terms the error cannot exceed the first omitted term as long as N is large enough to take you to the point where the sequence from there on is decreasing. This is the upper bound I'm using above.
The sin(x) = x approximation is actually exact (in terms of doubles) for |x| < 2^-26 = 1.4e-8. See also [1]. This happens to cover 48.6% of all doubles.
Similarly, cos(x) = 1 for |x| < 2^-27 = 7.45e-9 (see [2]).
Finally, sin(double(pi)) tells you the error in double(pi) - that is, how far the double representation of pi is away from the "real", mathematical pi [3].
That is precisely the technique discussed in the article: it's the first term of the Taylor expansion. Except that the article used more terms of the expansion, and also used very slightly "wrong" coefficients to improve the overall accuracy within the small region.
That's what I assumed would have been a reasonable optimization!
What I really found amazing was that rather than reducing the number of multiplications to 2 (to calculate x^3), you can reduce it to 0, and it would still do surprisingly well.
Tangential at best, but why was the 'r' dropped from that term? Or why not call it caching? Why the weird "memo-ization"? It makes me think of a mass extinction event where everything is turned into a memo.
It's explained right in the linked Wikipedia page:
> The term "memoization" was coined by Donald Michie in 1968[3] and is derived from the Latin word "memorandum" ("to be remembered"), usually truncated as "memo" in American English, and thus carries the meaning of "turning [the results of] a function into something to be remembered". While "memoization" might be confused with "memorization" (because they are etymological cognates), "memoization" has a specialized meaning in computing.
The term memoization likely precedes the word caching (as related to computing, obviously weapon caches are far older). Memoization was coined in 1968. CPU caches only came about in the 80s as registers became significantly faster than main memory.
As wikipedia outlines, the r was dropped because of the memo. It's derived from the latin word memorandum that does contain the r, just like memory, but apparently it was more meant as an analogy to written memos.
Hey Mike, thanks for all the work you've been doing - I first used D3 back in 2013/2014 and been using them for side projects since.
I haven't dug into the whole v3/v4 modularisation, so that might answer it, but is there a way to minimise the dependencies that Plot brings? You say it needs D3, but what of it does it need specifically, the whole thing? It's just that it's 250K or so, so I was wondering what the minimal setup here is.
I remember using this editor in the very early 2000s, it was quite something back then. You could easily live edit files off of FTP. With PHP being all the rage, we were changing things pretty live :-)
I imagine 99% of installations (and certainly CI pipelines) would be fine with their wsgi webserver and sqlite instead of Clickhouse, Relay, memcached, nginx, Postgres, Kafka and a host of other services. I wanted to take a stab at this myself, but given the complexity of the system, uncertainty of being able to merge it back, and, last but not least, their license, I decided against it.
[0]: https://github.com/getsentry/sentry/issues/32794