More

drej · on May 6, 2022

We used to run it self hosted, but the architecture is rather complex, so we ended up switching to their SaaS offering. I tried advocating [0] for a leaner architecture for simpler setups + to run integration tests against... but it wasn't met with much enthusiasm.

I imagine 99% of installations (and certainly CI pipelines) would be fine with their wsgi webserver and sqlite instead of Clickhouse, Relay, memcached, nginx, Postgres, Kafka and a host of other services. I wanted to take a stab at this myself, but given the complexity of the system, uncertainty of being able to merge it back, and, last but not least, their license, I decided against it.

[0]: https://github.com/getsentry/sentry/issues/32794

kenrick95 · on May 6, 2022

This has been the experience at my company as well. The previous version of Sentry Server was okay-ish to self-host, but the newest version requires setting up some more services that our infra unwilling to setup/maintain as they are different from the tech stack that our devs use in the company. We ended up with SaaS too

drej · on April 5, 2022

There are several principles that have helped me:

  - Do NOT migrate to new tools, even if they are really shiny and popular - this may end up being a big hurdle you'll regret down the line (looking at you, Notion, Todoist, Google Keep, the list goes on). Just stick to your tools for at least a couple of years before exploring alternatives.
  - Try search instead of organisation. I used to have elaborate folder structures for everything... but have since given up, still keep some basic structure, but I find most things via search anyway.
  - Help your non-technical friends and family. You may have discovered the best clouds, NAS solutions, software, methodology... but they still use post its for passwords and keep family recipes in that one notebook in the cupboard. Guide them, share your subscriptions with them, give them pointers.
  - Use fewer services, even if they are not the best tools for the job. I find it much easier to reason about a handful of storage media (and their backups, pricing etc.) than having an app for every single activity in my life (e.g. I stopped using apps for recipes, I use my synchronised note taking app instead. Same for shopping lists, use my todo app. Etc.)

scambier · on April 5, 2022

+1 for "use search instead of organization".

I've been taking notes for 2 years, zettelkasten-style, and 90% of them are just dumped in the same directory, without links or tags. If I'm looking for something and remember that I might have that in my notes, I just search for it.

That also implies that, when writing a note, I sometimes add a line with a few related keywords and synonyms.

raelmiu · on April 5, 2022

How though? People say this like search is simple, but how do you search for things?

I'm using Roam and I still have this issue.

scambier · on April 5, 2022

I spent some time to build my own web-based markdown editor, and was using minisearch[0] to index and do fuzzy search on my notes. I've recently switched to Obsidian and plan to make a similar extension in the next weeks.

[0]: https://www.npmjs.com/package/minisearch

d-gearloose · on April 5, 2022

If you're on a Linux system and the notes are just text files a grep -rn . -e 'search term' is probably sufficient. Use -rnw If you're looking for whole words.

leephillips · on April 5, 2022

Absolutely. And if you do this a lot it might be worth substituting ripgrep for grep.

JustLurking2022 · on April 5, 2022

Maybe leave your friends and family alone - sure, it's fine to mention a product if the topic comes up but I find one of the things humans tend to want help with the least it's organization - everyone has their own system.

anon2020dot00 · on April 5, 2022

Hard agree. I've never bothered to even read about PARA, Zettelkasten, Johnny Decimal, etc.. I just prefer to come-up with my own system

asadkn · on April 5, 2022

I agree. I rarely ever use new tools, but I switched to Notion a few years ago and even recommended it to many friends and colleagues.

Big mistake. Lessons learned. It's been slow and buggy all this time but I still kept using it. The final nail in the coffin was it started using 15-20% of a CPU core on IDLE, and has been doing so for 5+ months. Tried everything and gave up. Going back to simple text files.

glenstein · on April 5, 2022

>Do NOT migrate to new tools

Completely agree here. For better or worse (well, definitely worse), I've noticed that some part of my brain that loves list-making and re-organizing. This leads me to an impulse to be in a never-ending process of migrating from one tool to another, wind at my back with a new motivated epiphany about how to re-organize everything.

Of course, when I say it out loud, it's nonsense, but moment to moment it doesn't feel like that's what I'm doing.

I really love the wiki-style organization but I don't want to depend on a tool with features stuck to that tool. I want something universal, like text.

Right now I'm using Simplenote and its killer linked notes feature, but I fear perhaps I've got a bit of lock-in there too. I at least believe in the ability to easily backup and migrate out of Simplenote, I think it's kinda portable.

But I really think simple principles, like search don't sort, may cut through the unnecessary complexity that my brain loves to produce and administer.

dangoor · on April 5, 2022

If you want to stick with text, I can wholeheartedly recommend Obsidian. It's a bunch of Markdown files on disk and Obsidian does a great job of layering organization features on top of that. (I personally also pay for their Sync service, but you can sync to other devices using other cloud services — they're just Markdown files on disk, after all!)

glenstein · on April 5, 2022

Obsidian is awesome. Thanks.

I guess in truth I have a handful of "requirements" - text based, able to avoid lock-in, able to sync, and cross-platform availability. Looks like it checks all the boxes.

webtv-user · on April 5, 2022

Your last point is a big one for me. Although I did not personally code the tools I use for 'PKM', they don't depend on any service provider remaining in existence in order to function. Not even my ISP.

> ... even if they are not the best tools for the job

But they are though. ;-)

drej · on March 21, 2022

Clicking on the link I though "I wonder it it starts with time.sleep"

It was a pretty safe guess, though, the number of bugs related to this is pretty epic.

drej · on March 2, 2022

Perhaps I'm mistaken, but from an interview with the game author, I gathered that you can use 10k+ words, only ~2300 of those being candidates for the winning word.

But the analysis only uses the latter subset. So in theory there could be a valid 5-letter that could be a good opener, even if it cannot be a winning word.

drej · on Nov 15, 2021

What I find hilarious is that companies argue who can query 100 TB faster and try to sell this to people. I've been on the receiving end of offers by both of the companies in question and used both platforms (and sadly migrated some data jobs to them).

While they can crunch large datasets, they are laughably slow for the datasets most people have. So while I did propose we use these solutions for our big-ish data projects, management kept pushing for us to migrate our tiny datasets (tens of gigabytes or smaller) and the perf expectedly tanked compared to our other solutions (Postgres, Redshift, pandas etc.), never mind the immense costs to migrate everything and train everyone up.

Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).

sanketsarang · on Nov 16, 2021

I did work on making a database myself, and I must say that querying 100TB fast, let alone storing 100TB of data, is a real problem. Some companies (very few) don't have much choice but to use a DB that works on 100TB. If you do have small data, then you have a lot of options. But if your data is large, then you have very few options. So it is correct to be competing on how fast a DB can query 100TB of data; while at the same time being slow if you have just 10GB of data. Some databases are designed only for large data, and should not be used if your data is small.

doppelganger1 · on Nov 16, 2021

The larger your data, the more that indexing and maintaining them hurt you. This is why they do much better at larger datasets vs small data sets. It’s all about trade offs.

To overcome this, they make use of cache and if the small data is frequently accessed, the performance is generally pretty good and acceptable for most use cases.

geoduck14 · on Nov 16, 2021

Did anyone else notice the surge of brand new accounts that are appearing on these discussions of Databricks with pro-Databrick opinions?

If we had access to IP address of the posters, I sure would be interested in looking at correlation among them.

doppelganger1 · on Nov 18, 2021

What about my comment above is pro-Databricks? Snowflake works the same way. So do most large scale DW insert Exadata, Netezza, etc...

Does anyone else notice people questioning common sense?

khc · on Nov 16, 2021

with most people working from home, not sure if this heuristic works.

disclaimer: works for databricks, but not on spark, and first time posting in this thread

tshanmu · on Nov 15, 2021

Resume driven development FTW!

StephenJGL · on Nov 16, 2021

Very true. You have to understand the actual capabilities and your actual requirements. We work with petabyte size datasets and BigQuery is hard to beat. Our other reporting systems are still all in MySQL though.

autokad · on Nov 16, 2021

its my experience if its just 10s of GBs then use 'normal' solutions. if TB then spark is great for that. note I have only used DataBricks & Spark, no snowflake.

jeltz · on Nov 16, 2021

PostgreSQL and MySQL can handle a few TB just fine. It is when you reach over 10TB that you need something else.

drej · on Nov 8, 2021

I remember benchmarking fast compression algorithms a while back and had lz4 as one of the top contenders, but I remember going with Snappy over lz4, because I ran into some compatibility issues in lz4 land... I can't recall any specifics, but I think it was some Hadoop libraries not being able to read files lz4-compressed by some other versions of the library.

Has anyone run into the same issue? I'm considering reopening this investigation (even though I'm very happy with Snappy).

nonesuchluck · on Nov 8, 2021

Similar problem I've encountered is AWS Athena not supporting Parquet files compressed with Zstd. I re-wrote files with Snappy to get around it.

drej · on Aug 23, 2021

It’s fairly difficult from the other side as well - contributing. I’ve been trying to complete wikidata from a few open source datasets I am intensly familiar with… and it’s been rather painful. WD is the sole place I have ever interacted with that uses RDF, so I always forget the little syntax I learned last time around. I have some pre-existing queries versioned, because I’ll never be able to write them again. I even went to a local Wikimedia training to get acquainted with some necessary tooling, but I’m still super unproductive compared to e.g. SQL.

It’s sad, really, I’d love to contribute more, but the whole data model is so clunky to work with.

drej · on Aug 23, 2021

That being said, I now remember I stopped contributing for a slightly different reason. While I tried to fill WD with complete information about a given subject, this was never leverage by a Wikimedia project - there is certain resistance to generating Wikipedia articles/infoboxes from Wikidata, so you're fighting on two fronts, you always have to edit things in two places and it's just a waste of everyone's time.

Unless the attitude becomes "all facts in infoboxes and most tables come from WD", the two "datasets" will continue diverging. That is obviously more easily said than done, because relying on WD makes Wikipedia contribution a lot more difficult... and that pretty much defeats its purpose.

dane-pgp · on Aug 23, 2021

> the two "datasets" will continue diverging.

You may be pleased to learn that there is a project underway that aims to largely solve that problem:

https://www.mediawiki.org/wiki/Wikidata_Bridge

The last piece of news I can immediately find is that it was deployed to the Catalan Wikipedia in August 2020, but I'm not sure what progress there has been since.

valexiev · on Aug 31, 2021

I have no problems with the data model, but sadly you can't insert RDF statements: you have to go through tools like QS and wikidata-cli and the WD update performance is dismal.

See https://phabricator.wikimedia.org/T290061, which I posted in https://phabricator.wikimedia.org/project/board/5504/ for DataQualityDays2021

drej · on Aug 17, 2021

Nice. Reminds me of an optimisation trick from a while ago: I remember being bottlenecked by one of these trigonometric functions years ago when working with a probabilistic data structure... then I figured the input domain was pretty small (a couple dozen values), so I precomputed those and used an array lookup instead. A huge win in terms of perf, obviously only applicable in these extreme cases.

ThePadawan · on Aug 17, 2021

One of the things I recently learned that sounded the most "that can't possibly work well enough" is an optimization for sin(x):

If abs(x) < 0.1, "sin(x)" is approximated really well by "x".

That's it. For small x, just return x.

(Obviously, there is some error involved, but for the speedup gained, it's a very good compromise)

tzs · on Aug 17, 2021

To put some numbers on it, using N terms of the Taylor series for sin(x) [1] with |x| <= 0.1, the maximum error percentage cannot exceed [2]:

  N  Error Limit
  1  0.167% (1/6%)
  2  8.35 x 10^-5% (1/11980%)
  3  1.99 x 10^-8% (1/50316042%)
  4  2.76 x 10^-12% (1/362275502328%)

Even for |x| as large as 1 the sin(x) = x approximation is within 20%, which is not too bad when you are just doing a rough ballpark calculation, and a two term approximation gets you under 1%. Three terms is under 0.024% (1/43%).

For |x| up to Pi/2 (90°) the one term approximation falls apart. The guarantee from the Taylor series is within 65% (in reality it does better, but is still off by 36%). Two terms is guaranteed to be within 8%, three within 0.5%, and four within 0.02%.

Here's a quick and dirty little Python program to compute a table of error bounds for a given x [3].

[1] x - x^3/3! + x^5/5! - x^7/7! + ...

[2] In reality the error will usually be quite a bit below this upper limit. The Taylor series for a given x is a convergent alternating series. That is, it is of the form A0 - A1 + A2 - A3 + ... where all the A's have the same sign, |Ak| is a decreasing sequence past some particular k, and |Ak| has a limit of 0 as k goes to infinity. Such a series converges to some value, and if you approximate that by just taking the first N terms the error cannot exceed the first omitted term as long as N is large enough to take you to the point where the sequence from there on is decreasing. This is the upper bound I'm using above.

[3] https://pastebin.com/thN8B7Gf

MauranKilom · on Aug 17, 2021

Some trivia, partly stolen from Bruce Dawson[0]:

The sin(x) = x approximation is actually exact (in terms of doubles) for |x| < 2^-26 = 1.4e-8. See also [1]. This happens to cover 48.6% of all doubles.

Similarly, cos(x) = 1 for |x| < 2^-27 = 7.45e-9 (see [2]).

Finally, sin(double(pi)) tells you the error in double(pi) - that is, how far the double representation of pi is away from the "real", mathematical pi [3].

[0]: https://randomascii.wordpress.com/2014/10/09/intel-underesti...

[1]: https://github.com/lattera/glibc/blob/master/sysdeps/ieee754...

[2]: https://github.com/lattera/glibc/blob/master/sysdeps/ieee754...

[3]: https://randomascii.wordpress.com/2012/02/25/comparing-float...

quietbritishjim · on Aug 17, 2021

That is precisely the technique discussed in the article: it's the first term of the Taylor expansion. Except that the article used more terms of the expansion, and also used very slightly "wrong" coefficients to improve the overall accuracy within the small region.

RicoElectrico · on Aug 17, 2021

Come on, at this point I've seen this "engineering approximation" memed so many times, even on Instagram ;)

What is more interesting to me is that this can be one of rationales behind using radians.

And that tan(x)~x also holds for small angles, greatly easing estimations of distance to objects of known size (cf. mil-dot reticle).

lapetitejort · on Aug 17, 2021

tan(x)~x because cos(x)~1. It's approximations all the way down.

chriswarbo · on Aug 17, 2021

This is a very common assumption in Physics, e.g. https://en.wikipedia.org/wiki/Pendulum_(mathematics)#Small-a...

Whether it's appropriate in a numerical calculation obviously depends on the possible inputs and the acceptable error bars :)

sharikone · on Aug 17, 2021

I think that you will find that for subnormal numbers any math library will use the identity function for sin(x) and 1 for cos(x)

ThePadawan · on Aug 17, 2021

Right, but the largest subnormal number in single-precision floats is ~ 10^-38.

That the sin(x) approximation still works well for 10^-1 (with an error of ~0.01%) is the really cool thing!

jvz01 · on Aug 17, 2021

Another good hint is the classical half-angle formulas. You can often avoid calling sin() and cos() altogether!

Bostonian · on Aug 17, 2021

Why wouldn't you at least include the x^3 term in the Taylor series for abs(x) < 0.1?

ThePadawan · on Aug 17, 2021

That's what I assumed would have been a reasonable optimization!

What I really found amazing was that rather than reducing the number of multiplications to 2 (to calculate x^3), you can reduce it to 0, and it would still do surprisingly well.

mooman219 · on Aug 17, 2021

This would be a decent lookup for the atan2 function:

https://gist.github.com/mooman219/19b18ff07bb9d609a103ef0cd0...

tantalor · on Aug 17, 2021

https://en.wikipedia.org/wiki/Memoization

bluedino · on Aug 17, 2021

Technically it's a lookup table if you pre-compute them. Memoization would just be caching them as you do them.

tantalor · on Aug 17, 2021

Not necessarily, you could "cache" them in a compilation step and then use the table at runtime.

bruce343434 · on Aug 17, 2021

Tangential at best, but why was the 'r' dropped from that term? Or why not call it caching? Why the weird "memo-ization"? It makes me think of a mass extinction event where everything is turned into a memo.

franciscop · on Aug 17, 2021

It's explained right in the linked Wikipedia page:

> The term "memoization" was coined by Donald Michie in 1968[3] and is derived from the Latin word "memorandum" ("to be remembered"), usually truncated as "memo" in American English, and thus carries the meaning of "turning [the results of] a function into something to be remembered". While "memoization" might be confused with "memorization" (because they are etymological cognates), "memoization" has a specialized meaning in computing.

wongarsu · on Aug 17, 2021

The term memoization likely precedes the word caching (as related to computing, obviously weapon caches are far older). Memoization was coined in 1968. CPU caches only came about in the 80s as registers became significantly faster than main memory.

As wikipedia outlines, the r was dropped because of the memo. It's derived from the latin word memorandum that does contain the r, just like memory, but apparently it was more meant as an analogy to written memos.

drej · on May 5, 2021

Hey Mike, thanks for all the work you've been doing - I first used D3 back in 2013/2014 and been using them for side projects since.

I haven't dug into the whole v3/v4 modularisation, so that might answer it, but is there a way to minimise the dependencies that Plot brings? You say it needs D3, but what of it does it need specifically, the whole thing? It's just that it's 250K or so, so I was wondering what the minimal setup here is.

Cheers and congrats on the release!

drej · on April 20, 2021

I remember using this editor in the very early 2000s, it was quite something back then. You could easily live edit files off of FTP. With PHP being all the rage, we were changing things pretty live :-)

Good times.