More

agencies · on Sept 30, 2023

What is state of the art currently?

rutgersnj22 · on Sept 30, 2023

here is a more recent paper where I am one of the authors: http://www.vldb.org/pvldb/vol9/p828-deng.pdf

itake · on Sept 30, 2023

https link:

https://www.vldb.org/pvldb/vol9/p828-deng.pdf

sa-code · on Sept 30, 2023

Lucene's WFST is absurdly fast

anonzzzies · on Sept 30, 2023

And also; are there implementations to look at? Or libraries/open source dbs/search engines that use these?

agencies · on Aug 23, 2023

You say that if someone has the chops to be a real mathlete they won't need Polya's _How To Solve It_

I'll say I went to college 25 years ago with people who had competed internationally in high school and who placed competitively on the Putnam, and they LOVED Polya's book.

I think whether you enjoy seeing strategies laid out well--—whether or not you've been able to figure some of it out yourself---depsnds more on your personality than on how good you are at solving creative math problems.

agencies · on Feb 28, 2023

Is the code or expanded explanation available?

feoren · on Feb 28, 2023

Unfortunately I did this for work so I can't open-source the code without some awkward conversations (which may be possible and worth it, but I haven't yet). I'm sure I could write a blog post on it, but I don't have a blog, so ... sorry. Peter Norvig's post plus my point about tries and an efficient comparison algorithm at least give you a head start.

abecedarius · on March 1, 2023

Norvig wrote a similar expansion in his chapter for the book Beautiful Data (along with several other small programs to do fun things with natural language corpus data).

You can find it with a web search.

(His version didn't use a trie, because Python's built-in dicts are much more efficient than a trie in Python, even with the extra redundancy that a trie eliminates.)

agencies · on Feb 1, 2023

Here's a write-up from a relatively small/personal perspective

https://blog.qwertyforce.dev/posts/similar_image_search

agencies · on Jan 25, 2023

Have you tried https://historio.us/ ??

agencies · on Jan 25, 2023

How much would you be willing to pay for such a service?

gowings97 · on Jan 26, 2023

$20/month USD. I put a lot of value on being able to retrieve everything I've ever read.

metadat · on Jan 25, 2023

Why can't it be built into the browser?

Preserving privacy is nice when there's no compelling reason to sacrifice it. Does this really need to be a SaaS?

stonogo · on Jan 25, 2023

It is built into Safari (via the History interface). Opera had this feature before it became a Chrome reskin. Chrome used to support it, but it only worked on http (not https) sites and after some years the feature was dropped. Chrome addons like Falcon bring it back (but Falcon seems unattended these days). The Min Browser offers full-text search history out of the box, but the browser experience is ... eccentric.

There was a SaaS service for this called Recawl which required a browser plugin. memex.garden offered this as part of a SaaS but they dropped this feature. Browserparrot focuses on this, again as a SaaS, again with uncertain pricing. Diskernet does this fully-local, but the software is not free (and is only offered via subscription pricing). St. Clair Software's HistoryHound does full-text history search, but only on Mac; I suppose it's got a bigger featureset than the Safari tool, and it supports not-Safari browsers.

The field is littered with previous attempts to get this right.

agencies · on Jan 25, 2023

Depends on required features like cross browser support, cross device support, handling pdfs, ocr images, etc. Some of the mentioned features already exist in the browser. Not sure if the browser vendors are incentivized to develop and maintain such features.

metadat · on Jan 25, 2023

I recall a project showcased on HN that does browser history full text indexing via a proxy. Perhaps that would be a better approach.

agencies · on Jan 25, 2023

Yeah several threads on HN that have lists of tools and pros/cons.

kazinator · on Jan 25, 2023

It is?

In Firefox, type *, space and then search terms; this searches bookmarks.

Also, in the Bookmarks -> Manage Boomarks menu there is a search box, for those who despise convenience.

metadat · on Jan 25, 2023

They're talking about including page content, Firefox only includes title, URL, and manually added tags.

agencies · on Aug 25, 2022

Interesting intro/overview in "What every software engineer should know about search" https://scribe.rip/p/what-every-software-engineer-should-kno...

agencies · on June 3, 2022

There is definitely demand and folks are willing to pay:

https://www.duckbillgroup.com/

https://www.vantage.sh/

kamrani · on June 3, 2022

Thanks for sharing the links. Is this something you'd be interested in as well?

agencies · on June 2, 2022

If it were easy/cheap enough to host, can a model like shared game servers or web/email hosting work? People pay $20 a month without thinking for web hosting. What does it take to make "search hosting" a thing, where cheap search hosting companies can crop up both at the low end with bare bones offerings and others climb up the value chain with offerings like squarespace...

marginalia_nu · on June 4, 2022

Cheap is probably a bit out of reach. I think right now, for hosting something like my search engine, you're looking at either a one-time cost of around $5000, or $200/month in server rentals. Maybe you can bring that down with the economies of scale, but it's never going to be anywhere close to $20.

agencies · on April 29, 2022

To clarify I'm not asking about HN itself but articles linked from HN.

As you said the HN api is great and there are at least 2 existing published crawls of it that help a lot.

arinlen · on April 29, 2022

> To clarify I'm not asking about HN itself but articles linked from HN.

I might not have a clear picture of what you're looking for, but items of type "story" returned by the HN API do have a URL field, which I believe correspond to submitted links.

You can scrape the text field of comment items, but that takes a bit more work.

lcnPylGDnU4H9OF · on April 29, 2022

Hopefull this will help: you're talking about a submission to HN, e.g. a link to a WSJ article complete with comments section, and OP is talking about the specific WSJ article.

krapp · on April 29, 2022

The fastest way to get that would probably still be through HN's API, you just have to take the URL field for stories and ignore everything else.

tedunangst · on April 29, 2022

And how do you get the content once you have the URL?

arinlen · on April 29, 2022

> And how do you get the content once you have the URL?

I don't understand your question. If you have the URL, you just GET it, like any regular URL? Is there something that I'm missing?

agencies · on April 29, 2022

Many domains have expired or content is no longer available.

krapp · on April 29, 2022

Use IA more responsibly, perhaps. Instead of scraping it, convert the list of links from HN to point to IA? You still have to work with whatever limits the site puts up in any case.

agencies · on April 29, 2022

If a HN story is a link to Wikipedia, the HN api serves the content of the Wikipedia page??