Hacker Newsnew | past | comments | ask | show | jobs | submit | agencies's commentslogin

What is state of the art currently?


here is a more recent paper where I am one of the authors: http://www.vldb.org/pvldb/vol9/p828-deng.pdf



Lucene's WFST is absurdly fast


And also; are there implementations to look at? Or libraries/open source dbs/search engines that use these?


You say that if someone has the chops to be a real mathlete they won't need Polya's _How To Solve It_

I'll say I went to college 25 years ago with people who had competed internationally in high school and who placed competitively on the Putnam, and they LOVED Polya's book.

I think whether you enjoy seeing strategies laid out well--—whether or not you've been able to figure some of it out yourself---depsnds more on your personality than on how good you are at solving creative math problems.


Is the code or expanded explanation available?


Unfortunately I did this for work so I can't open-source the code without some awkward conversations (which may be possible and worth it, but I haven't yet). I'm sure I could write a blog post on it, but I don't have a blog, so ... sorry. Peter Norvig's post plus my point about tries and an efficient comparison algorithm at least give you a head start.


Norvig wrote a similar expansion in his chapter for the book Beautiful Data (along with several other small programs to do fun things with natural language corpus data).

You can find it with a web search.

(His version didn't use a trie, because Python's built-in dicts are much more efficient than a trie in Python, even with the extra redundancy that a trie eliminates.)


Here's a write-up from a relatively small/personal perspective

https://blog.qwertyforce.dev/posts/similar_image_search


Have you tried https://historio.us/ ??


How much would you be willing to pay for such a service?


$20/month USD. I put a lot of value on being able to retrieve everything I've ever read.


Why can't it be built into the browser?

Preserving privacy is nice when there's no compelling reason to sacrifice it. Does this really need to be a SaaS?


It is built into Safari (via the History interface). Opera had this feature before it became a Chrome reskin. Chrome used to support it, but it only worked on http (not https) sites and after some years the feature was dropped. Chrome addons like Falcon bring it back (but Falcon seems unattended these days). The Min Browser offers full-text search history out of the box, but the browser experience is ... eccentric.

There was a SaaS service for this called Recawl which required a browser plugin. memex.garden offered this as part of a SaaS but they dropped this feature. Browserparrot focuses on this, again as a SaaS, again with uncertain pricing. Diskernet does this fully-local, but the software is not free (and is only offered via subscription pricing). St. Clair Software's HistoryHound does full-text history search, but only on Mac; I suppose it's got a bigger featureset than the Safari tool, and it supports not-Safari browsers.

The field is littered with previous attempts to get this right.


Depends on required features like cross browser support, cross device support, handling pdfs, ocr images, etc. Some of the mentioned features already exist in the browser. Not sure if the browser vendors are incentivized to develop and maintain such features.


I recall a project showcased on HN that does browser history full text indexing via a proxy. Perhaps that would be a better approach.


Yeah several threads on HN that have lists of tools and pros/cons.


It is?

In Firefox, type *, space and then search terms; this searches bookmarks.

Also, in the Bookmarks -> Manage Boomarks menu there is a search box, for those who despise convenience.


They're talking about including page content, Firefox only includes title, URL, and manually added tags.


Interesting intro/overview in "What every software engineer should know about search" https://scribe.rip/p/what-every-software-engineer-should-kno...


There is definitely demand and folks are willing to pay:

https://www.duckbillgroup.com/

https://www.vantage.sh/


Thanks for sharing the links. Is this something you'd be interested in as well?


If it were easy/cheap enough to host, can a model like shared game servers or web/email hosting work? People pay $20 a month without thinking for web hosting. What does it take to make "search hosting" a thing, where cheap search hosting companies can crop up both at the low end with bare bones offerings and others climb up the value chain with offerings like squarespace...


Cheap is probably a bit out of reach. I think right now, for hosting something like my search engine, you're looking at either a one-time cost of around $5000, or $200/month in server rentals. Maybe you can bring that down with the economies of scale, but it's never going to be anywhere close to $20.


To clarify I'm not asking about HN itself but articles linked from HN.

As you said the HN api is great and there are at least 2 existing published crawls of it that help a lot.


> To clarify I'm not asking about HN itself but articles linked from HN.

I might not have a clear picture of what you're looking for, but items of type "story" returned by the HN API do have a URL field, which I believe correspond to submitted links.

You can scrape the text field of comment items, but that takes a bit more work.


Hopefull this will help: you're talking about a submission to HN, e.g. a link to a WSJ article complete with comments section, and OP is talking about the specific WSJ article.


The fastest way to get that would probably still be through HN's API, you just have to take the URL field for stories and ignore everything else.


And how do you get the content once you have the URL?


> And how do you get the content once you have the URL?

I don't understand your question. If you have the URL, you just GET it, like any regular URL? Is there something that I'm missing?


Many domains have expired or content is no longer available.


Use IA more responsibly, perhaps. Instead of scraping it, convert the list of links from HN to point to IA? You still have to work with whatever limits the site puts up in any case.


If a HN story is a link to Wikipedia, the HN api serves the content of the Wikipedia page??


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: