Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I build search engines for a living. While I appreciate the hacker spirit of your project with Solr, I also see this as a huge problem that leads to bad search experiences. Tuning boosts in Solr is not even close to a reasonable way to solve problems like this. Arguably not even for an underfunded library, but certainly not for a high web traffic consumer website.

For one, you need disciplined acceptance criteria in the form of both qualitative standards (things a non-technical manager can look at and say yes or no) as well as various relevancy measurements like mean reciprocal rank and normalized discounted cumulative gain (via acquiring human annotated data if needed).

When people only focus on qualitative feedback on top of boosts and hacks in an off the shelf tool, they usually end up with some weird witches’ brew of bizarre boosts and time-decay weighting that is extremely fragile and can’t be robustly changed or even understood without the qualitative performance going haywire. You need disciplined study of quantitative ranking metrics to know the drivers of performance, fall off as you move down the ranking position, and to make search index updates reproducible and make incremental improvement measurable.

Meanwhile if you only focus on quantitative metrics, you might miss obvious red flags. The relevance score used for NDCG might be biased some way. You might surface highly relevant results to only one context or sense of the words in a query (like only showing fruit for “apple” and never tech gadgets). You need people who make the subjective appraisal of quality for users to be looped in.

Here’s the point. When this is all missing, you will lose credibility with people making the decisions. They’ll hear some engineer babble about NDCG but then say the darn thing doesn’t work in the QA testing. Or they’ll say the qualitative results look OK and get angry when weird counter-examples pop up in the second or third results pages, which might have been measured with quantitative metrics.

When this happens, executives and managers just want to punt. They want the “nobody ever got fired for buying IBM” equivalent for search, and that’s how you end up with Confluence still only supporting exact title matching and having no ability for actual content relevancy search.

In this sense, the little projects showing “look what us non-specialists could cook up by hacking some boosts in Solr!” do a lot of harm and should not be considered as the plucky success stories they are often painted to be.



But it works? If you measure your tuning based on which rank an item is when a user clicks you might get feedback that would obviously help find more examples or create some kind of self-learning system, but if you only care about say, boosting common best sellers periodically, well, isn’t that good for (almost) everyone? ;-) My suggestion for GoodReads would be to rank exact title matches higher than partial or reordered titles, but that would just be a start. Of course, evaluate what’s working and what isn’t, and maybe do test cases where you see if lower rankings have improved. But any attempt at a fix is likely better than doing nothing, as you’re likely to spend energy fixing popular books folks want to find...


> “But it works?”

This is the big red flag, when non-specialists hacking on Solr boosts are claiming something works because of a few qualitative test cases.

“It works” is a statement that only applies after you’ve done qualitative and quantitative goodness of fit testing.

You wouldn’t have a random IT employee make a stock-trading algorithm and then test it on a month of data and call it a success.

For a search solution to “work,” it needs to pass quantitative and qualitative checking, and be explainable to stakeholders and be reproducible / incrementally updateable. The training and arrival at hyperparameters all need to be reproducible and based on the outcome criteria they are meant to solve.

Making some hacks into a demo that superficially looks good is not at all the same as “it works.”


I wouldn’t do this with stock trading because I’m risking everything. But I wouldn’t call the people adjusting search engine parameters completely untrained either, simply not using a methodology that tests their changes against every query. I’ve found that for libraries, at least, the search engines folks are used to are simply SO BAD, so unoptimized, that a little hand tweaking and prioritizing of exact title matches will go a long way. And you’re confusing manual testing with no testing—they would very carefully watch for counter examples with a list of known good titles to search for and get back an expected set of results and were known to rollback changes when they had unexpected consequences. Effectively they have the risk appetite to test in production because the cost to end users is minimal, and the assumption when search doesn’t work is, “oh, they must not have that book” or “oh, they need to fix this particular search” and not “oh, they broke search completely and must be fired” (no one says that last one)


> When this happens, executives and managers just want to punt. They want the “nobody ever got fired for buying IBM” equivalent for search, and that’s how you end up with Confluence still only supporting exact title matching and having no ability for actual content relevancy search

This whole post was a wild ride. But out of curiosity, have you considered walking up to Atlassian and saying “pay me $1MM a year and I’ll solve this problem for you”?


Obviously Atlassian would have to think a high-quality search would result in $1MM of additional profits for them. Judging by what Atlassian is actually doing... they apparently don't think a search better than they've got is in fact necessary for their profits or a good investment that will be returned in profits, right?


I hear you, but I firmly believe the search we did there works far better than Goodreads as outlined in the OP. And could be proven as such with the kind of formalized evaluation you suggest (I am no longer there though).

I agree that underfunded "DIY" enterprise software projects that are not properly/professionally managed/implemented with the proper expertise are a problem, in academic libraries and elsewhere, for search projects and other things.

I still don't see the problem of setting up solr indexed fields and boosts to ensure that "match as phrase" is boosted higher than non-adjacent matches (a feature built into Solr), and "match _complete_ title" is boosted highest of all. This is what the Goodreads examples failed on. It is pretty simple to set up, and I don't see much risk of this causing problems or being worse than not doing it, and would solve those horrible Goodreads results specifically.

I understand since it's what you specialize in, you see the risk of "look at what us non-specialists can set up" sending someone away from... actually I'm not sure what, hiring someone like you? (Which if I were in charge of an academic library budget, which I'm not, I'd be wiling to consider -- don't get me wrong it's not a terrible idea!). In reality, I think what it steers people away from is... a relevancy search like Goodreads has. (Goodreads is _not_ a "plucky little project", and apparently they think their horrible search is good enough! It is not! And I do think they could make it a LOT better pretty easily without having to spend millions on it).

You seem to suggest that products like Solr or ElasticSearch should not be used/configured except by people as speicalized as you backed by relatively expensive search evaluation programs. While I'm sure that would result in better searches everywhere (not being sarcastic, I fully accept that), I think it's unrealistic. If you convince people their only choices are Solr/ElasticSearch/postgres-full-text out of the box using only one indexed field with no configuration for relevance tuning; no search at all; or hiring you or an equivalently expensive search program internal or external -- you're not going to get the expensive search program you want, and you're definitely not going to get "okay, we just wont' have a search at all then", you're going to get people not touching the configuration at all, and ending up with Goodreads search.

Your search really doesn't have to be as bad as Goodreads is, without having to invest in the kind of program and expertise you are suggesting, I really believe that and stand by it. If you invest in what you are suggesting, certainly it will be even better.

(PS: If you have disciplined acceptance criteria combined with qualitative feedback from experts etc -- aren't you still gonna end up tuning your Solr configuration to achieve improvement on those evaluations? I'm confused by your suggestion that turning Solr configuration with boosts etc is not the right tool. Or are you suggesting Solr is the wrong tool for... search?)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: