Hacker Newsnew | past | comments | ask | show | jobs | submit | jeanloolz's commentslogin

Did not know neither and reseatched. Us = xvideos.com the 2nd largest porn site.


Depending on how you look at it I suppose but I believe Gemini surpasses OpenAI on many levels now. Better photo and video models. The leaderboard for text and embeddings are also putting Google on top of Openai.


That hunting dog analogy is epic and perfectly matches my experience.


A junior in SQL would need AI to write things they're not sure about, the same way stackoverflow has helped us for many many years before AI. A senior in sql, and in fact any languages, would use AI to be accelerated (I know I do).


I see this comparison too often and I don't think it's fair. Stackoverflow has peer review.


It's a fair statement. Good point.


I second trafilatura greatly. This will save a huge amount of money to just send the text to the LLM. I used it on this recent project (shameless plug): https://github.com/philippe2803/contentmap. It's a simple python library that creates a vector store for any website, using a domain XML sitemap as a starting point. The challenge was that each domain has its own HTML structure, and to create a vector store, we need the actual content, removing HTML tags, etc. Trafilatura basically does that for any url, in just a few lines of code.


Good to know! Yes, trafilatura is great, sure it breaks sometimes, but everything breaks on some website - the real questions are how often and what is the extent of breakage. For general info., the library was published about here [1], where in Table 1 they provide some benchmarks.

I also forgot to mention another interesting scraper that's an LLM based service. A quick search here tells me it was mentioned once by simonw, but I think it should be better known just for the convenience! Prepend "r.jina.ai" to any URL to extract text. For ex., check out [2] or [3].

[1] https://aclanthology.org/2021.acl-demo.15.pdf

[2] https://r.jina.ai/news.ycombinator.com/

[3] (this discussion) https://r.jina.ai/news.ycombinator.com/item?id=41428274


I built a similar thing as a python library that does just that: https://github.com/philippe2803/contentmap

Blog post that explains the rationale behind the library: https://philippeoger.com/pages/can-we-rag-the-whole-web

Just submit your XML sitemap into a python class, and it will do the crawling, chunking, vectorizing and storage in an SQLite file for you. It's using SQLiteVSS integration with Langchain, but thinking of moving away from it, and do an integration with the new sqlite-vec instead.


This is part of a dream of a tool I would like:

A relational crawler on a particular subject with nuanced, opaque, seemingly-temporally-unrelated connections that show a particular MIC conduction of acts::

"Follow all the congress members who have been a part of a particular committee, track their signatory/support for particular ACTs that have been passed, and look at their investment history from open data, quiver, etc - and show language in any public speaking talking about conflicts and arms deals occurring whereby their support of the funding for said conflicts are traceable to their ACTs, committee seat, speaking engagements, investment profit and reporting as compared to their stated net worth over each year as compared to the stated gains stated by their filings for investment. Apply this pattern to all congress, and their public-profile orbit of folks, without violating their otherwise private-related actions."

And give it a series of URLs with known content for which these nuances may be gleaned.

Or have a trainer bot that will constantly only consume this context from the open internet over time such that you can just have a graph over time for the data...

PYTHON: Run it all through txtai / your library ? nodes and ask questions of the data in real time?

(And it reminds me of the work of this fine person/it::

https://mlops.systems/#category=isafpr

https://mlops.systems/#category=afghanistan


I know sqlite-vss has been upgraded lately. But, it was unstable for a while prior. Are you having good experiences with it?


Actually, Sqlite-vss has been untouched for quite some time, and the creator has officially communicated that it was deprecated to be replaced by sqlite-vec, which has recently seen its first non-alpha release (v0.1.0). So, I would embrace sqlite-vec now if I were you.

I have not used sqlite-vec much because it was only alpha-released for now, but it finally came out a few days ago. I'm looking into integrating it and use it to make sqlite more my go-to RAG database.


Very happy to see this extension already out. I tried some of the previous alpha version and is incredibly much easier to use and integrate than the previous sqlite-vss extension. Kudos to the creator.


I originally added sqlite-vss (your original vector search implementation) on Langchain as a vectorstore. Do you think this one is mature enough to add on Langchain, or should I wait a bit?

Love your work by the way, I have been using sqlite-vss on a few projects already.


Hey really cool to see re sqlite-vss+langchain! You could try a langchain integration now, there's an (undocumented) sqlite-vec pypi package you can install that's similar to the sqlite-vss one. Though I'd only try it for dev stuff now (or stick to alpha releases), but things will be much more stable when v0.1.0 comes out. Though I doubt the main SQL API (the vec0 table) syntax will change much between now and then.


Cool beans! I'll look into it soon then


He says in the blog post and here that this isn't finished yet


Author of the article here. Just went though your website and I can not believe I never heard about Mojeek. I'll probably have a go at your API eventually.


What an epic video! Great presentation, phenomenal explanation. What a truly great work.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: