Depending on how you look at it I suppose but I believe Gemini surpasses OpenAI on many levels now. Better photo and video models. The leaderboard for text and embeddings are also putting Google on top of Openai.
A junior in SQL would need AI to write things they're not sure about, the same way stackoverflow has helped us for many many years before AI. A senior in sql, and in fact any languages, would use AI to be accelerated (I know I do).
I second trafilatura greatly. This will save a huge amount of money to just send the text to the LLM.
I used it on this recent project (shameless plug): https://github.com/philippe2803/contentmap. It's a simple python library that creates a vector store for any website, using a domain XML sitemap as a starting point. The challenge was that each domain has its own HTML structure, and to create a vector store, we need the actual content, removing HTML tags, etc. Trafilatura basically does that for any url, in just a few lines of code.
Good to know! Yes, trafilatura is great, sure it breaks sometimes, but everything breaks on some website - the real questions are how often and what is the extent of breakage. For general info., the library was published about here [1], where in Table 1 they provide some benchmarks.
I also forgot to mention another interesting scraper that's an LLM based service. A quick search here tells me it was mentioned once by simonw, but I think it should be better known just for the convenience! Prepend "r.jina.ai" to any URL to extract text. For ex., check out [2] or [3].
Just submit your XML sitemap into a python class, and it will do the crawling, chunking, vectorizing and storage in an SQLite file for you. It's using SQLiteVSS integration with Langchain, but thinking of moving away from it, and do an integration with the new sqlite-vec instead.
A relational crawler on a particular subject with nuanced, opaque, seemingly-temporally-unrelated connections that show a particular MIC conduction of acts::
"Follow all the congress members who have been a part of a particular committee, track their signatory/support for particular ACTs that have been passed, and look at their investment history from open data, quiver, etc - and show language in any public speaking talking about conflicts and arms deals occurring whereby their support of the funding for said conflicts are traceable to their ACTs, committee seat, speaking engagements, investment profit and reporting as compared to their stated net worth over each year as compared to the stated gains stated by their filings for investment. Apply this pattern to all congress, and their public-profile orbit of folks, without violating their otherwise private-related actions."
And give it a series of URLs with known content for which these nuances may be gleaned.
Or have a trainer bot that will constantly only consume this context from the open internet over time such that you can just have a graph over time for the data...
PYTHON: Run it all through txtai / your library ? nodes and ask questions of the data in real time?
(And it reminds me of the work of this fine person/it::
Actually, Sqlite-vss has been untouched for quite some time, and the creator has officially communicated that it was deprecated to be replaced by sqlite-vec, which has recently seen its first non-alpha release (v0.1.0). So, I would embrace sqlite-vec now if I were you.
I have not used sqlite-vec much because it was only alpha-released for now, but it finally came out a few days ago. I'm looking into integrating it and use it to make sqlite more my go-to RAG database.
Very happy to see this extension already out. I tried some of the previous alpha version and is incredibly much easier to use and integrate than the previous sqlite-vss extension. Kudos to the creator.
I originally added sqlite-vss (your original vector search implementation) on Langchain as a vectorstore. Do you think this one is mature enough to add on Langchain, or should I wait a bit?
Love your work by the way, I have been using sqlite-vss on a few projects already.
Hey really cool to see re sqlite-vss+langchain! You could try a langchain integration now, there's an (undocumented) sqlite-vec pypi package you can install that's similar to the sqlite-vss one. Though I'd only try it for dev stuff now (or stick to alpha releases), but things will be much more stable when v0.1.0 comes out. Though I doubt the main SQL API (the vec0 table) syntax will change much between now and then.
Author of the article here. Just went though your website and I can not believe I never heard about Mojeek. I'll probably have a go at your API eventually.