Hacker Newsnew | past | comments | ask | show | jobs | submit | dspoka's commentslogin

Looks cool! Any advantages to the mini-lm model - it seems better on most mteb tasks but wondering if maybe inference or something is better.


Mini-lm is a better embedding model. This model does not perform attention calculations, or use a deep learning framework after training. You won’t get the contextual benefits of transformer models in this one.

It’s not meant to be a state of the art model though. I’ve put in pretty limiting constraints in order to keep dependencies, size and hardware requirements low, and speed high.

Even for a word embedding model it’s quite lightweight, as those have much larger vocabularies are are typically a few gigabytes.


Which do use attention? Any recommendations?


Depends immensely on use case — what are your compute limitations? are you fine with remote code? are you doing symmetric or asymmetric retrieval? do you need support in one language or many languages? do you need to work on just text or (audio, video, image)? are you working in a specific domain?

A lot of people wind up using models based purely on one or two benchmarks and wind up viewing embedding based projects as a failure.

If you do answer some of those I’d be happy to give my anecdotal feedback :)


Sorry, I wasn’t clear. I was speaking about utility models/libraries to compute things like meaning similarity with not just token embeddings but with attention too. I’m really interested in finding a good utility that leverages the transformer to compute “meaning similarity” between two texts.


Most current models are transformer encoders that use attention. I like most of the options that ollama provides.

I think this one is currently the top of the MTEB leaderboard, but large dimension vectors and a multi billion parameter model: https://huggingface.co/nvidia/NV-Embed-v1


looks like it's the size of the model itself, more lightweight and faster. mini-lm is 80mb while the smallest one here is 16mb.


Mini-lm isn't optimized to be as small as possible though, and is kind of dated. It was trained on a tiny amount of similarity pairs compared to what we have available today.

As of the last time I did it in 2022, Mini-lm can be distilled to 40mb with only limited loss in accuracy, so can paraphrase-MiniLM-L3-v1 (down to 21mb), by reducing the dimensions by half or more and projecting a custom matrix optimization(optionally, including domain specific or more recent training pairs). I imagine today you could get it down to 32mb (= project to ~156 dim) without accuracy loss.


What are some recent sources for high quality similarity pairs?


Sensational title that misrepresents the message in paper.

However, when conducting more targeted automatic evaluations, we found that the imitation models close little to none of the large gap between LLaMA and ChatGPT. In particular, we demonstrate that imitation models improve on evaluation tasks that are heavily supported in the imitation training data. On the other hand, the models do not improve (or even decline in accuracy) on evaluation datasets for which there is little support. For example, training on 100k ChatGPT outputs from broad-coverage user inputs provides no benefits to Natural Questions accuracy (e.g., Figure 1, center), but training exclusively on ChatGPT responses for Natural-Questions-like queries drastically improves task accuracy.

Just because this might not be the way to replicate the performance of ChatGPT across all tasks, it seems to work quite well on whichever tasks are in the imitation learning. That is still a big win.

Later on this also works for factual correctness. (leaving aside the argument whether this is the right approach for factuality)

For example, training on 100k ChatGPT outputs from broad-coverage user inputs provides no benefits to Natural Questions accuracy (e.g., Figure 1, center), but training exclusively on ChatGPT responses for Natural-Questions-like queries drastically improves task accuracy.


To be fair, this paper has been made obsolete in its entirety with recent research. It's not really their fault, but folks need to start publishing faster as posters or something if they want to provide something relevant.

A better title, knowing what we now, might be "To outperform GPT4, do more than imitating"


Link to said research?



And yet research on this topic suggest the opposite. "These days everyone seems to thinks that "planting trees" is an important solution to the climate crisis. They're mostly wrong, and in this paper we explain why. Instead of planting trees, we need to talk about people managing landscapes." [1]

[1] https://twitter.com/ForrestFleisch1/status/13062214459331297...


That paper is in almosy 100% agreement wuth what I'm stating and the point that I'm making. The issue is about biogechemical cycling and limiting the view of what a forest us to a set of trees, as opposed to a set of processes, is a blocking factor to most people's understanding of their potential for increase carbon storage.

The broader issue I would also cute is that we have an extraordinary limited view of how these processes operate in most ecosystems because they are massively understudied.


I don't think the paper you posted disagrees with OP.

Tree planting can be a reasonable force of good for dealing with climate change however it needs to be done in such a way that the local ecosystem (both currently and in the targeted "restored state") is sustainable and able to survive with minimal human management in the long term.

And importantly, tree planting is only a viable solution of carbon offsetting as long as it isn't harming or displacing the local communities.

A good example of this is Tentree/Veritree's efforts. Restoring the heavily logged Mangrove forests on the coasts of Kenya, Madagascar, and in Indonesia recreates the local ecosystems for animals, fish, insects, and plants that live in marshes while also rebuilding the natural sea wall that protects inland areas from flooding due to weather. It's a good carbon sequestration project while more importantly serving to repair local ecosystems and reduce the impacts of further climate change on the local residents. Importantly these projects also focus on educating the locals on responsible forest management so that they can continue to harvest lumber for construction purposes without impacting the ecosystem or the sustainability of the recreated marshlands.

You can have good tree planting but it's more than just sticking saplings into the ground. Plenty of projects do really good work with the money they get towards forest restoration and most importantly these projects don't serve solely to offset environmental costs in western society but rather to repair ecosystems of disadvantaged regions and help protect these communities against the oncoming threat that is climate change.

TLDR: Tree planting as "more tree == less carbon" is obviously ineffective but in the bigger picture tree planting efforts can really make a difference as long as you put a modicum of research into what projects you are funding.


Thank you for making the point more clearly. Something I want to highlight is how little we actually know about how carbon cycling works in most of the Earth's ecosystems. Even just the uncertainty around carbon residency is something we understand very poorly in a broad geographic context. The answer is that we really don't know what most ecosystems potential is for carbon sequestration; and the critism I'm making is that just because we have high uncertainty around a system, doesn't mean we shouldn't consider it as a viable path, especially when it's probably the easiest to implement thing we can do with a wide range of well established cobenefits.


My main qualm with the parent comment was this in particular, "Number one, is that forests work as long term carbon storage and sinks." Even with the surrounding context it sounded like this strategy will just "work".

The example you give with mangroves is a great one which does in fact work. Pragmatically and historically most of the attempts however, have not due to mismanagement and other unseen complications.

Seeing the further comments I see the point the parent was making is more around first principles of Forrests as carbon sinks not about its implementations.


>The example you give with mangroves is a great one which does in fact work. Pragmatically and historically most of the attempts however, have not due to mismanagement and other unseen complications.

You need to cite this if you are going to keep making that statement. I'm not arguing that markets are well implemented, that common practice is well defined or even very useful, or that we're even prioritizing for the right outcomes, but the notion that forests don't sequester carbon over significant time horizons is 100% false. Global forests represent the most significant, straightforward opportunity for removing carbon from the atmosphere, no debate. Yes we need to do better at managing them from a climate change perspective (good fire, biodiversity, water), but there is simply no better option right now for doing any kind of meaningful drawdown of carbon from the atmosphere than forests.


Do you have any good write up on this? Was always interested in the how this space is still locked up.


I think the reaction from software devs on how Copilot's uses their code for ML is interesting in that all the ML companies have been doing this with all other forms of produced content: texts, posts, messages, photo captions, etc. And most likely even less care went into adhering to laws or ethics. Yes code has licenses and thus more distinct legal ramifications but on the other side are people who don't really understand that every time they interact with software or produce some content, everything is gathered and harnessed to power all these companies.


Chief Medical Officer of Intel talking about what hacks look like on people's genetic information.

https://www.youtube.com/watch?v=HKQDSgBHPfY


Is sitting in a housing court just allowed? How does one go around doing this?


Just go. Almost all courts, with the exception of family court, are open and you can just walk in and sit.

If you yourself are going to be involved in a case it's a good idea to sit in a few trials so you get an idea of how things go.


Usually, yes! I’m not sure how it works with remote hearings during Covid-19 but I imagine some courts are broadcasting live streams of the proceedings while others are invitation-only. In normal circumstances where courts are open you can just show up and sit in the gallery to observe. Most housing courts have an “eviction day” once a week. If you call the courthouse ahead of time the clerk can tell you what day of the week is eviction day.


Great to see you working on this!

I was wondering if you could estimate what it would cost to have always on recording of all these radio conversations, cost of running this speech2text ML and cost of labeling this data.

I think having these rough estimates will make donations easier for people.


I've got a year+ of the Ohio MARCS-IP site in Hamilton County Ohio recorded. Let me know if you need some data -- I'd be more than happy to get you the dump.

(trunk-recorder + rdio scanner).

The UI is:

https://cvgscan.iwdo.xyz for the live stuff, but, let me know if you're interested in the data -- my email is in my profile


Great question! Unfortunately the long term costs aren't clear yet, right now I'm using google speech as a bootstrapping technique, but that is prohibitively expensive to run long term.

I think once my models are viable enough to do this at scale, the cost will be basically the cost of running a dedicated server per N streams. So $100-300/mo per N streams? Where N could roughly be at least 100 concurrent streams per server. I will know this better in "stage 2" where I'm attempting to scale this up. It's also a fairly distributed problem so I can look into doing it folding@home style, or even have the stream's originator running transcription in some cases to keep costs down.


Something I came across for this : https://app.astralapp.com/


https://astralapp.com/ should be a better link where the project is explained.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: