Hacker Newsnew | past | comments | ask | show | jobs | submit | g413n's commentslogin

what's the basis for conversion between hours of neural data to number of tokens? is that counting the paired text tokens?

edit: oops sorry misread - the neural data is tokenised by our embedding model. the number of tokens per second of neural data varies and depends on the information content.

we would be so down to buy s3 junkyard tbh we were going around begging various storage clouds to offer us this before giving up and building it ourselves


I think <2x more drives than needed, not 20x (24 vs 14TB), but the racks holding the drives could've been denser. Around the same cost in any case and our colo doesn't charge for space, so it's not a big deal and we were just going with what we were familiar with, but something to try.


Oops sorry, my bad! Great to read all about it - good luck with the project.


yeah we weren't sure about putting that number esp whether it includes all the image attachments, but in any case it's at least around the right reference class for the largest text data operations.


yeah that's why we started paying people near the second half- not super clearly stated in the blogpost, but the novelty definitely wore off with plenty of drives left to stack, so we switched strategies to get it done in time.

I think everyone who showed up for a couple hours as part of the party had a good time tho, and the engraved hard drives we were giving out weren't cheap :p


it's not just in sf it's across the street from our office

this has been incredibly nice for our first hardware project, if we ever expand substantially then we'd def care more about the colo costs.


yeah it's on the wishlist to try


Thanks to op for actually replying to the various comments here - really appreciate that (and for the initial of course!)


our training stack doesn't make strong assumptions about data integrity, it's chill


just general research work. Once the recipes are efficient enough the modality is a smaller detail.

On the product side we're trying to orient more towards 'productive work assistant' rather than the default pull of audio models towards being an 'ai friend'.


yeah this

it means that even after negotiating much better terms than baseline we run into the fact that cloud providers just have a higher cost basis for the more premium/general product.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: