More

g413n · 2025-12-08T17:49:40 1765216180

what's the basis for conversion between hours of neural data to number of tokens? is that counting the paired text tokens?

rio-popper · 2025-12-08T17:53:28 1765216408

edit: oops sorry misread - the neural data is tokenised by our embedding model. the number of tokens per second of neural data varies and depends on the information content.

g413n · 2025-10-02T15:18:59 1759418339

we would be so down to buy s3 junkyard tbh we were going around begging various storage clouds to offer us this before giving up and building it ourselves

g413n · 2025-10-02T15:13:58 1759418038

I think <2x more drives than needed, not 20x (24 vs 14TB), but the racks holding the drives could've been denser. Around the same cost in any case and our colo doesn't charge for space, so it's not a big deal and we were just going with what we were familiar with, but something to try.

urbandw311er · 2025-10-02T21:48:50 1759441730

Oops sorry, my bad! Great to read all about it - good luck with the project.

g413n · 2025-10-02T15:11:18 1759417878

yeah we weren't sure about putting that number esp whether it includes all the image attachments, but in any case it's at least around the right reference class for the largest text data operations.

g413n · 2025-10-01T21:05:59 1759352759

yeah that's why we started paying people near the second half- not super clearly stated in the blogpost, but the novelty definitely wore off with plenty of drives left to stack, so we switched strategies to get it done in time.

I think everyone who showed up for a couple hours as part of the party had a good time tho, and the engraved hard drives we were giving out weren't cheap :p

g413n · 2025-10-01T20:42:55 1759351375

it's not just in sf it's across the street from our office

this has been incredibly nice for our first hardware project, if we ever expand substantially then we'd def care more about the colo costs.

g413n · 2025-10-01T20:38:59 1759351139

yeah it's on the wishlist to try

bobbob1921 · 2025-10-02T19:39:05 1759433945

Thanks to op for actually replying to the various comments here - really appreciate that (and for the initial of course!)

g413n · 2025-10-01T20:33:11 1759350791

our training stack doesn't make strong assumptions about data integrity, it's chill

g413n · 2025-10-01T20:25:22 1759350322

just general research work. Once the recipes are efficient enough the modality is a smaller detail.

On the product side we're trying to orient more towards 'productive work assistant' rather than the default pull of audio models towards being an 'ai friend'.

g413n · 2025-10-01T20:20:49 1759350049

yeah this

it means that even after negotiating much better terms than baseline we run into the fact that cloud providers just have a higher cost basis for the more premium/general product.