Hacker Newsnew | past | comments | ask | show | jobs | submit | vimarsh6739's commentslogin

To me, this feels less about Rust and more about moving away from copyleft.


This is the truth of it. They want to take everything proprietary to make more money off of it.


One of the more subtle aspects of retargeting GPU code to run on the CPU is the presence of fine grained(read - block level and warp level) explicit synchronization mechanisms being available in the GPU. However, this is not the same in CPU land, so additional care has to be taken to handle this. One example of work which tries this is https://arxiv.org/pdf/2207.00257 .

Interestingly, in the same work, contrary to what you’d expect, transpiling GPU code to run on CPU gives ~76% speedups in HPC workloads compared to a hand optimized multi-core CPU implementation on Fugaku(a CPU only supercomputer), after accounting for these differences in synchronization.


A single CPU thread should be treated as basically a warp executing 4 simd vectors in parallel. The naïve implementation of __syncthreads() would be an atomic mechanism shared across all threads that contribute to what is GPU workgroup.

Looks like this entire paper is just about how to move/remove these barriers.


yes, but in practice, I believe people spam __syncthreads() in GPU kernels just to ensure correctness. There is value in statically proving that you don't need a synchronization instruction at a certain point. Doubly more so in the transpilation case, when you now find your naive __syncthreads() being called multiple times due to it being present in CUDA code(or MLIR in this case).

An interesting add on to me would be the handling of conditionals. Because newer GPUs have independent thread scheduling which is not present in the older ones, you have to wonder what is the desired behaviour if you are using CPU execution as a debugger of sorts(or are just GPU poor). It'd be super cool to expose those semantics as a compiler flag for your transpiler, allowing me to potentially debug some code as if it ran on an ancient GPU like a K80 for some fast local debugging.

But the ambitious question here is this - if you take existing GPU code, run it through a transpiler and generate better code than handwritten OpenMP, do you need to maintain an OpenMP backend for the CPU in the first place? It'd be better to express everything in a more richer parallel model with support for nested synchronization right? And let the compiler handle the job of inter-converting between parallelism models. It's like saying if Pytorch 2.0 generates good Triton code, we could just transpile that to CPUs and get rid of the CPU backend. (of course triton doesn't support all patterns so you would fall back to aten, and this kind of goes for a toss)


> Because newer GPUs have independent thread scheduling I assume you mean at the warp level. The threads are not independent and there are many shaders you can write to prove this fact.

I agree that statically proving that something like the syncing is unnecessary can only be a good thing.

The question of why not simply take your GPU code and transpile to CPU code is more of the question of what did you originally lose in writing the GPU code to begin with. If you are talking about ML work most of that is expressed a bunch of matrix operations that naturally translate to GPUs with low impedance. But other kinds of operations might be better expressed directly as CPU code (any serial operations). And for CPU to GPU the loss as you have pointed out is probably in the synchronization.


Might be a bit out of context, but isn't the TPU also optimized for low latency inference? (Judging by reading the original TPU architecture paper here - https://arxiv.org/abs/1704.04760). If so, does Groq actually provide hardware support for LLM inference?


Jonathan Ross on that paper is Groq's founder and CEO. Groq's LPU is an natural continuation of the breakthrough ideas he had when designing Google's TPU.

Could you clarify your question about hardware support? Currently we build out our hardware to support our cloud offering, and we sell systems to enterprise customers.


Thanks for the quick reply! About hardware support, I was wondering if the LPU has a hardware instruction to compute the attention matrix similar to the MatrixMultiply/Convolve instruction in the TPU ISA. (Maybe a hardware instruction which fuses a softmax on the matmul epilogue?)


We don't have a hardware instruction but we do have some patented technology around using a matrix engine to efficiently calculate other linear algebra operations such as convolution.


Are you considering targeting consumer market? There are a lot of people throwing $2k-$4k into local setups and they primarily care about inference.


At the moment we're concentrating on building out our API and serving the enterprise market.


The card is pretty useful to me as a first card since it has no foreign transaction fee.


Agree it became my primary card when I recently spent three months in London because of that and the ubiquity of Apple Pay.


it isn't a real download, because you don't have access to the raw file.


It solves OP's problem. Why is that bad?


This is going to sound mean, but I don't intend it that way. But a clear question calls for a clear answer.

The reason why your solution is bad is because your solution only solves the problem for certain imagined values of 'problem'.

We don't know what the playback user story is for OP -- is it a raspberry pi? is it a librem 5? -- and there's no way to know that the user's problem is in fact solved, as you claim.

Your solution makes assumptions, and there's no way to know if they're reasonable, because this is HN, not Best Buy, and some of us are on some pretty interesting (and DRM-free) hardware. That's why it's bad.


My comment started with "As a last resort...", meaning if none of the DL scripts worked for him/her. I didn't advocate Premium as the primary solution.


Does YT Premium work offline? The OP says they will be going without internet which is why they need to download them.


Yeah with Premium downloaded videos can be watched offline, but you need to connect to the internet every 30 days.

It’s a nice no-fuss solution if you just want to watch some videos offline (as mentioned, it’s not suitable for archival) and the device is iOS/Android.


I think I'd prefer to have the plain video file.


KRAZAM is really under-rated. It is one of those few YouTube channels which produce really really good content and stories relative to their size.



How does oneAPI/SYCL compare to CUDA? We certainly need an alternative to OpenCL, but every day, I can't help but notice the widening gulf between CUDA and any other GPGPU API out there.


Worse, only does C++, while CUDA is polyglot, and tooling isn't at the level as something like NInsights.


I really don't have anything to say to the OP, but I wonder(in a similar situation) if with the recent push towards e-sim, will SMS based 2FA become more problematic?

If a phone with an e-sim dies, and you need some kind of OTP, I wonder how you'll receive it. You can't exactly 'transplant' the SIM into another phone.


SMS 2FA is just a terrible idea. I advise anyone to use something like TOTP but also to store the TOTP seed as well as recovery codes in e.g. a KeePass database.

You may use a different database than the one with the rest of your passwords. Sync these databases with something like Syncthing, which is completely controlled by you, can do untrusted encrypted nodes and can not only sync but also take occasional backups for you.

Also don't forget to put the master password of your KeePass databases into someone elses database. Someone you trust in person, e.g. a family member.

It may be a quite complicated setup, but once its set up, it works and not much effort to maintain it is required. If you get a new device simply add a new syncthing node.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: