“It depends”: what’s your prior experience, what kind of roles interest you, how big is the gap between what you have + a little ML knowledge/side projects?
I’d argue there’s a big need for people with solid fundamental CS, sysadmin, infra skills who can bridge the gap into ML practitioner/researcher understanding. Applications or inference generally are probably easiest to break into, especially if you already have service knowledge. If you want to work on distributed training or kernel/model optimization, you probably need to prove your chops there.
Neoclouds, startups in the AI space, maybe hw vendors are probably good places to look.
as someone in the space this ticks a lot of boxes: kubernetes-native, strong isolation, python sdk (ideal for ML scenarios). devmapper is a nice ootb approach.
Glancing at the readme, is your business model technical support? Or what's your plan with this?
Anything interesting to share around startup time for large artifacts, scaling, passing through persistent storage (or GPUs) to these sandboxes?
Curious what things like 'Multi-node cluster capabilities for distributed workloads' mean exactly? inter-VM networking?
No business model short-term. My goal is broad adoption, 100% open-source.
By multi-node I mean so far I only support 1 k8s node, i.e. 1 machine, but soon adding support for multiple. Still, on 20 CPUs I can run +50 VM pods with fractional vCPU limits.
For GPU passthrough: not possible today because I use Firecracker as VMM. On roadmap: Add support for Qemu, then GPU passthrough possible.
Inter-VM networking: it's already possible on single-node: 1 VM = 1 pod. Can have multiple pods per node (have a look at utils/stress-test.sh). Right now I default deny-all ingress for safety (because by default k8s allows inter pod communication), but can make ingress configurable.
Startup time: a second, or a few seconds, depending on which base image (alpine, ubuntu, etc...) and whether you use a before_script or not (what I execute before the network lockdown)
Large artifacts: you can configure resource allocated to a VM pod in the sandbox config and it basically uses k8s resource limits.
> No business model short-term. My goal is broad adoption, 100% open-source.
IMHO that's kind of a red flag. There's a happy path here where it's successful but stays low-maintenance enough that you just work on it in your spare time, or it takes of and gets community support, or you get sponsorships or such. But there's also an option where in a year or two it becomes your job and you decide to monetize by rug-pulling and announce that actually paying the bills is more important than staying 100% open source. Not a dig at you, just something that's happened enough times that I get nervous when people don't have a plan and therefore don't have a plan to avoid the outcome that creates problems for users.
Sure one day if it really kicks off I could think of offering additionally a SaaS solution with paid enterprise features like SOC 2 compliance, RBAC, multiple clouds supported, etc. Why not. But I strongly believe that for it to be successful, it needs a strong open-source base. Then, billing huge companies for compliance features or huge usage makes sense. That would support development of the open-source part too.
I like the Docker model, for instance: free for companies under 250 employees or $10m/y revenue.
In any case, it will always be open-source.
Those paid enterprise features wouldn't come from closed-source: they would come from compliance of a particular SaaS-offered infra setup, that anybody else could reproduce. Just like HuggingFace.
The author got hired by Modular, the AI startup founded by the creators of LLVM and Swift, and is now working on the new language Mojo.
He’s been bringing a bunch of ideas from Vale to Mojo
Oh nice! I just had an excuse to try mojo via max inference, it was pretty impressive. Basically on par with vllm for some small benchmarks, bit of variance in ttft and tpot. Very cool!
Larger memory, weaker comms. You can optimize for this by doing things like increasing batch size/data parallelism vs sharding schemes with more comms.
At scale training won’t be able to avoid comms entirely, while many models can fit in a single MI300 for serving.
can't speak to GCP specifically but usually the issue is they are host-attached and can't be migrated, so need to be wiped on VM termination or migration -- that's when you lose data.
Reboots typically don't otherwise do anything special unless they also trigger a host migration. GCP live migration has some mention of support though
note that stop/terminate via cloud APIs usually releases host capacity for other customers and would trigger data wipe, a guest initiated reboot typically will not.
yeah, the system/application distinction feels somewhat superficial. The “multiple user space” inside a container thing sounds interesting (not sure what that means exactly), but maybe more similar to a Kubernetes pod, except maybe instead of different rootfs there’s another isolation mechanism?
The "first" link (after the home button) on bbchallenge is the header bar link to https://bbchallenge.org/story which cites Aaronson in the first sentence (double first!). I would not describe it like OP for someone trying to find the actual link ;)
"One Collatz Coincidence", the 2nd story on the blog, also mentions Aaronson
That's going to be difficult because the language itself requires panic support to properly implement indexing, slicing, and integer division. There are checked methods that can be used instead, but to truly eliminate panics, the ordinary operators would have to be banned when used with non-const arguments, and this restriction would have to propagate to all dependencies as well.
Yes that’s right. The feature really wants compiler support for that reason. The simplest version wouldn’t be too hard to implement. Every function just exports a flag on whether or not it (or any callees) can panic. Then we have a nopanic keyword which emits a compiler error if the function (or any callee) panics.
It would be annoying to use - as you say, you couldn’t even add regular numbers together or index into an array in nopanic code. But there are ways to work around it (like the wrapping types).
One problem is that implicit nopanic would add a new way to break semver compatibility in APIs. Eg, imagine a public api that just happens to not be able to panic. If the code is changed subtly, it could easily start panicing again. That could break callers, so it has to be a major version bump. You’d probably have to require explicit nopanic at api boundaries. (Else assume all public functions from other crates can panic). And because of that, public APIs like std would need to be plastered with nopanic markers everywhere. It’s also not clear how that works through trait impls.
As far as I can tell, no_std doesn't change anything with regard to either the usability of panicking operators like integer division, slice indexing, etc. (they're still usable) nor on whether they panic on invalid input (they still do).
The problem is with false positives. Even if you clearly see that some function will never panic (but it uses some feature which may panic), compiler might not always see that. If compiler says that there are no panics, then there are no panics, but is it enough to add as part of the language if you need to mostly avoid using features that might panic?
I’d argue there’s a big need for people with solid fundamental CS, sysadmin, infra skills who can bridge the gap into ML practitioner/researcher understanding. Applications or inference generally are probably easiest to break into, especially if you already have service knowledge. If you want to work on distributed training or kernel/model optimization, you probably need to prove your chops there.
Neoclouds, startups in the AI space, maybe hw vendors are probably good places to look.