I think a lot of these may have improved since your last experience with Keras. It's pretty easy to override the training loop and/or make custom loss. The below is for overriding training / test step altogether, custom loss is easier by making a new loss function/class.
> - Keras's training loop assumes you can fit all the data in memory and that the data is fully preprocessed, which in the world of LLMs and big data is infeasible.
The Tensorflow backend has the excellent tf.data.Dataset API, which allows for out of core data and processing in a streaming way.
That's a fair implementation of custom loss. Hugging Face's Trainer with transformers suggests a similar implementation, although their's has less boilerplate.
> To return to the point about image augmentations being hard to add: It's so easy to explain what your training code should do "Just distort the hue a bit" and there seem to be operations explicitly for that: https://www.tensorflow.org/api_docs/python/tf/image/adjust_h.... but when you go to train with them, you'll discover that backpropagation isn't implemented, i.e. they break in training code.
Why not do the data augmentation during preprocessing (so that the transformations don't have to be done by differentiable transforms)? I.e., map over a tf.Dataset with the transformation (and append to the original dataset).
This is cool - might be worth training a simple discriminator model to identify your utterances, and then you can use the plug-and-play language model (PPLM - https://github.com/huggingface/transformers/blob/master/exam...) to generate utterances modeling a specific speaker without special tokens. Could also take less time to fine-tune.
Additionally, the Nim code was not compiled with many optimizations turned on! (I.e., without -d:release).
$ nim c -o:base64_test_nim -d:danger --cc:gcc --verbosity:0 base64_test.nim
$ nim c -o:json_test_nim -d:danger --cc:gcc --verbosity:0 json_test.nim
IIRC the -d:danger flag is necessary for some optimizations (like disabling bounds checking) but -d:release is necessary for most optimizations to be enabled.
Edit: It appears I'm incorrect, -d:danger does imply -d:release in newer Nim versions.
It does imply release in latest Nim versions, `-d:release` still has some checks enabled, and `-d:danger` is full-on release mode with all possible checks disabled.
I'd recommend Elements of Statistical Learning or ISLR instead, if you want to start with a theory-heavy introduction. Most of what you need for DS you'd I think better learn through projects or on-the-job.
Also, as others have mentioned, some of the most important skills for DS are data munging, data "presentation", and soft skills like managing expectations / relationships / etc.
I would not recommend this book if you want to get into DS with the idea that, "I'll read this and then I'll know everything I need to." It's too dense and academically-focused, and it would probably be discouraging if you try to read this all without getting your feet wet.
Congrats guys!!! This has been much-anticipated and I'm very excited. I personally wish that the owned reference stuff (https://nim-lang.org/araq/ownedrefs.html) had been part of 1.0, but I think that at some point shipping 1.0 >> everything else.
I've been following (and evangelizing) Nim for a while, this will make it easier to do so.
I don't quite understand the definition of "memory safety" in that document. If deallocation can cause other objects to end up pointing to the wrong thing and the wrong data, how is that different from memory corruption?
If your filesystem suddenly starts returning the contents of notepad.exe when asked for user32.dll and vice versa, is that not filesystem corruption?
If an admin user object can suddenly start pointing to the guest user object and still be considered "memory safe", that doesn't seem like a very safe definition of safety.
It's type safety. The pointer will always point to an object of the same type. This is common in operating systems, for example. You have some out of band way of verifying that the object's identity has not changed (deallocation would be a change of identity) when you access the object under some sort of serialization.
It's certainly a weaker definition of memory safety than you and I, and I would guess most people, would have in mind. So in that sense, I think the author is wrong to call it memory safety.
You're totally correct that a logic bug in this category could cause a credentials pointer to point to a different or higher set of credentials, and that is an implementation risk.
i guess the argument is that you'll never read random garbage instead of a well-formed object; and given that random garbage could result in pretty much arbitrary "undefined behavior", it should at least guarantee that your program will behave roughly as intended, even if giving incorrect results
> and given that random garbage could result in pretty much arbitrary "undefined behavior"
Nitpick: UB doesn't come from reading random garbage, it's quite the opposite: UB could result in reading random garbage, but it could also result in many worse things.
I don't think the owned reference part will change the language drastically. For people that want to write kernels, games or real time code it would be great. Nim's GC is super configurable already. Yeah owned reference would be better but its already pretty good.
https://keras.io/examples/keras_recipes/trainer_pattern/
> - Keras's training loop assumes you can fit all the data in memory and that the data is fully preprocessed, which in the world of LLMs and big data is infeasible.
The Tensorflow backend has the excellent tf.data.Dataset API, which allows for out of core data and processing in a streaming way.