Purely functional languages have no reason to 0-indexed arrays. Imperative languages that deal with for loops are more logically consistent with 0-indexed arrays. I think.
I wish FPGAs were fast enough because I don’t see how else this problem can be solved. Easy to issue a patch that fixes logic implemented in FOGA. Impossible to fix hardware rooted issues unless they work around it at a huge performance cost, if at all possible.
One problem of FPGAs is that the absolute dependence on proprietary tools from the vendor, the hardware industry is much more closed in comparison. By using those tools, you have to agree the terms and conditions such as the following (this one is from Xilinx),
> By using this software, you agree not to: [...] display the object code of the Software on any computer screen.
From a security perspective, it doesn't inspire confidence, there's no ability to do an independent verification, something like doing a reproducible build with an compiler which source code is open to audit. There's no equivalence of GCC or LLVM for FPGA.
FPGAs are significantly slower and significantly more expensive than dedicated hardware.
The FPGAs used for data-center usage don't even use LUTs primarily anymore, because fully configurable LUTs are too slow and too expensive to manufacture at scale. Instead, data-center sized FPGAs use "DSP Slices" primarily (yeah, LUTS exist but its mostly DSP Slices). They're very expensive (https://www.digikey.com/product-detail/en/xilinx-inc/A-U200-...) and require a very specific set of skills to work with.
DSP slices aren't being made at the exclusion of LUTs on modern FPGAs. There's nothing about LUTs that are harder to make at scale than DSP slices. LUTs are just a little bank of SRAM, and SRAM is generally the most mature cells on a process node.
> DSP slices aren't being made at the exclusion of LUTs on modern FPGAs.
Of course they are. Every square-micrometer of the die is going to be either a LUT, DSP, or RAM on that chip. Xilinx makes a decision for how many of each is most useful to its customers.
> There's nothing about LUTs that are harder to make at scale than DSP slices.
Scaling "typical" designs on LUTs is worse than scalaing "typical" designs on a DSP Slice.
Synthesize a 32-bit wallace-tree multiplier for instance, and you'll use thousands of LUTs (maybe 2000ish). However, reserve a DSP-slice for a 32-bit multiply routine, and you'll only use ONE slice.
However, those multipliers on the DSP-slice could be "wasted" if you didn't need that many multipliers. Maybe your arithmetic is primarily addition and subtraction. In any case, when most people talk about "Reconfigurable FPGAs", they're talking about the LUTs which can be the building block of any logic. They aren't talking about DSP-slices, which are effectively prebuilt ALUs connected in a mesh.
Your original post is making it sound like FPGAs are switching from LUTs to DSP slices in the general case, which is blatantly not true. Yeah, it uses less resources to use a DSP slice instead of synthesizing a multiplier, bit that's because a DSP slice is just an 18x18 MAC. It's not replacing LUTs for general logic.
It wasn't my intent to say that FPGAs were switching away from LUTs in general. But I expect that FPGAs of the datacenter to be more-and-more DSP-based into the future.
Think of today's supercomputer problems: Deep Learning, 3d Rendering, Finite Element Analysis, Weather Modeling, nuclear research, protein folding, even high-frequency trading. What do they all have in common?
They're all giant matrix-multiplication problems at their core... fundamentally built up from the multiply-and-accumulate primitive.
EDIT: The only real exception is maybe EDA. I don't know too much about those algorithms involved, but IIRC it involves binary decision diagrams (https://en.wikipedia.org/wiki/Binary_decision_diagram). So some problems aren't matrix-multiplication based... but I dare say the majority of today's supercomputer problems involve matrix-multiplication.
Super computers and data centers are two different things with markedly different requirements.
Also, deep learning isn't a great fit for FPGAs anyway. There's dedicated silicon that does a better job if you're looking for matrix multiplies per watt.
Additionally, high frequency trading isn't doing much in the way of matrix multiplies, as they don't have enough time. They've got few hundred clock cycles per packet in to get a reply out.
Xilinx absolutely markets these Alveo U200 FPGAs as deep-learning accelerators.
> Additionally, high frequency trading isn't doing much in the way of matrix multiplies, as they don't have enough time. They've got few hundred clock cycles per packet in to get a reply out.
I'm not a HFT-user, but I've always assumed that simulating Monte-carlo Black Scholes was roughly what HFT-traders were doing. Maybe not Black-scholes itself, but maybe some other differential equation that requires a lot of Monte-carlo runs of.
Either way, Black-scholes (and other models) are partial differential equations, which are best simulated as a sequence of matrix multiplications. That's my understanding anyway.
They're marketing towards that, but I can't think of any serious deep learning shops that are using them. Modern GPUs have them beat, to say nothing of dedicated ASICs. The one niche they might have I could see would be infernce on Zync-likes for non power constrained applications, but that's quite a tiny niche.
And there's cute tricks to avoid matrix multiplies in the critical path for HFT.
A lot of the inefficiency of FPGAs is signal routing. If you are very likely to implement X, Y, or Z, then it may just be more economical to put those into the FPGA as discrete circuits. At best maybe you can break them apart into common elements and provide those.
There really isn't much need for LUTs any more, barring some fundamental change in data processing.
I see your point, but I think that is because of a lack of commitment to higher integration functional blocks. A lot of work done on a FPGA is very regular.
I suppose LUTs could be the best way to do create the glue logic that basically amounts to signal routing and synchronization, but it seems unlikely.
I'm not going to be surprised if FPGAs of the future start building in on-chip non-blocking CLOS-networks to more efficiently route messages across the chip.
Dedicated-hardware that serves a purpose. Sure, you can build a CLOS-network out of LUTs, but it'd be more efficient if you made dedicated hardware for it instead.
If you can get by with higher level functional blocks sitting off of a network on chip, you go the direction of a modern SoC. The neat games you can play with routing those NoC packets around are close to what you're talking about.
But DSVPN requires a server on such network, too, with certain ports accessible from the internet. If you use a cloud server with Wireguard it too can relay traffic between your home client and destination.