Warning: Some posts on this platform may contain adult material intended for mature audiences only. Viewer discretion is advised. By clicking ‘Continue’, you confirm that you are 18 years or older and consent to viewing explicit content.
There’s actually a perplexity improvement parameter-to-paramater for BitNet-1.58 which increases as it scales up.
So yes, post-training quantization perplexity issues are apparent, but if you train quantization in from the start it is better than FP.
Which makes sense through the lens of the superposition hypothesis where the weights are actually representing a hyperdimensional virtual vector space. If the weights have too much precision competing features might compromise on fuzzier representations instead of restructuring the virtual network to better matching nodes.
Constrained weight precision is probably going to be the future of pretraining within a generation or two looking at the data so far.
There’s actually a perplexity improvement parameter-to-paramater for BitNet-1.58 which increases as it scales up.
So yes, post-training quantization perplexity issues are apparent, but if you train quantization in from the start it is better than FP.
Which makes sense through the lens of the superposition hypothesis where the weights are actually representing a hyperdimensional virtual vector space. If the weights have too much precision competing features might compromise on fuzzier representations instead of restructuring the virtual network to better matching nodes.
Constrained weight precision is probably going to be the future of pretraining within a generation or two looking at the data so far.