🔍 Point-by-Point Rebuttals
On power consumption: The report suggests that software optimizations allow chips to run at full capacity, advising manufACTurers to allocate more power headroom. Chan argues this is counterproductive, as a higher power budget often forces a lower operating frequency to stay within thermal limits, ultimately reducing computational power.
On data transfer: The report advocates for a "pull" model (GPU actively reads data) over a "push" model (data is sent to the GPU), citing high notification overhead for the latter. Chan disagrees, stating that the "pull" method is inherently slower and that improving the network card's processing capability would be a better solution. It's noted that they might be discussing different aspects: the report focuses on notification overhead, while Chan is concerned with transmission latency.
On Activation Functions: DeepSeek's report recommends replacing the complex SwiGLU activation function with a simpler one to reduce computational load. Chan finds this unnecessary, pointing out that the Sonic MoE architecture has already proven that SwiGLU can achieve optimal performance.
Comments & Questions (0)
No comments yet
Be the first to comment!