Top AI News Today
Real Time

TALOS-V2: Implementing Transformer Architecture as Pure Hardware Circuitry on FPGA to Achieve Over 5

Luthira Abeykoon and Krish Chhajer, Electrical and Computer Engineering undergraduates at the University of Toronto, have successfully ported Karpathy...
Luthira Abeykoon and Krish Chhajer, Electrical and Computer Engineering undergraduates at the University of Toronto, have successfully ported Karpathy's MicroGPT—a minimalist GPT implementation consisting of only 200 lines of pure Python code and 4,192 parameters—entirely into hardware using SystemVerilog.
This project, named TALOS-V2 (Tensor Accelerated Logic for On-Chip Systems), operates without a GPU, PyTorch, or a CPU inference loop. Every step of the transformer architecture is executed as a dedicated hardware circuit, achieving a generation speed exceeding 50,000 tokens per second. The project has been fully open-sourced on GitHub.

Technical Implementation DetAIls

The system runs on a DE1-SoC Cyclone V, an Educational-grade Intel FPGA. Key architectural decisions include:
  • Weight StoRAGe: Model weights are stored in on-chip ROM using a Q4.12 fixed-point format.

  • Systolic Array: The repetitive Matrix-Vector Multiplication (MatVec) Operations found throughout the model are implemented as a 16-channel systolic ARRay. This single unit is time-shared to handle Q/K/V projections, the Multi-Layer Perceptron (MLP), and the Language Model (LM) head.

Overcoming the Attention Mechanism Challenge

The attention mechanism proved to be the most complex component to migrate from software to hardware. While a single line of Python code suffices for attention in a software environment, the hardware implementation required decomposing the process into eight distinct steps:
  1. Generation of Q/K/V (Query, Key, Value).

  2. Scanning dot products.

  3. Tracking maximum values.

  4. APProximating the exponential function.

  5. Accumulation.

  6. Division.

  7. Mixing with V.

  8. Final projection.

Project Philosophy

The creators emphasize that the primary goal of TALOS-V2 was not to run large language models (LLMs), but rather to visualize every step of Transformer inference as tangible hardware components: mEMOry units, counters, state machines, and Look-Up Tables (LUTs).


★★★★★
★★★★★
Be the first to rate this article.

Comments & Questions (0)

Captcha
Please be respectful — let's keep the conversation friendly.

No comments yet

Be the first to comment!