The execution of the Same Skill can vary Dramatically across different model and harness combinations, sometimes even hindering performance. Researchers from Shanghai Jiao Tong University's IPADS lab analyzed over 118,000 Skills and found that:
15% of tasks saw a performance decrease after using a Skill.
87% of tasks showed no improvement on at least one model.
Some Skills caused token costs to skyrocket by 451% with no corresponding increase in success rate.
Model Capability Mismatch: A Skill may assume a highly capable model, but a smaller model might not underStand the instructions, leading to a performance drop in 15% of tasks when forced.
Environment Dependencies Cause Errors: If a Skill requires a Python package that isn't installed on the user's machine, the LLM is forced into a cycle of trial and error, wasting a large number of tokens.
Slow and Costly: For highly repetitive and rigid tasks, the LLM must re-run its "inference-tool calling" loop each time, resulting in extremely high token costs.
To tackle these pain points, the Shanghai Jiao Tong University team drew inspiration from traditional language virtual machine designs to create SkVM, a virtual machine architecture for natural language.
This process compiles Skills into a format that is more comprehensible to models. During Skill installation, the AOT compiler (composed of a compilation-optimized Skill and an LLM) generates multiple compiled artifacts. These artifacts help the Agent Harness and LLM better understand the Skill during runtime. Before execution, SkVM performs three key steps:
PASS-1: Capability-Based Compilation System: The system identifies 26 "Primitive Capabilities" (e.g., tool invocation, instruction following, format alignment) and benchmarks the LLM's proficiency in each, similar to a CPU benchmark. Unlike other LLM test suites, these capabilities test for basic, orthogonal, composable, and logic-agnostic skills rather than complex problem-solving. Each capability is scored to generate an objective capability profile for the LLM+Harness combination. The compiler then analyzes the Skill to determine its required primitive capabilities and levels. If a Skill's requirements exceed the LLM+Harness's capabilities, the compiler optimizes the Skill to lower its demands. For example, if a Skill uses relative paths for preDeFined scripts and the LLM+Harness lacks the ability to parse them, the compiler converts them to absolute paths during Installation, reducing the required level for the "script execution" capability.
PASS-2: Environment Binding: Skills often define required environments and dependencies. During runtime, the LLM typically checks for and installs these, leading to significant token waste or installation failures. The AOT compiler automatically extracts the necessary packages and tools, generating installation/verification scripts. This allows for one-click environment setup before execution, eliminating the need for the LLM to troubleshoot.
PASS-3: Concurrency Extraction: Over 76% of Skills contain workflows, which Agent harnesses typically execute serially by default. AOT compilation can uncover parallelization opportunities at different granularities within a Skill's execution, including data parallelism (one instruction, multiple data), instruction parallelism (parallel execution of independent instructions), and thread parallelism (multiple independent sub-agents handling different subtasks). It then generates a parallelizable DAG (Directed Acyclic Graph) workflow. Developers can also register custom compilation optimization mechanisms with the AOT compiler for further pre-runtime optimization.
Beyond static compilation, SkVM employs Just-in-Time (JIT) compilation at runtime to accelerate Skill execution.
Code Solidification: Scripts defined in a Skill are often code templates with variable parameters. Each time the Skill runs, the LLM must regenerate the executable script, wasting a large number of tokens. To counter this, during the AOT phase, SkVM generates a code fingerprint, template, and corresponding parameter list. During runtime, the code generated by the LLM is matched against the pre-generated AOT code fingerprint. If a match is successful multiple times consecutively, SkVM uses JIT Compilation to solidify the executable code based on the input parameters, rather than having the LLM regenerate it each time.
Adaptive Recompilation: If an error or retry occurs during runtime, the system collects error logs and feeds them back to the compiler for automatic re-optimization of the Skill. This prevents the same errors from recurring and improves the overall task success rate.
The research team tested SkVM on 118 representative tasks, including code generation and data analysis. The results showed significant benefits, especially for weaker, smaller models, as it compensates for their shortcomings in handling complex JSON structures, environment dependencies, and script parsing. This enables a Qwen 30B model to achieve task success rates comparable to Opus 4.6. For top-tier models, using SkVM-compiled Skills reduced token consumption by up to 40%.
Comments & Questions (0)
No comments yet
Be the first to comment!