
Stage 1 (Reasoning Chains Construction) — an LLM, guided by a curriculum/template c, generates a structured reasoning chain r based on query (x), set of available tools T, and ground truth label y.
Stage 2 (Function Calling Generation Using Generated Reasoning) — we evaluate the effectiveness of r by prompting an LLM to predict a function call y' conditioned on (x, T, r), without providing c at inference.
Stage 3 (Training Sample Filtering) — to ensure high-quality supervision, we compare the predicted y' with the reference y using two rounds of verification: (1) Exact Match and AST-based structural comparison (AST/EM), and (2) LLM-based judgment to identify semantically equivalent alternatives. Only samples that pass verification are included in the final dataset.