Kimi K2 Thinking
Chinese Lab Moonshot AI has released a reasoning text-only LLM. Following Sebastian Raschka, K2 Thinking architecturally seems like a scaled up version of DeepSeek R1 in certain ways, e.g. the 1 trillion parameters (1T vs 671B) or the larger context size (256K vs. 128K).
Benchmarks
Writing
K2 Thinking scores below the non-thinking variant, K2 Instruct, on the EQ Bench Creative Writing leaderboard, with worse “Slop” and “Repetition” scores; producing only half the length. (The score could still change as the documentation notes that temperature=1 gives the best performance, but the benchmark authors use temp=0.7). Derya Unutmaz praises its “creativity and prose at least four out of five stars”, but according to my limited testing, it appears undertrained in German: occasionally broken syntax and grammar, unidiomatic word choice (making it sound like an incorrect verbatim translation from English) and incorrect typography.
Other Benchmarks
K2 Thinking currently appears to be the best open-wights model overall, with Artificial Analysis placing it just below GPT-5 High and well above Anthropic’s Claude Sonnet 4.5, recent open-weights winner Minimax M2, and OpenAI gpt-oss-120B.
Outperformance on benchmarks is not universal, however: it’s behind on a range of coding tasks in particular (but does consistently outperform DeepSeek V3.2).
Availability and pricing
As of now, the only third party provider on OpenRouter besides Moonshot is Parasail - which is not listed on the K2 Vendor Verifier. It remains unclear if the K2 Thinking Heavy variant, which uses a parallel decoding strategy and appears competitive with GPT-5 Pro in some scenarios, is publicly available. Artificial Intelligence lists the cost of running their evals as a little less expensive than DeepSeek R1 (-$3) and somewhat more expensive than Claude 4.5 Haiku (+$17) - and less than half the price of Claude 4.5 Sonnet. Ethan Mollick remarks that K2 Thinking produces a lot of thinking tokens: 1,595 tokens vs 110 tokens with DeepSeek R1 in one test. Prices per million tokens continues to an insufficient basis for comparing LLMs, also because the Moonshot API features an implicit cache ($0.15 vs $0.60 on cache hit per 1M tokens).
