Chinese artificial intelligence (AI) startup DeepSeek has unveiled advanced technology capable of rapidly training and inferring from long texts.
According to DeepSeek's official account on X (formerly Twitter) on the 19th, the company's developers, including founder Liang Wenpeng, published a paper introducing a mechanism called 'Native Sparse Attention' on the document storage site 'arXiv.' In the paper, DeepSeek noted that 'modeling long texts is crucial for next-generation language models, but the high expense of the standard attention mechanism poses a significant challenge' and added, 'sparse attention can improve efficiency while maintaining model capabilities.'
The existing 'Full Attention' mechanism, which computes relationships between all tokens (the unit of data processed in AI models), faces the problem of exponentially increasing computational complexity as sequence length increases, leading to active research in the AI industry on sparse attention that computes only a subset of tokens.
DeepSeek announced, 'We are introducing the 'Native Sparse Attention (NSA) mechanism' that integrates algorithm innovation and hardware optimization for efficient long text modeling' and explained, 'NSA adopts a dynamic hierarchical sparse strategy that combines token compression with token selection.'
Recently, AI models, including OpenAI's 'o' series, DeepSeek's 'R1', and Google's Gemini 2.0, have led a trend of inference models increasingly requiring long text processing capabilities. As a result, while sparse attention methods are being researched to tackle issues such as rapidly rising computation expenses, DeepSeek highlighted problems such as the applicability of sparsity to only certain stages, incompatibility with the latest attention structures, and a design that overlooks training efficiency.
In contrast, the NSA developed by DeepSeek applies a dynamic hierarchical sparse strategy that inherently compresses unimportant tokens while selecting only essential tokens to reduce computation expenses and increase speed. DeepSeek explained that token compression allows for overall context recognition, and token selection helps maintain detail.
DeepSeek claimed that in benchmarking tests comparing full attention and NSA, NSA scored higher and that the decoding speed for processing 64K sequences was 11.6 times faster with NSA, while backpropagation was also six times faster. DeepSeek emphasized two key innovations of NSA: ▲ implementation of hardware optimization through balanced algorithm design for significant speed improvement and ▲ reduction of pre-training computation expense without degrading model performance.