Training Data Sources

  • Web data (multilingual)

    Status: Reported

    Citation: Alibaba Cloud documentation

  • Proprietary Alibaba data

    Status: Alleged

Overview: A Chinese LLaMA Competitor

Qwen (from the Chinese name Tongyi Qianwen) is a family of powerful, multilingual, and multimodal open-source models developed by Alibaba, one of China’s largest technology companies. From a legal and strategic perspective, Qwen is a direct competitor to Meta’s LLaMA. It is a top-tier open-source model released by a technology giant under a custom license designed to hinder its biggest rivals.

Key Models

The Qwen family includes a range of powerful open-source models:

  • Qwen 2: The flagship text generation model, released in multiple sizes (up to 72 billion parameters). It is known for its strong multilingual capabilities.
  • Qwen-VL: A multimodal version capable of understanding and interpreting both text and images.

Like other major AI labs, Alibaba has not fully disclosed its training data, creating a familiar “black box” risk.

  • Stated Sources: Alibaba states the models are trained on a large corpus of multilingual web data and proprietary internal data.
  • Geopolitical Context: The emphasis on multilingual data, including a large amount of Chinese-language data, differentiates it from many Western models. However, the sources of this data are not disclosed, presenting the same risk of copyright infringement across multiple languages and jurisdictions.
  • Proprietary Data: The use of Alibaba’s own proprietary data (e.g., from its e-commerce platforms) would give it a unique data advantage, similar to Grok’s use of X data, but details are not public.

The Tongyi Qianwen License Agreement

Qwen models are not released under a standard permissive open-source license. They are governed by a custom license that is very similar in strategy to Meta’s LLaMA license.

Key Terms & Restrictions

  • Free for Most: The license is free and permits commercial use, modification, and distribution for most users.
  • Commercial Restriction for Large Companies: The license contains a crucial restriction. Companies with more than 100 million monthly active users are prohibited from using the model commercially without seeking a separate license from Alibaba.
  • The Target: This clause is aimed at Alibaba’s largest global and domestic competitors (e.g., Tencent, ByteDance, Google, Microsoft). It prevents them from directly using Qwen to improve their own products.
  • Commoditizing the Model Layer: Like Meta, Alibaba’s strategy is to commoditize the AI model itself, undermining the business model of companies that charge for API access.
  • Risk Transference: By open-sourcing the model, Alibaba transfers the direct legal risk of using a model trained on undisclosed data to the thousands of developers and smaller companies that build on top of it.
  • Controlled “Openness” and Geopolitics: This is another example of “controlled open-source.” As a Chinese company, Alibaba’s ability to enforce its license terms against a US or European company (and vice-versa for copyright claims) is subject to complex international legal and geopolitical factors.