In the rapidly evolving landscape of artificial intelligence, open-source models are gaining significant traction. The recent release of Qwen 3 has stirred the AI community, offering a powerful alternative to proprietary models like Gemini 2.5 Pro. This article delves into the features, performance, and advantages of Qwen 3, highlighting why it’s a game-changer in the AI domain.
2. Understanding Qwen 3
2.1. What is Qwen 3?
Qwen 3 is an open-source large language model (LLM) developed to provide high-performance AI capabilities. With its open weights and source code, it offers transparency and flexibility for developers and researchers.
2.2. Key Features
Open-Source: Fully accessible code and weights.
Hybrid Thinking Mode: Adjustable reasoning capabilities.
Tool Integration: Seamless function calling during chain-of-thought processes.
Multiple Model Variants: Including both Mixture of Experts and dense models.
3. Benchmark Comparisons
3.1. Performance Metrics
Qwen 3’s flagship model, Qwen3-235B-A22B, demonstrates impressive performance across various benchmarks:
LiveCodeBench: Scores 70.7%, surpassing Gemini 2.5 Pro’s 70.4%.
CodeForces ELO Rating: Achieves 2056, compared to Gemini 2.5 Pro’s 2001.
BFCL (Berkeley Function Calling Leaderboard): Attains a score of 70.8, outperforming Gemini 2.5 Pro’s 62.9.
3.2. Function Calling Capabilities
Qwen 3 excels in function calling tasks, crucial for agentic applications and coding assistance. Its superior performance in BFCL benchmarks underscores its proficiency in this area.
4. Hybrid Thinking Mode
4.1. Thinking vs. Non-Thinking Modes
Qwen 3 introduces a hybrid approach to problem-solving:
Thinking Mode: Engages in step-by-step reasoning for complex tasks.
Non-Thinking Mode: Provides rapid responses for straightforward queries.
4.2. Adjustable Thinking Budget
Users can configure the model’s reasoning depth by adjusting the token budget, balancing performance and speed according to task requirements.
5. Model Variants
5.1. Mixture of Experts (MoE) Models
Qwen3-235B-A22B: 235 billion parameters with 22 billion active parameters.
Qwen3-30B-A3B: 30 billion parameters with 3 billion active parameters, optimized for efficiency.
5.2. Dense Models
Qwen 3 offers six dense models ranging from 600 million to 32 billion parameters, catering to various computational capacities and application needs.
6. Training and Data
6.1. Pre-training Stages
Qwen 3 underwent a comprehensive training process:
Stage 1: Pre-trained on over 30 trillion tokens to establish foundational language skills.
Stage 2: Focused on knowledge-intensive data, including STEM and reasoning tasks.
Stage 3: Extended context length to 32K tokens using high-quality long-context data.
6.2. Post-training Enhancements
Post-training involved:
Long Chain-of-Thought Training: Enhanced reasoning abilities.
Reinforcement Learning: Improved model exploration and exploitation capabilities.
Thinking Model Fusion: Integrated quick response capabilities.
General Reinforcement Learning: Strengthened general capabilities and corrected undesired behaviors.
7. Tool Integration and Use Cases
7.1. Tool Calling During Chain of Thought
Qwen 3’s ability to perform tool calls within its reasoning process enables complex task execution, such as:
Fetching data from APIs.
Organizing files based on type.
Generating and executing code snippets.
7.2. Integration with Zapier MCP
Through Zapier’s MCP server, Qwen 3 can connect with over 7,000 applications, facilitating extensive automation and integration capabilities.
8. Comparison with Gemini 2.5 Pro
8.1. Performance Benchmarks
While Gemini 2.5 Pro leads in certain benchmarks, Qwen 3 closely trails, often surpassing in specific areas like function calling and code generation.
8.2. Open-Source Advantage
Unlike Gemini 2.5 Pro, Qwen 3’s open-source nature allows for:
Greater transparency.
Customization and fine-tuning.
Broader accessibility for research and development.
9. Deployment and Accessibility
9.1. Running Qwen 3 Locally
Qwen 3 can be deployed locally using platforms like LM Studio, offering users control over their AI applications without reliance on external APIs.
9.2. Platform Support
The model is compatible with various frameworks, including:
Ollama
MLX
Llama.cpp
K Transformers
10. Conclusion
Qwen 3 emerges as a formidable open-source LLM, challenging proprietary models with its robust performance, hybrid thinking capabilities, and extensive tool integration. Its accessibility and flexibility make it a valuable asset for developers, researchers, and organizations seeking advanced AI solutions.
11. FAQs
Q1: What sets Qwen 3 apart from other open-source models?
A1: Qwen 3’s hybrid thinking mode, superior function calling capabilities, and extensive tool integration distinguish it from other open-source LLMs.
Q2: Can Qwen 3 be fine-tuned for specific applications?
A2: Yes, its open-source nature allows for customization and fine-tuning to cater to specific use cases.
Q3: How does Qwen 3 handle complex tasks?
A3: Utilizing its thinking mode, Qwen 3 engages in step-by-step reasoning, making it adept at handling complex problems requiring deeper analysis.
Q4: Is Qwen 3 suitable for real-time applications?
A4: Absolutely. Its non-thinking mode provides quick responses, making it ideal for applications where speed is crucial.
Q5: Where can I access Qwen 3?
A5: Qwen 3 is available on platforms like Hugging Face, LM Studio, and can be integrated using frameworks such as Ollama and Llama.cpp.
Why does no one talk about mistral anymore
Good league promotion, now Mr Matthew is affiliated with Zapier… I bet that is a result of his work being highlighted on a Google show. Way to go, congratulations, This channel was one of the first I ever followed on AI developments…
Matthew, not sure if you saw this: https://selectcommitteeontheccp.house.gov/sites/evo-subsites/selectcommitteeontheccp.house.gov/files/evo-media-document/DeepSeek%20Final.pdf
Woowww that's incredible!
Benchmarks are like the first chapter of a book. I enjoy going over them, and it can be fun, but there's a lot more to look at. And a good first chapter only means the rest of the book has a reasonable likelihood of being good. No more, no less.
I am a beginner developer and I have tried all of the models, oh boy have I. For me, I was frustrated with AI until Gemini 2.5 Pro! Nothing compares. Nothing!
Hopefully one day we are getting models that run on normal machines at home 🙂
qwen is not as good as grok and claude.. i gave those three an incomplete code in flet. but qwen even fkedup the code it gave me while grok and claude gave me the correct working code
The fine prints under the benchmarks reveal a lot. Its a great model but not a gemini 2.5 pro or gpt-4o LVL LLM.
New rules till its tested life anyone can claim anything……
Qwen shows me that benchmarks dont reflect reality, gemini 2.5 pro absolutely cooks it in real world use
I can't believe you you made a video out of reading Qwen3 release material, parroting the fake benchmark data and not even bothering to test it, while claming it's amazing and being all enthusiastic. You literally have zero credibility anymore. Why the ridiculous thumbnail like you just discovered that Qwen3 is Jesus reincarnated as an LLM? You can't possibly think people are liking this crap.
Yes right “comparable to Gemini 2.5 pro” 😂
Use at your own risk.
It is a CCP state owned company
I’m not talking about political ideology but commercial espionage
They can steal your ideas
Not multimodal unfortunately. But otherwise the best 32b model I've ever tested locally, and I have tested A LOT. In thinking mode outperforms even the best 70b models I have tested. It's also great that you simply can switch it to fast mode and even then it's still strong. There are some pitfalls though like that you can be default only use 32K context without changing some configuration options that Ollama doesn't even expose. So with main versions on their site, you are stuck with 32k. Also at low temperatures the reasoning can get stuck. Reasoning worked best at 0.8 for me and still good at 0.6 (which is also the recommended default). But I've also only tested the q4 from ollama. There are probably better quantizations, especially from unslaught that already offers a 128k context variant. I couldn't go full context with these either though yet, but ollama is just stupid (no KV cache splitting, llama.cpp's split options not exposed).
ClosedAI is done.
I used the biggest one in hugging hat and it's BAAAD. It's neither possible to steer it up with custom system prompt (for the moment it doesn't work even if it's activated by user). Matthew videos of last year are all the same: benchmarks,fast generations and his enthusiasm….
This is perfect for my university masters dissertation 😂. I literally have a meeting with my supervisor tomorrow and now i have a better foundation model that I can use for thr project.
Qwen3 also fails Needle in a heystack on every level. I've given it the same 10k instruction document over 100 times now, and it's not once been able to keep the information strait. Referencing and using info from part 5 in the data set in other parts, mixing everything up so bad that the output is legit gibberish and useless. If I were to give it a NIHS score, it gets 20%. The worst I've ever seen.
Qwen3 is honestly terrible. It's actively worse than Qwen 2.5 in everything other than answering knowledge based questions. It doesn't follow instructions at all, it can't be used to Role play an Agent. Seriously, it is the worst model I've seen in the last year.
This is a complete joke. All he did was look at the benchmarks, didn't bother trying it. Berman is discrediting himself with this one. Keep it up and I will unsub. Do better!
One thing I don't understand, what exactly in these benchmarks? Can there be a specific type of real world questions or tasks? Or something else. Please if someone knows share with me here👇
…96GB of RAM… 😮 ….weeping at my little laptop….
I did a quick test with qwen3.32b and it was much better without thinking. I wanted some dart code. With thinking I got not dart but javascript and the code did not do exactly what I wanted. Without thinking I got some excellent code in dart. From now on qwen3.32b-no-thinking and gemma3.27b are my two favorite LLMs.
@ Matthew Berman:
I've discovered that apparently, according to the llm, it's knowledge cut off date is October 2023. Can you verify this? Does this matter in your opinion? (I found it did in my case, based on the prompt I provided it)
It's still pretty dam good though, but it aint Gemini 2.5 Pro.
Anyone else want to chime in? I welcome all opinions, as long as they're constructive to some degree (if you want to vent frustration then that's fine too).
I have a doubt if a model has very less active parameters in MoE like qwen 3 30b
Will it require resources only for 3 billion parameters
Sounds amazing. Tests show something else 🤣
Is it a deep agent like manus?
I strongly advise that we (U.S.) do not encourage use of and become dependent on Chinese AI models.
gemini cant be beaten
TIMESTAMPS*, Sponsor Skips & Summary (by *VidSkipper AI ): Qwen3 is a new open-source model comparable to Gemini 2.5 Pro. It features a hybrid thinking model, optimized for agent-based tasks and coding with tool calling during chain of thought.
0:00 🚀 Qwen 3 Overview
• 🚀 Qwen 3, an open-source model, rivals Gemini 2.5 Pro, excelling in coding and function calling.
• ⚙️ Features hybrid thinking, balancing deep reasoning with fast responses, ideal for tasks like coding.
• 💾 Trained on 36 trillion tokens, incorporating web data, PDFs, textbooks, and synthetic data for enhanced knowledge.
2:08 🧠 Hybrid Thinking Model
• 🧠 Hybrid thinking model allows adjusting the 'thinking budget,' optimizing for complex tasks requiring deeper thought.
• 💻 Suitable for coding, allowing configuration of task-specific budgets, achieving balance between cost-efficiency and quality.
• 🛠️ Optimized for MCP tool usage, integrates with tools via Zapier's MCP server.
[Skip ad: 5:25] 4:01 (84s): ⏭ Sponsor
5:25 👨💻 Qwen 3 Models
• 👨💻 Released two Mixture of Experts models and six dense models, including flagship 235B parameter model.
• 💾 Flagship model: Qwen 3 235B with 22 billion active parameters, 128 experts, 128K context length.
• ⚡ Efficient model: 30B parameter model with 3 billion active parameters, ideal for fast performance.
6:44 🛠️ Advanced Capabilities
• 🛠️ Tool calling during chain of thought, enabling complex tasks like fetching GitHub stars and plotting bar charts.
• 🗂️ Can organize desktops by file type within the same inference run, showcasing advanced computer use.
• 🧪 Utilizes pre-training data from web, PDFs, textbooks, and synthetic data, enhanced by previous Qwen models.
9:58 ⚙️ Training and Availability
• ⚙️ Four-stage training pipeline includes long chain of thought, reasoning reinforcement learning, and thinking model fusion.
• 💡 Integrates non-thinking capabilities for quick responses, fine-tuned for reasoning and rapid response balance.
• 🚀 Employs reinforcement learning across 20 general domain tasks to strengthen general capabilities.
11:22 🥇 Benchmarks and Testing
• 💻 Available on LM Studio, O Lama, MLX, Llama CPP, and K Transformers, outperforming Llama 4.
• 🥇 Excels in benchmarks like MMLU and GPQA, demonstrating superior performance across various tasks.
• 🧪 Independent benchmarks confirm flagship model's strong performance in scientific reasoning.
SPONSORED SEGMENTS
4:01–5:25 (84s): ⏭ Sponsor
** Generated using ✨ VidSkipper AI Chrome Extension
I wish new release videos like this came with a quick summary at the beginning… List each version of the new model, whether it can run on each of the common VRAM sizes (8/12/24/48/80/132…) and if it is multi-modal. I know that it could hurt your ratings because right now you have my attention for the whole video while I wait with growing frustration for the two items of information that I am looking for. More often than not, I walk away with an implied answer to both questions. For example having watched this video entirely, I sort of know that the 30B version can run in 92gb and (because you never mentioned it) that the model is not multi-modal. So I cross Qwen-3 off my testing list and move on. As always, thanks for a great video!
Summary by gemini
Here's a summary of the video:
* Performance and Benchmarks: Qwen 3 rivals Gemini 2.5 Pro in various benchmarks, showcasing strong performance in coding and agentic tasks [00:07]. It excels in function calling and demonstrates impressive results in Live Codebench and Code Forces [00:41].
* Hybrid Thinking Model: Qwen 3 introduces a hybrid approach, allowing users to adjust the "thinking budget" for the model. This means it can provide quick responses for simple tasks or take more time for complex problems [02:13].
* Model Specifications: The Qwen 3 family includes various models, with the flagship model having 235 billion parameters and 22 billion active parameters [05:27]. There are also smaller, more efficient models available [06:03].
* Tool Calling and Computer Use: Qwen 3 can perform tool calling during chain of thought, enabling it to handle complex tasks like fetching GitHub stars and plotting a bar chart [06:49]. It can also perform computer tasks, such as organizing files [07:41].
* Pre-training and Post-training: Qwen 3 was trained on 36 trillion tokens across 119 languages, using a multi-stage process that included synthetic data generation [08:25]. The post-training process focused on developing the hybrid model and improving its reasoning and response capabilities [09:57].
* Comparison to Llama 4: Qwen 3 outperforms Llama 4 in several benchmarks, making it a strong contender in the open-source model landscape [11:37].
* Availability and Speed: The model is available for download on platforms like LM Studio and O Lama [11:18]. The 30 billion parameter model demonstrates impressive speed, especially on systems with powerful hardware [13:35].
Do you have any other questions about this video, or would you like to explore other videos?
Where's FireShip right now?
Sorry, but, for me, isn't a best model in the world. For while.
Qwen 3 is poop.