OpenAI’s new GPT-4.1 models can process a million tokens and solve coding problems better than ever


OpenAI launched a new family of AI models this morning that significantly improve coding abilities while cutting costs, responding directly to growing competition in the enterprise AI market.

The San Francisco-based AI company introduced three models — GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano — all available immediately through its API. The new lineup performs better at software engineering tasks, follows instructions more precisely, and can process up to one million tokens of context, equivalent to about 750,000 words.

“GPT-4.1 offers exceptional performance at a lower cost,” said Kevin Weil, chief product officer at OpenAI, during Monday’s announcement. “These models are better than GPT-4o on just about every dimension.”

Perhaps most significant for enterprise customers is the pricing: GPT-4.1 will cost 26% less than its predecessor, while the lightweight nano version becomes OpenAI’s most affordable offering at just 12 cents per million tokens.

How GPT-4.1’s improvements target enterprise developers’ biggest pain points

In a candid interview with VentureBeat, Michelle Pokrass, post training research lead at OpenAI, emphasized that practical business applications drove the development process.

“GPT-4.1 was trained with one goal: being useful for developers,” Pokrass told VentureBeat. “We’ve found GPT-4.1 is much better at following the kinds of instructions that enterprises use in practice, which makes it much easier to deploy production-ready applications.”

This focus on real-world utility is reflected in benchmark results. On SWE-bench Verified, which measures software engineering capabilities, GPT-4.1 scored 54.6% — a substantial 21.4 percentage point improvement over GPT-4o.

For businesses developing AI agents that work independently on complex tasks, the improvements in instruction following are particularly valuable. On Scale’s MultiChallenge benchmark, GPT-4.1 scored 38.3%, outperforming GPT-4o by 10.5 percentage points.

Why OpenAI’s three-tiered model strategy challenges competitors like Google and Anthropic

The introduction of three distinct models at different price points addresses the diversifying AI marketplace. The flagship GPT-4.1 targets complex enterprise applications, while mini and nano versions address use cases where speed and cost efficiency are priorities.

“Not all tasks need the most intelligence or top capabilities,” Pokrass told VentureBeat. “Nano is going to be a workhorse model for use cases like autocomplete, classification, data extraction, or anything else where speed is the top concern.”

Simultaneously, OpenAI announced plans to deprecate GPT-4.5 Preview — its largest and most expensive model released just two months ago — from its API by July 14. The company positioned GPT-4.1 as a more cost-effective replacement that delivers “improved or similar performance on many key capabilities at much lower cost and latency.”

This move allows OpenAI to reclaim computing resources while providing developers a more efficient alternative to its costliest offering, which had been priced at $75 per million input tokens and $150 per million output tokens.

Real-world results: How Thomson Reuters, Carlyle and Windsurf are leveraging GPT-4.1

Several enterprise customers who tested the models prior to launch reported substantial improvements in their specific domains.

Thomson Reuters saw a 17% improvement in multi-document review accuracy when using GPT-4.1 with its legal AI assistant, CoCounsel. This enhancement is particularly valuable for complex legal workflows involving lengthy documents with nuanced relationships between clauses.

Financial firm Carlyle reported 50% better performance on extracting granular financial data from dense documents — a critical capability for investment analysis and decision-making.

Varun Mohan, CEO of coding tool provider Windsurf (formerly Codeium), shared detailed performance metrics during the announcement.

“We found that GPT-4.1 reduces the number of times that it needs to read unnecessary files by 40% compared to other leading models, and also modifies unnecessary files 70% less,” Mohan said. “The model is also surprisingly less verbose… GPT-4.1 is 50% less verbose than other leading models.”

Million-token context: What businesses can do with 8x more processing capacity

All three models feature a context window of one million tokens — eight times larger than GPT-4o’s 128,000 token limit. This expanded capacity allows the models to process multiple lengthy documents or entire codebases at once.

In a demonstration, OpenAI showed GPT-4.1 analyzing a 450,000-token NASA server log file from 1995, identifying an anomalous entry hiding deep within the data. This capability is particularly valuable for tasks involving large datasets, such as code repositories or corporate document collections.

However, OpenAI acknowledges performance degradation with extremely large inputs. On its internal OpenAI-MRCR test, accuracy dropped from around 84% with 8,000 tokens to 50% with one million tokens.

How the enterprise AI landscape is shifting as Google, Anthropic and OpenAI compete for developers

The release comes as competition in the enterprise AI space heats up. Google recently launched Gemini 2.5 Pro with a comparable one-million-token context window, while Anthropic’s Claude 3.7 Sonnet has gained traction with businesses seeking alternatives to OpenAI’s offerings.

Chinese AI startup DeepSeek also recently upgraded its models, putting additional pressure on OpenAI to maintain its leadership position.

“It’s been really cool to see how improvements in long context understanding have translated into better performance on specific verticals like legal analysis and extracting financial data,” Pokrass said. “We’ve found it’s critical to test our models beyond the academic benchmarks and make sure they perform well with enterprises and developers.”

By releasing these models specifically through its API rather than ChatGPT, OpenAI signals its commitment to developers and enterprise customers. The company plans to gradually incorporate features from GPT-4.1 into ChatGPT over time, but the primary focus remains on providing robust tools for businesses building specialized applications.

To encourage further research in long-context processing, OpenAI is releasing two evaluation datasets: OpenAI-MRCR for testing multi-round coreference abilities and Graphwalks for evaluating complex reasoning across lengthy documents.

For enterprise decision-makers, the GPT-4.1 family offers a more practical, cost-effective approach to AI implementation. As organizations continue integrating AI into their operations, these improvements in reliability, specificity, and efficiency could accelerate adoption across industries still weighing implementation costs against potential benefits.

While competitors chase larger, costlier models, OpenAI’s strategic pivot with GPT-4.1 suggests the future of AI may not belong to the biggest models, but to the most efficient ones. The real breakthrough may not be in the benchmarks, but in bringing enterprise-grade AI within reach of more businesses than ever before.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *