Microsoft has introduced three new foundational AI models capable of generating text, voice, and images, marking a significant step in the company’s efforts to expand its multimodal AI capabilities. Despite its ongoing partnership with OpenAI, the move highlights Microsoft’s ambition to compete more directly with other leading AI developers.
One of the models, MAI-Transcribe-1, can convert speech into text across 25 languages and operates 2.5 times faster than the company’s Azure Fast service, according to Microsoft. MAI-Voice-1 focuses on audio generation, enabling users to produce up to 60 seconds of sound in just one second and even create custom voices. Meanwhile, MAI-Image-2 is designed for video generation.
MAI-Image-2 was first introduced on March 19 through MAI Playground, a testing platform for large language models. All three models are now available via Microsoft Foundry, with the transcription and voice tools also accessible in MAI Playground.
These models were developed by Microsoft’s MAI Superintelligence team, led by Mustafa Suleyman, which was established in November 2025. Suleyman emphasized the company’s focus on “Humanist AI,” aiming to design systems centered on real human communication and practical use cases.
Microsoft is positioning these models as cost-effective alternatives in an increasingly competitive AI market, where rivals like Google and OpenAI dominate. Pricing starts at $0.36 per hour for transcription, $22 per one million characters for voice generation, and $5 per one million input tokens (and $33 per one million output tokens) for image generation.
While expanding its in-house AI capabilities, Microsoft remains committed to its partnership with OpenAI, having invested over $13 billion into the organization. The collaboration continues to power many of Microsoft’s products, even as the company builds its own advanced AI systems.