New York, NY – In a recent demonstration by Gavin Purcell, co-host of the AI for Humans podcast, a video was shared on Reddit showcasing an argument between a human pretending to be an embezzler and their boss, both seemingly indistinguishable from an artificial intelligence model known as Sesame’s CSM. The interaction was so realistic that it blurred the lines between human and AI, showcasing the capabilities of the technology.
Sesame’s CSM utilizes a unique approach to achieve its lifelike quality, employing two AI models (a backbone and decoder) based on Meta’s Llama architecture to process both text and audio simultaneously. With training on approximately 1 million hours of primarily English audio, Sesame’s CSM offers three different model sizes, with the largest boasting 8.3 billion parameters. This integration of text and audio processing in a single-stage, multimodal transformer-based model sets it apart from traditional text-to-speech systems.
Blind tests conducted on CSM-generated speech samples versus real human recordings showed no significant preference from human evaluators, indicating a near-human quality in isolated speech contexts. However, when conversational context was introduced, real human speech was consistently favored, pointing towards a gap in fully contextual speech generation that Sesame’s CSM aims to bridge.
Despite the advancements made by Sesame’s CSM, co-founder Brendan Iribe acknowledged existing limitations. In a statement on Hacker News, Iribe highlighted issues such as tone, prosody, pacing, interruptions, timing, and conversation flow that the system still grapples with. While acknowledging the current challenges faced by the technology, Iribe expressed optimism in the potential for improvement and growth in the future.