Google’s new AI turns text into music


Researchers at Google have created an AI that can generate minutes of music from a text prompt. It can also translate whistling and humming melodies into other instruments, just as systems like the DALL-E generate images from text prompts ( TechCrunch). The model is called MusicLM and although you can’t operate it yourself, the company has uploaded a number of samples created using this model.

Examples are impressive. 30-second snippets that sound like real songs created from paragraph-length descriptions that prescribe genres, moods, and even certain instruments, and one or two of his, like “melodic techno.” There is his 5 minute song generated from the words. Perhaps my favorite is the “Story Mode” demo. In this mode, the model is essentially given a script that morphs between prompts. For example, the prompt:

Electronic song played in a video game (0:00-0:15)

Meditation song by the river (0:15-0:30)

Fire (0:30-0:45)

Fireworks (0:45-0:60)

It might not be for everyone, but I could totally see this being made by humans (I listened to it dozens of times on a loop while writing this article). The demo site also provides an example of what the model produces when asked to generate a 10-second clip of him for an instrument such as a cello or maracas (the latter example shows that the system is relatively not working well). Certain genres, music suitable for jailbreaks, and even things that sound like novice and advanced piano players. It contains.

MusicLM can also simulate human vocals. Voice tones and overall sound seem to be handled well, but there are glaring gaps in their quality. The best way I can describe it is that they sound grainy or static. That quality is not clear in the example above, but I think the following illustrates it well.

By the way, that’s the result of a request to make music that flows at the gym. Also, you may have noticed that the lyrics are nonsense, but you might not always hear them if you weren’t paying attention. For example, if you hear someone singing in Simlish, or in English, but it’s not.

don’t pretend to know how Google achieved these results, and for those of you who are the type of person who can make sense of the numbers, they have published a research paper that explains them in detail.

Diagram illustrating the “hierarchical intersequence modeling task” used by researchers AudioLM, another Google project.
chart: google

AI-generated music has a long history dating back decades. There is a system that is credited with composing pop songs, copying Bach better than man in the 90s, and accompanying his performances live. One recent version uses the AI ​​image generation engine StableDiffusion to turn text prompts into spectrograms, which are then transformed into music. According to the paper, MusicLM not only outperforms other systems in terms of its “quality and caption adherence”, but also the fact that it can capture audio and copy melodies.

That last part is probably one of the coolest demos the researchers have put out. On this site, you can play the input audio of someone humming or whistling a tune and hear how the model reproduces it as an electronic synth lead, string quartet, guitar solo, etc. From the examples I’ve heard, it does the task very well.

As with other forays into this kind of AI, Google has been much more cautious with MusicLM than its peers using similar technology. “There are no plans to release the model at this time,” the paper concludes, citing “potential misappropriation of creative content” (read: plagiarism) and potential cultural appropriation or misrepresentation risks. increase.

It’s always possible that at some point this technology will appear in one of Google’s fun music experiments, but for now, the research is only available to other people building music AI systems. Google says it publishes a dataset containing about 5,500 music-text pairs. This could be useful for training and evaluating other music AIs.


Leave a Reply

Your email address will not be published. Required fields are marked *