Are you looking to use text to speech to add an AI voice over to your video? Do you want an easy to use API to incorporate a background voice over to your video? Then this post has all that you need.
Some common use cases where you might want to generate a video with an AI voice over include:
I will show you how to create an AI voice over and add it to a video using Shotstacks media APIs. It will be a step-by-step, easy-to-follow guide so even those with no prior programmatic video editing experience can automate the generation of videos with professional sounding AI narration.
Shotstack is an API-driven, video automation platform for creating, editing, and distributing dynamic videos at scale. In this post, I will guide you through generating the AI voice over using the Shotstack Create API. Then, I will show you how to integrate the voice over with your video assets using the Shotstack Edit API.
Before you start, register on the Shotstack website to get a free API key. You'll need this key to make authorized requests to the Create and Edit APIs. You should also be familiar with the cURL utility and running commands from the command line of your operating system.
The Create API creates video, images, audio and text using the latest generative AI services. In addition to built-in services, it also allows you to seamlessly invoke third-party services. You can use ElevenLabs for hyper-realistic voice overs, HeyGen or D-ID for text-to-avatar creation and or Stability AI to generate images from text prompts.
Let's explore two ways to generate a voice over using the Create API. One using the built in API service and one using Elevenlabs. To make the example more realistic we'll mock up a weather report style video using our own script and a background video from Pexels.
We will send a POST request to the Create API using cURL to generate a voice over from text. Execute the following command on your shell. Make sure to use your own stage/sandbox API key as the value for the x-api-key
header parameter, instead of YOUR_API_KEY
:
curl -X POST \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
https://api.shotstack.io/create/stage/assets/ \
-d '
{
"provider": "shotstack",
"options": {
"type": "text-to-speech",
"text": "Moving down to the Central area we are seeing clear skies, a gentle breeze and mild temperatures, perfect for an evening stroll. Temperatures are hovering around a comfortable 55 degrees fahrenheit, the ideal weather for outdoor activities.",
"voice": "Matthew",
"language": "en-US"
}
}'
The stage keyword in the URL is the environment you are working in. The JSON body of the command (the value of the -d
option) includes the text to convert to speech and the chosen voice and language combination. To get a list of all available languages and voices, check out the text-to-speech API docs.
Expect an output like the following:
{
"data": {
"type": "asset",
"id": "01hmg-6n6yd-k3q2w-me4kg-3rgtn9",
"attributes": {
"owner": "c2jsl2d4xd",
"provider": "shotstack",
"type": "text-to-speech",
"status": "queued",
"created": "2024-01-19T06:31:14.425Z",
"updated": "2024-01-19T06:31:14.425Z"
}
}
}
Copy the id from the response as we will be using it to check the status of the audio in the next step.
It can take time for the voice over generation to complete so wait for a few seconds and then run the following command. Make sure to replace the id in the URL with the one received in the response of the last API call.
curl -X GET \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
https://api.shotstack.io/create/stage/assets/01hmg-6n6yd-k3q2w-me4kg-3rgtn9
Expect an output similar to this:
{
"data": {
"type": "asset",
"id": "01hmg-6n6yd-k3q2w-me4kg-3rgtn9",
"attributes": {
"owner": "c2jsl2d4xd",
"provider": "shotstack",
"type": "text-to-speech",
"url": "https://shotstack-create-api-stage-assets.s3.amazonaws.com/c2jsl2d4xd/01hmg-6n6yd-k3q2w-me4kg-3rgtn9.mp3",
"status": "done",
"created": "2024-01-19T06:31:14.425Z",
"updated": "2024-01-19T06:31:20.181Z"
}
}
}
The status
parameter in the response should show done
if the audio has been generated. If the generation is still in progress, you might see statuses like rendering
, saving
, or queued
. In that case, just wait for a few seconds and resend the same GET request.
Once the status says done
, download or visit the link at the url
parameter in the response. It should play an mp3 audio file.
Here is an example of the Shotstack generated audio:
Another method to generate a realistic AI voice over is by leveraging the third-party integration with Elevenlabs. Follow these steps:
Now we are ready to use ElevenLabs. As for the POST request, we simply need to change the provider
attribute inside the JSON body. It should look like this:
Just like we did with the other API requests use your own stage/sandbox API key as the value for the x-api-key
header parameter, instead of YOUR_API_KEY
.
curl -X POST \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY " \
https://api.shotstack.io/create/stage/assets/ \
-d '
{
"provider": "elevenlabs",
"options": {
"type": "text-to-speech",
"text": "Moving down to the Central area we are seeing clear skies, a gentle breeze and mild temperatures, perfect for an evening stroll. Temperatures are hovering around a comfortable 55 degrees fahrenheit, the ideal weather for outdoor activities.",
"voice": "Adam"
}
}'
The ElevenLabs request body is very similar to the Shotstack text to speech request except the provider value is elevenlabs
, there is no language choice and the list of voices is different. In this example we use the voice Adam
. For a full list of voices check out the ElevenLabs options.
The response is the same as using the Shotstack text to speech and includes the id of the asset being generated. Use the id to check the status of the asset using exactly the same approach as before, like this:
curl -X GET \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
https://api.shotstack.io/create/stage/assets/01hmg-acrmr-3yd0c-q5c61-w5b7sh
As before the response includes the status and url of the generated audio file. Here is an example of the audio generated:
For the second part of this guide, we will use the Edit API to add our AI voice over to our weather video. First, create an empty file named video.json
and paste the following JSON to it:
{
"timeline": {
"background": "#000000",
"tracks": [
{
"clips": [
{
"asset": {
"type": "audio",
"src": "https://shotstack-create-api-stage-assets.s3.amazonaws.com/c2jsl2d4xd/01hmg-acrmr-3yd0c-q5c61-w5b7sh.mp3"
},
"start": 0,
"length": 16
}
]
},
{
"clips": [
{
"asset": {
"type": "video",
"src": "https://player.vimeo.com/external/428974406.hd.mp4?s=8e75e82ef712ac04df173007f2e5f32ee00180fd&profile_id=174&oauth2_token_id=57447761"
},
"start": 0,
"length": 17,
"transition": {
"in": "fade",
"out": "fade"
}
}
]
}
]
},
"output": {
"format": "mp4",
"resolution": "hd"
}
}
This JSON file specifies the following properties for the video:
Note that the MP3 file generated using ElevenLabs is 16 seconds long, so we have set the length of the audio clip to 16 seconds. The video from Pexels is 22 seconds long but we have cut it short at 17 seconds.
Now, we will send a POST request to the Edit API to generate the video based on this JSON file. Run the following command from your shell with your own API key:
curl -X POST \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
-d @video.json \
https://api.shotstack.io/edit/stage/render
Expect a response like this:
{
"success": true,
"message": "Created",
"response": {
"message": "Render Successfully Queued",
"id": "b609765f-ec2e-4727-8f22-ded38efd4f4d"
}
}
Copy the id from the response as we will be using it to check the status of the video render in the next step.
The Edit API takes time to render a video. After you send the render request to the API, wait a few seconds and run the command below. Make sure to replace the id in the URL with the one received in the response of the last API call.
curl -X GET \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
https://api.shotstack.io/edit/stage/render/b609765f-ec2e-4727-8f22-ded38efd4f4d
A response similar to below will be returned:
{
"success": true,
"message": "OK",
"response": {
"id": "b609765f-ec2e-4727-8f22-ded38efd4f4d",
"owner": "c2jsl2d4xd",
"plan": "sandbox",
"status": "done",
"error": "",
"duration": 15,
"billable": 15,
"renderTime": 9911.59,
"url": "https://shotstack-api-stage-output.s3-ap-southeast-2.amazonaws.com/c2jsl2d4xd/b609765f-ec2e-4727-8f22-ded38efd4f4d.mp4",
"poster": null,
"thumbnail": null,
"data": {
...
}
}
Just like the Create API, the status
parameter in the response should show done
if the video has finished rendering. If the rendering is still in progress, you might see statuses like rendering
, saving
, or queued
. In that case, just wait for a few more seconds and resend the same GET request.
Once the status is done
, download or visit the link at the url
parameter in the response. The URL is the path to an mp4 video file which will play in your browser or a video player.
Here is the final weather video:
In this post, we explored easy ways to generate AI voice overs using an API and seamlessly blend them into videos using simple JSON payloads and simple API requests.
From this simple example you can imagine creating a full featured personalised weather report video with a different voice over and background video for each location and weather conditions. You could also add lower thirds titles, weather icons and a subtle background soundtrack using the Edit API to make the video even more engaging.
To learn more started, visit the official docs or check out more of our developer guides and tutorials.
Every month we share articles like this one to keep you up to speed with automated video editing.