Skip to content

morioka/tiny-openai-whisper-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tiny-openai-whisper-api

OpenAI Whisper API-style local server, runnig on FastAPI. This is for companies behind proxies or security firewalls.

This API will be compatible with OpenAI Whisper (speech to text) API. See also Create transcription - API Reference - OpenAI API.

Some of code has been copied from whisper-ui

Now, this server emulates the following OpenAI APIs.

Running Environment

This was built & tested on Python 3.10.9, Ubutu22.04/WSL2.

  • openai=1.55.0
  • openai-whisper=20240930

Setup

sudo apt install ffmpeg
pip install fastapi python-multipart pydantic uvicorn openai-whisper httpx
# or pip install -r requirements.txt

or

docker compose build

Usage

server

export WHISPER_MODEL=turbo  # 'turbo' if not supecified
export PYTHONPATH=.
uvicorn main:app --host 0.0.0.0

or

docker compose up

client

note: Authorization header is ignored.

example 1: typical usecase, identical to OpenAI Whisper API example

curl --request POST \
  http://127.0.0.1:8000/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F model="whisper-1" \
  -F file="@/path/to/file/openai.mp3"

example 2: set the output format as text, described in quickstart.

curl --request POST \
  http://127.0.0.1:8000/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F model="whisper-1" \
  -F file="@/path/to/file/openai.mp3" \
  -F response_format=text

example 3: Windows PowerShell5

Set-ExecutionPolicy -Scope Process -ExecutionPolicy RemoteSigned
$env:OPENAI_API_KEY="dummy"
$env:OPENAI_BASE_URL="http://localhost:8000/v1"

.\tests\powershell\Invoke-Whisper-Audio-Transcription.ps1 "C:\temp\alloy.wav"

experimental: gpt-4o-audio-preview, chat-completions

currenly "output text only" mode is supported. If "output text and audio" is specified, the system makes "output text only" response.

  • output text and audio
    • modalities=["text", "audio"], audio={"voice": "alloy", "format": "wav"}
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview-2024-10-01",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": [
                { 
                    "type": "text",
                    "text": "What is in this recording?"
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": encoded_string,
                        "format": "wav"
                    }
                }
            ]
        },
    ]
)

print(completion.choices[0].message.audio.transcript)
ChatCompletion(id='chatcmpl-AXQt8BTMW4Gh1OcJ5qStVDNZGdzSq', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', audio=ChatCompletionAudio(id='audio_6744555cc6d48190b67e70798ab606c3', data='{{base64-wav}}', expires_at=1732535148, transcript='The recording contains a voice stating that the sun rises in the east and sets in the west, a fact that has been observed by humans for thousands of years.'), function_call=None, tool_calls=None))], created=1732531546, model='gpt-4o-audio-preview-2024-10-01', object='chat.completion', service_tier=None, system_fingerprint='fp_130ac2f073', usage=CompletionUsage(completion_tokens=236, prompt_tokens=86, total_tokens=322, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=188, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=48), prompt_tokens_details=PromptTokensDetails(audio_tokens=69, cached_tokens=0, text_tokens=17, image_tokens=0)))
  • output text only
    • modalities=["text"]
completion = client.chat.completions.create(
    model="gpt-4o-audio-preview-2024-10-01",
    modalities=["text"],
    messages=[
        {
            "role": "user",
            "content": [
                { 
                    "type": "text",
                    "text": "What is in this recording?"
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": encoded_string,
                        "format": "wav"
                    }
                }
            ]
        },
    ]
)

print(completion.choices[0].message.content)
ChatCompletion(id='chatcmpl-AXTBlZypmtf1CCWrR6X5uX55r4VHY', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="The recording contains a statement about the sun's movement, stating that the sun rises in the east and sets in the west, a fact that has been observed by humans for thousands of years.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1732540389, model='gpt-4o-audio-preview-2024-10-01', object='chat.completion', service_tier=None, system_fingerprint='fp_130ac2f073', usage=CompletionUsage(completion_tokens=38, prompt_tokens=86, total_tokens=124, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=38), prompt_tokens_details=PromptTokensDetails(audio_tokens=69, cached_tokens=0, text_tokens=17, image_tokens=0)))

experimental: Dify integration

  • LLM registration ... OK
  • audio file transcription ... NG
    • dify-0.12.1 doesn't pass an user prompot which contains audio file (data) to this tiny-openai-whisper-api server.
    • probably on dify-0.12.1, "native (audio) file processing capabilities" is available only for openai:gpt-4o-audio-preview. How can we give these feature to openai-compatible LLMs?

TODO

License

Whisper is licensed under MIT. Everything else by morioka is licensed under MIT.

About

OpenAI Whisper API-style local server, runnig on FastAPI

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published