From 471dc4ed481f70c1e1503faf5541aa8bf907e0fc Mon Sep 17 00:00:00 2001 From: Toshiaki Takeuchi Date: Wed, 15 May 2024 10:29:25 -0400 Subject: [PATCH] Adding Whisper support --- README.md | 33 ++- examples/UsingWhisperToTranscribeSpeech.mlx | Bin 0 -> 4945 bytes openAIAudio.m | 253 ++++++++++++++++++++ 3 files changed, 283 insertions(+), 3 deletions(-) create mode 100644 examples/UsingWhisperToTranscribeSpeech.mlx create mode 100644 openAIAudio.m diff --git a/README.md b/README.md index ad32884..8508d6e 100644 --- a/README.md +++ b/README.md @@ -4,13 +4,14 @@ This repository contains example code to demonstrate how to connect MATLAB to the OpenAI™ Chat Completions API (which powers ChatGPT™) as well as OpenAI Images API (which powers DALL·E™). This allows you to leverage the natural language processing capabilities of large language models directly within your MATLAB environment. -The functionality shown here serves as an interface to the ChatGPT and DALL·E APIs. To start using the OpenAI APIs, you first need to obtain OpenAI API keys. You are responsible for any fees OpenAI may charge for the use of their APIs. You should be familiar with the limitations and risks associated with using this technology, and you agree that you shall be solely responsible for full compliance with any terms that may apply to your use of the OpenAI APIs. +The functionality shown here serves as an interface to the Chat Completion, Images and Audio APIs. To start using the OpenAI APIs, you first need to obtain OpenAI API keys. You are responsible for any fees OpenAI may charge for the use of their APIs. You should be familiar with the limitations and risks associated with using this technology, and you agree that you shall be solely responsible for full compliance with any terms that may apply to your use of the OpenAI APIs. Some of the current LLMs supported are: - gpt-3.5-turbo, gpt-3.5-turbo-1106, gpt-3.5-turbo-0125 - gpt-4-turbo, gpt-4-turbo-2024-04-09 (GPT-4 Turbo with Vision) - gpt-4, gpt-4-0613 -- dall-e-2, dall-e-3 +- dall-e-2, dall-e-3 (Images) +- Whisper (Audio) For details on the specification of each model, check the official [OpenAI documentation](https://platform.openai.com/docs/models). @@ -328,6 +329,31 @@ imshow(images{1}) % Should output an image based on the prompt ``` +## Getting Started with Audio API +Generate speech from your text with OpenAI using the function `openAIAudio.speech` as follows: +```matlab +exampleText = "Here is an example!"; +[y,Fs] = openAIAudio.speech(exampleText); +sound(y,Fs) +audiowrite("example.wav",y,Fs) % audio file +``` + +Transcribe speeach from the audio file using the function `openAIAudio.transcrptions` as follows: +```matlab +output = openAIAudio.transcriptions("example.wav"); +delete("example.wav") +output.text +``` + +This will return the original input text. +```shell +>> output.text + +ans = + + 'Here is an example.' +``` + ## Examples To learn how to use this in your workflows, see [Examples](/examples/). @@ -340,7 +366,8 @@ To learn how to use this in your workflows, see [Examples](/examples/). - [DescribeImagesUsingChatGPT.mlx](/examples/DescribeImagesUsingChatGPT.mlx): Learn how to use GPT-4 Turbo with Vision to understand the content of an image. - [AnalyzeSentimentinTextUsingChatGPTinJSONMode.mlx](/examples/AnalyzeSentimentinTextUsingChatGPTinJSONMode.mlx): Learn how to use JSON mode in chat completions - [UsingDALLEToEditImages.mlx](/examples/UsingDALLEToEditImages.mlx): Learn how to generate images -- [UsingDALLEToGenerateImages.mlx](/examples/UsingDALLEToGenerateImages.mlx): Create variations of images and editimages. +- [UsingDALLEToGenerateImages.mlx](/examples/UsingDALLEToGenerateImages.mlx): Create variations of images and edit images. +- [UsingWhisperToTranscribeSpeech.mlx](/example/UsingWhisperToTranscribeSpeech.mlx): Transcribe speech and have it read aloud. ## License diff --git a/examples/UsingWhisperToTranscribeSpeech.mlx b/examples/UsingWhisperToTranscribeSpeech.mlx new file mode 100644 index 0000000000000000000000000000000000000000..c9ff877a5d9ccd239d5999c001361f3bfcff431e GIT binary patch literal 4945 zcmaJ_2T)U6w+^986_AJ^5Q=~j5D1}*RB53}5s@kZfrQ?B6G3{Fj`WU-1_Y5}=u!mf z0i`1-5PFvv|NGDL%DeBLnSJ&-XJ&nK*0)#JRwpE40001F_yq<`bjXy75&!^Iga80N z{)@4Kqk|j5!Ocw1+X>-nBH)EWqLcbGF*j+$HvBJqG)#W$beoKOEtoBmLa|i3We%BH z8jG69g!%iz;C#f?MC1<69?O}Virq^l58Fr-Dzp0*y(LqCRUQbGVZEQx6px{t>XS(|I}B@xVaw=l!zM8%Xxz{({Bsp~oZ*hc+t7n+j33C`t4vWA z*{+eiprVT6`?xHV7rdTIL%jqWF8URB>it~SIVMkFQzYjlCmM8#_S?lP0VO0gE|A02 zprn{{q1D)*2e>*og2cpFgjby7CD3L ziheKE#cuUSev{EI-aDH_ZjosJ!E*hMPoD~ru{g|S32XswQL ziK=u75w~G}g1)91Ebg_jS|uUhbA38jA4T=4Q77bO=35T(KKDVaYMUd z{Ifr`+L$fGJ>E2Oq!4Jd89eMd(&Hb}toE-DTdz3_e-h-d|?Bc7MYG3!^<-FPR~7tJN7h8)Zc4+Ney|_oY6UOR8m_G=E<7y8S+feqzl$ z<8+BkH!Di3hLc+&)pE>0r>-U5FJ5Ey;2o#;zSK(E9a*z*g7gqQ4P4G=+uEzZsY|T$ z^EumFeoo85+uN-L6FEFn;zZTtJ+5f6?-EU;gQgWl57^|(GdO^;1IqH@dJpf$8jFD6 zt6Tsj%fG$($Q`?OToo_8ez_|Tx3t7M@v-bpornxgQ14zb^qWM+3Lq=h8`s|_NqE`z z^VogL8jwG3=n?pZe=$e=+0Rp&!H`F)^)B4Zz*bWye-RCAt8nnRIl+U~Dk||iBSZqW zm%%|>?HU&wE1{unqxaSbyN!}fbfc;K;tEpiVBv8?6Hy6+gk}M(c|>v7atL6dvJW~e>wE+j%?$Ktc0|zA3fPBLGEbQHPdXL$3CWtXtN4f z=7?ndTk-KyompUR4>xya-&SzOO{`C3ocQNxk_AdPmNgeaHbP|~YGUCqp?Rw9j}DwZ zH9+`W*E%LzQE!MAT0@#Qg@M$N)X)&}wp@zHIL{J&o$H#>qj&Cip0I{#kx(x`8qc)38m`h1s5pJeZ zm2H4z1l{(j2_HCWitZ#ZSCdJ&)mbdW_o%W=J1a%3_l5JD`#2a)yc+la+J8XY7?)18 zkOi9dSO%k!YaMD)IL}_j*D_=+3qUt8texBmq7)BOgGJjxbg~MP(oPpfDPEtR9?T!# z`{o~KmvJ(w?i;N)nKWfOt|V;iN9k92T3DAbMSCM$uMw*$HA^=&e(-^OT{E6g3Uj-i zbWEs!`QidsZiLt6+nS$3FUAm_+NJa_+H^*1wfHJ=7f?hnh}H+Fk*?(H;?h`ek~QiH zkWWs@FB!bTnu7D7DMyEo!djoEw#dJItywD=BECRT|K!F(kgqXEz7@fX&#HtH%WNM} zNXW@R!yMqDgUYn(4R76@qYe5CG)=y|R#RvgQ6d&QoPK~slpNR0m|u}kQG0)wmM+8} zeNx$hqB2|IsGZB2f^g!BA751Fe&`PeYCIw0ygBT z^%aX;`_5Hs*ipU~GSn0k_`R*xz?NL3PQ|g>EcSUd)i?)n`nce;$JH_8l}fln1RwVp z339Zx{mp}hw=i?gJS@CdimBSG`k`W!8M*cKdg=mCF^$Ha?xt%@f_=?IY+td7acoF< zX9aCD)z!cfORcRwb&g1AwWrY!iFU5%$oJNl?I9ZIC&KVjTNu)aGgl=mN6SJe?Id?I??={i2fPlZ?gkb0p?ed> z`%@d`)o@*tlgi+A!|tl31kpq#*2Ax#y6%38!jKvpou%qoPCSfe>{a65wuW{~JY8() ziY6w=3#Lox_Rjb~m zYC;BwpjcDKanG%ss*rgvZp-x%%7|I>LNvFCh=D0<0H!llZg3FEk|h`maO^q^&jKy{ zIMU;sC?lH(d8qaWMx?!EFq?eTGiYctWa{%XrEy?l+s@AcBGUP$!`ELDPQ5F9_ZeA+ zwvsh1krca~y+VhfKXRBwgyuSdnYKS!z79*g0#t`Dys1;PiIl^_%uaoog=^iVF9>RY zGxjIs=IFKIb;?wj75q}YgUr38d34N?p$^*2I|i$UB_kG|cB%o*!?SJfhfaEw>gyZ!XmS)M zr6tK_CxzB7c=w&AEvt3IpNolRg%71FC?tiNJez^>L#5NXJ=#&ZP0cYBB9%jOcS9S+ zEaA7(!(qK>yN`v0&HQ62GjTOWyI_@p-MY#!L6SAoO2iicPbcff?r~1h9{H%vewRkz zO(7-$2{BZ_nMWmpRq||KGjuj8#t|K7(C!c_sMJ!N z@-VC_M(QMw%O-b4)+p+v{>;0{bD@Ygit3nz{+$JGO`HZ7gmIvmq` z5}N+bVoAJrN;d5=vvn|r|0L8%(t<;pK#}ztVD;7s=Sf#|M|S!E+2?2dHrQE(;Auwi z2b$3?Oj1gN2{3bR3AGB2!Sddm;=M@|19X=nW}Jm4Jb^k?w8Wlh;Dpxd-vcq9S+{W| zK=(}IaIE-JUQB^QU5gW2r}U2VFbZ_pVIl2_vcMH*!yqXMSbsN_z(M0niopnGV(G|E z)@z$xD3B%oX+H5_2G|=nw(P;MnZTS!Gq8OM|6+gBLE+-O?X7@289brKD$KG6xjk%u z`UQJci3kW^P&46ugZY<9L`J)>Kaz4S-X$@v~D%kJy>^pa;w#QB=f@S zWJ@CDy%=HC}8%2vXp15y56au_B-i+wwAp@qEeB%w-=E{A`$9^D4eDn}=J~%>|NRIkLo2Ig#9?V5|l4dL>M49f}?NIWZ zm&pli1{E*soZROaas~;q_6%o*(z~(^=|w3OBo{bA`!)-^vp;Q5?DBhO4d|R?6iTHh z>)qIaP+rM{eTR|-?&kN#FPyRmkvX3@EZQ>`vvhJGU(~kje|{)?sKmVMxRUib3v`&i z!xf>QC~GqMTJTRk>k;jG{dgYXcuueWollhK|7FzVQ>z`mWg(gI&G~T9Bu*ujJp0xy zG^bWrZ4ZzExpLPSR%@GC_o7n(?l5=f>`DC8^!tZ~8WSW}4vN>r6^2^xNz$CyNH-!S zJ`#5YQG^lZ3c1$ajoA^qN*7Q zQgk2K834ywA$dHsJONMI)a)r_TIM%9tT^HcEO+(-B-cDW4K~=;JJuQQ zviV^ube)x1L)T8xv(L75kmBl0jgO6S!#E!9{ykLQbToK~FFe6=o$r-rS4p_1M4S{K zy~hH33bl*a6Y)a2&?L?ofMFtq{>avtthDgq0I?%Q#}0J2R`ddaQ&HRGcvUGp2B?7u z6UN#_qS2X`Y!8(vAZKxtbHslt8ztDz)Ph%141emp|E8=i0*SD2MW{MhJN{DFJ+_-V z)X&8=9SKl9mZJM88 zm;3_K-;Y@P;Ub@xtsRA=4f(zhQvPng22qf{OJoB}tur0Hp!;JsNbdTT)?`4Kn~VPT zp_;?iFpyp3xwzAg1?dL0MTC$u$U)j=HUF-t^RwwkkF_5dgnEZrE3_&TNS>-U3-qlx z$ImF4tG&U{a&GG4?#mcAvH~#-22%ZW6o|6Aw|1c6c z4{*M}|0VesfHi*mTep8+@O&TpTd)rQhyT#ko(DKzAp8c%CI4T5-{r!2@$+%?w|E`p zzed&bK<5L-UWX{(bD`2)E^Fc9`n4V|HeH1H_UT(udPmupE>** Qosj{=@#jPMt0V*b55Tl>%m4rY literal 0 HcmV?d00001 diff --git a/openAIAudio.m b/openAIAudio.m new file mode 100644 index 0000000..40cd897 --- /dev/null +++ b/openAIAudio.m @@ -0,0 +1,253 @@ +classdef openAIAudio + %openAIAudio Collection of static methods to connect to Audio API from OpenAI. + % + % openAIAudio Functions: + % speech - Text to Speech API from OpenAI that generates + % audio from text. + % transcriptions - Speech to Text API from OpenAI that generates + % text transcription from an audio file. + % translations - Speech to Text API from OpenAI that generates + % English translation from a foreign language audio file. + + methods (Access=public,Static) + + function [y,Fs,response] = speech(text,nvp) + % SPEECH Generate speech using the OpenAI API + % + % [y,Fs,response] = OPENAIAUDIO.SPEECH(text) generates audio + % from the input TEXT using the OpenAI API, and returns + % sampled data, y, and a sample rate for that data, Fs. + % use `audiowrite(filename,y,Fs)` to save the audio to a file. + % + % [y,Fs,response] = OPENAIAUDIO.SPEECH(__, Name=Value) specifies additional options + % using one or more name-value arguments: + % + % ModelName - Name of the model to use for speech generation. + % "tts-1" (default) or "tts-1-hd" + % Voice - The voice to use in generated audio. Options are: + % "alloy" (default), "echo", "fable", "onyx", + % "nova", and "shimmer". The preview is available here + % https://platform.openai.com/docs/guides/text-to-speech/voice-options + % Speed - The speed of the generated, from 0.25 to 4. Default is 1. + % TimeOut - Connection Timeout in seconds (default: 10 secs) + % + + arguments + text (1,1) {mustBeTextScalar} + nvp.ModelName (1,1) {mustBeMember(nvp.ModelName,["tts-1","tts-1-hd"])} = "tts-1" + nvp.Voice (1,1) {mustBeMember(nvp.Voice,["alloy","echo","fable","onyx","nova","shimmer"])} = "alloy" + nvp.Speed (1,2) {mustBeNumeric,mustBeInRange(nvp.Speed,0.25,4)} = 1 + nvp.TimeOut (1,1) {mustBeReal,mustBePositive} = 10 + nvp.ApiKey {mustBeNonzeroLengthTextScalar} + end + + endpoint = "https://api.openai.com/v1/audio/speech"; + apikey = getenv("OPENAI_API_KEY"); + timeout = nvp.TimeOut; + params = struct("model",nvp.ModelName,"input",text,"voice",nvp.Voice); + if nvp.Speed ~= 1 + params.speed = nvp.Speed; + end + + % Send the HTTP Request + response = sendRequest(apikey, endpoint, params, timeout); + if isfield(response.Body.Data,"error") + y = []; + Fs = []; + else + y = response.Body.Data{1}; + Fs = response.Body.Data{2}; + end + + end + + function [output,response] = transcriptions(filepath,nvp) + % TRANSCRIPTIONS Transcribe audio using the OpenAI API + % + % [output, response] = OPENAIAUDIO.TRANSCRIPTIONS(filepath) generates + % text transcription from the input audio file FILEPATH using the + % OpenAI API. + % + % [output, response] = OPENAIAUDIO.TRANSCRIPTIONS(__, Name=Value) + % specifies additional options using one or more name-value arguments: + % + % ModelName - Name of the model to use for transcription. + % Only "whisper-1" is currently available. + % Language - The language of the input audio. This + % improves the accuracy and latency. Use + % the ISO-639-1 format to specify the language + % https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes + % Prompt - An optional text to guide the model's style + % or continue a previous audio segment. + % ResponseFormat - The format of the transcript output: + % "json" (default), "text", "srt", "vtt", + % or "verbose_json" + % Temperature - The sampling temperature between 0 and 1. + % Higher values like 0.8 will make the output + % more random. Default is 0. + % TimestampGranularities - The timestamp granularity used when + % ResponseFormat is set to "verbose_json": + % "segment" (default), and/or "word". Choosing + % "word" will add latency. + % TimeOut - Connection Timeout in seconds (default: 10 secs) + % + + arguments + filepath {mustBeValidFileType(filepath)} + nvp.ModelName {mustBeMember(nvp.ModelName,"whisper-1")}= "whisper-1" + nvp.Language {mustBeValidLanCode(nvp.Language)} + nvp.Prompt {mustBeTextScalar} + nvp.ResponseFormat {mustBeMember(nvp.ResponseFormat, ... + ["json","text","srt","vtt","verbose_json"])} = "json" + nvp.Temperature (1,1) {mustBeInRange(nvp.Temperature,0,1)} = 0 + nvp.TimestampGranularities (1,:) {mustBeText,mustBeMember(nvp.TimestampGranularities, ... + ["segment","word"])} + nvp.TimeOut (1,1) {mustBeReal,mustBePositive} = 10 + nvp.ApiKey {mustBeNonzeroLengthTextScalar} + end + + endpoint = "https://api.openai.com/v1/audio/transcriptions"; + apikey = getenv("OPENAI_API_KEY"); + timeout = nvp.TimeOut; + + import matlab.net.http.io.* + params = struct('model', nvp.ModelName, 'file', FileProvider(filepath)); + if isfield(nvp,"Language") + params.language = nvp.Language; + end + if isfield(nvp,"Prompt") + params.prompt = nvp.Prompt; + end + if nvp.ResponseFormat ~= "json" + params.response_format = nvp.ResponseFormat; + end + if nvp.Temperature > 0 + params.temperature = nvp.Temperature; + end + if isfield(nvp,"TimestampGranularities") + if nvp.ResponseFormat == "verbose_json" + if isscalar(nvp.TimestampGranularities) && nvp.TimestampGranularities ~= "segment" + params.timestamp_granularities = {nvp.TimestampGranularities}; + else + params.timestamp_granularities = nvp.TimestampGranularities; + end + else + warning("set ResponseFormat to 'verbose_json' to enable TimestampGranularities.") + end + end + keyval = [fieldnames(params) struct2cell(params)].'; + body = MultipartFormProvider(keyval{:}); + + % Send the HTTP Request + response = sendRequest(apikey, endpoint, body, timeout); + if isfield(response.Body.Data,"error") + output = ""; + else + output = response.Body.Data; + end + end + + function [output,response] = translations(filepath,nvp) + % TRANSLATIONS Translate audio into English text using the OpenAI API + % + % [output, response] = OPENAIAUDIO.TRANSLATIONS(filepath) generates + % English translation from the input audio file FILEPATH using the + % OpenAI API. + % + % [output, response] = OPENAIAUDIO.TRANSLATIONS(__, Name=Value) + % specifies additional options using one or more name-value arguments: + % + % ModelName - Name of the model to use for transcription. + % Only "whisper-1" is currently available. + % Prompt - An optional text to guide the model's style + % or continue a previous audio segment.The + % prompt must be in English. + % ResponseFormat - The format of the transcript output: + % "json" (default), "text", "srt", "vtt", + % or "verbose_json" + % Temperature - The sampling temperature between 0 and 1. + % Higher values like 0.8 will make the output + % more random. Default is 0. + % + + arguments + filepath {mustBeValidFileType(filepath)} + nvp.ModelName {mustBeMember(nvp.ModelName,"whisper-1")}= "whisper-1" + nvp.Prompt {mustBeTextScalar} + nvp.ResponseFormat {mustBeMember(nvp.ResponseFormat, ... + ["json","text","srt","vtt","verbose_json"])} = "json" + nvp.Temperature (1,1) {mustBeInRange(nvp.Temperature,0,1)} = 0 + nvp.TimeOut (1,1) {mustBeReal,mustBePositive} = 10 + nvp.ApiKey {mustBeNonzeroLengthTextScalar} + end + + endpoint = "https://api.openai.com/v1/audio/translations"; + apikey = getenv("OPENAI_API_KEY"); + timeout = nvp.TimeOut; + + import matlab.net.http.io.* + params = struct('model', nvp.ModelName, 'file', FileProvider(filepath)); + if isfield(nvp,"Prompt") + params.prompt = nvp.Prompt; + end + if nvp.ResponseFormat ~= "json" + params.response_format = nvp.ResponseFormat; + end + if nvp.Temperature > 0 + params.temperature = nvp.Temperature; + end + keyval = [fieldnames(params) struct2cell(params)].'; + body = MultipartFormProvider(keyval{:}); + + keyval = [fieldnames(params) struct2cell(params)].'; + body = MultipartFormProvider(keyval{:}); + + % Send the HTTP Request + response = sendRequest(apikey, endpoint, body, timeout); + if isfield(response.Body.Data,"error") + output = ""; + else + output = response.Body.Data; + end + + end + + end + +end + +function response = sendRequest(apikey,endpoint,body,timeout) + % sendRequest send request to the given endpoint, return response + headers = matlab.net.http.HeaderField('Authorization', "Bearer " + apikey); + if isa(body,'struct') + headers(2) = matlab.net.http.HeaderField('Content-Type', 'application/json'); + end + request = matlab.net.http.RequestMessage('post', headers, body); + httpOpts = matlab.net.http.HTTPOptions; + httpOpts.ConnectTimeout = timeout; + response = send(request, matlab.net.URI(endpoint), httpOpts); +end + +function mustBeValidFileType(filePath) + mustBeFile(filePath); + s = dir(filePath); + if ~endsWith(s.name, [".flac",".mp3",".mp4",".mpeg",".mpga",".m4a",".ogg",".wav","webm"]) + error("Not a valid file type") + % error("llms:pngExpected", ... + % llms.utils.errorMessageCatalog.getMessage("llms:pngExpected")); + end + mustBeLessThan(s.bytes,4e+6) +end + +function mustBeNonzeroLengthTextScalar(content) + mustBeNonzeroLengthText(content) + mustBeTextScalar(content) +end + +function mustBeValidLanCode(code) + mustBeTextScalar(code) + if strlength(code) ~= 2 + error("Use 2-letter ISO-639-1 language code.") + end + +end \ No newline at end of file