E2E Phi 3 Mini 4k Instruct Whispser Demo
Interactive Phi 3 Mini 4K Instruct Chatbot with Whisper
Introduction:
The Interactive Phi 3 Mini 4K Instruct Chatbot is a tool that allows users to interact with the Microsoft Phi 3 Mini 4K instruct demo using text or audio input. The chatbot can be used for a variety of tasks, such as translation, weather updates, and general information gathering.
Create your Huggingface Access Token
Create a new token Provide a new name Select write permissions copy the token and save it in a safe place
The following Python and it performs two main tasks: importing the os module and setting an environment variable.
-
Importing the
osmodule:- The
osmodule in Python provides a way to interact with the operating system. It allows you to perform various operating system-related tasks, such as accessing environment variables, working with files and directories, etc. - In this code, the
osmodule is imported using theimportstatement. This statement makes the functionality of theosmodule available for use in the current Python script.
- The
-
Setting an environment variable:
- An environment variable is a value that can be accessed by programs running on the operating system. It is a way to store configuration settings or other information that can be used by multiple programs.
- In this code, a new environment variable is being set using the
os.environdictionary. The key of the dictionary is'HF_TOKEN', and the value is assigned from theHUGGINGFACE_TOKENvariable. - The
HUGGINGFACE_TOKENvariable is defined just above this code snippet, and it is assigned a string value"hf_**************"using the#@paramsyntax. This syntax is often used in Jupyter notebooks to allow user input and parameter configuration directly in the notebook interface. - By setting the
'HF_TOKEN'environment variable, it can be accessed by other parts of the program or other programs running on the same operating system.
Overall, this code imports the os module and sets an environment variable named 'HF_TOKEN' with the value provided in the HUGGINGFACE_TOKEN variable.
This code snippet defines a function called clear_output that is used to clear the output of the current cell in Jupyter Notebook or IPython. Let's break down the code and understand its functionality:
The function clear_output takes one parameter called wait, which is a boolean value. By default, wait is set to False. This parameter determines whether the function should wait until new output is available to replace the existing output before clearing it.
The function itself is used to clear the output of the current cell. In Jupyter Notebook or IPython, when a cell produces output, such as printed text or graphical plots, that output is displayed below the cell. The clear_output function allows you to clear that output.
The implementation of the function is not provided in the code snippet, as indicated by the ellipsis (...). The ellipsis represents a placeholder for the actual code that performs the clearing of the output. The implementation of the function may involve interacting with the Jupyter Notebook or IPython API to remove the existing output from the cell.
Overall, this function provides a convenient way to clear the output of the current cell in Jupyter Notebook or IPython, making it easier to manage and update the displayed output during interactive coding sessions.
Perform text-to-speech (TTS) using the Edge TTS service. Let's go through the relevant function implementations one by one:
calculate_rate_string(input_value): This function takes an input value and calculates the rate string for the TTS voice. The input value represents the desired speed of the speech, where a value of 1 represents the normal speed. The function calculates the rate string by subtracting 1 from the input value, multiplying it by 100, and then determining the sign based on whether the input value is greater than or equal to 1. The function returns the rate string in the format "{sign}{rate}".
2.make_chunks(input_text, language): This function takes an input text and a language as parameters. It splits the input text into chunks based on the language-specific rules. In this implementation, if the language is "English", the function splits the text at each period (".") and removes any leading or trailing whitespace. It then appends a period to each chunk and returns the filtered list of chunks.
-
tts_file_name(text): This function generates a file name for the TTS audio file based on the input text. It performs several transformations on the text: removing a trailing period (if present), converting the text to lowercase, stripping leading and trailing whitespace, and replacing spaces with underscores. It then truncates the text to a maximum of 25 characters (if longer) or uses the full text if it is empty. Finally, it generates a random string using the [uuid] module and combines it with the truncated text to create the file name in the format "/content/edge_tts_voice/{truncated_text}_{random_string}.mp3". -
merge_audio_files(audio_paths, output_path): This function merges multiple audio files into a single audio file. It takes a list of audio file paths and an output path as parameters. The function initializes an emptyAudioSegmentobject called [merged_audio]. It then iterates through each audio file path, loads the audio file using theAudioSegment.from_file()method from thepydublibrary, and appends the current audio file to the [merged_audio] object. Finally, it exports the merged audio to the specified output path in the MP3 format. -
edge_free_tts(chunks_list, speed, voice_name, save_path): This function performs the TTS operation using the Edge TTS service. It takes a list of text chunks, the speed of the speech, the voice name, and the save path as parameters. If the number of chunks is greater than 1, the function creates a directory for storing the individual chunk audio files. It then iterates through each chunk, constructs an Edge TTS command using thecalculate_rate_string()' function, the voice name, and the chunk text, and executes the command using theos.system()function. If the command execution is successful, it appends the path of the generated audio file to a list. After processing all the chunks, it merges the individual audio files using themerge_audio_files()function and saves the merged audio to the specified save path. If there is only one chunk, it directly generates the Edge TTS command and saves the audio to the save path. Finally, it returns the save path of the generated audio file. -
random_audio_name_generate(): This function generates a random audio file name using the [uuid] module. It generates a random UUID, converts it to a string, takes the first 8 characters, appends the ".mp3" extension, and returns the random audio file name. -
talk(input_text): This function is the main entry point for performing the TTS operation. It takes an input text as a parameter. It first checks the length of the input text to determine if it is a long sentence (greater than or equal to 600 characters). Based on the length and the value of thetranslate_text_flagvariable, it determines the language and generates the list of text chunks using themake_chunks()function. It then generates a save path for the audio file using therandom_audio_name_generate()function. Finally, it calls theedge_free_tts()function to perform the TTS operation and returns the save path of the generated audio file.
Overall, these functions work together to split the input text into chunks, generate a file name for the audio file, perform the TTS operation using the Edge TTS service, and merge the individual audio files into a single audio file.
The implementation of two functions: convert_to_text and run_text_prompt, as well as the declaration of two classes: str and Audio.
The convert_to_text function takes an audio_path as input and transcribes the audio to text using a model called whisper_model. The function first checks if the gpu flag is set to True. If it is, the whisper_model is used with certain parameters such as word_timestamps=True, fp16=True, language='English', and task='translate'. If the gpu flag is False, the whisper_model is used with fp16=False. The resulting transcription is then saved to a file named 'scan.txt' and returned as the text.
The run_text_prompt function takes a message and a chat_history as input. It uses the phi_demo function to generate a response from a chatbot based on the input message. The generated response is then passed to the talk function, which converts the response into an audio file and returns the file path. The Audio class is used to display and play the audio file. The audio is displayed using the display function from the IPython.display module, and the Audio object is created with the autoplay=True parameter, so the audio starts playing automatically. The chat_history is updated with the input message and the generated response, and an empty string and the updated chat_history are returned.
The str class is a built-in class in Python that represents a sequence of characters. It provides various methods for manipulating and working with strings, such as capitalize, casefold, center, count, encode, endswith, expandtabs, find, format, index, isalnum, isalpha, isascii, isdecimal, isdigit, isidentifier, islower, isnumeric, isprintable, isspace, istitle, isupper, join, ljust, lower, lstrip, partition, replace, removeprefix, removesuffix, rfind, rindex, rjust, rpartition, rsplit, rstrip, split, splitlines, startswith, strip, swapcase, title, translate, upper, zfill, and more. These methods allow you to perform operations like searching, replacing, formatting, and manipulating strings.
The Audio class is a custom class that represents an audio object. It is used to create an audio player in the Jupyter Notebook environment. The class accepts various parameters such as data, filename, url, embed, rate, autoplay, and normalize. The data parameter can be a numpy array, a list of samples, a string representing a filename or URL, or raw PCM data. The filename parameter is used to specify a local file to load the audio data from, and the url parameter is used to specify a URL to download the audio data from. The embed parameter determines whether the audio data should be embedded using a data URI or referenced from the original source. The rate parameter specifies the sampling rate of the audio data. The autoplay parameter determines whether the audio should start playing automatically. The normalize parameter specifies whether the audio data should be normalized (rescaled) to the maximum possible range. The Audio class also provides methods like reload to reload the audio data from file or URL, and attributes like src_attr, autoplay_attr, and element_id_attr to retrieve the corresponding attributes for the audio element in HTML.
Overall, these functions and classes are used to transcribe audio to text, generate audio responses from a chatbot, and display and play audio in the Jupyter Notebook environment.