KLM v7 Share Version (PreTrained Model for Applio / RVC 2) AI Voice Model

KLM v7 Share Version (PreTrained Model for Applio / RVC 2)

Created: April 14, 2024

Hmm... Not sure if this is the right place to post this. Since there's nothing in the rules about pre-trained models, I'm posting here. (Hoping I don't get kicked off from the server...<:nails:1159569314848972891> )

First off, KLM is a pre-trained model. Based on the content of Dr. Joon-cheol Park's thesis, which includes all Korean sounds such as bilabials, alveolars, velars, uvulars, and glottals, the model has been trained with a 40-page script by various voice actors, as well as ordinary Korean men and women, and also includes song data from male and female vocalists.

Unlike typical pre-trained models, it has a very wide vocal range, which should be very helpful for those covering songs. Unfortunately, the shared version lacks some unlicensed scripts and audio, and only supports 32K samples.

(This can only be applied to the RVC V2 / 32K model.)

Applio is currently the most widely used software by developers on this server and is intuitive and easy to use even for beginners. The files have been extracted so they can be directly used in Applio's Custom Pre-trained.

Just unzip the files in the \rvc\pretraineds\pretraineds_custom folder where Applio is installed.

Since Korean and Japanese have similar structures and pronunciations, most Japanese should work fine with KLM, but I'm not sure if it will properly apply to the so-called "AYAYA" type voices or V-tuber voices. It is not suitable for use with voices that include ASMR or robotic sounds.

Here at AI Hub, there are many experienced creators who do a lot of research, but most novice creators don't have the resources or the hardware necessary to train a model, and this model was designed to minimize server or colab costs for those who use it for hobbies or studies.

If it's really difficult to implement or clean the dataset, conversations can be managed with a dataset of 10 seconds or less, but this is not recommended. Typically, if the model's data is too limited, there tends to be a severe boost in white noise, so a dataset of at least 3 minutes is preferable.

I am currently continuing to train the pre-trained model. I expect to be able to share a model supporting 32k, 40k, and 48k by as early as October to November this year. The amount of data will be so extensive compared to the shared version that it is expected to take a significant amount of time. This pre-model is a version without any copyright or usage restrictions and can be used for various purposes. However, if you use this model, it would be very helpful for me to identify issues and enhance the model if you could note in your writings that you used KLM.

Recommendations for train :
Dataset -
Model Data set : 5~8 Mins of speech
(opt.) Vocal Data set : 1~3 mins

Train -
"" USE RVC V2 / 32K / RMVPE ""

Batch size per GPU : 4
Epochs : 50~150 (The number of epochs is proportional to the amount of your data. For more details, please ask the experts in the model maker channel)

Pre-Trained Model Link -
https://huggingface.co/SeoulStreamingStation/IU-Voice-MultiLanguage-V1/resolve/main/KLMv7s_32k.zip?download=true

SAMPLES : Speech & Vocal
6 Mins of dataset / 4 batch size / 100 epochs
Normal = RVC V2 Pre-trained model
KLM = KLMv7S

The reason for the inaccurate pronunciation of the sample is not a problem with the model, but rather because the cleanup of the target was done automatically due to laziness

Additional Details:

Tags: No tags available

Download Link: https://huggingface.co/SeoulStreamingStation/IU-Voice-MultiLanguage-V1/resolve/main/KLMv7s_32k.zip?download=true