Photo by Alex Knight on Unsplash

Fine-tuning Azure Open AI models for tasks like text classification.

Srikanth Machiraju

--

Things that you can do, but should not do with Open AI.

Since Open AI GPT models gained popularity, there have been questions about what else they can do beyond using it for tasks like QnA, and summarization.

One common use case, I learned from many articles and blogs, is the use of Open AI GPT’s fine-tuning capabilities for data classification. In this blog post, I will share my learnings from fine-tuning Azure Open AI GPT models for a classification task.

Note: For people who are new to Azure. Microsoft offers GPT models on a pay-per-use basis on its cloud platform Azure. We can build GPT-based solutions on Azure using the Open AI Python SDKs.

I recommend you go through my notebook first to understand how GPT models can be fine-tuned for custom tasks like data classification.

Few short learnings vs. Fine-tuning.

Using a few short learning approaches we can fine-tune the model to some extent by providing few request, and response samples.

A few short learning examples (Image source: https://arxiv.org/abs/2005.14165)

However, this approach has a few limitations. The first is the token limit, the number of tokens in the prompt should fit within the token limit (2k per prompt for GPT-3 DaVinci). Secondly, the samples should be repeated every time you want to train the model (Remember GPT models are stateless).

Open AI’s fine-tuning capability allows you to train the model to fit your large datasets using one-time training activity that produces a customized GPT model. Nonetheless, this approach itself has a few limitations which are explained in detail below.

Note: The limitations mentioned here are when compared to traditional simpler models that can be trained to do data classification on text.

Limitations

Token-based pricing model

It is a known fact that the price for using Open AI models is charged per token basis. For fine-tuning tasks, there is some relief because we are only charged for training hours and not for tokens used for training. But when the model is used for inference the same token-based pricing applies. For every new text that needs classification, the price per token generated will be applicable. The below picture shows the inference cost for the sample dataset.

Inference cost of sample dataset (Image by Author)

A word roughly equates to 4 tokens. Considering the “ADA” model, which is the cheapest, the cost of inference for 1000 words is (1000 * 4) / 1000 * $0.0004 = $0.0016. In my opinion, this is huge compared to the traditional model deployed on a serverless computer.

Model performance

There are two types of models typically used for classification tasks — generative models and discriminative models. GPT models are generative models. Generative models learn the data distribution from the input, hence can generalize better given small data sets. They are very good at generating new data points from learned distribution. For classification tasks, they rely on joint probability to identify the chances of the input text, and the label existing simultaneously. Since Open AI GPT models are pre-trained with knowledge of the world wide web, hence they tend to hallucinate during inference due to the learned knowledge. The existing knowledge and data patterns may heavily influence the classification task for a custom data set.

As a result, you may observe the model predicting labels that do not exist in the training set. From the image below, you may observe that the GPT model predicted a class called “techn” which does not exist in the data set.

Discriminative models on the other hand learn decision boundaries. When they are wrong, they map the input incorrectly to an existing label. These models learn a function that maps inputs to outputs and does not make any assumptions about the data.

From the notebook, you may observe that the accuracy of the Open AI model is around 88.46% (for a billed training time of ~20 min). When compared to a simple random forest classifier, the accuracy is around 91.5% (for a training time of 14 seconds).

To some extent, the hallucinations in the Open AI model can be reduced using prompt engineering techniques (more about it in my blog here). This may help in mapping the input to a constant keyword like “Unknown”. For a detailed explanation of generative vs. discriminative models read this article.

Model Portability

Azure Open AI models are not portable, they cannot be downloaded or dockerized. They can only be deployed once within the Azure Open AI service. We cannot create more than one deployment of the same model. On the other hand, as you can see from the notebook simple models can be serialized and saved to a file from which they can be deployed any number of times. The compressed size of the random forest model is around ~7MB.

Another limitation is that a deployed model cannot be promoted to the next environment. This comes from the fact that a customized model can only be deployed once. So, if we have dev, test, and production environments we need to train the model multiple times (which is not desired given the cost & reproducibility).

A deployed model is charged on an hourly basis irrespective of the user if you leave the model unattended or idle for more than 15 days it will be deleted automatically.

Training & Inference limitations

The training dataset cannot be greater than 200 MB when fine-tuning Open AI models which is an important limitation to consider. If your training dataset is larger than 200 MB, you may have to split and train the model, open AI models support re-training a customized model.

Also, only one training activity is allowed to run now. If you have multiple training triggered, they all wait in a queue.

During training or inference, the Open AI models may reject to complete the prompt due to content filtering restrictions. Although this is not a problem during training, it will impact the inference if Open AI detects the prompt falling under four categories of harmful content (violence, hate, sexual, and self-harm). However, I do not consider this a key limitation because we can fine-tune the content filtering. Nonetheless, this demands additional processing steps during inference.

You may read this blog post to learn more about content filtering configurations.

Summary

In summary, while you can do a few things with Open AI they are not meant to be. Open AI models are built for much more sophisticated tasks and using them for simple tasks which can be done using simpler models will be counterproductive. If there is a good reason why you feel Open AI should be used for classification, you are welcome to use my notebook to learn how to do it. Please, let me know in the comments as well why do you think Open AI is more useful than simple models for data classification tasks.

--

--

Srikanth Machiraju

Cloud Solution Architect at Microsoft | AI & ML Professional | Published Author | Research Student