Creating our own preference dataset
Our model can currently write paragraphs about topics related to machine learning, but it doesn’t have the same writing style as the original authors. This is a typical use case for preference alignment, where we want to change the “voice” of the model to closely imitate the source data. It’s important to note that, experimentally, DPO tends to make models more verbose and pushes them to use very formal language. Therefore, the training will need to use DPO surgically to avoid this pitfall and instead adopt the less formal style of these blog articles.
In this section, we will create a preference dataset where the chosen answers are extracts from the text, while rejected answers are generated by the model. To implement it, we will modify the code created in Chapter 5, which was designed to generate instruction datasets.
As seen in the previous section, preference and instruction datasets rely on the same principles...