Practice – Creating your own custom English tokenizer
As we discussed in the previous section, Rasa has a powerful extension system, and this allows you to create custom components. In this section, we will show you how to create an English tokenizer.
As discussed in Writing Rasa extensions, the easiest way to create a custom component is to inherit the base class provided by Rasa. For our tokenizer, it needs to inherit rasa.nlu.tokenizers.tokenizer.Tokenizer
, and override the tokenize()
method.
For the sake of simplicity, we will use a simple way to split English text into tokens: splitting the text according to spaces. One possible implementation of our English tokenizer is as follows:
from rasa.nlu.tokenizers.tokenizer import Tokenizer class MyWhitespaceTokenizer(Tokenizer): def __init__(self, component_config): super().__init__(component_config) def tokenize(self, message...