Quote:
Originally Posted by Fuzz

I know this is not super relevant stuff, but it would be fascinating to know why they have so much trouble with this sort of thing. I presume it comes down to tokenization of the word, without dissecting the word itself. Once tokenized, it can tell you the colour, shape, properties, everything about it, but the word itself is a single object.
|
You were on the right track though the bold is not quite right either. This explains it as well as can be (and tokenization doesn't necessarily happen at the word level but can also be a group of letters, and special characters are their own token).
https://platform.openai.com/tokenizer
Quote:
|
A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).
|
You can test it out yourself here