Thread: The A.I. Thread
View Single Post
Old 08-08-2025, 12:22 AM   #758
Firebot
#1 Goaltender
 
Join Date: Jul 2011
Exp:
Default

Quote:
Originally Posted by Fuzz View Post

I know this is not super relevant stuff, but it would be fascinating to know why they have so much trouble with this sort of thing. I presume it comes down to tokenization of the word, without dissecting the word itself. Once tokenized, it can tell you the colour, shape, properties, everything about it, but the word itself is a single object.
You were on the right track though the bold is not quite right either. This explains it as well as can be (and tokenization doesn't necessarily happen at the word level but can also be a group of letters, and special characters are their own token).



https://platform.openai.com/tokenizer

Quote:
A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).
You can test it out yourself here

Last edited by Firebot; 08-08-2025 at 12:26 AM.
Firebot is offline   Reply With Quote
The Following User Says Thank You to Firebot For This Useful Post: