Estimate LLM Token Count - Korean vs English Differences

Problem

Need to estimate token count before calling an LLM API. Installing tiktoken is overkill — just need a rough estimate.

Solution

function estimateTokenCount(text: string): number {
  const words = text.trim().split(/\s+/).filter(w => w.length > 0);
  let tokens = 0;

  for (const word of words) {
    const hasKorean = /[가-힣]/.test(word);
    if (hasKorean) {
      tokens += Math.ceil(word.length * 1.5);  // Korean: ~1.5 tokens/char
    } else {
      tokens += Math.ceil(word.length * 0.75); // English: ~0.75 tokens/char
    }
  }

  const mdElements = text.match(/```[\s\S]*?```|`[^`]+`|#{1,6}\s/g);
  if (mdElements) tokens += mdElements.length * 2;

  return Math.max(1, Math.round(tokens));
}

function formatTokens(count: number): string {
  return count < 1000 ? `${count} tokens` : `${(count / 1000).toFixed(1)}K tokens`;
}

Key Points

  • Korean characters take 3 bytes in UTF-8 and consume more tokens in BPE tokenizers than English. The same content in Korean uses 1.5-2x more tokens.
  • This estimate isn’t exact, but it’s sufficient for pre-call cost estimation and input length checks.
  • For precise counts, use OpenAI’s tiktoken or Anthropic’s tokenizer. But this approximation works for most use cases.