Microsoft.ML.Tokenizers 1.0.0

About

Microsoft.ML.Tokenizers supports various the implementation of the tokenization used in the NLP transforms.

Key Features

  • Extensible tokenizer architecture that allows for specialization of Normalizer, PreTokenizer, Model/Encoder, Decoder
  • BPE - Byte pair encoding model
  • English Roberta model
  • Tiktoken model
  • Llama model
  • Phi2 model

How to Use

using Microsoft.ML.Tokenizers;
using System.Net.Http;
using System.IO;

//
// Using Tiktoken Tokenizer
//

// initialize the tokenizer for `gpt-4` model
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4");

string source = "Text tokenization is the process of splitting a string into a list of tokens.";

Console.WriteLine($"Tokens: {tokenizer.CountTokens(source)}");
// print: Tokens: 16

var trimIndex = tokenizer.GetIndexByTokenCountFromEnd(source, 5, out string processedText, out _);
Console.WriteLine($"5 tokens from end: {processedText.Substring(trimIndex)}");
// 5 tokens from end:  a list of tokens.

trimIndex = tokenizer.GetIndexByTokenCount(source, 5, out processedText, out _);
Console.WriteLine($"5 tokens from start: {processedText.Substring(0, trimIndex)}");
// 5 tokens from start: Text tokenization is the

IReadOnlyList<int> ids = tokenizer.EncodeToIds(source);
Console.WriteLine(string.Join(", ", ids));
// prints: 1199, 4037, 2065, 374, 279, 1920, 315, 45473, 264, 925, 1139, 264, 1160, 315, 11460, 13

//
// Using Llama Tokenizer
//

// Open stream of remote Llama tokenizer model data file
using HttpClient httpClient = new();
const string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model";
using Stream remoteStream = await httpClient.GetStreamAsync(modelUrl);

// Create the Llama tokenizer using the remote stream
Tokenizer llamaTokenizer = LlamaTokenizer.Create(remoteStream);
string input = "Hello, world!";
ids = llamaTokenizer.EncodeToIds(input);
Console.WriteLine(string.Join(", ", ids));
// prints: 1, 15043, 29892, 3186, 29991

Console.WriteLine($"Tokens: {llamaTokenizer.CountTokens(input)}");
// print: Tokens: 5

Main Types

The main types provided by this library are:

  • Microsoft.ML.Tokenizers.Tokenizer
  • Microsoft.ML.Tokenizers.BpeTokenizer
  • Microsoft.ML.Tokenizers.EnglishRobertaTokenizer
  • Microsoft.ML.Tokenizers.TiktokenTokenizer
  • Microsoft.ML.Tokenizers.Normalizer
  • Microsoft.ML.Tokenizers.PreTokenizer

Additional Documentation

Feedback & Contributing

Microsoft.ML.Tokenizers is released as open source under the MIT license. Bug reports and contributions are welcome at the GitHub repository.

No packages depend on Microsoft.ML.Tokenizers.

.NET 8.0

.NET Standard 2.0

Version Downloads Last updated
3.0.0-preview.26160.2 2 03/16/2026
2.0.0 3 03/04/2026
2.0.0-preview.25527.5 2 03/05/2026
2.0.0-preview.25503.2 2 03/05/2026
2.0.0-preview.25373.1 2 03/05/2026
2.0.0-preview.1.25127.4 2 03/05/2026
2.0.0-preview.1.25125.4 2 03/05/2026
1.0.3 3 03/05/2026
1.0.2 2 03/05/2026
1.0.1 2 03/05/2026
1.0.0 3 03/05/2026
0.22.0 2 03/05/2026
0.22.0-preview.24526.1 2 03/05/2026
0.22.0-preview.24522.7 2 03/05/2026
0.22.0-preview.24378.1 2 03/05/2026
0.22.0-preview.24271.1 2 03/05/2026
0.22.0-preview.24179.1 2 03/05/2026
0.22.0-preview.24162.2 2 03/05/2026
0.21.1 2 03/05/2026
0.21.0 2 03/05/2026
0.21.0-preview.23511.1 2 03/05/2026
0.21.0-preview.23266.6 2 03/05/2026
0.21.0-preview.22621.2 2 03/05/2026
0.20.1 2 03/05/2026
0.20.1-preview.22573.9 2 03/05/2026
0.20.0 2 03/05/2026
0.20.0-preview.22551.1 2 03/05/2026