Skip to content

Update tokenizers requirement from 0.22 to 0.23#110

Open
dependabot[bot] wants to merge 1 commit intomainfrom
dependabot/cargo/tokenizers-0.23
Open

Update tokenizers requirement from 0.22 to 0.23#110
dependabot[bot] wants to merge 1 commit intomainfrom
dependabot/cargo/tokenizers-0.23

Conversation

@dependabot
Copy link
Copy Markdown
Contributor

@dependabot dependabot Bot commented on behalf of github May 2, 2026

Updates the requirements on tokenizers to permit the latest version.

Release notes

Sourced from tokenizers's releases.

Release v0.23.1

TL;DR

tokenizers 0.23.1 is the first proper stable release in the 0.23 line — 0.23.0 only ever shipped as rc0 because the release pipeline itself was broken (Node side hadn't shipped multi-platform binaries since 2023, Python side was on pyo3 0.27 without free-threaded support). 0.23.1 is the version where everything actually goes out the door together: full Node multi-platform wheels for the first time in years, Python 3.14 (regular and free-threaded 3.14t), full type hints for every Python class, and a stack of measurable perf wins on the BPE / added-vocab hot paths.

There is no functional 0.23.0 published — we tag 0.23.1 directly so users don't accidentally pull a never-shipped version.


🚨 Breaking changes

  • Drop Python 3.9 (#1952) — requires-python = ">=3.10"; 3.9 users stay on 0.22.x.
  • add_tokens normalizes content at insertion (#1995) — re-saved tokenizer.json may differ in the added_tokens block. Existing files load unchanged.
  • Type stubs are precise (#1928, #1997) — methods that returned Any now return real types; mypy --strict may surface previously-hidden errors. Stub layout also moved from tokenizers/<sub>/__init__.pyi to tokenizers/<sub>.pyi. This breaks the surface of some of the processors like RobertaProcessign's __init__ .
  • 3.14t-only: setters/getters return PyResult<T> because of Arc<RwLock<Tokenizer>>; a poisoned lock surfaces as PyException instead of a panic.

⚡ Performance — measured locally on this Mac, not lifted from PRs

Run with cargo bench --bench <name> -- --save-baseline v0_22_2 on v0.22.2, then --baseline v0_22_2 on v0.23.1. Numbers are point-in-time wall clock on a single laptop; relative deltas are what matters, absolute numbers will differ on CI hardware.

Added-vocabulary deserialize — the headline win (#1995, #1999)

bench: improve added_vocab_deserialize to reflect real-world workloads (#2000) is now representative of how transformers actually loads tokenizer.json files. The combined effect of daachorse for the matching automaton plus the normalize-on-insert refactor is enormous on this workload:

benchmark v0.22.2 v0.23.1 change
100k tokens, special, no norm ~410 ms 248 ms −40%
100k tokens, non-special, no norm ~7.1 s 273 ms −96%
100k tokens, special, NFKC ~395 ms 235 ms −40%
100k tokens, non-special, NFKC ~7.4 s 290 ms −96%
400k tokens, special, no norm ~15 s 980 ms −94%

Real-world impact: loading a Llama-3-style tokenizer with a large set of added tokens dropped from "noticeable pause" to "instant".

BPE encode

benchmark v0.22.2 v0.23.1 change
BPE GPT2 encode batch, no cache 530 ms 446 ms −16%
BPE GPT2 encode batch (cached) 690 ms 685 ms noise
BPE GPT2 encode (single) 1.95 s 1.94 s noise
BPE Train (small) 32.6 ms 31.5 ms −3%
BPE Train (big) 1.01 s 988 ms −2%

The BPE per-thread cache PR (#2028) shows much larger wins on highly-parallel workloads (+47–62% at 88+ threads on a server box, per the PR's own measurements on Vera). Single-thread batch numbers above are flat or slightly improved because cache-hit overhead was already low without contention.

Llama-3 encode

... (truncated)

Commits

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Updates the requirements on [tokenizers](https://github.com/huggingface/tokenizers) to permit the latest version.
- [Release notes](https://github.com/huggingface/tokenizers/releases)
- [Changelog](https://github.com/huggingface/tokenizers/blob/main/RELEASE.md)
- [Commits](huggingface/tokenizers@v0.22.0...v0.23.1)

---
updated-dependencies:
- dependency-name: tokenizers
  dependency-version: 0.23.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot added dependencies Pull requests that update a dependency file rust Pull requests that update Rust code labels May 2, 2026
@dependabot dependabot Bot requested a review from a team as a code owner May 2, 2026 05:03
@dependabot dependabot Bot added the rust Pull requests that update Rust code label May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants